# Logistic Regression
## 1. Notation
The following notation is to be used for the rest of the notebook:
* m: the size of the training data size
* n: the number of features per training sample
* $x^{(i)}$ the $i$-th training sample: a vector of shape $(n * 1)$
* $y^{(i)}$ the $i$-th label associated to $x^{(i)}$ 
* X is the training dataset and expressed as the $(n * m)$ matrix

$\begin{equation}
X = \begin{bmatrix} x^{(1)} && x^{(i)} && .. && .. && x^{(m)}\end{bmatrix}
\end{equation}$
* Y represents the lables, generally expressed as the $(1 * m)$ matrix
$\begin{equation}
X = \begin{bmatrix} y^{(1)} && y^{(i)} && .. && .. && y^{(m)}\end{bmatrix}
\end{equation}$
* w represents the weights/parameters to be computed.
* b represents the bias unit, which was previously denoted as $\theta_0$ and associated with the constant extra feature $x_0=1$

## 2. Technical Details
The main ideas of the algorithm are adressed in the following [notebook](https://github.com/ayhem18/Towards_Data_science/blob/master/Machine_Learning/logistic_regression/logistic_regression.ipynb)

## 3. Implementation
This is a simple implementation of a custom Logistic regression model.

### 3.1 Sigmoid function's implementation

In [1]:
## importing necessary libraries
import numpy as np
import copy 

In [None]:
def sigmoid(x):
    """
    Compute the sigmoid of x

    Arguments:
    x -- A scalar or a numpy array of any shape

    Return:
    sigmoid(x)
    """
    return 1 / (1 + np.exp(-x))

In [None]:
def initialize_with_zeros(n_features):
    """
    This function creates a vector of zeros of shape (n_features, 1) for w and initializes b to 0.
    
    Argument:
    n_features: number of features for the training examples
    
    Returns:
    w -- initialized vector of shape (n_features, 1)
    b -- initialized scalar (corresponds to the bias) of type float
    """
      
    w = np.zeros((n_features, 1))
    b = 0.0

    return w, b

### 3.2 Output and gradients
Given the matrix $X$, it is necessary to compute:
$\begin{equation}
A = \sigma(w^T X + b) = \begin{bmatrix} a^{(1)} && a^{(2)} && .. && .. && a^{(m)}\end{bmatrix}
\end{equation}$
as well as the cost function:
$\begin{equation}
J(w, b) = -\frac{1}{m} \cdot \sum _{i=1}^{m} [y ^ {(i)} \cdot \log(a^{(i)}) + (1 - y^{(i)}) \cdot \log(1 - a^{(i)})]
\end{equation}$
Using an optimization algorithm would require calculating the gradients:

$\begin{equation}
\begin{aligned}
\frac{\delta J}{\delta w} = \frac{1}{m} X(A - Y) ^ T \\ 
\frac{\delta J}{\delta b} = \frac{1}{m} \sum_{i=1}^{m} (a^{(i)} - y^{(i)})
\end{aligned}
\end{equation}$



In [None]:
def cost_gradient(w, b, X, Y):
    m = X.shape[1]

    A = sigmoid(np.dot(w.T, X) + b)
    # cost will store an (1, m) np array where cost[i] = (a^{(i)} - y^{(i)}) 
    cost = - (np.dot(np.log(A), Y.T) + np.dot(np.log(1 - A), 1 - Y.T)) / m
    
    dw = np.dot(X, (A-Y).T) / m
    db = np.sum(A - Y) / m
    # convert the cost into a scalar value
    cost = np.squeeze(np.array(cost))

    return cost, dw, db

In [None]:
def optimize(w, b, X, Y, num_iterations=200, learning_rate=0.008, print_cost=False, cost_print_cycle=50):
    """
    This function 
    Arguments:
    w -- weights, a numpy array of size n_features
    b -- bias, a scalar
    X -- data of shape (n_features, m)
    Y -- true "label" vector (containing 0 if negative, 1 if positive), of shape (1, number of examples)
    num_iterations -- number of iterations of the optimization loop
    learning_rate -- learning rate of the gradient descent update rule
    print_cost -- True to print the loss every cost_print_cycle times
    
    Returns:
    parameters -- list with weights w and bias b
    grads -- list with derivative of the cost function with respect to w and then to b
    costs_history -- list of all the costs computed during the optimization: can be used for plotting learning curve.
    
    """
    
    w = copy.deepcopy(w)
    b = copy.deepcopy(b)
    
    costs_history = []
    
    for i in range(num_iterations):
    
        grads, cost = cost_gradient(w, b, X, Y)
        
        dw = grads[0]
        db = grads[1]
        
        w = w -learning_rate * dw
        b = b - learning_rate * db
        
        if i % cost_print_cycle == 0:
            costs_history.append(cost)
        
            # Print the cost every 100 training iterations
            if print_cost:
                print ("Cost after iteration {}: {}".format(str(i), str(cost)))
    
    parameters = [w, b]
    grads = [dw, db]

    return parameters, grads, costs_history

### 3.4 Classification
As Logistic regression tackes a classification problem, The final output should be either $1$ or $0$. Given a certain treshhold $t$, (reflecting the seriousness of consequences of different error types) the classification would be:
$\begin{aligned}
y = \begin{equation}
    \begin{cases}
      1 , ~ \hat{y} > t \\
      0 , ~ otherwise
    \end{cases}
\end{equation}
\end{aligned}$

In [None]:
def predict(w, b, X, threshhold=0.5):
    '''
    Predict whether the label is 0 or 1 using learned logistic regression parameters (w, b)
    
    Arguments:
    w -- weights, a numpy array of size (n_features, 1)
    b -- bias, a scalar
    X -- data of size (n_features, m)
    
    Returns:
    Y_prediction -- a numpy array (vector) containing all predictions (0/1) for the examples in X
    '''
    
    # making sure weights are a not a first rank numpy array of shape (n_features,) 
    w = w.reshape(X.shape[0], 1)
    
    # compute the probabilities with the learned parameters
    A = sigmoid(np.dot(w.T, X) + b)
    
    # using a logical array A > threshhold
    return A > threshhold
    