# One-node Neural Net through Logistic Regression



In [2]:
import numpy as np
import copy
import matplotlib.pyplot as plt
import scipy



## 1 - Problem Statement ##

Given a dataset containing:  
- a training set of m_train images labeled as True (y=1) or False (y=0)
- a test set of m_test images labeled the same as above
- each image is of shape (num_pixel, num_pixel, 3) where 3 is for the 3 channels (RGB). Thus, each image is square (height = num_pixel) and (width = num_pixel).

we are trying to build a simple image-recognition algorithm that can correctly classify whether given pictures belong to a class or not.


## 2 - Preprocessing

Considering the shape of each image, our hypothetical train and test set would have a a shape of (num_examples, num_pixel, num_pixel, 3). Our first step to feed the data to the neural network would be to flatten each individual example.

In [5]:
# Reshape the training and test examples
train_set_x_flatten = train_set_x_orig.reshape(train_set_x_orig.shape[0], -1).T
test_set_x_flatten = test_set_x_orig.reshape(test_set_x_orig.shape[0], -1).T

During the training of the model, we will be multiplying weights and add biases to some initial inputs in order to observe neuron activations. It is extremely important for each feature to have a similar range such that our gradients don't explode. Knowing that the RGB values in each pixel channel range from 0 to 255, we will divide all values by 255 to standardize our dataset.

In [7]:
train_set_x = train_set_x_flatten / 255.
test_set_x = test_set_x_flatten / 255.

## 3 - The learning algorithm ##

The intention is to build a Logistic Regression, using a Neural Network mindset (inspired by deeplearning.ai). See below for a graphical representation showing why Logistic Regression can be viewed as a Neural Network.

<center><img src="model.PNG" width=600></center>  
    
**Mathematical expression of the algorithm**:

For one example $x^{(i)}$:
$$z^{(i)} = w^T x^{(i)} + b $$
$$\hat{y}^{(i)} = a^{(i)} = sigmoid(z^{(i)})$$ 
$$ \mathcal{L}(a^{(i)}, y^{(i)}) =  - y^{(i)}  \log(a^{(i)}) - (1-y^{(i)} )  \log(1-a^{(i)})$$

The cost is then computed by summing over all training examples:
$$ J = \frac{1}{m} \sum_{i=1}^m \mathcal{L}(a^{(i)}, y^{(i)})$$


<a name='4'></a>
## 4 - Building the parts of our algorithm ## 

The main steps for building a Neural Network are:
1. Define the model structure (such as number of input features) 
2. Initialize the model's parameters
3. Loop:
    - Calculate current loss (forward propagation)
    - Calculate current gradient (backward propagation)
    - Update parameters (gradient descent)

We will build 1-3 separately and integrate them into one function we call `model()`.

<a name='4-1'></a>
### 4.1 - Helper functions

<a name='ex-3'></a>
### Sigmoid
$sigmoid(z) = \frac{1}{1 + e^{-z}}$ for $z = w^T x + b$ to make predictions on the final output:

In [8]:
def sigmoid(z):
    """
    Compute the sigmoid of z

    Parameters
    ----------
    z : array
        A scalar or numpy array of any size.

    Returns
    -------
    s : array
        sigmoid(z)
    """
    s = 1 / (1 + np.exp(-z))
    
    return s

### 4.2 - Initializing parameters

The function below is used to initialize w (weights) as a vector of zeros and b as the bias term.

In [13]:
def initialize_with_zeros(n):
    """
    This function creates a vector of zeros of shape (n, 1) for w and initializes b to 0.
    
    Parameters
    ----------
    n : int
        size of the w vector
    
    Returns
    -------
    w : array
        initialized vector of shape (n, 1)
    b : float
        initialized scalar of type float (bias term)
    """
    w = np.zeros((n, 1))
    b = 0.

    return w, b


### 4.3 - Forward and Backward propagation

Now that your parameters are initialized, we can do the "forward" and "backward" propagation steps for learning the parameters.

Forward Propagation:
- Computing $A = \sigma(w^T X + b) = (a^{(1)}, a^{(2)}, ..., a^{(m-1)}, a^{(m)})$
- Calculating the cost function: $J = -\frac{1}{m}\sum_{i=1}^{m}(y^{(i)}\log(a^{(i)})+(1-y^{(i)})\log(1-a^{(i)}))$

Backward propagation to calculate derivatives (gradients): 

$$ \frac{\partial J}{\partial w} = \frac{1}{m}X(A-Y)^T$$
$$ \frac{\partial J}{\partial b} = \frac{1}{m} \sum_{i=1}^m (a^{(i)}-y^{(i)})$$

In [25]:
def propagate(w, b, X, Y):
    """
    Implement the cost function and its gradient for the propagation explained above

    Parameters
    ----------
    w : vector - weights, a numpy array of size (num_pixel * num_pixel * 3, 1)
    b : int - bias, a scalar
    X : 2-d array - data of size (num_pixel * num_pixel * 3, number of examples)
    Y : array - true label of size (1, number of examples)

    Returns
    -------
    cost : negative log-likelihood cost for logistic regression
    dw : gradient of the loss with respect to w, thus same shape as w
    db : gradient of the loss with respect to b, thus same shape as b
    """

    m = X.shape[1]

    # FORWARD PROPAGATION
    A = sigmoid(np.dot(w.T, X) + b)  # compute activation
    cost = (-1 / m) * np.sum(Y * np.log(A) + (1 - Y) * np.log(1 - A))  # calculate cost
    cost = np.squeeze(np.array(cost))

    # BACKWARD PROPAGATION (TO FIND GRAD)
    dw = 1 / m * np.dot(X, (A - Y).T)
    db = 1 / m * np.sum(A - Y)

    grads = {"dw": dw, "db": db}

    return grads, cost



### 4.4 - Optimization
Using an optimize function we want to update the parameters using gradient descent. The goal is to learn $w$ and $b$ by minimizing the cost function $J$. For a parameter $\theta$, the update rule is $ \theta = \theta - \alpha \text{ } d\theta$, where $\alpha$ is the learning rate.

In [27]:
def optimize(w, b, X, Y, n_iter=100, alpha=0.009, verbose=False):
    """
    This function optimizes w and b by running a gradient descent algorithm
    
    Parameters
    ----------
    w : vector - weights, a numpy array of size (num_pixel * num_pixel * 3, 1)
    b : int - bias, a scalar
    X : 2-d array - data of size (num_pixel * num_pixel * 3, number of examples)
    Y : array - true label of size (1, number of examples)
    n_iter : number of iterations of the optimization loop
    alpha : learning rate of the gradient descent update rule
    verbose : True to print the loss every 100 steps
    
    Returns
    -------
    params : dictionary containing the weights w and bias b
    grads : dictionary containing the gradients of the weights and bias with respect to the cost function
    costs : list of all the costs computed during the optimization, this will be used to plot the learning curve.
    """
    
    w = copy.deepcopy(w)
    b = copy.deepcopy(b)
    
    costs = []
    
    for i in range(n_iter):
        # Cost and gradient calculation 
        grads, cost = propagate(w, b, X, Y)

        dw = grads["dw"]
        db = grads["db"]
        
        # update weight and bias
        w = w - alpha * grads['dw']
        b = b - alpha * grads['db']
           
        # Record the costs
        if i % 100 == 0:
            costs.append(cost)
            if verbose:
                print ("Cost after iteration %i: %f" %(i, cost))
    
    params = {"w": w,
              "b": b}
    
    grads = {"dw": dw,
             "db": db}
    
    return params, grads, costs


### 4.5 - Prediction
There are two steps to computing predictions:

1. Calculate $\hat{Y} = A = \sigma(w^T X + b)$

2. Convert the entries of a into 0 (if activation <= 0.5) or 1 (if activation > 0.5), stores the predictions in a vector `Y_prediction`.

In [31]:
def predict(w, b, X):
    '''
    Predict whether the label is 0 or 1 using learned logistic regression parameters (w, b)
    
    Parameters
    ----------
    w : vector - weights, a numpy array of size (num_pixel * num_pixel * 3, 1)
    b : int - bias, a scalar
    X : 2-d array - data of size (num_pixel * num_pixel * 3, number of examples)
    
    Returns
    -------
    Y_prediction : array containing all predictions (0/1) for the examples in X
    '''
    
    m = X.shape[1]
    Y_prediction = np.zeros((1, m))
    w = w.reshape(X.shape[0], 1)
    
    # Compute vector A predicting the probabilities
    A = sigmoid(np.dot(w.T, X) + b)
    
    for i in range(A.shape[1]):
        
        # Convert probabilities A[0,i] to actual predictions
        if A[0, i] > 0.5:
            Y_prediction[0, i] = 1
        else:
            Y_prediction[0, i] = 0
    
    return Y_prediction

## 5 - All functions into a model ##

Implementing the model function using the following notation:
    - Y_prediction_test for predictions on the test set
    - Y_prediction_train for predictions on the train set
    - parameters, grads, costs for the outputs of optimize()

In [33]:
def model(X_train, Y_train, X_test, Y_test, n_iter=2000, alpha=0.5, verbose=False):
    """
    Builds the logistic regression model by calling the helper functions
 
    Parameters
    ----------
    X_train : numpy array
        training set of shape (num_pixel * num_pixel * 3, m_train)
    Y_train : numpy array
        training labels of shape (1, m_train)
    X_test : numpy array
        test set of shape (num_pixel * num_pixel * 3, m_test)
    Y_test : numpy array
        test labels of shape (1, m_test)
    n_iter : int 
        number of iterations of the optimization loop
    alpha : float
        learning rate of the gradient descent update rule
    verbose : Boolean
        True to print the loss every 100 steps
    
    Returns
    -------
    d : dictionary 
        containing information about the model.
    """
    
    # initialize parameters with zeros 
    w, b = initialize_with_zeros(X_train.shape[0])
    
    # Gradient descent 
    params, grads, costs = optimize(w, b, X_train, Y_train, n_iter=n_iter, alpha=alpha, verbose=verbose)
    
    # Retrieve parameters w and b from dictionary "params"
    w = params["w"]
    b = params["b"]
    
    # Predict test/train set examples (≈ 2 lines of code)
    Y_prediction_test = predict(w, b, X_test)
    Y_prediction_train = predict(w, b, X_train)

    # Print train/test Errors
    if verbose:
        print("train accuracy: {} %".format(100 - np.mean(np.abs(Y_prediction_train - Y_train)) * 100))
        print("test accuracy: {} %".format(100 - np.mean(np.abs(Y_prediction_test - Y_test)) * 100))

    
    d = {"costs": costs,
         "Y_prediction_test": Y_prediction_test, 
         "Y_prediction_train" : Y_prediction_train, 
         "w" : w, 
         "b" : b,
         "alpha" : alpha,
         "n_iter": n_iter}
    
    return d