# Introduction to Neural Networks

## Perceptrons

### Introduction

A single perceptron defines a linear equation (e.g. Wx + b = 0) and a decision function (step) 
- The weights of the linear equation label the edges 
- The bias term labels the node 
- Bias can also be considered as a weight through an edge 
- Step function converts score to binary decision 

### AND, OR, NOT and XOR perceptons

**AND**

- Single layer perceptron 
- 1 if both inputs are 1, 0 otherwise 

OR 
- Single layer perceptron 
- 1 if both/either inputs are 1, 0 otherwise 

NOT 
- Single layer perceptron 
- Considers maximum 1 input and is 1 if input is 0, and vice versa 

NAND 
- Single layer perceptron: Combination of AND and NOT 
- Inverse of OR: 1 if both/either inputs are 0, 1 otherwise 

XOR 
- Two layer perceptron: 
    - 1st layer: 2 nodes NAND and OR 
    - 2nd layer: AND 

## Perceptron algorithm

For a single line classification 

In [None]:
import numpy as np
# Setting the random seed, feel free to change it and see different solutions.
np.random.seed(42)

def stepFunction(t):
    if t >= 0:
        return 1
    return 0

def prediction(X, W, b):
    return stepFunction((np.matmul(X,W)+b)[0])

# TODO: Fill in the code below to implement the perceptron trick.
# The function should receive as inputs the data X, the labels y,
# the weights W (as an array), and the bias b,
# update the weights and bias W, b, according to the perceptron algorithm,
# and return W and b.
def perceptronStep(X, y, W, b, learn_rate = 0.01):
    # Loop through each observation
    for i in range(len(X)):
        # Get prediction
        y_hat = prediction(X[i], W, b)
        
        # Determine the sign/value of the adjustment
        delta_mult = 0
        if y[i]-y_hat > 0:
            delta_mult = 1
        elif y[i]-y_hat < 0:
            delta_mult = -1
            
        # Adjust the weights and bias after each prediction
        W += delta_mult*learn_rate*X[i][:, np.newaxis]
        b += delta_mult*learn_rate
    
    return W, b
    
# This function runs the perceptron algorithm repeatedly on the dataset,
# and returns a few of the boundary lines obtained in the iterations,
# for plotting purposes.
# Feel free to play with the learning rate and the num_epochs,
# and see your results plotted below.
def trainPerceptronAlgorithm(X, y, learn_rate = 0.01, num_epochs = 25):
    x_min, x_max = min(X.T[0]), max(X.T[0])
    y_min, y_max = min(X.T[1]), max(X.T[1])
    W = np.array(np.random.rand(2,1))
    b = np.random.rand(1)[0] + x_max
    # These are the solution lines that get plotted below.
    boundary_lines = []
    for i in range(num_epochs):
        # In each epoch, we apply the perceptron step.
        W, b = perceptronStep(X, y, W, b, learn_rate)
        boundary_lines.append((-W[0]/W[1], -b/W[1]))
    return boundary_lines


## Error function

### Introduction

As a measure of error of a classification problem, the error function must:
- be continuous (to be differentiable)
- penalize misclassifications strongly, and less for correctly classified points
- be sort of a distance between the boundary line and the misclassified points

### Sigmoid function

Predictions must be converted from discrete to continuous through changing the activation function from a step function to the sigmoid function. The sigmoid function transform an unbounded input to a range bounded by 0 and 1. It transforms the discrete prediction into a probability.

### Softmax

For converting scores in a multi-class classification problems to probabilities, we need a function which:
- Produces probabilities for each class which add up to 1: Divide the class score by the sum of all scores.
- Avoids a division by zero: Convert all scores to positive numbers
- Assigns higher probabilities to higher scores: Use the exponential function to convert the scores.


In [50]:
import numpy as np

# Write a function that takes as input a list of numbers, and returns
# the list of values given by the softmax function.
def softmax(L):
    exp_scores = np.exp(np.array(L))
    sum_exp_scores = np.sum(exp_scores)
    
    return list(exp_scores/sum_exp_scores)

### Maximum likelihood

Maximizes the probability of observing the data given the model. It calculates the joint probability of the individual events occurring given the probability assigned to it by the model. For instance:
- If an observation was a 1, what is the model's probability for this observation to be 1?

### Cross-Entropy

The cross-entropy connects probabilities with error functions. It is the conversion of the joint probability into the sum of the negative of the logarithm of the probabilities:
- A good model will provide a low cross-entropy (namely, a good model produces a high likelihood of the observations occurring given the data, and the negative of the logarithm of a high probability is a low number)
- A bad model will provide a high cross-entropy

Therefore, the cross-entropies can be considered as size of the errors.

In [66]:
import numpy as np

# Write a function that takes as input two lists Y, P,
# and returns the float corresponding to their cross-entropy.
def cross_entropy(Y, P):
    Y_array, P_array = np.array(Y), np.array(P)
    
    return -np.sum(Y_array * np.log(P_array) + (1-Y_array)*np.log(1-P_array))

### Logistic regression

- Uses as error function the cross-entropy divided by the number of observations. Since the model predictions are probabilities which are driven by the weights, the error function is a function of the weights.
- Uses gradient descent to minimize the error function in a step-wise fashion.

## Gradient Descent

- The gradient is formed by the vector of the partial derivative of the error function with respect to the weights and bias. It provides the **increase** in the error function for a 1-unit change in each of the weights.
- The gradient is only valid on those particular values for the weights.
- The gradient descent algorithm updates the weights in the opposite direction of the gradient (i.e. for decreases of the error function) scaled by the learning rate.
- The gradients wrt the weights are simply the scalars times the values of the coordinate (or factor) related to that weight. This scalar is a multiple of the different between the label and the modelled probability. I.e. the farther away the probability is from the label, the higher the gradient (and vice versa).

## Neural networks

- Neural networks come into play when the decision boundary to classify items is non-linear. 
- It effectively combines multiple linear decision boundaries into one model.
- It combines the probability of each item as per each 'submodel', provides them a weight and produces a final probability for the entire network. The node which combines the two models, in itself becomes a model.

### Layers

- **Input layer**: 
    - First layer containing the inputs
    - Number of **factors** determines the number of nodes
- **Hidden layer**: 
    - Set of linear models created with the first layers
    - Number of **models** determines the number of nodes
- **Output layer**: 
    - Linear models get combined to obtain a non-linear model
    - Number of **classes** determines the number of nodes

**Deep neural network** has multiple hidden layers:
- First hidden layer create nonlinear models
- Second hidden layer combines the nonlinear models to create more linear models.

It splits the n-dimensional space (i.e. input layer) with a highly nonlinear boundary.

### Feed-forward

Feedforward is the process of turning inputs into outputs through the layers of the network:
- The first step is a matrix multiplication between the matrix of Weights (with each row consisting of the weights for each node in the hidden layer) and a vector of the input. Subsequently, the sigmoid function is applied to the resulting vector to produce probabilities.
- The second step is a vector multiplication between the vector or Weights (with each weight corresponding to one edge going into the output layer) and the vector of probabilities resulting from the first step. Again the sigmoid function is applied to produce 

Therefore, the feedforward process applies a sequence of linear models and sigmoids to reach the output.

### Backpropagation

In a nutshell, backpropagation will consist of:

- Doing a feedforward operation.
- Comparing the output of the model with the desired output.
- Calculating the error.
- Running the feedforward operation backwards (backpropagation) to spread the error to each of the weights.
- Use this to update the weights, and get a better model.
- Continue this until we have a model that is good.

Breaking down backpropagation in a multi-layered neural network:
- The predictions are following from the feed-forward process (i.e. the sequence of matrix multiplications and sigmoids). Therefore, the predictions are a function of all the weights in the network. And therefore, the error function is also a function of all the weights in the network.
- The gradient is a long vector of the partial derivatives of the error function w.r.t. to **all** the weights in the network.
- Gradient descent will update each of the weights by adding the learning rate time the partial derivative of the error function w.r.t. that weight.
- In determining the gradient, the process uses the chain rule to get the derivative of the composite functions (i.e. the sigmoids and matrix multiplications). The chain rule says that the derivative of a composite function is simply the multiplication of the partial derivatives. Therefore the partial derivatives for the weights are simply the multiplication of the partial derivatives (backwards through the network).

Breaking down the layers of the partial derivatives:
- First partial derivative: The partial derivative of the error function with respect to output layer is simply the difference between the label and the probability.
- Second partial derivative: This probability in turn is derived from the output layer and is the sigmoid of a linear combination probabilities from the hidden layer. The partial derivative of this w.r.t. one of the weights is simply the level of the weight times the derivative of the sigmoid of the probability produced by the relevant node in the hidden layer. This is true since the derivative of the other sigmoid w.r.t. this weight is 0.
- Third partial derivative: follows in similar fashion.