# Log-Loss vs Mean Squared Error

In [1]:
# Imports 
import numpy as np
import pandas as pd

There are many other error functions used for neural networks. Let me teach you another one, called the mean squared error. As the name says, this one is the mean of the squares of the differences between the predictions and the labels.

## Gradient Descent with Squared Errors

Goal: Make predictions as close as possible to the real values.

To measure, we use a metric of how wrong the predictions are using "error." A common metric is the sum of the squared errors.

![a](Images/a1.png)

- where y_hat is the prediction and y is thre true value.
- you take the sum over all output units j and another sum over all data points μ


- This variable j represents the output units of the network. For each output unit, find the difference between the true value yy and the predicted value from the network y_hat, then square the difference, then sum up all those squares.


- μ is a sum over all the data points. So, for each data point you calculate the inner sum of the squared differences for each output unit. Then you sum up those squared differences for each data point. That gives you the overall error for all the output predictions for all the data points.

## Why Squared Errors (SSE)

1. The square ensures the error is always positive, so we only consider magnitude


Remember that the output of a neural network, the prediction, depends on weights.

![a](Images/a2.png)

Our goals is find the weights W_ij that minimize the squared error E. To do this with a neural network we need ***gradient descent***

We must take small steps everytime to minimize the error. We can find this direction by calculating the gradient of the squared error. Gradient is another term for rate of change or slope.

## Gradient 

The gradient is just a derivative generalized to functions with more than one variable. We can use calculus to find the gradient at any point in our error function, which depends on the input weights.

Below I've plotted an example of the error of a neural network with two inputs, and accordingly, two weights. You can read this like a topographical map where points on a contour line have the same error and darker contour lines correspond to larger errors.

At each step, you calculate the error and the gradient, then use those to determine how much to change each weight. Repeating this process will eventually find weights that are close to the minimum of the error function, the black dot in the middle.

![a](Images/a3.png)

### Caveats / Local Mins

Since the weights will just go wherever the gradient takes them, they can end up where the error is low, but not the lowest. These spots are called local minima. If the weights are initialized with the wrong values, gradient descent could lead the weights into a local minimum, illustrated below. There are methods to avoid this, such as using [momentum](https://distill.pub/2017/momentum/).

![a](Images/a4.png)


## The Math Explained

First we need some measure of how bad our prediction are. Just use the difference in output $E = (y - \hat{y})$. We can also add a square to remove negatives and penalize outliers more, so $E = (y - \hat{y})^2$.


Next we need to sum up the error for all data records denoted by the sum over μ. Also multiply 1/2 to clean up math later.

$E = (1/2) \sum_{μ}(y - \hat{y})^2$.

Next sub in the prediction as a function of weights 

![a](Images/a5.png)

So for a single output: 

![a](Images/a6.png)

Next we find the negative for the gradient, which points to the weights with lowest error.

![a](Images/a7.png)

### Update Step

The update step for each weight parameter can be given as following

![a](Images/a8.png)

WE must use chain rule to expand the partial derivative. For

![a](Images/a9.png)

The resulting expansion for the partial derivative of the error with respect weights

![a](Images/a10.png)

$\hat{y}$ depends on the weights so we take the partial derivative of y with respect to the weights.

![a](Images/a11.png)

The partial derivate for $w_{i}$  is just $x_{1}$.

![a](Images/a12.png)

Putting it all together, we get

![a](Images/a15.png)

We can simply define an "ERROR TERM" as  $δ = (y - \hat{y}) * f'(h)$

and write our weight update as:

$w_{i}' = w_{i} + η * δ * x_{i}$

# Gradient Descent: The Code 

From before we saw that one weight update can be calculated as:

$\Delta w_i = \eta \, \delta x_i$ 

with the error term δ as

$\delta = (y - \hat y) f'(h) = (y - \hat y) f'(\sum w_i x_i)$

Remember, in the above equation $(y - \hat y)$ is the output error, and f'(h) refers to the derivative of the activation function, f(h). We'll call that derivative the output gradient.

Now I'll write this out in code for the case of only one output unit. We'll also be using the sigmoid as the activation function f(h).

So f(h) = 1/(1+ exp(-x))

f'(h) = 1/(1+ exp(-x)) * (1 - 1/(1+ exp(-x))

In [2]:
# Functions

# Activation Function
def sigmoid(x):
    return 1/(1+np.exp(-x))

# Derivative of Activation Function
def sigmoid_prime(x):
    return sigmoid(x) * (1 -sigmoid(x))

In [3]:
learnrate = 0.5
x = np.array([1,2,3,4])
y = np.array(0.5)
w = np.array([0.5, -0.5, 0.3, 0.1])

# Gradient Descent Step:


# Calculate the node's linear combination of inputs and weights
h = np.dot(x,w)
# not h = x * w

# Calculate the ouptut 
y_hat  = sigmoid(h)

# Calculate error of the ouput
error = y - y_hat

# Calculate the error term
error_term = error * sigmoid_prime(h)

# Calculate change in weights
del_w = learnrate * error_term * x

print('Neural Network output:')
print(y_hat)
print('Amount of Error:')
print(error)
print('Change in Weights:')
print(del_w)


Neural Network output:
0.6899744811276125
Amount of Error:
-0.1899744811276125
Change in Weights:
[-0.02031869 -0.04063738 -0.06095608 -0.08127477]


# Implementing gradient descent

Okay, now we know how to update our weights:

$\Delta w_{ij} = \eta * \delta_j * x_i$

You've seen how to implement that for a single update, but how do we translate that code to calculate many weight updates so our network will learn?

As an example, I'm going to have you use gradient descent to train a network on graduate school admissions data (found at http://www.ats.ucla.edu/stat/data/binary.csv). This dataset has three input features: GRE score, GPA, and the rank of the undergraduate school (numbered 1 through 4). Institutions with rank 1 have the highest prestige, those with rank 4 have the lowest.

![a](Images/a16.png)

The goal here is to predict if a student will be admitted to a graduate program based on these features. For this, we'll use a network with one output layer with one unit. We'll use a sigmoid function for the output unit activation.

# Data Cleanup

You might think there will be three input units, but we actually need to transform the data first. The rank feature is categorical, the numbers don't encode any sort of relative values. Rank 2 is not twice as much as rank 1, rank 3 is not 1.5 more than rank 2. Instead, we need to use dummy variables to encode rank, splitting the data into four new columns encoded with ones or zeros. Rows with rank 1 have one in the rank 1 dummy column, and zeros in all other columns. Rows with rank 2 have one in the rank 2 dummy column, and zeros in all other columns. And so on.

We'll also need to standardize the GRE and GPA data, which means to scale the values such that they have zero mean and a standard deviation of 1. This is necessary because the sigmoid function squashes really small and really large inputs. The gradient of really small and large inputs is zero, which means that the gradient descent step will go to zero too. Since the GRE and GPA values are fairly large, we have to be really careful about how we initialize the weights or the gradient descent steps will die off and the network won't train. Instead, if we standardize the data, we can initialize the weights easily and everyone is happy.

Now that the data is ready, we see that there are six input features: gre, gpa, and the four rank dummy variables. Here is an table of the data.

![a](Images/a17.png)

# Mean Square Error

We're going to make a small change to how we calculate the error here. Instead of the SSE, we're going to use the mean of the square errors (MSE). Now that we're using a lot of data, summing up all the weight steps can lead to really large updates that make the gradient descent diverge.

To compensate for this, you'd need to use a quite small learning rate. Instead, we can just divide by the number of records in our data, mm to take the average.This way, no matter how much data we use, our learning rates will typically be in the range of 0.01 to 0.001. Then, we can use the MSE (shown below) to calculate the gradient and the result is the same as before, just averaged instead of summed.

$E = (1/2m) \sum_{\mu} (y^\mu - \hat{y}^\mu)^2 $

Here's the general algorithm for updating the weights with gradient descent:

- Set the weight step to zero: $\Delta w_i = 0$

For each record in the training data
 1. calculate the output $\hat y = f(\sum_i w_i x_i)$
 2. calculate the error term, $δ=(y − \hat{y}) * f'(\sum_{i} w_{i}x_{i})$
 3. update the weight, $Δw' =Δw +δx_{i}$

- Update the weights, $w = w + ηΔw / m$ , where η is the learning rate and mm is the number of records.
- Repeat for set epochs

Our Activation Function remains sigmoid 

$f(h) = 1/(1+e^{-h})$ , and 


$f'(h) = f(h)(1 - f(h))$  so $f'(h) = 1/(1+e^{-h}) * (1 - 1/(1+e^{-h}))$

# Using numpy to implement

1. Initialize the weights
       - Must bbe near 0 and not squashed at high or low ends
       - Must be random, so use normal distribution centered at 0
   
 Use a scale: $1/ \sqrt{n}$ , where n is the number of inputs. This keeps the input low for the number of inputs. So:
 
```bash
weights = np.random.normal(scale=1/n_features**.5, size=n_features)

```

 2. Use the np.dot functions to find dot product of two matrices, which is linear combinination of w and x.
 
```bash
# input to the output layer
output_in = np.dot(weights, inputs)

```
 
3. Split training and testing 90/10

```bash
np.random.seed(42)
sample = np.random.choice(data.index, size=int(len(data)*0.9), replace=False)
data, test_data = data.loc[sample], data.drop(sample)
```

4. Splitting Features and Targets
```bash
# Split into features and targets
features, targets = data.drop('admit', axis=1), data['admit']
features_test, targets_test = test_data.drop('admit', axis=1), test_data['admit']
```

# Implementation

In [4]:
# Read Data
admissions = pd.read_csv('intro_to_neural_network_2.csv')

In [5]:
# Make dummy variables for rank, one hot encoding
data = pd.concat([admissions, pd.get_dummies(admissions['rank'], prefix='rank')], axis=1)
data = data.drop('rank', axis=1)

# Standarize features to zero mean and a standard deviation of 1
for field in ['gre', 'gpa']:
    mean, std = data[field].mean(), data[field].std()
    data.loc[:,field] = (data[field]-mean)/std
    
print(data[:10])

   admit       gre       gpa  rank_1  rank_2  rank_3  rank_4
0      0 -1.798011  0.578348       0       0       1       0
1      1  0.625884  0.736008       0       0       1       0
2      1  1.837832  1.603135       1       0       0       0
3      1  0.452749 -0.525269       0       0       0       1
4      0 -0.586063 -1.208461       0       0       0       1
5      1  1.491561 -1.024525       0       1       0       0
6      1 -0.239793 -1.077078       1       0       0       0
7      0 -1.624876 -0.814312       0       1       0       0
8      1 -0.412928  0.000263       0       0       1       0
9      0  0.972155  1.392922       0       1       0       0


In [6]:
# Split 90% to Training and 10% Testing
np.random.seed(42)
sample = np.random.choice(data.index, size=int(len(data)*0.9), replace=False)
data, test_data = data.loc[sample], data.drop(sample)

# Split into features and targets
features, targets = data.drop('admit', axis=1), data['admit']
features_test, targets_test = test_data.drop('admit', axis=1), test_data['admit']

print(features[:5])

print ("\n Targets: \n")

print(targets[:5])

          gre       gpa  rank_1  rank_2  rank_3  rank_4
209 -0.066657  0.289305       0       1       0       0
280  0.625884  1.445476       0       1       0       0
33   1.837832  1.603135       0       0       1       0
210  1.318426 -0.131120       0       0       0       1
93  -0.066657 -1.208461       0       1       0       0

 Targets: 

209    0
280    0
33     1
210    0
93     0
Name: admit, dtype: int64


In [7]:
# Helper Functions

def sigmoid(x): # x is dot product of weights and inputs
    return 1/(1 + np.exp(-x))
    

In [10]:
# Gradient Descent

#Initialize the weights
n_records, n_features = features.shape
weights = np.random.normal(scale = 1/n_features**0.5, size = n_features)

#Training Parameters
epochs = 1000
leanrate = 0.5
last_loss = None

for e in range(epochs):
    
    # Create a matrix matching shape of weights, this is delta w
    del_w = np.zeros(weights.shape)
    
    # Loop through all records, x is the input, y is the target
    
    # pandas.values removes any axis titles
    for x,y in zip(features.values, targets):
        
        # Calculate the output
        output = sigmoid(np.dot(x,weights))
        
        # Calculate the error
        error = (y - output)
        
        # Sigmoid Prime
        sigmoid_prime = (output)*(1-output)
        
        # Caluculate the error term
        error_term = error * sigmoid_prime
        
        # Weight Change
        del_w += error_term * x
   
    weights += (del_w * learnrate)/n_records
    
    # Printing out the mean square error on the training set
    if e % (epochs / 10) == 0:
        out = sigmoid(np.dot(features, weights))
        
        # Mean Square Error
        loss = np.mean((out - targets) ** 2)
        
        if last_loss and last_loss < loss:
            print("Train loss: ", loss, "  WARNING - Loss Increasing")
        else:
            print("Train loss: ", loss)
        last_loss = loss
        
    
# Calculate accuracy on test data
tes_out = sigmoid(np.dot(features_test, weights))
predictions = tes_out > 0.5
accuracy = np.mean(predictions == targets_test)
print("Prediction accuracy: {:.3f}".format(accuracy))  

Train loss:  0.27086665326589826
Train loss:  0.20672981036936744
Train loss:  0.20074550171822744
Train loss:  0.19871136118765292
Train loss:  0.1978686940782158
Train loss:  0.19746948362907232
Train loss:  0.19726166263589326
Train loss:  0.19714582811250664
Train loss:  0.19707792744606575
Train loss:  0.19703660044970267
Prediction accuracy: 0.725


# Multilayer Perceptrons

we have multiple input units and multiple hidden units, the weights between them will require two indices: $w_{ij}$ ,where $i$ denotes input units and $j$ are the hidden units.

For example, the following image shows our network, with its input units labeled $x_1, x_2, x_3$ , and its hidden nodes labeled $h_1$ and $h_2$:

![a](Images/a20.png)

Now to index the weights, we take the input unit number for the i and the hidden unit number for the j. Now, the weights need to be stored in a matrix, indexed as $w_{ij}$. 

- Each row in the matrix will correspond to the weights leading out of a single input unit

- Each column will correspond to the weights leading in to a single hidden unit.

![a](Images/a21.png)

To initialize these weights in NumPy, we have to provide the shape of the matrix. If features is a 2D array containing the input data:

```bash

# Number of records and input units
n_records, n_inputs = features.shape
# Number of hidden units
n_hidden = 2
weights_input_to_hidden = np.random.normal(0, n_inputs**-0.5, size=(n_inputs, n_hidden))

```

So for each hidden layer unit, we need to calculate the following:

$h_{j} = \sum_{i} w_{ij}x_{i}$

In this case, we're multiplying the inputs (a row vector here) by the weights. To do this, you take the dot (inner) product of the inputs with each column in the weights matrix. For example, to calculate the input to the first hidden unit, j = 1j=1, you'd take the dot product of the inputs with the first column of the weights matrix, like so:

![a](Images/a22.png)

we get this for the first hidden perceptron unit:

$h_{1} = x_{1}w_{11} + x_{2}w_{21} + x_{3}w_{31}$


In NumPy, you can do this for all the inputs and all the outputs at once using np.dot:

```bash
hidden_inputs = np.dot(inputs, weights_input_to_hidden)
```

###  Making a column vector

Sometimes you'll want a column vector, even though by default NumPy arrays work like row vectors. For example when you need to dot (2x3) matrix with (1x3).  It's possible to get the transpose of an array like so arr.T, but for a 1D array, the transpose will return a row vector. Instead, use arr[:,None] to create a column vector:matrix.

1. For 1-D vector use arr[:,None]
2. For vector greater than 1-D, use transpose arr.T

In [20]:
# 1-D
one_D = np.array([0.5,0.2,0.3])
print(one_D)

# now we want column, doesnt work with .T
print(one_D.T)

#works with arr[:,None]
print(one_D[:,None])

two_D = np.array([[0.5,0.2,0.3],[0.1,2.3,1.3]])
print(two_D)
print(two_D.shape)
print(two_D.T)


[0.5 0.2 0.3]
[0.5 0.2 0.3]
[[0.5]
 [0.2]
 [0.3]]
[[0.5 0.2 0.3]
 [0.1 2.3 1.3]]
(2, 3)
[[0.5 0.1]
 [0.2 2.3]
 [0.3 1.3]]


Below, I'll implement a forward pass through a 4x3x2 network, with sigmoid activation functions for both layers.

In [46]:
def sigmoid(x):
    return 1/(1+np.exp(-x))

# Network Size
N_input = 4
N_hidden = 3
N_output = 2

np.random.seed(42)
# Make some fake data
X = np.random.randn(4)
print("Here are the input features:")
print(X,"\n")

# Define Random Weights np.random.normal(mean,std dev, dim)
weights_input_to_hidden = np.random.normal(0, scale=0.1, size=(N_input, N_hidden))
print("Here are the hidden layer input weights: [4x3]\n" , weights_input_to_hidden, "\n")
weights_hidden_to_output = np.random.normal(0, scale=0.1, size=(N_hidden, N_output))

#Make a forward pass through the network

hidden_layer_in = np.dot(X, weights_input_to_hidden)
hidden_layer_out = sigmoid(hidden_layer_in)

print('Hidden-layer Output: Sigmoid([1x4] Features dot [4x3] Weights = [1x3])')
print(hidden_layer_out)

output_layer_in = np.dot(hidden_layer_out, weights_hidden_to_output)
output_layer_out = sigmoid(output_layer_in)

print("\nHere are the output layer input weights: [3x2]\n" , weights_hidden_to_output)

print('\nOutput-layer Output: Sigmoid([1x3] Features dot [3x2] Weights = [1x2])')
print(output_layer_out)

Here are the input features:
[ 0.49671415 -0.1382643   0.64768854  1.52302986] 

Here are the hidden layer input weights: [4x3]
 [[-0.02341534 -0.0234137   0.15792128]
 [ 0.07674347 -0.04694744  0.054256  ]
 [-0.04634177 -0.04657298  0.02419623]
 [-0.19132802 -0.17249178 -0.05622875]] 

Hidden-layer Output: Sigmoid([1x4] Features dot [4x3] Weights = [1x3])
[0.41492192 0.42604313 0.5002434 ]

Here are the output layer input weights: [3x2]
 [[-0.10128311  0.03142473]
 [-0.09080241 -0.14123037]
 [ 0.14656488 -0.02257763]]

Output-layer Output: Sigmoid([1x3] Features dot [3x2] Weights = [1x2])
[0.49815196 0.48539772]


# Backpropagation

Now we've come to the problem of how to make a multilayer neural network learn. Before, we saw how to update weights with gradient descent. The backpropagation algorithm is just an extension of that, using the chain rule to find the error with the respect to the weights connecting the input layer to the hidden layer (for a two layer network).

Since we know the error at the output, we can use the weights to work backwards to hidden layers. For example, in the output layer, you have errors $\delta^o_k$ attributed to each output unit k. Then, the error attributed to hidden unit $j$ is the output errors, scaled by the weights between the output and hidden layers (and the gradient):

Error between hidden and output: $\delta^h_j = \sum W_{jk} \delta^o_k f'(h_j)$

Then, the gradient descent step is the same as before, just with the new errors: $\Delta W_{ij} = \eta \delta^h_j x_i$

where $w_{ij}$ are the weights between the inputs and hidden layer and $x_i$ are input unit values. This form holds for however many layers there are. 

The weight steps are equal to the step size times the output error of the layer times the values of the inputs to that layer: $\Delta W_{pq} = \eta  \delta_{output}  V_{in}$

Here, you get the output error, $\delta_{output}$, by propagating the errors backwards from higher layers. And the input values, $V_{in}$ are the inputs to the layer, the hidden layer activations to the output unit for example.


This is waaaay to hard to understand, so lets work through an example

# Example: Backpropagation

Let's walk through the steps of calculating the weight updates for a simple two layer network.

Suppose there are two input values, one hidden unit , and one output unit with sigmoid activation on hidden and output.

![a](Images/a23.png)

1. First calculate the input to hidden layer:
$h=∑w_i x_i = 0.1×0.4−0.2×0.3=−0.02$


2. Calculate the output of hidden layer:
$a=f(h)=sigmoid(−0.02)=0.495$


3. Calculate the input to the output layer:
$h=∑w_h x_h = 0.1×0.495= 0.0495$


4. Calculate the output of output layer:
$\hat{y} = sigmoid(0.0495)=0.512$


With the network output, we can start the backwards pass to calculate the weight updates for both layers. Using the fact that for the sigmoid function 

$f'(W \cdot a) = f(W \cdot a) (1 - f(W \cdot a))$

So the error for the output layer is:

$δ_o =(y−\hat{y})f′(W⋅a)=(1−0.512)×0.512×(1−0.512)=0.122$

Now we need to calculate the error term for the hidden unit with backpropagation. Here we'll scale the error term from the output unit by the weight $W$ connecting it to the hidden unit.

$\delta^h_j = \sum W_{jk} \delta^o_k f'(h_j)$  , so:

$\delta^h = W \delta^o_k f'(h_j) = 0.1×(0.122)×[0.495×(1−0.495)] = 0.003$


Now that we have the errors, we can calculate the gradient descent steps.The hidden to output weight step is the learning rate, times the output unit error, times the hidden unit activation value.

$ΔW = ηδ^oa = 0.5×0.122×0.495 = 0.0302$

Then, for the input to hidden weights $w_i$, it's the learning rate times the hidden unit error, times the input values.

$Δw_i = ηδ^hx_i = (0.5×0.003×0.1,0.5×0.003×0.3) = (0.00015,0.00045)$

## Vanishing Gradient
From this example, you can see one of the effects of using the sigmoid function for the activations. The maximum derivative of the sigmoid function is 0.25, so the errors in the output layer get reduced by at least 75%, and errors in the hidden layer are scaled down by at least 93.75%! You can see that if you have a lot of layers, using a sigmoid activation function will quickly reduce the weight steps to tiny values in layers near the input. This is known as the vanishing gradient problem. Later in the course you'll learn about other activation functions that perform better in this regard and are more commonly used in modern network architectures.

## Implementing in NumPy

Previously we were only dealing with error terms from one unit. Now, in the weight update, we have to consider the error for each unit in the hidden layer, $\delta_j$

$\Delta w_{ij} = \eta\delta_jx_{i}$

Steps: 
1. Calculate networks output error
2. Calculate networks output error term
3. Backpropragate to calculate the hidden later error term
4. Calculate the change in weights that are required from backpropragation

In [58]:
x = np.array([0.5, 0.1, -0.2])
target = 0.6
learnrate = 0.5

# Weights of input hidden are [3x2]
weights_input_hidden = np.array([[0.5, -0.6],
                                 [0.1, -0.2],
                                 [0.1, 0.7]])

weights_hidden_output = np.array([0.1, -0.3])

#Foward Pass 
hidden_layer_input = np.dot(x,weights_input_hidden) #[1x3] dot [3x2] = [1x2]
print("hidden layer input:\n", hidden_layer_input)

hidden_layer_output = sigmoid(hidden_layer_input)
print("hidden layer output:\n", hidden_layer_output) 

#[1x2] dot [1x2] = [1]
output_layer_in = np.dot(hidden_layer_output, weights_hidden_output)
print("output layer input:\n", output_layer_in)

output = sigmoid(output_layer_in)
print("output layer output:\n", output)

## Backwards pass

#Calculate output error
error = target - output
print('error from output:')
print(error)

#Calculate error term for output layer
output_error_term = error * (output) * (1-output)
print('output error term:')
print(output_error_term)

#Calculate error term for hidden layer
hidden_error_term = weights_hidden_output * output_error_term * (hidden_layer_output) * (1-hidden_layer_output)
print('hidden error term:')
print(hidden_error_term)

#Calculate change in weights for hidden layer to output layer
delta_w_h_o = learnrate * output_error_term * hidden_layer_output

#Calculate change in weights for input layer to hidden layer
delta_w_i_h = learnrate * hidden_error_term * x[:,None]

print('Change in weights for hidden layer to output layer:')
print(delta_w_h_o)
print('Change in weights for input layer to hidden layer:')
print(delta_w_i_h)

hidden layer input:
 [ 0.24 -0.46]
hidden layer output:
 [0.55971365 0.38698582]
output layer input:
 -0.06012438223148006
output layer output:
 0.48497343084992534
error from output:
0.11502656915007464
output error term:
0.028730669543515018
hidden error term:
[ 0.00070802 -0.00204471]
Change in weights for hidden layer to output layer:
[0.00804047 0.00555918]
Change in weights for input layer to hidden layer:
[[ 1.77005547e-04 -5.11178506e-04]
 [ 3.54011093e-05 -1.02235701e-04]
 [-7.08022187e-05  2.04471402e-04]]
