# Log-Loss vs Mean Squared Error

In [2]:
# Imports 
import numpy as np

There are many other error functions used for neural networks. Let me teach you another one, called the mean squared error. As the name says, this one is the mean of the squares of the differences between the predictions and the labels.

## Gradient Descent with Squared Errors

Goal: Make predictions as close as possible to the real values.

To measure, we use a metric of how wrong the predictions are using "error." A common metric is the sum of the squared errors.

![a](Images/a1.png)

- where y_hat is the prediction and y is thre true value.
- you take the sum over all output units j and another sum over all data points μ


- This variable j represents the output units of the network. For each output unit, find the difference between the true value yy and the predicted value from the network y_hat, then square the difference, then sum up all those squares.


- μ is a sum over all the data points. So, for each data point you calculate the inner sum of the squared differences for each output unit. Then you sum up those squared differences for each data point. That gives you the overall error for all the output predictions for all the data points.

## Why Squared Errors (SSE)

1. The square ensures the error is always positive, so we only consider magnitude


Remember that the output of a neural network, the prediction, depends on weights.

![a](Images/a2.png)

Our goals is find the weights W_ij that minimize the squared error E. To do this with a neural network we need ***gradient descent***

We must take small steps everytime to minimize the error. We can find this direction by calculating the gradient of the squared error. Gradient is another term for rate of change or slope.

## Gradient 

The gradient is just a derivative generalized to functions with more than one variable. We can use calculus to find the gradient at any point in our error function, which depends on the input weights.

Below I've plotted an example of the error of a neural network with two inputs, and accordingly, two weights. You can read this like a topographical map where points on a contour line have the same error and darker contour lines correspond to larger errors.

At each step, you calculate the error and the gradient, then use those to determine how much to change each weight. Repeating this process will eventually find weights that are close to the minimum of the error function, the black dot in the middle.

![a](Images/a3.png)

### Caveats / Local Mins

Since the weights will just go wherever the gradient takes them, they can end up where the error is low, but not the lowest. These spots are called local minima. If the weights are initialized with the wrong values, gradient descent could lead the weights into a local minimum, illustrated below. There are methods to avoid this, such as using [momentum](https://distill.pub/2017/momentum/).

![a](Images/a4.png)


## The Math Explained

First we need some measure of how bad our prediction are. Just use the difference in output $E = (y - \hat{y})$. We can also add a square to remove negatives and penalize outliers more, so $E = (y - \hat{y})^2$.


Next we need to sum up the error for all data records denoted by the sum over μ. Also multiply 1/2 to clean up math later.

$E = (1/2) \sum_{μ}(y - \hat{y})^2$.

Next sub in the prediction as a function of weights 

![a](Images/a5.png)

So for a single output: 

![a](Images/a6.png)

Next we find the negative for the gradient, which points to the weights with lowest error.

![a](Images/a7.png)

### Update Step

The update step for each weight parameter can be given as following

![a](Images/a8.png)

WE must use chain rule to expand the partial derivative. For

![a](Images/a9.png)

The resulting expansion for the partial derivative of the error with respect weights

![a](Images/a10.png)

$\hat{y}$ depends on the weights so we take the partial derivative of y with respect to the weights.

![a](Images/a11.png)

The partial derivate for $w_{i}$  is just $x_{1}$.

![a](Images/a12.png)

Putting it all together, we get

![a](Images/a15.png)

We can simply define an "ERROR TERM" as  $δ = (y - \hat{y}) * f'(h)$

and write our weight update as:

$w_{i}' = w_{i} + η * δ * x_{i}$

# Gradient Descent: The Code 

From before we saw that one weight update can be calculated as:

$\Delta w_i = \eta \, \delta x_i$ 

with the error term δ as

$\delta = (y - \hat y) f'(h) = (y - \hat y) f'(\sum w_i x_i)$

Remember, in the above equation $(y - \hat y)$ is the output error, and f'(h) refers to the derivative of the activation function, f(h). We'll call that derivative the output gradient.

Now I'll write this out in code for the case of only one output unit. We'll also be using the sigmoid as the activation function f(h).

So f(h) = 1/(1+ exp(-x))

f'(h) = 1/(1+ exp(-x)) * (1 - 1/(1+ exp(-x))

In [7]:
# Functions

# Activation Function
def sigmoid(x):
    return 1/(1+np.exp(-x))

# Derivative of Activation Function
def sigmoid_prime(x):
    return sigmoid(x) * (1 -sigmoid(x))

In [11]:
learnrate = 0.5
x = np.array([1,2,3,4])
y = np.array(0.5)
w = np.array([0.5, -0.5, 0.3, 0.1])

# Gradient Descent Step:


# Calculate the node's linear combination of inputs and weights
h = np.dot(x,w)
# not h = x * w

# Calculate the ouptut 
y_hat  = sigmoid(h)

# Calculate error of the ouput
error = y - y_hat

# Calculate the error term
error_term = error * sigmoid_prime(h)

# Calculate change in weights
del_w = learnrate * error_term * x

print('Neural Network output:')
print(y_hat)
print('Amount of Error:')
print(error)
print('Change in Weights:')
print(del_w)


Neural Network output:
0.6899744811276125
Amount of Error:
-0.1899744811276125
Change in Weights:
[-0.02031869 -0.04063738 -0.06095608 -0.08127477]


# Implementing gradient descent

Okay, now we know how to update our weights:

$\Delta w_{ij} = \eta * \delta_j * x_i$

You've seen how to implement that for a single update, but how do we translate that code to calculate many weight updates so our network will learn?

As an example, I'm going to have you use gradient descent to train a network on graduate school admissions data (found at http://www.ats.ucla.edu/stat/data/binary.csv). This dataset has three input features: GRE score, GPA, and the rank of the undergraduate school (numbered 1 through 4). Institutions with rank 1 have the highest prestige, those with rank 4 have the lowest.

![a](Images/a16.png)

The goal here is to predict if a student will be admitted to a graduate program based on these features. For this, we'll use a network with one output layer with one unit. We'll use a sigmoid function for the output unit activation.

# Data Cleanup

You might think there will be three input units, but we actually need to transform the data first. The rank feature is categorical, the numbers don't encode any sort of relative values. Rank 2 is not twice as much as rank 1, rank 3 is not 1.5 more than rank 2. Instead, we need to use dummy variables to encode rank, splitting the data into four new columns encoded with ones or zeros. Rows with rank 1 have one in the rank 1 dummy column, and zeros in all other columns. Rows with rank 2 have one in the rank 2 dummy column, and zeros in all other columns. And so on.

We'll also need to standardize the GRE and GPA data, which means to scale the values such that they have zero mean and a standard deviation of 1. This is necessary because the sigmoid function squashes really small and really large inputs. The gradient of really small and large inputs is zero, which means that the gradient descent step will go to zero too. Since the GRE and GPA values are fairly large, we have to be really careful about how we initialize the weights or the gradient descent steps will die off and the network won't train. Instead, if we standardize the data, we can initialize the weights easily and everyone is happy.

This is just a brief run-through, you'll learn more about preparing data later. If you're interested in how I did this, check out the data_prep.py file in the programming exercise below.

Now that the data is ready, we see that there are six input features: gre, gpa, and the four rank dummy variables.

In [None]:
![a](Images/a15.png)