<a href="https://colab.research.google.com/github/carlosfmorenog/CMM536/blob/master/CMM536_Topic_7/CMM536_T7_Lec.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Topic 7 - Neural Networks
![So easy!](https://www.dropbox.com/s/u7fvqjkpm64z4a9/nn101.jpg?raw=1)

## Aims of the Session

* Learn the basics of neural networks

![Not so fast!](https://www.dropbox.com/s/885mxca7i87rpo7/ML_maths.jpg?raw=1)

## Resources for the Lecture

### Online Courses

* [Deep Learning Specialization by Andrew NG (Coursera)](https://es.coursera.org/specializations/deep-learning)

## What is a Neural Network (NN)?

* Supervised learning classification algorithm

* **INSPIRED** by how the brain works

* Its basis traces back to the 1950's

* To understand what an NN does, we need to first understand what a `neuron` does!

* `Neuron`: The most basic component of an NN.

* Suppose that you have collected information of different house sizes and their price. You can train a neuron to predict the price of new houses.

![Fig. 1. Example of a neuron in a size-price prediction scenario](https://www.dropbox.com/s/zw5ttz6ci7r8age/neuron.jpg?raw=1)

### How does a neuron get trained?

* A neuron is nothing more than a mathematical function!

* If using `linear regression`, you would consider a basic linear function

        y = mx + b??

* However, neurons tend to be implemented using other mathematical functions

* For the house price problem, we will set our neuron to use the `Rectifying Linear Unit` (ReLu) function, which looks something like this:

![Fig. 2. Example of the ReLU function](https://www.dropbox.com/s/8o8gt3xjksymp8e/relu.jpg?raw=1)

* We use functions like this in our NNs because they are `activation functions` (more on that later)

* Therefore, our house pricing prediction model would look something like this:

![Fig. 3. Fitting a ReLU function tto predict house prices](https://www.dropbox.com/s/r3i8kmwv36lqc5j/housepred.jpg?raw=1)

### Multiple neurons working together

* What would happen if we want to consider more features besides the size?
    * Number of rooms, zip code, wealth, etc.

* We can train a neuron for each feature, but to better understand the relations between them we can interconnect more than one input into different neurons

![Fig. 4. A neural network to predict price based on different variables](https://www.dropbox.com/s/kc3etty14m0xtdd/houseprednn.jpg?raw=1)

* Notice that our NN model is composed of three `layers`:

* `Input`: The features to consider

* `Hidden unit`: The neurons that will fit the functions to predict

* `Output`: The predicted result

* The key of understanding how NNs work is in the **hidden unit**

* Notice that in this case, **all inputs** are connected to all neurons in the first hidden layer (this is also known as a `fully connected NN`) 

* Then, all neurons connect to a single neuron (second hidden layer) which produces the output!

* The art of NNs is to discover the optimal configuration by means of high computational power

* In fact, NNs could potentially find second order relations between the features!

![Fig. 5. Second order relations between features](https://www.dropbox.com/s/ybz9llj5yn6sl92/houseprednn2ndorder.jpg?raw=1)

* Notice that in this new example, not all features are connected to all neurons, i.e. this network is `not fully connected` (which is more often the case!)

### Why is this concept so trendy?

* Lots of available **LABELLED** data

![Fig. 6 a. Why NNs are so popular from the data perspective?](https://www.dropbox.com/s/o1wazkneh44v8vj/whynns.jpg?raw=1)

* Lots of computational power:
    * You can try different variables, number of hidden layers, activation functions, etc...

![Fig. 6 b. Why NNs are so popular from the computation perspective?](https://www.dropbox.com/s/xt928ukqwy5egi4/whynns2.jpg?raw=1)

## Neural Network Basics

### Binary Classification with NNs

* As we saw in the last example, a single neuron can simulate a linear regression by means of the ReLU function

* Now we are going to see how an NN can produce a **logistic regression**

* We will use colour images as input to predict if these contain a cat or don't!

### Logistic Regression using NNs

* A logistic regression calculates the probability of an input to be $0$ or $1$

* A logistic regression requires the following parameters:
    * An input feature vector $x \in \mathbb{R}^{n_x}$ where $n_x$ is the number of features
    * Training labels $y \in 0,1$
    * A **weights** vector $w \in \mathbb{R}^{n_x}$
    * A **bias** $b \in \mathbb{R}$
    * The predicted output:
        * For instance, $\hat{y} = \sigma(w^{T}x + b)$ where $T$ means *transpose*

* Since we need to predict the **probability** of a sample being 0 or 1, we need to set the `activation function` for the predicted output to something more suitable!

* We can use `ReLU`, `tanh` or `sigmoid`

* In this lecture we will focus on the **sigmoid function**

![Fig. 7. The sigmoid function](https://www.dropbox.com/s/etk4vkxzc54nw7c/sigmoid.jpg?raw=1)

* Notice that by default, the sigmoid function is bounded between 0 and 1

* Some observations of the sigmoid function:
    * $z$ substitutes the whole linear function, therefore it controls the weights and the bias **of the output** (i.e. final hidden layer)
    * If $z$ is a large positive number, then $\sigma(z) = 1$
    * If $z$ is small or large negative number, then $\sigma(z) = 0$
    * If $z = 0$, then $\sigma(z) = 0.5$

### Loss function

* You want to produce a $\hat{y}$ which is as similar as possible to $y$!

* The loss (or error) function helps you find how far you are from the actual values

* You could use the **mean square error** between both outputs: $L(y,\hat{y})=(y-\hat{y})^2$

* This is not convenient in NNs because the produced function becomes **non-convex** (we'll see what's this in a minute!)

* A more popular function is:
    $L(y,\hat{y})=(y)(log\hat{y})+(1-y)(log(1-\hat{y}))$

* Remember you want the result of the loss function to be **as small as possible**

### Cost function

* Even when you use the sigmoid function, you still need to learn the parameters $w$ and $b$. **Why?**

* The cost function depends on the loss function
    * This means that in order for us to know the best weights/bias of the NN, we need to know how good is the current output $\hat{y}$ and then this will iteratively improve our existing NN!

* The cost function is defined as $J(w,b) = \frac{1}{m}\sum_{i=1}^{m}{L(y,\hat{y}})$ where:
    * $m$ is the number of training samples
    * $i$ is the $i^{th}$ sample of the training data

### Gradient Descent

* You want to find the values of $w$ and $b$ that minimise $J(w,b)$

* If you plot $w$, $b$ and the cost $J(w,b)$ you will obtain something like this:

![Fig. 8. The search space for the values of weight/bias that reduce the cost function](https://www.dropbox.com/s/dt73inxq4i5jkjn/wbcost.jpg?raw=1)

* The previous example was **convex**, which means that it only has **one local minima** 

* This is preferable (as you wish to have only one optimal weight/bias combination, and that's why you **don't use mean square error for the loss!**

* Remember, mean square error will create a non-convex function

#### So what is gradient descent!!

* An iterative algorithm that starts in a random configuration of weight/bias and stats taking steps down (hopefully!) towards convergence (i.e. the combination of weight/bias that outputs the minimum value of the cost function)

![Fig. 9 a. Gradient descent](https://www.dropbox.com/s/lyfzx5oruyez6y3/gd.jpg?raw=1)

* where:
    * $\alpha$ is the `learning rate`
    * $\frac{dJ(w)}{dw}$ is the update in the weight value (and yes, it is a derivative!)

![Fig. 9 b. Gradient descent explained](https://www.dropbox.com/s/cam9xjqydt6twcf/gd2.jpg?raw=1)

#### SO WHAT HAPPENED TO THE BIAS?

* You also update it, in fact the real equations are $w:=w-\alpha\frac{dJ(w,b)}{dw}$ and $b:=b-\alpha\frac{dJ(w,b)}{db}$

### Putting all together (or at least trying to)

* With one example $a$ and two features $x_1$ and $x_2$, training an NN looks like this:

![Fig. 10. Recap of how to implement logistic regression with NNs.](https://www.dropbox.com/s/34rpu9kptmm73al/recap.jpg?raw=1)

* When you calculate $L(a,y)$ and you update the weight/bias, it is referred to as `backpropagation`!

* In contrast, when you compute the input to a neuron from the outputs of its predecessor neurons as a **weighted sum**, this is referred to as `propagation`

* When doing it for all $m$ examples, you need to sum all individual losses!

* Therefore, you need to add all the derivatives

**DO WE ACTUALLY NEED TO CALCULATE DERIVATIVES IN PYTHON?**

* How many NN models exist?

![Fig. 11. Different NNs.](https://www.dropbox.com/s/dmsz8i5krh6cy8t/nnmodels.jpg?raw=1)

# LAB: TRAINING A LOGISTIC REGRESSION NEURAL NETWORK FROM SCRATCH