# Recap: What we've been doing


`1-Tensors-in-PyTorch.ipynb:` introduces a tensor representation in PyTorch

`2-Neural-networks-in-PyTorch.ipynb:` introduces a basic framework for defining neural networks via the `nn` module


# Training Neural Networks

In this notebook we will explore how to train neural networks.

<span style="color:orange"> **IDEA:** We can think of neural networks as universal function approximators. </span>

Let's consider an example below. In the middle there is some function, _F(x)_, that maps the input (images of hand-written digits) to the output (probabilities for different class labels). For instance, if we pass an image with a digit 4 to the network, we would expect to obtain probability distribution with a high likelihood corresponding to the label 4. The magic of neural networks is that we can train them with non-linear activations to approximate this function _F(x)_ successfully. 

<img src="assets/function_approx.png" width=600px>


<span style="color:blue"> **GOAL:** We want to train the network by showing it lots of examples of digits and then adjust the weight parameters such that our network can approximate this function successfully.  </span>


Now how do we do that? To find the optimal weight parameters, we need to know how well our network is prediciting real outputs. Here we can calculate **loss function** (also called cost or optimizing function), which serves as a measure of our prediction error.

There are several different types of loss functions but one of the most widely used one is the **mean squared error** (MSE). MSE is often used in regression and binary classification problems. The formula for MSE is:

$$
\large \ell = \frac{1}{2n}\sum_i^n{\left(y_i - \hat{y}_i\right)^2}
$$

where $n$ is the number of training examples, $y_i$ are the true labels, and $\hat{y}_i$ are the predicted labels.


<span style="color:orange"> **IDEA:** We can adjust the weight parameters such that this loss is minimized. </span> Once the loss is minimized we know that out network is making as good predictions as it can.

<span style="color:green"> **CONCEPT:** We find the minimum loss using a process called **gradient descent**. </span> The gradient is the slope of the loss function with respect to the weight parameters. The gradient always points to the direction of the fastest change. For instance, consider the picture of the mountain below:

<img src='assets/gradient_descent.png' width=350px>

In this picture, the gradient is always going to point up the mountain. Imagine that our loss function is approximated by this mountain where we have the highest loss at the peak of the mountain and the lowest loss down in the valley. Therefore, if we want to minimize the loss, we have to go downwards and follow the direction of the negative gradient. You can think of this like descending a mountain by following the steepest slope to the base.

## Backpropagation 

