✅ Status:  done

## 📓 Exercise 17

---

Welcome to the exercise 17 in which we will talk about `gradient descent`. Again, first from a theoretical perspective and then we will implement `linear regression` from scratch and find its most optimal parameters using gradient descent.

### 🏷 GD: theory

---

> Fundamental information

Let's start with a definition:
> `Gradient descent` is simply a way how one can find optimal parameters for given model

Now, how is this way defined? Essentially, by the following formula:

$
\theta_{new} = \theta_{old} - \alpha \nabla L_m
$

The big picture of this formula is that we compute the gradient vector of the loss function, (Recall that gradient vector of a function points in a direction of the steepest growth of the given function), we revert its direction using the minus sign to point in the steepest downward direction and scale it by some learning rate parameter $\alpha$ and finally use the resulting vector to update our model's parameters. If we break down the formula:
- $\theta_{new}$ and $\theta_{old}$ are the model's parameters after and before making the step in the downwards direction (downwards direction in a sense of the loss since we want to minimize it)
- $\alpha$ is a learning rate that determines how big of a step you do. This is a hyper-parameter so you need to figure this out by some trial and error or grid search. If the steps are too big, then you likely jump over the (local) minimum, whereas if the steps are too small, then you might not even get to the most optimal set of parameters since you get stuck in some local minima. 

- $\nabla L_m$ is a gradient vector of a `loss function` of the given model. Below, we will discuss more on how to compute it for different models.

Gradient descent is an `iterative algorithm` which means you use the above formula several times to update the parameters, in the below section we will dicuss this more.

> Computing the gradient vector of a loss function

We have encountered several parametric models so far. Each of these model has certain `loss` function which depends on the model itself as well as whether we are doing regression or classification since for both tasks we use different loss functions. (usually `MSE` for regression and `Cross-entropy` for classification). Let's make a little summary:
- `Linear regression`: here exists a closed formula to find the gradient vector of the `MSE` of LR:

$
\nabla L_{LinR} = \frac{2}{n}X^T(X\theta - y)
$

where $X$ is the design matrix (a.k.a. the matrix with training data). Note that this is a vectorized formula for arbitrary number of features. Below, I will show how you can derive this formula for just a single feature.

- `Logistic/Softmax regression`: again we have a closed formula to find the gradient vector of `Cross-entropy` loss function:

$
\nabla L_{LogR}= X^T(h_{\theta}(X) - y)
$

where $h_{\theta}$ is the logistic/softmax model that takes as an input given sample $x$ and outputs a $1 \times k$ vector with probabilities for each of the $k$ classes. Therefore, if you feed this function with the whole matrix $X$ with all samples, you will get back $n \times k$ matrix. $y$ is then $n \times k$ matrix where in each row, it has $1$ at a position of a true class and for the rest $0$. In other words, $y$ is one-hot encoded. 

- `Feed forward neural network`: here there is no closed formula, therefore we will use technique called `backpropagation` to obtain the gradient vector, but more on this in the next exercise session. If you want to learn about it now, I suggest you check section 4.2 of this [note](https://github.com/ludekcizinsky/glass-forensic-analysis/blob/main/ml_report.pdf).

Just so for the sake of completness on how the closed formulas for the algorithms are derived, we can consider a simple example using `linear regression` with a single feature. Its loss function looks as follows:

$
J(\theta_0,\theta_1) =  \frac{1}{n} \sum_{i=1}^n(  h_{\theta}(x_i) - y_i )^2
$

Then to obtain the gradient vector, we need to find partial derivative with respect to both parameters:

$
\frac{\partial J}{\partial \theta_0}(\theta_0, \theta_1) = \frac{2}{n}\sum_{i = 1}^{n}(x_i \theta_1 + \theta_0 - y_i )
$

$
\frac{\partial J}{\partial \theta_1}(\theta_0, \theta_1) = \frac{2}{n}\sum_{i = 1}^{n}x_i(x_i \theta_1 + \theta_0 - y_i )
$

Then if you vectorize it, you obtain the closed formula as mentioned above.

> Training algorithms using GD

Let's first define some terminology:
**Epoch.** One epoch corresponds to training a model on the whole training dataset. Usually used within the context of Neural Networks as there you might feed forward the whole dataset through network several times. But it of course can be used for linear or logistic regression.

**Batch.** Batch refers to a set of training data. Batch’s size can be from 1 to N where N is the number of training data points.

Now, how do you actually choose proper batch size: There are fundamentally two scenarios:

- Small (if 1 then it is called `Stochastic`): you are able to compute gradient faster, but you pay a cost in terms of being consistent in terms your steps through the cost function’s space. In other words, the cost during training varies a lot. (training history line goes up and down a lot)
- `Mini-batch`: it lies between the large and small batch size, I would say in general, it might be around 5 % of the training data.
- Large: The cost should slowly but nicely decrease (training history line is smooth as there are no spikes)

In addition, in practice, you might want **GD to stop early** in the case where there is not much of an improvement in terms of **decrease in the loss**.

Finally, let's talk about **pros and cons**  of the `GD` as an optimizer:

Pros

- Intuitive to understand
- Used by many ML libraries as a default
- Works well for cost functions which have **convex space** (i.e. no local minimums)

Cons

- Assumes that cost function is differentiable and continuous
- It can get easily stuck in a local minimum

> Section summary

After this section, you should be able to explain:
- how `GD` works in detail
- how we can use it to train various algorithms such as `Linear` or `Logistic` regression
- how to setup hyper-parameters  such as learning rate or batch size
- pros and cons of `GD`

### 🏷 GD: practice

---

In this section, we focus on using gradient descent in practice. More specifially, we use it to optimize `Linear regression` model. Therefore:
- see [`implementation`](https://github.com/ludekcizinsky/nano-learn/blob/main/nnlearn/linear/_linr.py) for an example how you could do it, especially note the function `_train` where most of the core functionality is done
- see [`wine dataset example`](https://github.com/ludekcizinsky/nano-learn/tree/main/examples/linr-wine) to inspect how different hyper-parameters influence the overall result. (I suggest you clone the repository and play around with the hyper-parameters if you want)

> Section summary

This section should give you a practical intuition on how `GD` works. It should give a more concrete picture how the whole process works including choosing different number of epochs and batch sizes.

---

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=6e0f1839-a5f7-4da8-b3bc-f4b80036136e' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>