# Introduction to Machine Learning


## Outline:
- Introduce machine learning fundamentals and tools used by the scientific community.
- Introduce the linear regression model and gradient descent optimization
- Visualize how we can optimize a linear regression model using gradient descent.

## The fundamentals:
Machine learning has proven to be a useful tool to the scientific community by being able to solve almost any given problem in recent years. In this notebook, we will be covering the fundamentals of machine learning and demonstrating how we can apply it to model 2D data.

Before we can get to modeling data, we need to understand what tools to use and how they work. 


### What is [PyTorch](https://github.com/pytorch/pytorch)?

In short, PyTorch is a Python-based scientific computing package targeted at two sets of audiences:

- A replacement for NumPy to use the power of GPUs
- A deep learning research platform that provides flexibility and speed

To understand and install PyTorch, we recommend going through the [tutorials](https://pytorch.org/tutorials/) and [blogs](https://pytorch.org/blog/).

### What is a [Tensor](https://pytorch.org/tutorials/beginner/basics/tensorqs_tutorial.html)?
Tensor is a data structure which is a fundamental building block of PyTorch. Tensors are pretty much like numpy arrays, except that unlike numpy, tensors are designed to take advantage of parallel computation capabilities of a GPU
and more importantly for us - they can keep track of its gradients.

To follow along a single blogpost that goes through all concepts, we recommend [this](https://blog.paperspace.com/pytorch-101-understanding-graphs-and-automatic-differentiation/).



Having understood how the tools work, we can now use them to model our data. To train any machine learning model, there is a general recipe followed. This recipe can be summarized in three steps:
1. Define the model representation.
2. Choose a suitable loss function that tells us how far apart our model predictions are from the real data
3. Update the model representation using an optimization algorithm.

In this notebook, we will be introducing the linear regression model. We will measure loss using the Mean Squared Error (MSE) function and will try to optimize or reduce the loss value using a popular optimization algorithm, gradient descent.

To read further on how to train machine learning models, refer to Prof. Fund's [lecture notes](https://github.com/ffund/intro-ml-tss21) or Prof. Hegde's [blogposts](https://chinmayhegde.github.io/dl-notes/notes/lecture03/).

## Linear regression

The simplest machine learning algorithm is [linear regression](https://en.wikipedia.org/wiki/Linear_regression). We will code up linear regression from scratch with a twist: we will use gradient descent, which is also how neural networks learn. Most of this lesson is pretty much stolen from Jeremy Howard's fast.ai [lesson zero](https://www.youtube.com/watch?v=ACU-T9L4_lI)

- In linear regression, we assume that $y = w_0 + w_1 x $, where $y$ is the expected output.
- We look for the $w$ coefficients that give the 'best' prediction ($\tilde{y}$) for the output. The best prediction is defined by minimizing some cost function. For linear regression, we generally minimize the mean sequare error between the expected output $y$ and the prediction ($\tilde{y}$).

```{image} https://miro.medium.com/v2/resize:fit:1032/1*WswH2fPx0bf_JFRMm8V-HA.gif
:alt: nn-output-computation
:width: 500px
```

Image is taken from [here](https://blog.insightdatascience.com/a-quick-introduction-to-vanilla-neural-networks-b0998c6216a1).

### Implementation

We will learn the parameters $w_0$ and $\mathbf{w}_1$ of a line.

In [None]:
# Import plotting packages
from IPython.display import Image, HTML
from matplotlib.animation import FuncAnimation
import matplotlib.pyplot as plt
import time
import base64
import numpy as np

# Import machine-learning packages
import torch
from torch import nn

%matplotlib inline

Set random seed for reproducibility


In [None]:
np.random.seed(42)

Chose the true parameters we want to learn.

In [None]:
w = torch.as_tensor([3.0, 2])
w

### Data
Create some data points x and y which lie on the line



Documentation for:
- [ones()](https://pytorch.org/docs/stable/generated/torch.ones.html)
- [uniform_()](https://pytorch.org/docs/stable/generated/torch.Tensor.uniform_.html)

In [None]:
n = 100
x = torch.ones(n, 2)
# Underscore functions in pytorch means replace the value (update)
x[:, 1].uniform_(-1.0, 1)

x[:5]

In [None]:
y = x @ w + torch.rand(n)  # @ is a matrix product (similar to matmul)

In [None]:
plt.figure(dpi=150)
plt.scatter(x[:, 1], y)
plt.xlabel("x")
plt.ylabel("y")
plt.show();

In [None]:
w_real = torch.as_tensor([-3.0, -5])

### Loss function
If we could find a way to fit our guess for the coefficients the weights ($w_0$ and $\mathbf{w}_1$), we could use the exact same method for very complicated tasks (as in image recognition). 

We define our loss function $L$ as Mean Square Error loss as:
\begin{equation*}
L_{MSE} = \frac{1}{n} \cdot \sum_{i=1}^{n} (y_i - \tilde y_i)^2
\end{equation*}

In [None]:
def mse(y_true, y_pred):
    return ((y_true - y_pred) ** 2).mean()

Written in terms of $w_0$ and $w_1$, our **loss function** $L$ is:

\begin{equation*}
L_{MSE} = \frac{1}{n} \cdot \sum_{i=1}^{n} (y_i - (w_0 + \mathbf{w_1} \cdot x_i))^2
\end{equation*}

In [None]:
y_hat = x @ w_real
# Initial mean-squared error
mse(y_hat, y)

In [None]:
plt.figure(dpi=150)
plt.scatter(x[:, 1], y, label="y")
plt.scatter(x[:, 1], y_hat, label="$\\tilde{y}$")
plt.xlabel("$x$")
plt.legend(fontsize=7);

In [None]:
w = nn.Parameter(w_real)
w

### Gradient descent


In [None]:
# Load the image file
with open("figs/Gradient_descent2.png", "rb") as f:
    image_data = f.read()

# Create the HTML code to display the image
html = f'<div style="text-align:center"><img src="data:image/png;base64,{base64.b64encode(image_data).decode()}" style="max-width:700px;"/></div>'

# Display the HTML code in the notebook
display(HTML(html))

So far, we have specified the *model* (linear regression) and the *evaluation criteria* (or *loss function*). Now we need to handle *optimization*; that is, how do we find the best values for weights ($w_0$, $w_1$) such that they best fit the linear regression line?

To know how to change $w_0$ and $w_1$ to reduce the loss, we compute the derivatives (or gradients):

\begin{equation*}
\frac{\partial L_{MSE}}{\partial w_0} = \frac{1}{n}\sum_{i=1}^{n} -2\cdot [y_i - (w_0 + \mathbf{w_1}\cdot x_i)]
\end{equation*}

\begin{equation*}
\frac{\partial L_{MSE}}{\partial \mathbf{w_1}} = \frac{1}{n}\sum_{i=1}^{n} -2\cdot [y_i - (w_0 + \mathbf{w_1}\cdot x_i)] \cdot x_i
\end{equation*}

Since we know that we can iteratively take little steps down along the gradient to reduce the loss, aka, *gradient descent*, the size of the step is determined by the learning rate ($\eta$):

\begin{equation*}
w_0^{new} = w_0^{current} -   \eta \cdot \frac{\partial L_{MSE}}{\partial w_0}
\end{equation*}

\begin{equation*}
\mathbf{w_1^{new}} = \mathbf{w_1^{current}} -  \eta \cdot \frac{\partial L_{MSE}}{\partial \mathbf{w_1}}
\end{equation*}

In [None]:
def step(lr):
    y_hat = x @ w
    loss = mse(y, y_hat)
    # calculate the gradient of a tensor! It is now stored at w.grad
    loss.backward()

    # To prevent tracking history and using memory
    # (code block where we don't need to track the gradients but only modify the values of tensors)
    with torch.no_grad():
        # lr is the learning rate. Good learning rate is a key part of Neural Networks.
        w.sub_(lr * w.grad)
        # We want to zero the gradient before we are re-evaluate it.
        w.grad.zero_()

    return loss.detach().item(), y_hat.detach().numpy()

In PyTorch, we need to set the gradients to zero before starting to do back propragation because PyTorch accumulates the gradients on subsequent backward passes. This is convenient while training Recurrent Neural Networks (RNNs). So, the default action is to accumulate or sum the gradients on every `loss.backward()` call.
Because of this, when you start your training loop, ideally you should zero out the gradients so that you do the parameter update correctly. Else the gradient would point in some other direction than the intended direction towards the minimum (or maximum, in case of maximization objectives).

Explanations about how PyTorch calculates the gradients can be found [here](https://blog.paperspace.com/pytorch-101-understanding-graphs-and-automatic-differentiation/).

In [None]:
w = torch.as_tensor([-2.0, -3])
w = nn.Parameter(w)
lr = 0.1
losses = [float("inf")]
y_hats = []
epoch = 100

In [None]:
# Train model and perform gradient descent
for _ in range(epoch):
    loss, y_hat = step(lr)
    losses.append(loss)
    y_hats.append(y_hat)

## Visualization

Animation inspired by Prof. Fund's pratical [notebooks](https://github.com/ffund/ml-notebooks).

In [None]:
fig, axs = plt.subplots(1, 2, figsize=(10, 4), dpi=80)
axs[0].scatter(x[:, 1], y, label="y")
scatter_yhat = axs[0].scatter(x[:, 1], y_hat, label="$\\tilde{y}$")
axs[0].set_xlabel("$x$")
axs[0].legend(fontsize=7)

(line,) = axs[1].plot(range(len(losses)), np.array(losses))
axs[1].set_xlabel("Iteration")
axs[1].set_ylabel("Loss")
axs[1].set_title("Loss vs Iteration")
plt.close()


def animate(i):
    axs[0].set_title("Loss = %.2f" % losses[i])
    scatter_yhat.set_offsets(np.c_[[], []])
    scatter_yhat.set_offsets(np.c_[x[:, 1], y_hats[i]])
    line.set_data(np.array(range(i + 1)), np.array(losses[: (i + 1)]))
    return scatter_yhat, line


# plt.show()

animation = FuncAnimation(fig, animate, frames=epoch, interval=100, blit=True)
# let animation load
time.sleep(1)
plt.show();

In case you have some difficulties running the cell below without importing certain packages. Run the following code in a terminal in `L96M2lines` environment.
```shell
conda install -c conda-forge ffmpeg
```

In [None]:
display(HTML(f'<div style="text-align:center;">{animation.to_html5_video()}</div>'))

### The complete loss graph

In [None]:
plt.figure(dpi=150)
plt.plot(np.array(losses))
plt.xlabel("Iteration")
plt.ylabel("Loss")
plt.title("Loss vs Iteration")
plt.show();

## Further reading
In Deep learning, we use a variation of gradient descent called [mini-batch gradient descent](https://machinelearningmastery.com/gentle-introduction-mini-batch-gradient-descent-configure-batch-size/). Instead of calculating the gradient over the whole training data before changing model weights (coefficients), we take a subset (batch) of our data, and change the values of the weights after we calculated the gradient over this subset. 