# Linear Regression

## Notation

- Vectors are in bold lower case letters: $\bm{x}$
- Matrix 
- Vector components are in a subscript $\bm{x}_2$
- Data element number is in parenthesis in a superscript: $\bm{x}^{(3)}$
- $x$ is input/independent variable/features
- $y$ is output/dependent variable/answer

Let's show the notation with some sample data.

In [None]:
import pandas as pd
example_data = pd.DataFrame()
names = ['Neil', 'Buzz', 'Michael']
experience = [20, 22, 23]
education = [16, 21, 18]
example_data['name'] = names
example_data['experience_years'] = experience
example_data['education_years'] = education
example_data


If we think salary is linearly correlated to years of experience and years of education, the regression equation is:

$$\hat{y}^{(i)} = b + m_1 x_1^{(i)} + m_2 x_2^{(i)}$$

- $x_1$ is years of experience
- $m_1$ is the coefficient for $x_1$ that we need to solve for/learn
- $x_2$ is years of education
- $m_2$ is the coefficient for $x_2$ that we need to solve for/learn
- $b$ is the y-intercept that we need to solve for/learn

For Buzz's predicted salary:
$$\hat{y}^{(1)} = b + m_1 x_1^{(1)} + m_2 x_2^{(1)}$$
$$\hat{y}^{(1)} = b + m_1 (22) + m_2 (21)$$


We could write this in matrix notation as well (which is more convenient to write and code when dealing with a lot of inputs):

$$ \hat{\bm{y}} = \bm{Xw} $$

- $\bm{X}$ is the input Matrix:

$$\bm{X} = \left[\begin{array}{cc}
    1 && x_1^{(0)} && x_2^{(0)}\\
    1 && x_1^{(1)} && x_2^{(1)}\\
    1 && x_1^{(2)} && x_2^{(2)}\\
   \end{array}\right]$$

- $\bm{w}$ is the parameters of linear regression to solve for/learn:

$$\bm{w} = \left[\begin{array}{cc}
    b\\
    m_1\\
    m_2\\
   \end{array}\right]$$

- $\hat{\bm{y}}$ is the output vector:

$$\hat{\bm{y}} = \left[\begin{array}{cc}
    \hat{y}^{(0)} \\
    \hat{y}^{(1)} \\
    \hat{y}^{(2)} \\
   \end{array}\right] $$

## What is linear regression?

The goal of linear regression is to find a line that best fits data points.

To keep it simple, we'll first do a one input, $x$, and one output, $y$, regression.

In [None]:
import torch
import plotly.express as px
import plotly.graph_objects as go
from copy import deepcopy

In [None]:
torch.manual_seed(0)
torch.normal(torch.zeros(5),torch.ones(5))

In [None]:
X = torch.tensor([0.0, 1.0, 2.0, 3.0, 4.0])
Y = X+torch.normal(torch.zeros(5),torch.ones(5))
line = deepcopy(X)

In [None]:
fig = go.Figure()
fig.update_layout( height=500, width=750, margin=dict(l=20, r=20, t=20, b=20), xaxis=dict(title='X'), yaxis=dict(title='Y'))
data_trace = go.Scatter( x=X, y=Y, mode='markers', name='data', marker=dict(size=8, color=px.colors.qualitative.Plotly[2]) )
fig.add_traces(data_trace)

Try to fit a line to the points above. Maybe you draw something that looks like this:

In [None]:
fig = go.Figure()
fig.update_layout( height=500, width=750, margin=dict(l=20, r=20, t=20, b=20), xaxis=dict(title='X'), yaxis=dict(title='Y'))
data_trace = go.Scatter( x=X, y=Y, mode='markers', name='data', marker=dict(size=8, color=px.colors.qualitative.Plotly[2]))
line_trace = go.Scatter( x=X, y=line, mode='lines', name='regression line', line=dict(color=px.colors.qualitative.Plotly[0]))
fig.add_traces([data_trace, line_trace])

## Errors/Residuals

To see how well or bad we did, we need to subtract our predicted values, $\bm{\hat{y}}$, (the dots on the regression line) with the actual values, $\bm{y}$.

This gives us the residuals or errors, $\bm{e}$:
$$\bm{e} = \bm{\hat{y}} - \bm{y}$$
$$\left[\begin{array}{cc}
    e_1 \\ e_2 \\ \vdots \\ e_N 
   \end{array}\right]
   =
   \left[\begin{array}{cc}
    \hat{y}_1 \\ \hat{y}_2 \\ \vdots \\ \hat{y}_N 
   \end{array}\right]
   -
   \left[\begin{array}{cc}
    y_1 \\ y_2 \\ \vdots \\ y_N 
   \end{array}\right]$$

In [None]:
fig = go.Figure()
fig.update_layout( height=500, width=750, margin=dict(l=20, r=20, t=20, b=20), xaxis=dict(title='X'), yaxis=dict(title='Y'))
traces = []
for i, x in enumerate(line):
    if i>0:
        showlegend = False
    else:
        showlegend = True
    residual_trace = go.Scatter( x=[x, x], y=[line[i], Y[i]], line=dict(color=px.colors.qualitative.Plotly[1], dash='dash'), mode='lines', name='residuals/errors', showlegend=showlegend)
    traces.append(residual_trace)
data_trace = go.Scatter( x=X, y=Y, mode='markers', name='data', marker=dict(size=8, color=px.colors.qualitative.Plotly[2]))
line_trace = go.Scatter( x=X, y=line, mode='lines', name='regression line', line=dict(color=px.colors.qualitative.Plotly[0]))
traces.append(data_trace)
traces.append(line_trace)
fig.add_traces(traces)

## Loss/Objective Function: Mean Squared Error

We want to minimize the errors, but simply adding them doesn't make sense because a negative 1 error and positive 1 error are equally bad. Simply adding them would cancel each other out, and you'd get 0 error, which is not what we want.

Instead we want to square the errors so the negatives become positive before summing them.

If you average all the squares of the errors you get mean squared error:
$$ \mathrm{MSE} = \frac{\sum_{n=1}^{N} {{e_{n}}^2}}{N} = \frac{\sum_{n=1}^{N} { \left( \hat{y}_n - y_n \right) ^2}}{N} = \frac{\sum_{n=1}^{N} { \left( mx_n + b - y_n \right) ^2}}{N}$$

or if you want to do vector/matrix/tensor math:

$$ \mathrm{MSE} = \frac{ \bm{e}^\top \bm{e}}{N} = \frac{ \left( \hat{\bm{y}}-\bm{y} \right) ^\top \left( \hat{\bm{y}}-\bm{y} \right)}{N} = \frac{ \left( \bm{X}\bm{w} -\bm{y} \right) ^\top \left( \bm{X}\bm{w} -\bm{y} \right)}{N}$$

$$ \bm{X} = \left[\begin{array}{cc}
    1 & x^{(1)} \\
    1 & x^{(2)} \\
    \vdots \\
    1 & x^{(N)} 
   \end{array}\right], \quad
    \bm{w} = \left[\begin{array}{cc}
    b \\
    m
   \end{array}\right] $$

This is the function we want to minimize to get the regression line. You want to find the slope ($m$) and intercept ($b$) that minimizes the MSE.

In machine learning lingo this is called the **loss** or **objective** function.

The variables/parameters you are learning, slope ($b$) and y-intercept ($m$), are called the **weights**.

## Minimizing/optimizing MSE


MSE in linear regression is a convex function which means it's global minimum value is where its derivative, with respect to slope and y-intercept, is equal to zero.

You can solve for where linear regression's MSE derivative equals zero using calculus and algebra (analytically), which is usually the best way to do linear regression.

But for many other machine learning algorithms, the loss/objective function cannot be solved analytically. You need to use numerical methods to find the minimum.

For educational purposes, we'll do the numerical method called **gradient descent** on MSE.

In [None]:
M_surface = torch.linspace(0.0,1.3,100)
B_surface = torch.linspace(-1.0,0.5,100)
def mse_func(pred_Y, true_Y):
    return torch.square(pred_Y - true_Y).mean()
MSE_surface = torch.Tensor([[mse_func(m*X + b, Y) for m in M_surface] for b in B_surface])
fig = go.Figure()
fig.update_layout( height=500, width=750, margin=dict(l=20, r=20, t=20, b=20) )
fig.update_scenes(xaxis_title_text='slope',  
                yaxis_title_text='intercept',  
                zaxis_title_text='MSE',
                aspectratio=dict(x=1, y=1, z=0.5),
                )
surface = go.Surface(x=M_surface, y=B_surface, z=MSE_surface, colorscale='Turbo',
                     cmax=1.0,
                     cmin=MSE_surface.min().item(), showscale=False, opacity=0.5)
fig.add_trace(surface)
fig

### Gradient descent steps

1. Initial guess
    - Guess $m$ (slope) and $b$ (y-intercept)
2. Forward pass
    - Predict/calculate $\hat{y}$ with current $m$ and $b$
3. Evaluate loss function
    - Keep track of the loss to make sure it's improving.
4. Backward pass/backpropagation
    - Take the gradient of MSE with respect to $b$ and $m$: $$ \nabla_{\bm{w}} MSE = \left[\begin{array}{cc}
    \frac{\partial MSE}{\partial b} \\
    \frac{\partial MSE}{\partial m}
   \end{array}\right] $$

        - Pytorch does this for you! Just remember to only do differentiable operations in your forward pass.
    - Take a step in the negative direction of the gradient (direction of steepest descent)
        - You must pick a learning rate $\alpha$ which determines how big of a step to take
        - $b \leftarrow b - \alpha \frac{\partial MSE}{\partial b}$
        - $m \leftarrow m - \alpha \frac{\partial MSE}{\partial m}$
5. Repeat from step 2 until your loss stops decreasing


#### Initial Guess

- Lets simply guess 0s for $m$ and $b$
- We're going to use pytorch's tensors here
    - A vector is a 1D tensor and a matrix is a 2D tensor. A 3D tensor would be matrices stacked on top of each other out the screen. A tensor is the general term for a vector, matrix, or "matrices" with higher dimensions.
    - Pytorch tensors can keep track of derivatives as you do calculations on it, the requires_grad property needs to be true for this to happen
        - Google "automatic differentiation" if you want to learn more on how it does this

In [None]:
m = 0.0 # initial m guess
b = 0.0 # initial b guess
w = torch.tensor([b, m]) # put it in vector
w.requires_grad = True # keeps track of gradient

#### Forward Pass

- Let's compute $\hat{\bm{y}} = \bm{Xw}$
- Need to get $\bm{X}$ in the right form first:
$$ \bm{X} = \left[\begin{array}{cc}
    1 & x^{(1)} \\
    1 & x^{(2)} \\
    \vdots \\
    1 & x^{(N)} 
   \end{array}\right] $$

In [None]:
# puts a column of ones next to a column vector
def add_ones_column(input):
    ones_vec = torch.ones(input.shape) # make a vector of ones the size of input
    output = torch.stack((ones_vec, input), dim=1) # combine the ones vector and input vector, this gives the output matrix
    return output

In [None]:
X_mat = add_ones_column(X) 
X_mat

In [None]:
y_pred = torch.matmul(X_mat, w) # make a prediction

#### Evaluate loss

Let's graph our predictions with the data to see how well we did.

In [None]:
def show_regression(X, Y, y_pred):
    fig = go.FigureWidget()
    fig.update_layout( height=500, width=750, margin=dict(l=20, r=20, t=20, b=20), xaxis=dict(title='X'), yaxis=dict(title='Y'))

    fig_line = go.Scatter( x=X, y=y_pred, mode='lines', name='regression line', line=dict(color=px.colors.qualitative.Plotly[0]) )
    fig_data = go.Scatter( x=X, y=Y, mode='markers', name='data', marker=dict(size=8, color=px.colors.qualitative.Plotly[2]) )

    fig.add_traces([fig_data, fig_line])
    return fig

In [None]:
show_regression(X, Y, y_pred.detach())

Now let's calculate MSE.

In [None]:
mse_all = [] # list to store mse history
B = [b] # list to store intercept history
M = [m] # list to store slope history

In [None]:
pred_error = y_pred - Y # calculate residual
mse = torch.matmul(torch.t(pred_error), pred_error)/pred_error.numel() # calculate mse

print('MSE = ', mse.item())
mse_all.append(mse.item())

#### Backward Pass

Need to choose a learning rate, $\alpha$. This is how big of a step you take down the gradient.

In [None]:
learning_rate = 0.05 # choose a learning rate

- `.backward()` computes the gradient
- the gradients are stored in `.grad`
- `w.grad` holds the gradients
    - `w.grad[0]` holds $\partial_b MSE$
    - `w.grad[1]` holds $\partial_m MSE$
- update `w`: $$ w \leftarrow w - \alpha \nabla_{\bm{w}} MSE$$

In [None]:
mse.backward() # compute gradient
new_w = w.detach().clone() - w.grad.detach().clone() * learning_rate # take a step
print("b = ", new_w[0].item())
print("m = ", new_w[1].item())

#### Keep repeating Backward and Forward Pass

Graph below will update when you run the block after it

In [None]:
fig = show_regression(X, Y, y_pred.detach())
fig.update_layout(xaxis=dict(range=[-0.5,4.5]), yaxis=dict(range=[-1.5, 4.5]))
fig_line_data = fig.data[1]
fig

Graph above updates everytime you run the block below. You should see the regression line getting better quickly, then changes slow down as it approaches the regression line that gives the minimal MSE for this data. You can stop re-running the block below when the line is barely changing.

In [None]:
## forward pass
w = new_w
w.requires_grad = True
y_pred = torch.matmul(X_mat, w) # make a prediction
pred_error = y_pred - Y # calculate residual

## evaluate loss
mse = torch.matmul(torch.t(pred_error), pred_error)/pred_error.numel() # calculate mse
# mse.retain_g

print('MSE = ', mse.item())
mse_all.append(mse.item())
B.append(new_w[0].item())
M.append(new_w[1].item())

## backward pass
mse.backward() #
new_w = w.detach().clone() - w.grad.detach().clone() * learning_rate
print("b = ", new_w[0].item())
print("m = ", new_w[1].item())

## update graph
if fig_line_data != None:
    x_fig = torch.linspace(0,4,10)
    x_fig_mat = add_ones_column(x_fig)
    y_pred_fig = torch.matmul(x_fig_mat, new_w)
    fig_line_data.x = x_fig
    fig_line_data.y = y_pred_fig

#### Plot results

- Plot surface of possible values of MSE for different y-intercepts and slopes
- Plot the path of our gradient descent on top of it

In [None]:
M_surface = torch.linspace(0.0,1.3,100)
B_surface = torch.linspace(-1.0,0.5,100)
def mse_func(pred_Y, true_Y):
    return torch.square(pred_Y - true_Y).mean()
MSE_surface = torch.Tensor([[mse_func(m*X + b, Y) for m in M_surface] for b in B_surface])
fig = go.Figure()
fig.update_layout( height=500, width=750, margin=dict(l=20, r=20, t=20, b=20) )
fig.update_scenes(xaxis_title_text='slope',  
                yaxis_title_text='intercept',  
                zaxis_title_text='MSE',
                # yaxis = dict(range=[-1,1.5]),
                # zaxis = dict(range=[MSE.min().item(),MSE_max_display]),
                aspectratio=dict(x=1, y=1, z=0.5),
                )
surface = go.Surface(x=M_surface, y=B_surface, z=MSE_surface, colorscale='Turbo',
                     cmax=1.0,
                     cmin=MSE_surface.min().item(), showscale=False, opacity=0.5)
fig.add_trace(surface)
gradient_descent = go.Scatter3d(x=M, y=B, z=mse_all, line=dict(color='white'), marker=dict(size=2))
fig.add_trace(gradient_descent)
fig

- The plot above is for educational purposes 
- In most situations you will have more than two unknowns (or weights, paramaters, whatever you want to call it) that you are trying to learn, then you can't make the above plot
- Instead plot the loss vs iterations
- You should see loss decreasing and hopefully converging to a value

In [None]:
fig = px.line( x=range(len(mse_all)), y=mse_all, title='MSE v gradient descent steps' )
fig.update_layout( autosize=False, width=750, height=500 )
fig.update_xaxes( title="steps" )
fig.update_yaxes( title='MSE')

## Assignment: follow the same steps on the dataset below

In [None]:
X=torch.load('X_data.pt')
Y=torch.load('Y_data.pt')

In [None]:
fig = go.Figure()
fig.update_layout( height=500, width=750, margin=dict(l=20, r=20, t=20, b=20), xaxis=dict(title='X'), yaxis=dict(title='Y'))
fig_data = go.Scatter( x=X, y=Y, mode='markers', name='data' )

fig.add_traces([fig_data])