# Linear regression

## Welcome!
We are going to dive into powerful universe of machine learning models. 

We will start with one of the easiest ones - linear regression. Though simple, it will introduce you to a number of important concepts, which are very much valid, when studying more sophisticated models such as neural networks.

The idea of learning and intuition will be the same in almost all models, so make sure that you understand upcoming concepts first.

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from ipywidgets import interact, fixed
import ipywidgets as widgets
import sklearn as sk
import solutions

%matplotlib inline

## The problem of regression

Consider two series' of numbers:

In [None]:
X = np.array([1, 2, 3, 4, 5, 6, 7])
Y = np.array([4.1, 6.7, 10.8, 14.3, 15.5, 20.0, 21.37])

plt.scatter(X, Y)
plt.show()

It can be clearly seen that there is a relationship between $X$ and $Y$. Moreover, this relationship is close to one of the simplest ones - it's linear. 

In other words:

## $$y = w_0 + w_1 \cdot x$$

But how to find the good - or, as we'll say more often - **optimal** $w_0$ and $w_1$ for those two sets of data?

## Loss function 

Whenever you set yourself a goal, a good thing to figure out is how will you know you're satisfied (or not) with your results.

In Machine Learning, the concept of **loss function** (or cost function) embodies this question. You can think of it as a metric that tells how satisfied you are with your solution. The better your model, the lower the loss.

Of course, you have to evaluate your solution - or **model** in terms of the data you are interested in. So the loss function would have a form of

### $$loss(model, input\_data, output\_data)$$

or:

### $$L(W, X, Y)$$

Where:
* $L$ - loss function
* $W$ - the model
* $X$ - the input data
* $Y$ - the output data

In this example, the model $W$ is simply the numbers $(w_0, w_1)$ we want to find. 
However, this won't always be such a simple case! 

### Let's define the loss function!
What do you think would be the best way to measure how well some $(w_0, w_1)$ capture the relationship between our $X$ and $Y$?

Some important points to consider:
* the better $(w_0, w_1)$ fit the actual data, the lower the loss
* the loss shouldn't be dependent on the amount of the data, only on how well the model fits it!

In [None]:
def my_loss(w_0: float, w_1: float, X: np.ndarray, Y: np.ndarray):
    # implement your idea for loss function here!
    pass

In [None]:
# if you get stuck, check out the solutions script
# where you will find both the naive and vectorized solutions
# it's more rewarding to figure them out on your own, though!
my_loss = solutions.my_loss
my_loss = solutions.my_loss_vectorized

## How to find the optimal $(w_0, w_1)$?

Now that we have a way to measure the quality of our model, how can we find an optimal-enough one?

### Manually

In [None]:
def plot_linear_model(w_0: float, w_1: float, X: np.ndarray, Y: np.ndarray):
    Y_pred = w_0 + w_1 * X 
    plt.scatter(X, Y)
    plt.plot(X, Y_pred, 'r')
    plt.show()
    print('w_0:', w_0)
    print('w_1:', w_1)
    print('Loss:', my_loss(w_0, w_1, X, Y))
    
interact(plot_linear_model, 
         w_0=(-5.0, 5.0), 
         w_1=(-5.0,5.0),
         X=fixed(X),
         Y=fixed(Y)
        )

### Analytically

As our loss function is not *that* complicated, one could use a least-squares method and  calculate it's derivative in terms of $w_0$ and $w_1$ and see which values minimize it. 
In this case, it would even work:

In [None]:
w_1, w_0 = np.polyfit(X, Y, deg=1)
plot_linear_model(w_0, w_1, X, Y)

In [None]:
n_cases = 10
w_0_space = np.linspace(0, 5, n_cases)
w_1_space = np.linspace(5, 0, n_cases)
loss_grid = np.zeros((n_cases, n_cases))
for i in range(n_cases):
    for j in range(n_cases):
        w_1 = w_1_space[i]
        w_0 = w_0_space[j]
        loss_grid[i][j] = my_loss(w_0, w_1, X, Y)

sns.heatmap(loss_grid, xticklabels=w_0_space, yticklabels=w_1_space, annot=loss_grid)

However, in tougher cases, there would be much more than one global extrema and a much wider space of $W$ to consider.

### Using an ML method!

Searching through the whole space of $W$ is computationally expensive. What if there was a technique to navigate through it more intelligently?

Though normally there's no plausible way of generating a loss map such as above, we don't really need to know the loss throughout the whole space of solutions - we only want to find a place, where the loss will be lower than where we are currently.

#### Enter Gradient Descent!

If you know how to calculate the value of $L(w_0, w_1, X, Y)$ for a particular $(w_0, w_1)$, you can get a basic intuition about how that value is expected to change, should you shift $w_0$ or $w_1$ a bit. Do you know a math operation that does that?

It's a simple derivative!

Therefore, calculating the **values** of $\dfrac{\partial L}{\partial w_0}$ and $\dfrac{\partial L}{\partial w_1}$ **specifically at** $(w_0, w_1)$ tells you how the loss is expected to shift when $(w_0, w_1)$ will shift.

Especially, for any parameter $w$ if the value of $\dfrac{\partial L}{\partial w}$ is positive, we suspect that increasing $w$ will lead to increase of $L$ and decreasing $w$ will lead to decrease in $L$. 

Another intuition is that the bigger absolute value of $\dfrac{\partial L}{\partial w}$ is, the bigger (positive or negative) impact shifting of $w$ will have on $L$.

What are the expressions for $\dfrac{\partial L}{\partial w_0}$ and $\dfrac{\partial L}{\partial w_1}$ ? 

In [None]:
def dLdw0(w_0: float, w_1: float, X: np.ndarray, Y: np.ndarray):
    pass

def dLdw1(w_0: float, w_1: float, X: np.ndarray, Y: np.ndarray):
    pass

In [None]:
dLdw0 = solutions.dLdw0
dLdw1 = solutions.dLdw1