# Multivariate Linear Regression

## Welcome!

Today we're going to talk some more about last-time's topic - **Linear Regression** in a more generalized way. 

You'll see that with just a few changes, we will be able to apply Linear Regression model to problems much more interesting than plotting straight lines!

In [None]:
# imports
import sys
import numpy as np
import matplotlib.pyplot as plt
import ipywidgets as widgets
from ipywidgets import interact, fixed
import solutions
import importlib.util
import pandas as pd
import sklearn as sk
from typing import Tuple, Optional, List


%load_ext autoreload
%autoreload 2
%matplotlib inline

## Previously

### $$y = w_0 + w_1 \cdot x$$



In [None]:
X = solutions.X_1
Y = solutions.Y_1

plt.scatter(X, Y)

* we want to find the optimal $(w_0, w_1)$ - our **model**

* our model will be able to make a **hypothesis**

### $$ h_W(x) = w_0 + w_1 \cdot x$$

* a **loss function** lets as calculate how well our model **fits** the data. In this case, the loss function could look like this:

### $$L_{prev} = \frac{1}{N}\sum_{i=0}^N(h_W(x^{(i)}) - y^{(i)})^2 $$
 
 Though it's good enough for our purposes, we'll divide loss function by 2 - it won't change anything about it illustrating the quality of the model, but will simplify latter computations (can you guess how?)
 
### $$L = \frac{1}{2N}\sum_{i=0}^N(h_W(x^{(i)}) - y^{(i)})^2 $$

* We can find $(w_0, w_1)$ by using **Gradient Descent** method. 

* If we calculate gradients of $L$ with respect to $(w_0, w_1)$, or $\dfrac{\partial L}{\partial w_0}$ and $\dfrac{\partial L}{\partial w_1}$, we will know how to shift the values of $(w_0, w_1)$, so that they will fit the data better.

### $$
\dfrac{\partial L}{\partial w_0} = \dfrac{\sum_{i=0}^n w_0 + w_1 x^{(i)} - y^{(i)}}{n} \\
\dfrac{\partial L}{\partial w_1} =  \dfrac{\sum_{i=0}^n (w_0 + w_1 x^{(i)} - y^{(i)}) \cdot x^{(i)}}{n}
$$

![The idea of Gradient Descent](../1_regression/img/gradient_descent_0.png)

* We multiply the gradients by a **learning rate** $\alpha$, so that the updates are small and don't overshoot their objective.

### $$
w_0 = w_0 - \dfrac{\partial L}{\partial w_0} \cdot \alpha \\
w_1 = w_1 - \dfrac{\partial L}{\partial w_1} \cdot \alpha
$$
* We repeat that process for an arbitrary number of **epochs**

![Learning rates](../1_regression/img/learning_rate.png)

In [None]:
spec = importlib.util.spec_from_file_location(
    "solutions", 
    "../1_regression/solutions.py"
)
solutions_1 = importlib.util.module_from_spec(spec)
spec.loader.exec_module(solutions_1)

In [None]:
init_w_0 = np.random.rand()
init_w_1 = np.random.rand()
learning_rate = 0.01
num_iterations = 100

trained_w_0, trained_w_1, loss_history = \
    solutions_1.train_model(init_w_0, init_w_1, X, Y, learning_rate, num_iterations)

plt.plot(list(range(num_iterations)), loss_history)
plt.show()

In [None]:
Y_pred = trained_w_0 + trained_w_1 * X 
plt.scatter(X, Y)
plt.plot(X, Y_pred, 'r')
plt.show()
print('w_0:', trained_w_0)
print('w_1:', trained_w_1)
print('Loss:', solutions_1.my_loss_vectorized(trained_w_0, trained_w_1, X, Y))

## Now let's go bigger!

Today, we'll apply linear regression to real-life data!

In [None]:
from sklearn.datasets import load_boston
dataset = load_boston()
print(dataset.DESCR)

In [None]:
df = pd.DataFrame(dataset.data)
df.columns = dataset.feature_names
df['target'] = dataset.target
df.head()

In [None]:
df.describe()

We're dealing with a dataset describing houses - each of them has **13** features. Let's see how each of them is related to our target - the price of the house!

In [None]:
for feature in df.columns:
    if feature != 'target':
        print(f"-------{feature}--------")
        plt.scatter(df[feature], df['target'])
        plt.show()

As you can see from the plots, in most of them some kind of relationship can be observed.

Now, is it possible to use what we already know to train a model which will make accurate enough predictions?

In [None]:
X = dataset.data
Y = dataset.target

X.shape, Y.shape

The main difference is that previously:

### $$\hat{y} = h_W(x) = w_0 + w_1x$$ 

And today:

### $$\hat{y} = h_W(x_1, x_2, ..., x_k) \\
= w_0 + w_1x_1+ w_2x_2+ w_3x_3+ ... + w_kx_k \\
= w_0 + \sum_{i=1}^k w_i x_i$$ 

As you can see, $w_0$ has been left out from the sum, which makes it sad. Can we do something, which will make it possible to include it there?

The simple solution is to add a *bias feature* to our input dataset - $X$ - i.e. add a column of ones to it.

This way, for each datapoint $x^{(j)}$,  
### $$x_0^{(j)} = 1$$
and 
### $$ x_0^{(j)} \cdot w_0 = w_0 $$

therefore

### $$ w_0 + \sum_{i=1}^k w_i x_i^{(j)} =  \sum_{i=0}^k w_i x_i^{(j)} = h_W(x^{(j)})$$

In [None]:
X = np.column_stack([np.ones(X.shape[0]), X])
X.shape, Y.shape

Calculating this manually every time is not a goood idea. 

Your task now is to implement a function which will compute the hypotheses for given data $(X)$ and model $(w_0, w_1, w_2, w_3, w_4, w_5, w_6, w_7, w_8, w_9, w_{10}, w_{11}, w_{12}, w_{13})$

In [None]:
def hypothesis(
    X: np.ndarray,
    w_0: float, 
    w_1: float, 
    w_2: float, 
    w_3: float, 
    w_4: float, 
    w_5: float, 
    w_6: float, 
    w_7: float, 
    w_8: float, 
    w_9: float, 
    w_10: float, 
    w_11: float, 
    w_12: float, 
    w_13: float
) -> np.ndarray:
    pass
    # PLEASE DON'T EVER DO THAT

Wow, I got tired of even writing this header!

We obviously need something more elegant. This is why, from now on, we'll always think of particular datapoints not as numbers, but vectors of numbers. Therefore, the whole dataset will be a **vector of vectors** - a matrix.

The same way, we won't care much for every particular weight in our model, we'll treat them as a single vector of numbers.

So: 
### $$
\hat{y}^{(j)} = h_W(x^{(j)}) \sum_{i=0}^k w_i x_i^{(j)} = \sum Wx^{(j)} 
$$

* $W$ has shape $[n_{features}]$
* $X$ has shape $[n_{datapoints}, n_{features}]$
* $Y$ has shape $[n_{datapoints}]$

Please implement it. If you use numpy magic instead of iterating over columns, it should take you just one line of code!

In [None]:
def hypotheses(W: np.ndarray, X: np.ndarray) -> np.ndarray:
    return np.zeros(X.size[0])

In [None]:
hypotheses = solutions.hypotheses

In [None]:
# let's make a sanity check on a few examples!
W = np.random.rand(X.shape[1])
print('your solution:', hypotheses(W, X[:5]))
print('provided solution:', solutions.hypotheses(W, X[:5]))


This also means we have to update the formula for the cost function:

### $$
L(w_0, w_1, ... w_n) = L(W) \\ 
= \frac{1}{2N}\sum_{i=0}^N(\sum_{j=0}^k w_j x^{(i)}_j - y^{(i)})^2 \\
= \frac{1}{2N}\sum_{i=0}^N (h_W(x^{(i)}) - y^{(i)})^2
$$

In [None]:
def loss(W: np.ndarray, X: np.ndarray, Y: np.ndarray) -> float:
    return 0

In [None]:
loss = solutions.loss

In [None]:
W = np.random.rand(X.shape[1])
print('your solution:', loss(W, X, Y))
print('provided solution:', solutions.loss(W, X, Y))

...and Gradient Steps

For every iteration:
* calculate partial derivatives of cost function with respect to every element of W:

### $$\epsilon_j = \frac{\partial}{\partial w_j}L(W) = \frac{1}{N} \sum_{i=1}^N(h_W(x^{(i)}) - y^{(i)})x_j^{(i)}$$

* **simultaneously** update every element of W:

### $$w_j = w_j - \alpha \epsilon_j$$ 

Where $\alpha$ is our learning rate.

In [None]:
def gradient_step(
    W: np.ndarray, 
    X: np.ndarray, 
    Y: np.ndarray, 
    learning_rate=0.01
) -> np.ndarray:
    return W

In [None]:
gradient_step = solutions.gradient_step

In [None]:
W = np.random.rand(X.shape[1])
print('your solution:', gradient_step(W, X, Y))
print('provided solution:', solutions.gradient_step(W, X, Y))

### With all those tools at our disposal, let's train a model!


In [None]:
def train_model(
    init_W: np.ndarray,
    X: np.ndarray,
    Y: np.ndarray,
    learning_rate: float,
    num_iterations: int
) -> Tuple[np.ndarray, List[float]]:
    return init_W, []

In [None]:
train_model = solutions.train_model

In [None]:
init_W = np.random.rand(X.shape[1])
print('your solution:', train_model(init_W, X, Y, 0.1, 1))
print('provided solution:', solutions.train_model(init_W, X, Y, 0.1, 1))

In [None]:
init_W = np.random.rand(X.shape[1])
num_iterations = 10000
learning_rate = 0.01

trained_W, loss_hist = train_model(init_W, X, Y, learning_rate, num_iterations)

plt.plot(np.arange(num_iterations), loss_hist)
plt.show()

Y_pred = hypotheses(trained_W, X)

print('example targets', Y[:5])
print('example predictions', Y_pred[:5])

### WTF just happened?

Our algorithms seem to be perfect, yet loss has exploded and our trained weights are NaNs! Why is that?

In [None]:
def pretty_format(to_print, name=None):
    if name is not None: print(name)
    print(["%.2f" % x for x in to_print])

pretty_format(X.mean(axis=0), "means")
pretty_format(X.max(axis=0) - X.min(axis=0), "ranges")


Our datapoints have very weird orders of magnitude, ranging form $10^0$ to $10^2$. 

Even though the initial weights are very small, you can guess what such initial values will do to the initial hypotheses, values of loss function and it's gradients. 

Moreover, due to the imbalance in the scales of features, the process of training itself will be slower, as updates in some weights will outweight updates in the others. 

![Normalization](img/normalization.png)



## Feature scaling to the rescue!

We want all our features to be roughly in the same range, i.e [-1, 1]. This is called **data normalization**. 

One way to achieve it is **mean normalization**:

$$x_i = \frac{x_i - \mu_i}{max(x_i) - min(x_i)}$$

The exception is the bias feature - $x_0$  - since it's always equal to 1 (just like we want i t to be), we don't normalize it!

Now, implement a function which will calculate the mean-normalized Xs (and keep $X_0$ intact).
The function should return normalized X and calculated means and ranges.
We also want to be able to provide the function with pre-calculated means and ranges and use those, instead of calculating them from provided X.

In [None]:
def mean_normalization(
    X: np.ndarray, 
    means: Optional[np.ndarray] = None, 
    ranges: Optional[np.ndarray] = None
) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
    return X, means, ranges

In [None]:
mean_normalization = solutions.mean_normalization

In [None]:
X_norm, X_mean, X_range = mean_normalization(X)
X_norm_sol, X_mean_sol, X_range_sol = solutions.mean_normalization(X)

pretty_format(X_mean, "means yours")
pretty_format(X_mean_sol, "means provided")
print()
pretty_format(X_range, "ranges yours")
pretty_format(X_range_sol, "ranges provided")


In [None]:
# do feature matrices have the same shapes?
X.shape, X_norm.shape

## Now that our data has been normalized, let's try to train a model once more

In [None]:
init_W = np.random.rand(X_norm.shape[1])
num_iterations = 1000
learning_rate = 0.01

trained_W, loss_hist = train_model(init_W, X_norm, Y, learning_rate, num_iterations)

plt.plot(np.arange(num_iterations), loss_hist)
plt.show()

Y_pred = hypotheses(trained_W, X_norm)

print('example targets', Y[:5])
print('example predictions', Y_pred[:5])
print('final loss', loss_hist[-1])

### How does it compare to a scikit-learn model?

In [None]:
from sklearn.linear_model import LinearRegression

regressor = LinearRegression()
regressor.fit(X_norm, Y)
Y_pred_regr = regressor.predict(X_norm)
final_regressor_loss = np.mean((Y_pred_regr - Y) ** 2) / 2

pretty_format(Y[:5], 'example targets')
pretty_format(Y_pred[:5], 'example_predictions')
pretty_format([loss_hist[-1]], 'final loss')

Not bad!

As the final excercise, let's try to make things difficult for our model a bit more.

So far, things have been quite easy for our model - it was evaluated on the same data it hd been trained on. 

However, this is not the case in real life. What we care about is whether a trained model is able to make accurate predictions on the data it has never seen before.

That's why, when training **any** model on **any** dataset, the first thing you must do is split the dataset into **training**, **validation** and **test** sets.

For the simple models, validation set can be omitted, and today we'll see the usage of train and test sets.

In [None]:
def train_test_split(
    X: np.ndarray, 
    Y: np.ndarray,
    ratio: float = 0.7
) -> Tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]:
    n_datapoints = X.shape[0]
    assert(n_datapoints == Y.shape[0])
    shuffled_indices = np.arange(n_datapoints)
    np.random.shuffle(shuffled_indices)
    train_count = int(n_datapoints * ratio)
    train_indices = shuffled_indices[:train_count]
    test_indices = shuffled_indices[train_count:]

    X_train = X[train_indices]
    Y_train = Y[train_indices]
    X_test = X[test_indices]
    Y_test = Y[test_indices]
    
    return X_train, Y_train, X_test, Y_test


In [None]:
X_train, Y_train, X_test, Y_test = train_test_split(X, Y)

### Remember about data normalization!

#### How to normalize previously unseen data?

We have to make an assumption that the distribution of training data is close to the general distribution of data in our domain. In order for our normalization mapping to be coherent, we will normalize any test data using the means and ranges calculated from $X_{train}$


In [None]:
X_train, X_mean, X_range = mean_normalization(X_train)
# this is why we keep normalization data
X_test, X_mean, X_range = mean_normalization(X_test, X_mean, X_range)

In [None]:
def train_test_model(
    init_W: np.ndarray,
    X_train: np.ndarray,
    Y_train: np.ndarray,
    X_test: np.ndarray,
    Y_test: np.ndarray,
    learning_rate: float,
    num_iterations: int
) -> Tuple[np.ndarray, List[float], List[float]]:
    
    W = init_W
    train_loss_history = []
    test_loss_history = []
    for i in range(num_iterations):
        train_loss_history.append(loss(W, X_train, Y_train))
        test_loss_history.append(loss(W, X_test, Y_test))
        W = gradient_step(W, X_train, Y_train, learning_rate)
    return W, train_loss_history, test_loss_history

In [None]:
init_W = np.random.rand(X_norm.shape[1])
num_iterations = 1000
learning_rate = 0.01

trained_W, train_loss_hist, test_loss_hist = train_test_model(
    init_W, 
    X_train, 
    Y_train, 
    X_test, 
    Y_test, 
    learning_rate, 
    num_iterations
)

plt.plot(np.arange(num_iterations), train_loss_hist)
plt.plot(np.arange(num_iterations), test_loss_hist, color='red')

plt.show()

In [None]:
Y_pred = hypotheses(trained_W, X_test)
pretty_format(Y[:5], 'example targets')
pretty_format(Y_pred[:5], 'example_predictions')
pretty_format([train_loss_hist[-1], test_loss_hist[-1]], 'final loss')

Our predictions are not quite perfect and loss leaves something to be desired. 
But we also see, that our model has learnt *some* intuition about making predictions from real-life data, including data it has never seen before. 

That's pretty good!