# Multivariate Linear Regression

## Welcome!

Today we're going to talk some more about last-time's topic - **Linear Regression** in a more generalized way. 

You'll see that with just a few changes, we will be able to apply Linear Regression model to problems much more interesting than plotting straight lines!

In [None]:
# imports
import sys
import numpy as np
import matplotlib.pyplot as plt
from ipywidgets import interact, fixed
import ipywidgets as widgets
import solutions
import importlib.util
import pandas as pd
import sklearn as sk

%load_ext autoreload
%autoreload 2
%matplotlib inline

## Previously

### $$y = w_0 + w_1 \cdot x$$



In [None]:
X = solutions.X_1
Y = solutions.Y_1

plt.scatter(X, Y)

* we want to find the optimal $(w_0, w_1)$ - our **model**

* our model will be able to make a **hypothesis**

### $$ h_W(x) = w_0 + w_1 \cdot x$$

* a **loss function** lets as calculate how well our model **fits** the data. In this case, the loss function could look like this:

### $$L_{prev} = \frac{1}{N}\sum_{i=0}^N(h_W(x^{(i)}) - y^{(i)})^2 $$
 
 Though it's good enough for our purposes, we'll divide loss function by 2 - it won't change anything about it illustrating the quality of the model, but will simplify latter computations (can you guess how?)
 
### $$L = \frac{1}{2N}\sum_{i=0}^N(h_W(x^{(i)}) - y^{(i)})^2 $$

* We can find $(w_0, w_1)$ by using **Gradient Descent** method. 

* If we calculate gradients of $L$ with respect to $(w_0, w_1)$, or $\dfrac{\partial L}{\partial w_0}$ and $\dfrac{\partial L}{\partial w_1}$, we will know how to shift the values of $(w_0, w_1)$, so that they will fit the data better.

### $$
\dfrac{\partial L}{\partial w_0} = \dfrac{\sum_{i=0}^n w_0 + w_1 x^{(i)} - y^{(i)}}{n} \\
\dfrac{\partial L}{\partial w_1} =  \dfrac{\sum_{i=0}^n (w_0 + w_1 x^{(i)} - y^{(i)}) \cdot x^{(i)}}{n}
$$

![The idea of Gradient Descent](../1_regression/img/gradient_descent_0.png)

* We multiply the gradients by a **learning rate** $\alpha$, so that the updates are small and don't overshoot their objective.

### $$
w_0 = w_0 - \dfrac{\partial L}{\partial w_0} \cdot \alpha \\
w_1 = w_1 - \dfrac{\partial L}{\partial w_1} \cdot \alpha
$$
* We repeat that process for an arbitrary number of **epochs**

![Learning rates](../1_regression/img/learning_rate.png)

In [None]:
spec = importlib.util.spec_from_file_location(
    "solutions", 
    "../1_regression/solutions.py"
)
solutions_1 = importlib.util.module_from_spec(spec)
spec.loader.exec_module(solutions_1)

In [None]:
init_w_0 = np.random.rand()
init_w_1 = np.random.rand()
learning_rate = 0.01
num_iterations = 100

trained_w_0, trained_w_1, loss_history = \
    solutions_1.train_model(init_w_0, init_w_1, X, Y, learning_rate, num_iterations)

plt.plot(list(range(num_iterations)), loss_history)
plt.show()

In [None]:
Y_pred = trained_w_0 + trained_w_1 * X 
plt.scatter(X, Y)
plt.plot(X, Y_pred, 'r')
plt.show()
print('w_0:', trained_w_0)
print('w_1:', trained_w_1)
print('Loss:', solutions_1.my_loss_vectorized(trained_w_0, trained_w_1, X, Y))

## Now let's go bigger!

Today, we'll apply linear regression to real-life data!

In [None]:
from sklearn.datasets import load_boston
dataset = load_boston()
print(dataset.DESCR)

In [None]:
df = pd.DataFrame(dataset.data)
df.columns = dataset.feature_names
df['target'] = dataset.target
df.head()

In [None]:
df.describe()

We're dealing with a dataset describing houses - each of them has **13** features. Let's see how each of them is related to our target - the price of the house!

In [None]:
for feature in df.columns:
    if feature != 'target':
        print(f"-------{feature}--------")
        plt.scatter(df[feature], df['target'])
        plt.show()

As you can see from the plots, in most of them some kind of relationship can be observed.

Now, is it possible to use what we already know to train a model which will make accurate enough predictions?

In [None]:
X = dataset.data
Y = dataset.target

X.shape, Y.shape

The main difference is that previously:

### $$\hat{y} = h_W(x) = w_0 + w_1x$$ 

And today:

### $$\hat{y} = h_W(x_1, x_2, ..., x_k) \\
= w_0 + w_1x_1+ w_2x_2+ w_3x_3+ ... + w_kx_k \\
= w_0 + \sum_{i=1}^k w_i x_i$$ 

As you can see, $w_0$ has been left out from the sum, which makes it sad. Can we do something, which will make it possible to include it there?

The simple solution is to add a *bias feature* to our input dataset - $X$ - i.e. add a column of ones to it.

This way, for each datapoint $x^{(j)}$,  
### $$x_0^{(j)} = 1$$
and 
### $$ x_0^{(j)} \cdot w_0 = w_0 $$

therefore

### $$ w_0 + \sum_{i=1}^k w_i x_i^{(j)} =  \sum_{i=0}^k w_i x_i^{(j)} = h_W(x^{(j)})$$

In [None]:
X = np.column_stack([np.ones(X.shape[0]), X])
X.shape, Y.shape

Calculating this manually every time is not a goood idea. 

Your task now is to implement a function which will compute the hypotheses for given data $(X)$ and model $(w_0, w_1, w_2, w_3, w_4, w_5, w_6, w_7, w_8, w_9, w_{10}, w_{11}, w_{12}, w_{13})$

In [None]:
def hypothesis(
    X: np.ndarray,
    w_0: float, 
    w_1: float, 
    w_2: float, 
    w_3: float, 
    w_4: float, 
    w_5: float, 
    w_6: float, 
    w_7: float, 
    w_8: float, 
    w_9: float, 
    w_10: float, 
    w_11: float, 
    w_12: float, 
    w_13: float
) -> np.ndarray:
    pass
    # PLEASE DON'T EVER DO THAT

Wow, I got tired of even writing this header!

We obviously need something more elegant. This is why, from now on, we'll always think of particular datapoints not as numbers, but vectors of numbers. Therefore, the whole dataset will be a **vector of vectors** - a matrix.

The same way, we won't care much for every particular weight in our model, we'll treat them as a single vector of numbers.

So: 
### $$
\hat{y}^{(j)} = h_W(x^{(j)}) \sum_{i=0}^k w_i x_i^{(j)} = \sum Wx^{(j)} 
$$

* $W$ has shape $[n_{features}]$
* $X$ has shape $[n_{datapoints}, n_{features}]$
* $Y$ has shape $[n_{datapoints}]$

Please implement it. If you use numpy magic instead of iterating over columns, it should take you just one line of code!

In [None]:
def hypotheses(W: np.ndarray, X: np.ndarray) -> np.ndarray:
    return np.zeros(X.size[0])

In [None]:
hypotheses = solutions.hypotheses

In [None]:
# let's make a sanity check on a few examples!
W = np.random.rand(X.shape[1])
print('your solution:', hypotheses(W, X[:5]))
print('provided solution:', solutions.hypotheses(W, X[:5]))


This also means we have to update the formula for the cost function:

### $$
L(w_0, w_1, ... w_n) = L(W) \\ 
= \frac{1}{2N}\sum_{i=0}^N(\sum_{j=0}^k w_j x^{(i)}_j - y^{(i)})^2 \\
= \frac{1}{2N}\sum_{i=0}^N (h_W(x^{(i)}) - y^{(i)})^2
$$

In [None]:
def loss(W: np.ndarray, X: np.ndarray, Y: np.ndarray) -> float:
    return 0

In [None]:
loss = solutions.loss

In [None]:
W = np.random.rand(X.shape[1])
print('your solution:', loss(W, X, Y))
print('provided solution:', solutions.loss(W, X, Y))

...and Gradient Steps

For every iteration:
* calculate partial derivatives of cost function with respect to every element of W:

$$\epsilon_j = \frac{\partial}{\partial w_j}L(W) = \frac{1}{N} \sum_{i=1}^N(h_W(x^{(i)}) - y^{(i)})x_j^{(i)}$$

* **simultaneously** update every element of W:

$$w_j = w_j - \alpha \epsilon_j$$ 

Where $\alpha$ is our learning rate.

In [None]:
def gradient_step(
    W: np.ndarray, 
    X: np.ndarray, 
    Y: np.ndarray, 
    learning_rate=0.01
) -> np.ndarray:
    return W

In [None]:
gradient_step = solutions.gradient_step

* gradient descent na wielu wartościach
* zauważyć, że wartości są z różnych przedziałów => normalizacja
* Zauważyć że overfittujemy => train/test split

# Polynomial regression: a possible use case

This is a plot of a secret polynomial:

In [None]:
secret = solutions.secret_polynomial
X = np.arange(-4, 4, 0.7)
Y = [secret(x) for x in X]
plt.scatter(X, Y)
plt.show()

What degree of a polynomial could that be?

In [None]:
def to_poly_features(X, proposed_degree):
    # notice that x ** 0 = 1, so bias feature is already added here
    return np.array([[x ** n for n in range(proposed_degree)] for x in X])

In [None]:
proposed_degree = 7
features = to_poly_features(X, proposed_degree)
targets = np.array(Y)
for i in range(5):
    print(features[i])
    print()

## Feature scaling to the rescue!

We want all our features to be roughly in the same range, i.e [-1, 1]. This is called **data normalization**. 

One way to achieve it is **mean normalization**:

$$x_i = \frac{x_i - \mu_i}{max(X) - min(X)}$$

Of course, since $x_0$ is always equal to 1, we don't normalize it!

In [None]:
def mean_normalization(X, means=None, ranges=None):
    # implement me!
    # X - a matrix of features
    # calculate means and ranges if necessary
    # calculate normalized matrix X using calculated or given means and ranges
    # return X, means and ranges (we may want to reuse them)
    # do not normalize the first column of ones!

In [None]:
mean_normalization = solutions.mean_normalization

In [None]:
features, means, ranges = mean_normalization(features)
for i in range(5):
    print(features[i])
    print()

In [None]:
W = np.random.rand(proposed_degree)
print(W)
costs = []
steps = 100000

for i in range(steps):
    W = gradient_step(W, features, targets, 0.01)
    costs.append(loss(W, features, targets))

# it is always a good idea to plot the cost function to see how learning goes
step_nums = [i for i in range(steps)]
plt.scatter(x=step_nums, y=costs)
plt.show()
print(W)

In [None]:
calculated_targets = hypotheses(W, features)
plt.scatter(X, Y)
plt.plot(X, calculated_targets, color='red')
plt.show()
W

In [None]:
more_X = np.arange(-6, 6, 0.3)
more_Y = [secret(x) for x in more_X]
more_features = to_poly_features(more_X, proposed_degree)
more_features, means, ranges = mean_normalization(more_features, means, ranges)
more_targets = np.array(Y)

more_calculated_targets = hypotheses(W, more_features)
plt.scatter(more_X, more_Y)
plt.plot(more_X, more_calculated_targets, color='red')
plt.show()
W



In [None]:
interact(solutions.perform_polynomial_regression,
        steps=widgets.IntSlider(min=100,max=1000000,step=1000,value=1000), 
        degree=widgets.IntSlider(min=1,max=30,step=1,value=1))

# Let's play with real-life data!


In [None]:
houses = np.genfromtxt('houses.csv', delimiter=',')
# area, number of bedrooms, price

houses.shape

In [None]:
prices = houses[:, 2]
# relationship between area and price
areas = houses[:, 0]
plt.scatter(areas, prices)
plt.show()

# relation between no. of bedrooms
bedrooms_nos = houses[:, 1]
plt.scatter(bedrooms_nos, prices)
plt.show()

First, data must be normalized

In [None]:
features = add_bias_feature(houses[:, :-1]) 
targets = houses[:, 2]
features, _, _ = mean_normalization(features)

Now, let's split data into training and test sets

In [None]:
train_size = int(len(houses) * 2/3) 
train_numbers = random.sample(range(len(houses)), train_size)

train_features = np.array([features[i] for i in range(len(houses)) if i in train_numbers])
train_targets = np.array([targets[i] for i in range(len(houses)) if i in train_numbers])

test_features = np.array([features[i] for i in range(len(houses)) if i not in train_numbers])
test_targets = np.array([targets[i] for i in range(len(houses)) if i not in train_numbers])

len(train_features), len(train_targets), len(test_featW = gradient_step(W, train_features, train_targets, 0.1)ures), len(test_targets)

Let's train our model. Our model will consider only the training se during the training. 

We will plot the cost function to see how well the model performs on the training data, but also, separately, plot cost calculated for test data. This will help us see how well the model generalizes.

In [None]:
W = np.random.rand(3)
train_costs = []
test_costs = []
steps = 1000

for i in range(steps):
    W = gradient_step(W, train_features, train_targets, 0.1)
    train_costs.append(cost(W, train_features, train_targets))
    test_costs.append(cost(W, test_features, test_targets))
    
step_nums = [i for i in range(steps)]
# plt.scatter(x=step_nums, y=train_costs)
plt.scatter(x=step_nums, y=test_costs, color='red')

plt.show()

In [None]:
from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(areas, bedrooms_nos, prices)
ax.view_init(40, 100)

ax.set_xlabel('area')
ax.set_ylabel('bedrooms')
ax.set_zlabel('price')
plt.show()