# Welcome!

Today we're going to present last-time's topic - **Linear Regression** in a more generalized way. You'll see that this algorithm doesn't change a lot, when we want to fit our model to more features than just one!

This will be an occasion to introduce and tackle some problems faced by data scientists on a daily basis, such as:
- data normalization
- regularization of the cost function
- overfitting
- dividing the data into training and test sets

In [2]:
# imports
import numpy as np
import matplotlib.pyplot as plt
from ipywidgets import interact, fixed
import ipywidgets as widgets
import solutions

%matplotlib inline

# Problem

## Previously

### $$\hat{y} = h_W(x) = w_0 + w_1x$$ 

## Today

### $$\hat{y} = h_W(x_1, x_2, ..., x_k) = w_0 + w_1x_1+ w_2x_2+ w_3x_3+ ... + w_kx_k = w_0 + \sum_{i=1}^k w_i x_i$$ 

## or:

### $W$ - vector of coefficients or 'weights'

$$ W = [w_0, w_1, ..., w_k]$$

### $X^{(i)}$ - vector of  features of an i-th test case
$$ X^{(i)} = [x^{(i)}_0, x^{(i)}_1, ... x^{(i)}_k] $$
... where $x_0$ = 1, so that
$$h_W(X^{(i)}) = \sum_{j=0}^k w_j x_j = W * X^{(i)}$$


In [3]:
def add_bias_feature(X):
       return np.c_[np.ones(len(X)), X]

X = np.array([[1,2,3], [4,5,6]])
print(X)
print(add_bias_feature(X))

[[1 2 3]
 [4 5 6]]
[[ 1.  1.  2.  3.]
 [ 1.  4.  5.  6.]]


In [4]:
def hypotheses(W, X):
    # W: a vector of weights
    # X: a list of feature vectors of some objects (so effectively a matrix)
    # return a vector of hypotheses for *all* x-s 
    return(np.dot(X, W))

In [7]:
hypotheses = solutions.hypotheses

# Cost function

### $$L = \frac{1}{2N}\sum_{i=0}^N(h_W(x^{(i)}) - y^{(i)})^2 $$
## Previously

### $$L(w_0, w_1) = \frac{1}{2N}\sum_{i=0}^N(w_0 + w_1x^{(i)} - y^{(i)})^2 $$

## Today

### $$L(w_0, w_1, ... w_n) = L(W) = \frac{1}{2N}\sum_{i=0}^N(\sum_{j=0}^k w_j x^{(i)}_j - y^{(i)})^2 = \frac{1}{2N}\sum_{i=0}^N (h_W(x^{(i)}) - y^{(i)})^2$$

In [8]:
def cost(W, X, Y):
    # W: a vector of weights
    # X: a list of feature vectors of some objects (so effectively a matrix)
    # Y: a vector of our values
    # return cost (a scalar)
    differences = hypotheses(W, X) - Y
    return(1 / (2 * X.shape[0]) * np.dot(differences, differences))

In [9]:
cost = solutions.cost

# Gradient descent

For every iteration:
* calculate partial derivatives of cost function with respect to every element of W:

$$\epsilon_j = \frac{\partial}{\partial w_j}L(W) = \frac{1}{N} \sum_{i=1}^N(h_W(x^{(i)}) - y^{(i)})x_j^{(i)}$$

* **simultaneously** update every element of W:

$$w_j = w_j - \alpha \epsilon_j$$ 

Where $\alpha$ is our learning rate.

In [10]:
def gradient_step(W, X, Y, learning_rate=0.1):
    # W: a vector of weights
    # X: a list of feature vectors of some objects (so effectively a matrix)
    # Y: a vector of our values
    # return a vector of new values of W
    differences = hypotheses(W, X) - Y
    eps = 1 / X.shape[0] * np.dot(np.transpose(X), differences)
    return(W - alpha * eps)

In [11]:
gradient_step = solutions.gradient_step