<a href="https://colab.research.google.com/github/ch00226855/CMP414765Spring2021/blob/main/Week06_MultilinearRegression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 6
# Multilinear Regression

Last time we looked at a simple linear regression model $sales = \beta_0 + \beta_1\cdot\textit{TV advertising budget}$. More generally, a linear model makes a prediction by computing a weighted sum of their input features (plus a constant).

**Reading: Chapter 4**

## Multilinear Regression: Model Assumptions
**Model**:

$\hat{y} = \theta_0 + \theta_1x_1 + \theta_2x_2 +\cdots + \theta_nx_n$
1. $\hat{y}$ is the predicted value.
2. $n$ is the number of features.
3. $x_i$ is the i-th feature value.
4. $\theta_j$ is the j-th model parameter (associated with $x_j$).

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
# Toy example
columns = ['Homework', 'Midterm', 'Final']
data = pd.DataFrame({
    "Homework": [95, 70, 80, 100, 70],
    "Midterm": [90, 60, 80, 80, 85],
    "Final": [93, 66, 85, 60, 90]
}, index=["Alice", "Bob", "Clare", "David", "Eve"])

data.head()

Unnamed: 0,Homework,Midterm,Final
Alice,95,90,93
Bob,70,60,66
Clare,80,80,85
David,100,80,60
Eve,70,85,90


In this case:
- $x_1$ is the homework feature
- $x_2$ is the midterm feature
- $y$ is the final feature
- model is: $final = \theta_0 + \theta_1 * homework + \theta_2 * midterm$
- We need to come up with values for $\theta_0, \theta_1, \theta_2$ to complete the model.

**Objective**: Suppose that another student Fred has Homework score 85 and Midterm score 80. What is prediction of his final exam score?

## Multilinear Regression: Vectorized form

The multilinear model can also be written as:

**$\hat{y} = \theta\cdot\textbf{x}$**.
1. $\theta = (\theta_0, \theta_1, ..., \theta_n)$ is the paramter vector.
2. $\textbf{x} = (1, x_1, ..., x_n)$ is the feature vector.
3. The symbol $\cdot$ represents the inner-product of two vectors. For example, $(1, 2, 3)\cdot (4, 5, 6) = 1\times 4 + 2\times 5 + 3\times 6 = 32$.

**Why is the expression $\theta\cdot\textbf{x}$ equivalent to $\theta_0 + \theta_1x_1 + \theta_2x_2 +\cdots + \theta_nx_n$?**

In [None]:
# Let's apply the linear regression tool in sci-kit learn on the toy example



In [None]:
# Retrieve the estimated parameter values.



## Multilinear Regression: Cost Function
In order to calculate the best value for each parameter, we need a **cost function** that evaluates the errors made by a give set of parameter values. Here we use the **mean squared error (MSE)** function as the cost function:

$J(\textbf{X}, \theta) = \frac{1}{m}\sum_{i=1}^{m}\big(\theta\cdot\textbf{x}^{(i)} - y^{(i)}\big)^2$

Here $(\textbf{x}^{(i)}, y^{(i)})$ represents the i-th training example

In [3]:
# Calculate the MSE cost of the toy example for the parameter values given by sci-kit learn.



## Multilinear Regression: Training Algorithm 1
The value of $\theta$ that minimizes the cost function is given by the following **normal equation**:

$\hat{\theta} = \big(\textbf{X}^T\cdot\textbf{X}\big)^{-1}\cdot\textbf{X}^T\cdot\textbf{y}$.

1. $\textbf{X}$ is an $m\times (n+1)$ matrix whose i-th row is $\textbf{x}^{(i)}$.
$$\textbf{X} = \begin{pmatrix}
1 & x^{(1)}_1 & x^{(1)}_2 & \cdots & x^{(1)}_n \\
1 & x^{(2)}_1 & x^{(2)}_2 & \cdots & x^{(2)}_n \\
\vdots & \vdots &\vdots & \ddots & \vdots \\
1 & x^{(m)}_1 & x^{(m)}_2 & \cdots & x^{(m)}_n \\
\end{pmatrix}$$
2. $$\textbf{y} = \begin{pmatrix}y^{(1)} \\ \vdots \\ y^{(m)}\end{pmatrix}$$.
3. The cost function $J(\theta)$ also has a matrix expression
$$J(\theta) = \frac{1}{m}(\textbf{X}^T\cdot\theta - \textbf{y})^T\cdot (\textbf{X}^T\cdot\theta - \textbf{y})$$

In [4]:
# Construct matrix X using np.hstack(), np.ones()



In [5]:
# Construct vector y



In [6]:
# Apply the normal equation to find theta



## Multilinear Regression: Training Algorithm 2
The normal equation is not applicable when $\textbf{X}^T\cdot\textbf{X}$ is not invertible. It happens if:
- Several features are linearly dependent (for example, feature3 = feature1 + feature2)
- The number of features is greater than the number of training data (for example, DNA data)

When the matrix $\textbf{X}$ is too large, the normal equation may take too long to finish since it requires a matrix multiplication.

In these cases, we can use the **gradient descent** method to minimize the cost function instead.

Gradient descent with one variable ideally looks like this:
<img src="https://cdn-images-1.medium.com/max/1600/0*fU8XFt-NCMZGAWND." width="600">

Gradient descent with two variables ideally looks like this:
<img src="https://blog.paperspace.com/content/images/2019/09/F1-02.large.jpg" width="600">

Gradient descent is an iterative algorithm for finding the **local minimum** of a differentiable function.
- Choose an initial value of $\hat{\theta}$ and a **learning rate** $r$.
- For each iteration $k$, do:
$$\hat{\theta} \leftarrow \hat{\theta} - r\cdot\frac{\partial J(\hat{\theta})}{\partial \theta}.$$
- The partial derivative of the cost function is given by
$$
\frac{\partial J(\hat{\theta})}{\partial \theta} = \frac{2}{m}\cdot\textbf{X}^T\cdot(\textbf{X}\cdot\theta - \textbf{y}).
$$
- **Verify the formula of partial derivative asuuming there is one input feature.**

- End iteration if certain stop criteria is reached, such as:
    - Value of $\hat{\theta}$ becomes stable.
    - Certain iteration amount is reached.

In [7]:
# Choose a random initial value for each parameter.



In [8]:
# Perform gradient descent once.
# Choose a learning rate r


# 1. Calculate the gradient


# 2. Update the parameters


# 3. (optional) Show the MSE cost with new parameter values



In [None]:
# Perform gradient descent multiple times



In [None]:
# Plot the training curve.



**Discussion**
1. Change $r$ to 0.000001 and 1. Observe the MSE curve.
2. Do the initial parameter values matter?
3. How to determine when to stop the iteration?