# Week 5
# Simple Linear Regression and the Normal Equation (Chapter 4)

So far we have treated machine learning models and their training algorithms mostly like black boxes. Starting from Chapter 4, we will look into the mechanism of several popular machine learning models, analyze them mathematically and learn how to implement the methods from scratch. Let's start with the linear regression model.

## I. Simple Linear regression: Sales Prediction

To put things into context, let's look at a dataset that contains the sales revenue and the advertising budgets of a company in 200 different markets.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [3]:
url = "http://faculty.marshall.usc.edu/gareth-james/ISL/Advertising.csv"
advertising = pd.read_csv(url, index_col=0)
advertising.head()

Unnamed: 0,TV,radio,newspaper,sales
1,230.1,37.8,69.2,22.1
2,44.5,39.3,45.1,10.4
3,17.2,45.9,69.3,9.3
4,151.5,41.3,58.5,18.5
5,180.8,10.8,58.4,12.9


In [4]:
# plot TV vs. sales

In [None]:
# plot radio vs. sales

In [None]:
# plot newspaper vs. sales

For simplicity, we will only use `TV` as a predictor of `sales`.

In [5]:
data = advertising[['TV', 'sales']]
data.head()

Unnamed: 0,TV,sales
1,230.1,22.1
2,44.5,10.4
3,17.2,9.3
4,151.5,18.5
5,180.8,12.9


## Simple Linear Regression: Model Representation

In order to describe the model mathematically, we need to introduce a few notations:
- The input feature `TV` is represented as variable $X$.
- The output/response feature `sales` is represented as variable $Y$.
- Each instance of data is represented as $(x_i, y_i)$, where $i$ is the row index, $x_i$ is the value corresponding to $X$, and $y_i$ is the value corresponding to $Y$. For example, $(x_1, y_1) = (203.1, 22.1)$.

The **simple linear regression** model assumes that the relationship between $X$ and $Y$ is
$$Y \approx f(X) = \beta_0 + \beta_1 X.$$

- $\beta_0$ and $\beta_1$ are called **model parameters**. For simple linear regression, the relationship is characterized as a straight line with slope $\beta_1$ and y-intercept $\beta_0$.
- For a given line, we need a **cost function** (in some occasions also called **loss function**) that measures how well a given line fits the data.
- We also need a **training algorithm** that finds values of parameters so that the line fits the data well (usually "fitting the data well" means "minimizing the cost").


## Simple Linear Regression: Cost Function
A common choice of cost function is the **mean squared error (MSE) function**. It is defined as
$$\begin{align}
MSE(\beta_0, \beta_1) =& 
\frac{1}{N}\sum_{i=1}^N (y_i - f(x_i))^2 \\
=& \frac{1}{N}\sum_{i=1}^N\big(y_i - \beta_0 - \beta_1x_i\big)^2.
\end{align}$$

To better understand the RSS function, let's calculate the value $MSE(5, 0.1)$.

In [6]:
# Calculate (y_1 - f(x_1))^2 with beta0 = 5 and beta1 = 0.1.


In [7]:
# Create a list that contains value of (y_1 - f(x_1))^2 for i=1,...,200.


In [8]:
# Calculate MSE(5, 0.1)


In [9]:
# Write a function MSE(beta0, beta1) that returns the value of RSS with given beta0 and beta1.


In [10]:
# Calculate MSE(7, 0.05) using MSE().


In [None]:
# Plot the data as a scatter plot.

# Plot the line y = 5 + 0.1x.

# Plot the line y = 7 + 0.05x.


**Discussion**: 
- Which line fits the data better? 
- Which line has smaller MSE value?
- What is the geometric meaning of the MSE function?

# Simple Linear Regression: Training Algorithm
To find the value of $\beta_0, \beta_1$ that minimizes the MSE cost function, there is a formula called **normal equation** that gives the result directly:

$$\begin{pmatrix} \beta_0 \\ \beta_1 \end{pmatrix} = (\textbf{X}^T\cdot\textbf{X})^{-1}\cdot\textbf{X}^T\cdot\textbf{y}.$$

- $\textbf{X}$ is the matrix formed as 
$$\textbf{X} = \begin{pmatrix} 
1 & x_1 \\
1 & x_2 \\
\vdots & \vdots \\
1 & x_N \\
\end{pmatrix}.$$
- $\textbf{X}$ is the **matrix transpose** of $\textbf{X}$.
- $\cdot$ is **matrix multiplication**.
- $^{-1}$ is **matrix inverse**.
- $\textbf{y}$ is the vector of target values
$$\textbf{y} = \begin{pmatrix} 
y_1 \\
y_2 \\
\vdots \\
y_N \\
\end{pmatrix}.$$

Let's apply the normal equation and find the best parameter values.

In [17]:
# Construct X and y as numpy arrays
X = np.hstack([np.ones([len(data), 1]), data[['TV']].values])
# print(X)
y = data[['sales']].values

beta = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y)
print(beta)

[[7.03259355]
 [0.04753664]]


In [None]:
# Plot the data points and the optimal regression line.


In [18]:
# Find the results using LinearRegression class from sklearn.linear_model


**Discussion**:
- How can we interpret the value of $\beta_1$?
- How can we interpret the value of $\beta_0$?

# Homework

1. Build a simple linear regression model that uses `radio` to predict `sales`. 