*This notebook is part of  course materials for CS 345: Machine Learning Foundations and Practice at Colorado State University.
Original versions were created by Asa Ben-Hur.
The content is availabe [on GitHub](https://github.com/asabenhur/CS345).*

*The text is released under the [CC BY-SA license](https://creativecommons.org/licenses/by-sa/4.0/), and code is released under the [MIT license](https://opensource.org/licenses/MIT).*

<a href="https://colab.research.google.com/github//asabenhur/CS345/blob/master/fall22/notebooks/module03_04_univariate_linear_regression_01.ipynb">
  <img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Linear Regression

*Adapted from Chapter 3 of [An Introduction to Statistical Learning](https://www.statlearning.com/)*

In [3]:
import pandas as pd
import numpy as np
np.set_printoptions(precision=4)
import matplotlib.pyplot as plt

### Multivariate linear regression

Univariate linear regression can easily be extended to include multiple features, which is called **multivariate linear regression**.  Instead of using a single variable to make a prediction, we use a vector of variables:

$$
\hat{y} = w_1x_1 + \ldots + w_dx_d + b = \mathbf{w}^\top \mathbf{x} + b
$$

Each $x_i$ represents a different feature, and each feature has its own parameter.  In our advertising data for example, $\mathbf{x} = (x_1,x_2, x_3)^\top$, and $\mathbf{w} = (w_1,w_2, w_3)^\top$ for TV, radio and newspaper advertising.

As in the univariate case, the parameters are chosen to minimize the sum-squared error:
$$
J( \mathbf{w},b ) = \sum_{i=1}^N (y_i - \hat{y}_i)^2,
$$
where 
$$
\hat{y}_i = \mathbf{w}^\top \mathbf{x}_i + b.
$$


In [4]:
# read data into a pandas DataFrame
data = pd.read_csv('https://www.statlearning.com/s/Advertising.csv', index_col=0)
data.head()

Unnamed: 0,TV,radio,newspaper,sales
1,230.1,37.8,69.2,22.1
2,44.5,39.3,45.1,10.4
3,17.2,45.9,69.3,9.3
4,151.5,41.3,58.5,18.5
5,180.8,10.8,58.4,12.9


In [5]:
from sklearn.linear_model import LinearRegression

# create X and y
X = data[['TV', 'radio', 'newspaper']].values
y = data['sales'].values

# instantiate and fit
linreg = LinearRegression()
linreg.fit(X, y)

# print the coefficients
print (linreg.intercept_)
print (linreg.coef_)

2.9388893694594067
[ 0.0458  0.1885 -0.001 ]


For a given amount of Radio and Newspaper spending, an increase of $1000 in **TV** spending is associated with an **increase in sales of 45.8 widgets**.

For a given amount of TV and Newspaper spending, an increase of $1000 in **Radio** spending is associated with an **increase in sales of 188.5 widgets**.

For a given amount of TV and Radio spending, an increase of $1000 in **Newspaper** spending is associated with an **decrease in sales of 1.0 widgets**. How could that be?

## Evaluation metrics for regression problems

We just introduced the concept of fitting a multivariate linear model, let us now take a moment to ask what it might mean to judge the qaulity of a model after fitting.

Evaluation metrics for classification problems, such as **accuracy**, are not useful for regression problems. We need evaluation metrics designed for comparing **continuous values**.

Let's create some example numeric predictions, and calculate three common evaluation metrics for regression problems:

In [6]:
# define true and predicted response values
y = [100, 50, 30, 20]
y_pred = [90, 50, 50, 30]

**Mean Absolute Error** (MAE) is the mean of the absolute value of the errors:

$$\frac 1N\sum_{i=1}^N |y_i-\hat{y}_i|$$

It is implemented in scikit-learn:

In [7]:
from sklearn import metrics

metrics.mean_absolute_error(y, y_pred)

10.0

But it's just as easy to implement it yourself:

In [6]:
def mean_absolute_error(y, y_pred) :
    return np.mean(np.absolute(np.asarray(y) - np.asarray(y_pred)))

mean_absolute_error(y, y_pred)

10.0

**Mean Squared Error** (MSE) is the mean of the squared errors:

$$\frac 1N\sum_{i=1}^N(y_i-\hat{y}_i)^2$$

In [7]:
metrics.mean_squared_error(y, y_pred)

150.0

**Root Mean Squared Error** (RMSE) is the square root of the MSE:

$$\sqrt{\frac 1N\sum_{i=1}^N(y_i-\hat{y}_i)^2}$$

In [8]:
print (np.sqrt(metrics.mean_squared_error(y, y_pred)))

12.24744871391589


Comparing these metrics:

- **MAE** is the easiest to understand, because it's the average error.
- **MSE** "punishes" larger errors
- **RMSE** easier to understand than MSE because RMSE is in the "y" units.

All of these are measures of error or loss, because lower is better.

Here's an example, to demonstrate how MSE/RMSE punish larger errors:

In [8]:
# same true values as above
y_true = [100, 50, 30, 20]

# new set of predicted values
y_pred = [60, 50, 30, 20]

# the previous values were:  y_pred = [90, 50, 50, 30]

# MAE is the same as before
print(f"MAE is  : {metrics.mean_absolute_error(y, y_pred):6.3f}")

# RMSE is larger than before
print(f"RMSE is : {np.sqrt(metrics.mean_squared_error(y, y_pred)):6.3f}")

MAE is  : 10.000
RMSE is : 20.000


While the above were all measures of error, the following is a measure of the success of the model (higher is better).

**Coefficient of Determination** ($R^{2}$) is a measure of how much variance in the data is explained by the model. Now, having given you this simple English sentence, keep in mind this measure is both useful and more complicated to understand. Here's the mathematical definition.  First let's define the total sum of squares:

$$ 
\mathrm{SS}_{\mathrm{tot}} = \sum_{i=1}^N \left( y_{i} - \bar{y} \right)^{2} 
$$

This is the inherent variance in the data.  We also define the residual sum of squares:

$$ 
\mathrm{SS}_{\mathrm{res}} = \sum_{i=1}^N \left( y_{i} - \hat{y}_{i} \right)^{2},
$$

where $\hat{y}_{i}$ is the predicted label of $\mathbf{x}_i$ from our linear regression model.
We are now ready to define

$$
R^{2} = \frac{\mathrm{SS}_{\mathrm{tot}} - \mathrm{SS}_{\mathrm{res}}}{\mathrm{SS}_{\mathrm{tot}}} = 1 - \frac{\mathrm{SS}_{\mathrm{res}}}{\mathrm{SS}_{\mathrm{tot}}}
$$

You can learn more about R2 on its [Wikipedia page](https://en.wikipedia.org/wiki/Coefficient_of_determination).
And as you would expect, there is a scikit-learn [implementation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html).


### Evaluating multivariate regression on the advertising data


In [27]:
from sklearn.model_selection import train_test_split

X = data[['TV', 'radio', 'newspaper']].values
y = data['sales'].values

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=5)

linreg = LinearRegression()
linreg.fit(X_train, y_train)

y_pred = linreg.predict(X_test)

mae_test = metrics.mean_absolute_error(y_test, y_pred)
print(f"MAE on the test set : {mae_test:6.3f}")
r2_test = metrics.r2_score(y_test, y_pred)
print(f"R2 on the test set : {r2_test:6.3f}")

MAE on the test set :  1.216
R2 on the test set :  0.887


In [29]:
y_train_pred = linreg.predict(X_train)

mae_train = metrics.mean_absolute_error(y_train, y_train_pred)
print(f"MAE on the training set : {mae_train:6.3f}")
r2_train = metrics.r2_score(y_train, y_train_pred)
print(f"R2 on the training set : {r2_train:6.3f}")


MAE on the training set :  1.259
R2 on the training set :  0.900


### Exercise:  Selecting good features

Split the data into train/test sets and use RMSE to decide whether the newspaper feature should be used for our model.  You will need to train/test two versions - with and without that feature.

In [None]:
# for convenience, here's the data again:

X = data[['TV', 'radio', 'newspaper']].values
y = data['sales'].values


### Advantages/disadvantages of linear regression

Advantages of linear regression:

- Simple to explain
- Highly interpretable
- Model training and prediction are fast
- No tuning is required 
- Can perform well with a small number of observations

Disadvantages of linear regression:

- Presumes a linear relationship between the features and the labels
- Performance is (generally) not competitive with the best regression methods
- Can be sensitive to irrelevant features and outliers

Linear regression is a **parametric method**, meaning that success depends on the data satisfying our assumption that the data fall on a line/hyperplane.