###### Content under Creative Commons Attribution license CC-BY 4.0, code under BSD 3-Clause License © 2021 Lorena A. Barba, Tingyu Wang

# Multiple linear regression

So far, we have only modeled the relationship between one input variable (or feature) $x$ and one output variable $y$. More often than not, real-world model fitting problems involve making predictions using more than one features. For example, you can predict the box office gross of Hollywood movies using trailer views, Wikipedia page views, critic ratings and time of release; or predict the annual energy consumption of a building using its occupancy, structural information, weather data and so on.  In this notebook, we are going to extend the linear regression model to multiple linear regression model, which can predict one output variable using multiple features.

To have some data to work with, we grabbed the [auto miles per gallon (MPG) dataset](http://archive.ics.uci.edu/ml/datasets/Auto+MPG) from UCI Machine Learning Repository, removed the missing data and formatted it to csv. Our goal is to predict the MPG (fuel efficiency) of a car using its technical specs.

In [None]:
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
sns.set()

Let's read in the data and take a glance.

In [None]:
mpg_data = pd.read_csv('../data/auto_mpg.csv')
mpg_data.head()

And `pandas.DataFrame.info()` gives us a quick summary of the data.

In [None]:
mpg_data.info()

Here we have 392 entries and each is associated with a specific car model. There are alltogether 9 columns. Except for `car name`, values in all other columns are numeric. The name of a car won't directly affect its MPG. Despite being numeric, the `origin` column, indicating the country of origin, should be categorical. Here for simplicity, we don't include `car name` and `origin` as the features to predict the MPG. Now, let's define the feature columns: `x_cols` and the output column: `y_cols`.

In [None]:
y_col = 'mpg'
x_cols = mpg_data.columns.drop(['car name', 'origin', 'mpg'])  # also drop mpg column

print(x_cols)

We end up with keeping 6 features (all technical specs of a car) that are seemingly correlated with MPG.

## Data exploration

Before choosing a model to fit our data, understanding the data is equally important but often ignored. The best way to start is to visualize the relationship between input and output variables.

Remember we used to make a scatter plot when we only have one feature $x$ to observe the relationship. Since now we are dealing with 6 features, we want to make such plot for every $x$. The data visualization package `seaborn` provides us with a handy function [`seaborn.pairplot()`](https://seaborn.pydata.org/generated/seaborn.pairplot.html) to plot these 6 figures in one go. Double-click the figure to expand the view.

In [None]:
sns.pairplot(data=mpg_data, height=5, aspect=1,
             x_vars=x_cols,
             y_vars=y_col);

The features: `model_year`, `acceleration` show a positive correlation with `mpg`, while the rest show a negative correlation with `mpg`.
All six features demonstrate more or less a linear relationship with our output variable. That's why using linear regression model suits this problem well.

## Multiple linear regression

If every feature is correlated with $y$ individually, it is natural to thinw combining them linearly would be a good fit for $y$. Formally, the multiple linear regression model for $d$ input variables can be written as:

$$
\hat{y} = w_0 + w_1 x_1 + w_2 x_2 + \ldots + w_d x_d, 
$$

where the "hat" on $y$ denotes a predicted value.
Notice that we have $d+1$ weights for $d$ features, and $w_0$ is the intercept term. By letting $x_0 = 1$ for all data points, we can simplify the notation as:

$$
\hat{y} = \sum_{i=0}^{d} w_d x_d = \mathbf{w}^T \mathbf{x}, 
$$

where $\mathbf{w} = (w_0, w_1, \ldots, w_d)^T$ is the vector of weights, and $\mathbf{x} = (x_0, x_1, \ldots, x_d)^T$ the vector of input variables.

Since we've used subscript to denote features, let's index our dataset entries with superscript. For example, $x_1^{(i)}$ represent the `cylinders` (the first feature) value of the $i$-th car model.

Suppose our dataset has $N$ entries, writing out our model for each entry, we have:

\begin{align*}
\hat{y}^{(1)} & = w_0 x_0^{(1)} + w_1 x_1^{(1)} + w_2 x_2^{(1)} + \ldots + w_d x_d^{(1)} \\
\hat{y}^{(2)} & = w_0 x_0^{(2)} + w_1 x_1^{(2)} + w_2 x_2^{(2)} + \ldots + w_d x_d^{(2)} \\
&\vdots \\
\hat{y}^{(N)} & = w_0 x_0^{(N)} + w_1 x_1^{(N)} + w_2 x_2^{(N)} + \ldots + w_d x_d^{(N)}  \\
\end{align*}

Finally, we arrive at the matrix form of the multiple linear regression model:

$$
\hat{\mathbf{y}} = X\mathbf{w}
$$

The $X$ is the matrix of our input variables. To form $X$, we need to pad a column of $1$s to the left of our original data as the dummy feature corresponding to the intercept $w_0$. We use $\hat{\mathbf{y}}$ to represent the vector of the predicted output variables, use $\mathbf{y}$ to represent the vector of the observed (true) output variables.

Before coding our model, let's import the automatic differentiation library `autograd`.

In [None]:
import autograd.numpy as np
from autograd import grad

Let's prepare the input matrix $X$ and the vector $\mathbf{y}$ directly from our dataset.

In [None]:
X = mpg_data[x_cols].values
X = np.hstack((np.ones((X.shape[0], 1)), X))  # pad 1s to the left of input matrix
y = mpg_data[y_col].values

print(f"{X.shape = }, {y.shape = }")

Similar to the single-variable linear regression model in lesson 1, let's use the same loss function: $L(\mathbf{y}, \hat{\mathbf{y}}) = \frac{1}{N} \sum_{i=0}^{N}(y - \hat{y}^{(i)})^2$. It is also called the **mean squared error** loss function.

In [None]:
def linear_regression(params, X):
    """
    The linear regression model in matrix form.
    """
    return np.dot(X, params)

def mse_loss(params, model, X, y):
    """
    The mean squared error loss function.
    """
    y_pred = model(params, X)
    return np.mean( np.sum((y-y_pred)**2) )

## Gradient descent

Following lesson 2, we know that `autograd.grad()` would give us the function to compute the derivatives required in gradient descent.

In [None]:
gradient = grad(mse_loss)

Let test the function with a random initial guess,

In [None]:
gradient(np.random.rand(X.shape[1]), linear_regression, X, y)

Oops, it does not look nice. With the random weights, the gradient values are huge. Let us try with a few iterations in gradient descent.

In [None]:
max_iter = 30
alpha = 0.001
params = np.zeros(X.shape[1])

for i in range(max_iter):
    descent = gradient(params, linear_regression, X, y)
    params = params - descent * alpha
    loss = mse_loss(params, linear_regression, X, y)
    if i%5 == 0:
        print(f"iteration {i:3}, {loss = }")

With these weights, you have to choose a very tiny learning rate so that it won't blow up during gradient descent. This is because of the big numbers in certain columns, for instance, the `weight` column. In addition, having features with varying magnitudes will also lead to a slow convergence in the gradient descent. Therefore, it is critical to make sure that all features are on a similar scale. This step is also called **feature scaling** or **data normalization**.

Let's check the range of our features.

In [None]:
mpg_data[x_cols].describe().loc[['max', 'min']]

One commonly used feature scaling technique is min-max scaling, which scales the range of features in $[0,1]$. If $x$ is the original value of a feature, its scaled (normalized) value $x^{\prime}$ is given as:

$$
x^{\prime}=\frac{x-\min (x)}{\max (x)-\min (x)}
$$

Here, let's use the function [sklearn.preprocessing.MinMaxScaler()](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html) to rescale our $X$.

And check the range of each column of $X$ again.

In [None]:
from sklearn.preprocessing import MinMaxScaler

min_max_scaler = MinMaxScaler()
X_scaled = min_max_scaler.fit_transform(mpg_data[x_cols])
X_scaled = np.hstack((np.ones((X_scaled.shape[0], 1)), X_scaled))    # add the column for intercept

pd.DataFrame(X_scaled).describe().loc[['max', 'min']]

Notice that the column "0" corresponds to the dummy data for intercept. All values in that column is 1.

Finally, we are ready to run gradient descent to find the optimal parameters for our linear regression model.

In [None]:
max_iter = 1000
alpha = 0.001
params = np.zeros(X.shape[1])

for i in range(max_iter):
    descent = gradient(params, linear_regression, X_scaled, y)
    params = params - descent * alpha
    loss = mse_loss(params, linear_regression, X_scaled, y)
    if i%100 == 0:
        print(f"iteration {i:3}, {loss = }")

Let's print out the trained weights. Recall that the first element is the intercept, and the rest correspond to the 6 features respectively.

In [None]:
params

Now, we can make predictions with our model.

In [None]:
y_pred_gd = X_scaled @ params

One thing that we haven't discussed till now is how to quantify the accuracy of a model. For regression problems, two basic metrics are the mean absolute error (MAE) and the root mean squared error (RMSE). The latter is just the square root of the MSE loss function that we used.

$$
\mathrm{MAE}(\mathbf{y}, \hat{\mathbf{y}})=\frac{1}{N} \sum_{i=1}^{N}\left|y^{(i)}-\hat{y}^{(i)}\right|
$$

$$
\mathrm{RMSE}(\mathbf{y}, \hat{\mathbf{y}})=\sqrt{\frac{1}{N} \sum_{i=1}^{N}\left(y^{(i)}-\hat{y}^{(i)}\right)^{2}}
$$

Most common metrics are available in scikit-learn. Let's compute both errors using the corresponding functions in [`sklearn.metrics`](https://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics) module.

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error

mae = mean_absolute_error(y, y_pred_gd)
rmse = mean_squared_error(y, y_pred_gd, squared=False)
print(f"gradient descent")
print(f"{mae  = }")
print(f"{rmse = }")

### Linear regression with scikit-learn

We want to mention that the `LinearRegression()` function in scikit-learn offers the same capability. Now with a thorough understanding of the model, you should feel more comfortable to use these black-boxes. Let's try it out.

In [None]:
from sklearn.linear_model import LinearRegression

regressor = LinearRegression(fit_intercept=False).fit(X, y)
print("params:")
print()

y_pred_sklearn = regressor.predict(X)

mae = mean_absolute_error(y, y_pred_sklearn)
rmse = mean_squared_error(y, y_pred_sklearn, squared=False)
print(f"scikit-learn linear regression")
print(f"{mae  = }")
print(f"{rmse = }")

### Linear regression with pseudo-inverse

We want to conclude this notebook with a callback to the final lesson in the linear algebra module. Recall that we can use SVD to obtain the psuedo-inverse of a matrix and the psuedo-inverse offers a least squares solution of the corresponding linear system. Given $X$ and $\mathbf{y}$, finding the linear regression weights $\mathbf{w}$ that minimizes the MSE loss function is exactly a least squares problem.

Performing SVDs on large dataset might not be ideal, but let's try on this one.

In [None]:
from scipy.linalg import pinv

params = pinv(X) @ y
y_pred_svd = X @ params

mae = mean_absolute_error(y, y_pred_svd)
rmse = mean_squared_error(y, y_pred_svd, squared=False)
print(f"linear regression using pseudo inverse")
print(f"{mae  = }")
print(f"{rmse = }")

If you are careful enough, you will notice that the error from using pseudo-inverse is almost identical to the error from using the scikit-learn function. In fact, that is exactly how `LinearRegression()` is implemented in scikit-learn, since a closed form solution is available. However, for more complicated models, we have to use gradient descent.

In [None]:
# Execute this cell to load the notebook's style sheet, then ignore it
from IPython.core.display import HTML
css_file = '../style/custom.css'
HTML(open(css_file, "r").read())