# Multiple Linear Regression

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

%matplotlib notebook

Here is a data set of sales figures from different stores.

In [None]:
data = pd.read_csv('sales.csv')
data

Let's try to predict net sales from two variables: the square footage (size) of the store, and the number of competing stores in the area. Our model will be:

$$
\text{net sales} \approx b_0 + b_1 \times \text{sqft} + b_2 \times \text{competitors}
$$

Do you expect $b_1$ to be positive or negative? What about $b_2$?

Let's plot the data.

**Note**: the plot below is interactive. Try clicking and dragging to move the camera.

In [None]:
sq_ft = np.asarray(data['sq_ft'])
competing = np.asarray(data['competing_stores'])
net_sales = np.asarray(data['net_sales'])

%matplotlib notebook
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(sq_ft, competing, net_sales)
plt.xlabel('sq_ft')
plt.ylabel('competing_stores')

Our design matrix is:
    
$$
\begin{pmatrix}
 1 & s_1 & c_1\\
 1 & s_2 & c_2\\
 \vdots & \vdots & \vdots\\
 1 & s_n & c_n
\end{pmatrix}
$$

where $s_i$ is the size of the $i$th store, and $c_n$ is the number of competitors. In code:

In [None]:
X = np.column_stack((
    np.ones_like(sq_ft),
    sq_ft,
    competing
))

Solving the system $X^\intercal X \vec b = X^\intercal \vec y$:

In [None]:
b = np.linalg.solve(X.T @ X, X.T @ net_sales)
b

The function $h$ that we have fit is not a line; it is a plane:

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(sq_ft, competing, net_sales)
plt.xlabel('sq_ft')
plt.ylabel('competing_stores')

XX, YY = np.mgrid[1:10:2, 0:16:2]
Z = b[0] + b[1]*XX + b[2]*YY
ax.plot_wireframe(XX, YY, Z, color='black', alpha=.5)

Now we will compare our predictions to the observed sales. The plot below shows the actual sales for store $i$ as a red dot at $(i, \vec y_i)$, and the predicted sales as a blue dot at $(i, \vec h_i)$, where $\vec h = X \vec b$ is the vector of predictions.

In [None]:
np.arange(len(net_sales)), net_sales

In [None]:
plt.figure()
plt.scatter(np.arange(len(net_sales)), X @ b, label='Prediction', color='C0')
plt.scatter(np.arange(len(net_sales)), net_sales, label='Observation', color='C3')
plt.legend(loc='upper left')

The predictions were made using two predictor variables: the size of the store and the number of competitors in the store's district. Now let's try to use all of the predictor variables available to us in order to make more accurate predictions.

First, we make the design matrix. It consists of everything in `data` except for the first column (which contains the observations).

In [None]:
X_full = data.iloc[:,1:].values

Now we solve for the parameter vector, $\vec b$:

In [None]:
b_full = np.linalg.solve(X_full.T @ X_full, X_full.T @ net_sales)

Now we'll make the same plot as above, but with a green point for every store: the prediction resulting from using all of the predictor variables. The red dot is again the "true" observation. Uncomment the line below to compare these new predictions with the old predictions.

In [None]:
plt.figure()
plt.scatter(np.arange(len(net_sales)), X_full @ b_full, label='All Predictors', color='C2')
# plt.scatter(np.arange(len(net_sales)), X @ b, label='Two Predictors', color='C0')
plt.scatter(np.arange(len(net_sales)), net_sales, label='Observations', color='C3')
plt.legend(loc='upper left')