<a href="https://colab.research.google.com/github/danielbauer1979/MSDIA_PredictiveModelingAndMachineLearning/blob/main/GB886_II_7_100mDashNotebook_withPredUnc.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Multivariate Ordinary Least-Squares (OLS) Regression

In this module, we introduced linear regression modeling with $p$ predictors.  Hence, the data set will include $n$ observations $(y_i,x_{i,1},...,x_{i,p})$, $1 \leq i \leq n$, and we assume:
$$
y_i = f(x_i) + \varepsilon_i = \beta_0+\sum_{j=1}^p \beta_j\,x_{i,j} + \varepsilon_i.
$$
Like in the univariate case, OLS regression determines the estimate $\hat{\beta}$ that best approximates the training data in the *least-squares sense*:
$$
\hat{\beta}^{\text{OLS}} = \text{argmin}_{\beta}\left\{\sum_{i=1}^n \left(y_i - \left(\beta_0 + \sum_{j=1}^p \beta_j\,x_{i,j}\right)\right)^2 \right\}.
$$
The OLS estimate also still has nice properties as we had discussed in this module.

Let's evaluate multivariate regression in our example setting of 100m dash times.

We start by loading libraries. Here, importantly we consider two libraries for running linear regressions: [statsmodels](https://www.statsmodels.org/stable/index.html) and  [scikit-learn](https://scikit-learn.org/stable/).

Statsmodels is a packaged with nice-looking output and it allows for nice regression output and also confortable input (e.g., formula-based more similar to the look and feel in `R`). Scikit-learn is one of the most popular predictive modeling toolboxes and we will use it for many (!) models/algorithms throughout this course:

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as st
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression

## Getting the data

To make the data available, we can clone the github repository for this class into your colab notebook, via (remove the hashtag of course):

In [2]:
!git clone https://github.com/danielbauer1979/MSDIA_PredictiveModelingAndMachineLearning.git

Cloning into 'MSDIA_PredictiveModelingAndMachineLearning'...
remote: Enumerating objects: 34, done.[K
remote: Counting objects: 100% (34/34), done.[K
remote: Compressing objects: 100% (34/34), done.[K
remote: Total 34 (delta 14), reused 0 (delta 0), pack-reused 0[K
Receiving objects: 100% (34/34), 53.59 KiB | 2.23 MiB/s, done.
Resolving deltas: 100% (14/14), done.


If you now list the content...

In [None]:
!ls

you should see `MSDIA_PredictiveModelingAndMachineLearning` listed. And we can pull the data from there:

In [None]:
dash_data = pd.read_csv('MSDIA_PredictiveModelingAndMachineLearning/GB886_II_3_100mDashData.csv')
dash_data.head()

We can look at the data some. E.g. we can calculate sample statistics:

In [None]:
dash_data.describe()

Or we can plot some key interactions:

In [None]:
# @title Hair Length vs Height

from matplotlib import pyplot as plt
dash_data.plot(kind='scatter', x='Hair Length', y='Height', s=32, alpha=.8)
plt.gca().spines[['top', 'right',]].set_visible(False)

In [None]:
# @title Height vs Time

from matplotlib import pyplot as plt
dash_data.plot(kind='scatter', x='Height', y='Time', s=32, alpha=.8)
plt.gca().spines[['top', 'right',]].set_visible(False)

In [None]:
# @title Hair Length vs Time

from matplotlib import pyplot as plt
dash_data.plot(kind='scatter', x='Hair Length', y='Time', s=32, alpha=.8)
plt.gca().spines[['top', 'right',]].set_visible(False)

## Running Regression Models

### Via Statsmodels

Let's run our linear regression via statsmodels:

In [None]:
y = dash_data['Time']
X = dash_data.drop(columns=['Time'])
X = sm.add_constant(X) # Add a constant term as the default model doesn't include one
model_sm = sm.OLS(y, X).fit()
# Check regression results
model_sm.summary()

In [None]:
model_sm.predict([[1,4, 180]])

In [11]:
predob = model_sm.get_prediction([[1,4, 180]])

In [None]:
predob.predicted_mean

In [None]:
predob.se_mean

In [None]:
predob.conf_int(alpha=0.05)

In [None]:
predob.se_obs

In [None]:
predob.summary_frame(alpha=0.05)

### Via SkLearn

Let's run our linear regression via statsmodels:

In [None]:
y = dash_data['Time']
X = dash_data.drop(columns=['Time'])
model_sk = LinearRegression()
model_sk.fit(X, y)
print(model_sk.intercept_)
print(model_sk.coef_)

## Predict

Let's use our model to generate in-sample predictions:

In [None]:
y_hat = model_sk.predict(X)
y_hat

with errors:

In [None]:
y_hat - y

Some error metrics:

1) Mean-Absolute Error (MAE):

In [None]:
np.average(np.abs(y_hat - y))

2) Mean-Absolute Percentage Error (MAPE):

In [None]:
np.average(np.abs((y_hat - y)/y))

3) Root Mean-squared Error (RMSE):

In [None]:
np.sqrt(np.average((y_hat - y)*(y_hat - y)))

In [None]:
model_sk.predict([[4, 180], [20,170], [8,175]]).mean()