<a href="https://colab.research.google.com/github/danielbauer1979/ML_656/blob/main/Module1_WageRegressions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Multivariate Ordinary Least-Squares (OLS) Regression

In this module, we introduced univariate regression. In this module, we extended regression modeling to include $p$ predictors.  Hence, the data set will include $n$ observations $(y_i,x_{i,1},...,x_{i,p})$, $1 \leq i \leq n$, and we assume:
$$
y_i = f(x_i) + \varepsilon_i = \beta_0+\sum_{j=1}^p \beta_j\,x_{i,j} + \varepsilon_i.
$$
Like in the univariate case, OLS regression determines the estimate $\hat{\beta}$ that best approximates the training data in the *least-squares sense*:
$$
\hat{\beta}^{\text{OLS}} = \text{argmin}_{\beta}\left\{\sum_{i=1}^n \left(y_i - \left(\beta_0 + \sum_{j=1}^p \beta_j\,x_{i,j}\right)\right)^2 \right\}.
$$
The OLS estimate also still has nice properties as we had discussed in this module.

Let's evaluate multivariate regression in an example setting.

## Wage Regression

We have a rather old dataset, namely cross-sectional sample from the  May 1985 Current Population Survey by the US Census Bureau. These data include (hourly) wages for 534 individuals, where we have information on age, sex (0 for male, 1 for female), race (H for Hispanic, W for White, O for Other), years of education, et cetera. Let's take a look:


We start by loading libraries. Here, importantly we will rely on [scikit-learn](https://scikit-learn.org/stable/) for running the regression. While there are many more comfortable packages for the specific task of running a linear regression (formula-based more similar to the look and feel in `R`), scikit-learn is one of the most popular predictive modeling toolboxes and we will use it for many (!) models/algorithms throughout this course:

In [None]:
import numpy as np 
import pandas as pd
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import scipy.stats as st

To make the data available, you can clone my github repository into your colab notebook, via (remove the hashtag of course):

In [None]:
!git clone https://github.com/danielbauer1979/ML_656.git

If you now list the content...

In [None]:
!ls

you should see `ML_656` listed. And we can pull the data from there:

In [None]:
wage_data = pd.read_csv('ML_656/Wages_1985_Current_Population_Survey.csv')
wage_data.head() #the syntax makes clear that data is an object!

So we have the data available. Let's look at aggregate stats.

In [None]:
wage_data.describe()

OK, so let's run the wage regression using sklearn. For that we first have to recode our categorical variables.

In [None]:
wage_data['Race'].value_counts()

In [None]:
wage_data['Race_H'] = np.where(wage_data['Race'] == "H", 1, 0)
wage_data['Race_O'] = np.where(wage_data['Race'] == "O", 1, 0)

A more streamlines way of doing this is via `get_dummies` in pandas:

In [None]:
pd.get_dummies(wage_data['Race']).head()

We could then add in via `wage_data['Race_H'] = pd.get_dummies(wage_data['Race'])['H']` etc. Or we can just get all our dummies via `X = pd.get_dummies(wage_data['Race'], drop_first=True).head()`

So let's define our features:



In [None]:
X1 = wage_data[['Sex','Age','Race_H','Race_O']]
y = wage_data['Wage']

Let's run our first linear regression model:

In [None]:
model1 = LinearRegression(fit_intercept=True)
model1.fit(X1, y)
print(model1.intercept_)
print(model1.coef_)

So the presentation of the regression results is not as neat as when using R or statsmodels. But as we will see `sklearn` will make it easy to build more advanced models. Again, that's its purpose. 

Let's run a second regression with additional features:

In [None]:
X2 = wage_data[['Sex','Age','Race_H','Race_O','Yrs_Ed','Sthrn_Rgn']]

In [None]:
model2 = LinearRegression(fit_intercept=True)
model2.fit(X2, y)
print(model2.intercept_)
print(model2.coef_)

Let's run the full regression:

In [None]:
Occup_d = pd.get_dummies(wage_data['Occup'], prefix='Occup', drop_first=True)
Race_d = pd.get_dummies(wage_data['Race'], prefix='Race', drop_first=True)
Sect_d = pd.get_dummies(wage_data['Sect'], prefix='Sect', drop_first=True)

In [None]:
X = wage_data.drop(columns=['Wage','Occup','Race','Sect'])
X = pd.concat([X,Occup_d,Race_d,Sect_d], axis=1)

In [None]:
model3 = LinearRegression(fit_intercept=True)
model3.fit(X, y)
print(model3.intercept_)
print(model3.coef_)

And let's check the residuals.

In [None]:
ypred = model3.predict(X)
eps = y - ypred
plt.hist(eps)