<table style="width: 100%;" id="nb-header">
    <tr style="background-color: transparent;"><td>
        <img src="https://data-88e.github.io/assets/images/blue_text.png" width="250px" style="margin-left: 0;" />
    </td><td>
        <p style="text-align: right; font-size: 10pt;"><strong>Economic Models</strong>, EdX<br>
            Dr. Eric Van Dusen <br>
            Vaidehi Bulusu <br>
        Akhil Venkatesh <br>
</table>

# Lecture Notebook 8.2: Multiple Linear Regression

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import patches
import statsmodels.api as sm
import warnings
warnings.simplefilter("ignore")
from datascience import *
%matplotlib inline
from sklearn.metrics import mean_squared_error
import ipywidgets as widgets
from ipywidgets import interact

# Multiple Linear Regression

In simple linear regression, we're only considering one independent variable. We're essentially assuming that this is the only variable that affects our dependent variable. Going back to the price and quantity demanded example, while price does have a huge effect on demand, there are other factors that also affect demand including income taxes, sales taxes, advertising, prices of related goods, etc. In the context of regression, because we are not including these variables in our model, they are called **omitted variables**.

Omitted variables cause 2 main problems:

- The regression parameters end up being inaccurate (biased) because of something called omitted variable bias. Your regression estimates for the slope and intercept are higher or lower than the actual values because of omitted variables (we are generally only concerned about the slope).


- They prevent you from inferring a causal association between the independent and dependent variables. In other words, you can't say that it's *because* of an increase in price that your quantity demanded decreased as there are so many other factors – like the change in the price of a related good – that could be causing the decrease in demand.

To try to eliminate omitted variable bias from our model, we take simple linear regression a step further: to multiple linear regression. In multiple linear regression, we are including more indepedent variables – variables we think are confounding variable that we've omitted – to reduce omitted variable bias.

Let's look at multiple linear regression in Python using a new dataset on earnings and various other factors (check out the [data description](https://wps.pearsoned.com/wps/media/objects/11422/11696965/empirical/empex_tb/CPS08_Description.pdf)).

In [None]:
cps = Table.read_table("CPS.csv")
cps.show(5)

Say we want to look at the relationship between age and earnings (the `ahe` column which is the average hourly earnings). We would expect whether or not a person has a bachelor's degree to be a confounding variable as those with a bachelor's degree typically earn more than those with only a high school degree. This is how we would do multiple linear regression:

In [None]:
x_2 = cps.select("bachelor", "age").values # This is how we include multiple independent variables in our model
y_2 = cps.column("ahe")
model_2 = sm.OLS(y_2, sm.add_constant(x_2))
result_2 = model_2.fit()
result_2.summary()

## Dummy Variables

One type of variable commonly used in econometrics are dummy variables. These are variables that take a value of either 0 or 1 to indicate the presence of absence of a category. For example, take col: it takes the value of 1 to indicate that a person went to college and 0 to indicate that a person didn't go to college.

Let's do a regression of `ahe` on only `bachelor` (the dummy variable).

In [None]:
x_3 = cps.select("bachelor").values
y_3 = cps.column("ahe")
model_3 = sm.OLS(y_3, sm.add_constant(x_3))
result_3 = model_3.fit()
result_3.summary()

The coefficient (or slope) on `bachelor` is this:

In [None]:
result_3.params[1]

This means that the people in our sample with a bachelor's degree earn around $6.583 more per hour than those with only a high school degree.

An interesting fact about dummy variables is that we can calculate this coefficient another way:

In [None]:
# Filter for bachelor = 1 and find mean earnings
b_1_mean = np.mean(cps.where("bachelor", 1).column("ahe"))

# Filter for bachelor = 0 and find mean earnings
b_0_mean = np.mean(cps.where("bachelor", 0).column("ahe"))

# Take the difference in the mean earnings
diff = b_1_mean - b_0_mean
diff

In [None]:
np.round(result_3.params[1], 5) == np.round(diff, 5)

These two values are pretty much the same! This is because the coefficient on a dummy x-variable is just equal to the difference of the mean of the y-variable when x = 1 and x = 0.

### `pd.get_dummies()`

We can convert categorical variables in our dataset to dummy variables using the pd.get_dummies() function. This function gives you a table showing the presence or absence of dummy variables for that category for each observation in the dataset. Here's an example of converting the year values to dummies. 1 indicates that the person was surveyed in that year (e.g. the first 4 people in the dataset were surveyed in 1992 according to the table below).

In [None]:
year = cps.column("year")
np.unique(year)

In [None]:
year_dummies = pd.get_dummies(year)
year_dummies.head()

Let's do the same for the age variable. Take the first row as an example: since 1 is under 29, it means that that person is 29 years old.

In [None]:
age = cps.column("age")
np.unique(age)

In [None]:
age_dummies = pd.get_dummies(age)
age_dummies.head()

### Dummy Variable Trap

One problem you may run into when using dummy variables is called the dummy variable trap. This happens when you include variables for all the values of the dummy variable: for example, you include col, which takes on a value of 1 if the person went to college, as well as notcol, which takes on a value of 1 if the person didn't go to college.

This causes redundancy as you can express one independent variable as a linear combination of another independent variable. In this case:

$$ notcol = 1 - col$$

This means that there is a perfect correlation between these two independent variables which reuslts in perfect multicollinearity. In general, multicollinearity occurs any time there is a high correlation between the independent variables in your model. It causes your regression estimates to be highly inaccurate.

Let's see what happens in Python when multicollinearity happens:

In [None]:
year_dummies["ahe"] = np.array(cps.column("ahe"))
year_dummies.head()

In [None]:
x_3 = year_dummies[[1992, 2008]] # We are including dummy variables for each value of year
y_3 = year_dummies["ahe"]
model_3 = sm.OLS(y_3, sm.add_constant(x_3))
result_3 = model_3.fit()
result_3.summary()

Python detected the multicollinearity and gave you a warning. A solution is to just drop one of the dummy variables.