# Multiple Linear Regression

In this notebook are some exercises to gain more experience with the material presented in the Multiple Linear Regression lecture. You'll get some practice fitting models, and gain a stronger theoretical understanding of the technique as well. We'll also introduce some new important concepts that weren't explicitly covered in the lecture.

In [2]:
# import the packages we'll use
## For data handling
import pandas as pd
import numpy as np
from numpy import meshgrid

## For plotting
import matplotlib.pyplot as plt
import seaborn as sns

## This sets the plot style
## to have a grid on a white background
sns.set_style("whitegrid")

from pandas.plotting import scatter_matrix

import statsmodels.api as sm

## Theoretical Questions

##### 1. Gradient Descent

While we have the normal equation as the OLS estimate for $\hat{\beta}$ it is sometimes not ideal to to use that equation to find the estimate. This is because if there are too many features it can be computationally costly to perform the inverse operation.

One alternative to the normal equation is to perform gradient descent.

Let $\ell(\beta)$ denote a loss function. 

If we remember some Calculus III we'll remember that for a particular value of $\beta$, say $\beta^*$, the direction of greatest descent for $\ell$ at $\beta^*$, i.e. how to get to the minimum of $\ell$ most quickly from $\beta^*$, is the opposite direction of the gradient, $-\nabla \ell(\beta^*)$. You can thus approach the value of $\beta$ that minimizes $\ell$ by iteratively updating $\beta$ by moving in $\alpha$ sized steps in the direction of greatest descent.

Write out an algorithm (in mathematical symbols not code) that leverages the gradient of the loss function to find the optimal $\hat{\beta}$ for multiple linear regression.

##### 2. Statistical Significance of the Model

One kind of explanatory modeling question we may be interested in is whether or not the model is statistically significant. Assuming that we have $m$ features this corresponds to the following hypothesis test:
$$
\text{H}_0: \beta_1 = \beta_2 = \dots = \beta_m \text{ vs.}
$$

$$
\text{H}_1: \text{ at least one of } \beta_i \neq 0.
$$
This test allows you to say whether any of your predictors are significantly associated with the target $y$, when compared to the baseline model of $y=E(y)$. In non-statistical terms we're asking the question: "Does my regression model contain at least one feature that helps explain what I see in the target variable?"

Let's see how we can perform this test using `statsmodels`.

Suppose I fit a multiple linear regression for the following data:

In [2]:
X = np.random.randn(100,2)
y = 2 + 3*X[:,0] + np.random.randn(100) 

In [3]:
fit = sm.OLS(y, sm.add_constant(X)).fit()

print(fit.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.907
Model:                            OLS   Adj. R-squared:                  0.905
Method:                 Least Squares   F-statistic:                     471.9
Date:                Sat, 03 Oct 2020   Prob (F-statistic):           1.04e-50
Time:                        11:32:53   Log-Likelihood:                -137.07
No. Observations:                 100   AIC:                             280.1
Df Residuals:                      97   BIC:                             288.0
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          1.9121      0.097     19.685      0.0

Let's take a closer look at the table.

<img src = "F_stat.png" style="width:60%"></img>

The circled portion of the table is the $F$-statistic of the above hypothesis test and the $p$-value associated with said test. As we should expect here the $p$-value is incredibly low meaning that we would reject the null hypothesis in favor of the alternative.

Now return to the beer data and use `statsmodels` to fit a multiple linear regression for the following model:
$$
\text{IBU} = \beta_0 + \beta_1 \text{ABV} + \beta_2 \text{Stout} + \beta_3 \text{Stout} \times \text{ABV}.
$$

Perform the above hypothesis test and interpret the results.

In [None]:
## Code here




##### 3. The General Linear $F$-Test or Partial $F$-Test

The statistical test we showed in problem 2. is not all that useful for determining if specific variables are useful for explaining the variance we see in the target data.

You can always perform a hypothesis test for a single coefficient ($\text{H}_0: \beta_i = 0$ vs. $\text{H}_1: \beta_i \neq 0$) by examining the confidence interval associated with that coefficient. If the interval contains $0$ you know that the variable is not statistically significant at that confidence level (standard level is $95\%$).

Another test you may wish to perform is the General Linear $F$-Test (also called partial $F$-Test). In this test you are asking the question "Is my target related to this collection of variables?". For example suppose you're interested in testing $\beta_3$, $\beta_4$, and $\beta_5$:
$$
\text{H}_0: \beta_3 = \beta_4 = \beta_5 = 0 \text{ vs.}
$$

$$
\text{H}_1: \text{ at least one of } \beta_i, \ i = 3,4,5 \text{ are nonzero}
$$

This test is useful when you're interested in the effects of a categorical variable with more than one category. In that setting you can't just test if one of the categories has a significant effect, you must test all categories concurrently. Thus the confidence interval procedure mentioned above doesn't work.

In order to perform this test you need two models, a <i>full model</i> that must at least contain the variables you're interested in testing, and a <i>reduced model</i> that is the full model with the variables you're interested in removed.

We'll now show how to do this test with `statsmodels`. 

For this example the full model is:
$$
y = \beta_0 + \beta_1 x_1 + \beta_2 \text{a} + \beta_2 \text{b}.
$$

The reduced model is:
$$
y = \beta_0 + \beta_1 x_1.
$$

In [4]:
# Here I create some fake data
X = np.zeros((600,2))
np.random.seed(440)

# The first column is a continuous feature
X[:,0] = 2*np.random.randn(600)-1

# The second is categorical
X[:200,1] = 1
X[200:400,1] = 2
X[400:,1] = 3

# y = 1 + 2x_1 + - 2*1_{x_2 == 1} + epsilon
y = 1 + 2*X[:,0] + np.random.randn(600)
y[X[:,1] == 1] = y[X[:,1] == 1] - 2

In [5]:
# Now I put it in a dataframe
df = pd.DataFrame({'y':y,'x1':X[:,0],'x2':X[:,1]})

df.loc[df.x2 == 1,'x2'] = 'a'
df.loc[df.x2 == 2,'x2'] = 'b'
df.loc[df.x2 == 3,'x2'] = 'c'

# Make dummy variables
df[['a','b']] = pd.get_dummies(df['x2'])[['a','b']]

In [6]:
# First you fit the full model
fit = sm.OLS(df['y'],sm.add_constant(df[['x1','a','b']])).fit()

# Then we type out the specific hypothesis we're
# testing for the reduced model as a string
hypotheses = 'a=b=0'

# You can then call the f_test with that hypothesis
f_test = fit.f_test(hypotheses)

print("The F-statistic is",f_test.fvalue, 
      "which has an associate p-value of", f_test.pvalue)


The F-statistic is [[244.13578899]] which has an associate p-value of 3.566476961458313e-78


The results of this test inform us that there is very strong statistical evidence of the categorical variable having an effect on the target `y`. Thus it would be wise to leave it in the model for explanatory purposes.

Return to the `carseats` data set and build a full model that includes the `ShelveLoc` variable. Then perform a partial $F$-test to see if there is evidence that shelve location has an effect on `Sales`.

In [1]:
## Code here or write here





##### 4. Sum of Squares Table

Recall that the goal of explanatory modeling is to help explain the target data. This means trying to explain the variance of the target data.

One way we can examine this is by looking at the <i>Sum of Squares Table</i>.

Recall the variance of $y$ is how much it deviates from $\overline{y}$, i.e.
$$
\sum_{i=1}^n \left(y_i - \overline{y}\right)^2 = \text{SST},
$$
the total sum of squares, it can be shown that:
$$
\text{SST} = \sum_{i=1}^n \left(\hat{y}_i - \overline{y} \right)^2 + \sum_{i=1}^n\left(\hat{y}_i - y_i\right)^2 = \text{SSM} + \text{SSR},
$$
where $\text{SSM}$ denotes the model sum of squares and $\text{SSR}$ denotes the residual sum of squares.

Note the $\text{SSR}$ is $n \text{MSE}$, and thus estimates $\sigma^2$. The $\text{SSM}$ term estimates the variance in $y$ that is explained by the model.

Further, for multiple linear regression $\text{SSM}$ can be broken down into contributions to the variance from each individual variable.

Let's demonstrate how to get the sum of squares table using `statsmodels`.


In [8]:
# Here I create some fake data
np.random.seed(440)
X = np.zeros((600,2))

# The first column is a continuous feature
X[:,0] = 2*np.random.randn(600)-1

# The second is categorical
X[:200,1] = 1
X[200:400,1] = 2
X[400:,1] = 3

# y = 1 + 2x_1 + - 2*1_{x_2 == 1} + epsilon
y = 1 + 2*X[:,0] + np.random.randn(600)
y[X[:,1] == 1] = y[X[:,1] == 1] - 2

In [9]:
# Now I put it in a dataframe
df = pd.DataFrame({'y':y,'x1':X[:,0],'x2':X[:,1]})

df.loc[df.x2 == 1,'x2'] = 'a'
df.loc[df.x2 == 2,'x2'] = 'b'
df.loc[df.x2 == 3,'x2'] = 'c'

# Make dummy variables
df[['a','b']] = pd.get_dummies(df['x2'])[['a','b']]

In [10]:
# First you fit the model
# note that we use ols instead of sm.OLS
from statsmodels.formula.api import ols

# for ols you write the regression formula
# as a string
fit = ols('y ~ x1 + a + b', df).fit()


sm.stats.anova_lm(fit)

Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
x1,1.0,9821.083158,9821.083158,9566.082414,0.0
a,1.0,500.690058,500.690058,487.689829,2.0338949999999998e-79
b,1.0,0.597257,0.597257,0.581749,0.4459292
Residual,596.0,611.887428,1.026657,,


In the above table you can sum up the `sum_sq` column to get $\text{SST}$. Each entry in the `sum_sq` column is that variable's (or the residual's) contribution to the $\text{SST}$.

<b>Note: Sums of squares are quite susceptible to the scale of the data. So a variable with a very large scale (say the 1000s) may seem more important than another variable with a very small scale (say 1/10s). Thus it is important to scale your data before fitting the model.</b>

Return to the `beer` data, then create the sum of squares table for the following model:
$$
\text{IBU} = \beta_0 + \beta_1 \text{ABV} + \beta_2 \text{Stout} + \beta_3 \text{ABV} \times \text{Stout} + \epsilon.
$$

In [11]:
## Code here or write here




In [12]:
## Code here or write here








                            OLS Regression Results                            
Dep. Variable:                    IBU   R-squared:                       0.493
Model:                            OLS   Adj. R-squared:                  0.489
Method:                 Least Squares   F-statistic:                     111.3
Date:                Sun, 25 Apr 2021   Prob (F-statistic):           2.39e-50
Time:                        16:49:34   Log-Likelihood:                -1478.1
No. Observations:                 347   AIC:                             2964.
Df Residuals:                     343   BIC:                             2980.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     10.4825      6.010      1.744      0.0

##### 5. Interpreting Interaction Terms.

Look at the model you just fit for the `beer` data. Try to interpret the estimates of the various coefficients.

##### 6. Model Selection Algorithms

Here we'll describe two additional model selection algorithms.

##### Forwards Selection

Say you have $m$ predictors $X_1,\dots,X_m$, and a target $y$. 

Starting with an empty model you build $m$ simple linear regression models and then choose the one with lowest testing error (for instance by looking at the average cv error). Call this model $1$. 

Take model $1$ and go through the remaining $m-1$ features and add them one at a time to model $1$. This will give you $m-1$ two feature models. Look at the one with lowest testing error, call it model $2$. If model $2$ has lower testing error than model $1$ continue in this way and look at the remaining $m-2$ predictors. If model $1$ has the lower testing error you stop and model $1$ is the model you choose.

You continue until you either find a model with lowest testing error (for example if model $3$ had lower testing error than model $4$ you chose model $3$), or until you have built the model regressing $y$ on all of $X_1,\dots,X_m$.

##### Backwards Selection

This algorithm is sort of the opposite of forwards selection.

Again say you have $m$ predictors $X_1,\dots,X_m$ and a target $y$.

Starting with the model regressing $y$ on all of $X_1,\dots,X_m$, remove each of the $X_i$ predictors one at a time, regressing $y$ on the remaining $m-1$ features. If one of these models has lower testing error than the full model take it and call it model $1$. If none of those models has lower testing error than the full model stick with the full model.

Take model $1$ and remove each of the $m-1$ predictors one at a time, regressing $y$ on the remaining $m-2$ features. If one of those models has lower testing error than model $1$ take it and call it model $2$. If none of those models has lower testing error than model $1$ stick with model $1$.

Continue in this way until you have a reduced model with lowest testing error, or until you end up with the model with no predictors, i.e. $y = E(y)$.

###### Greedy Algorithms

These are both <i>greedy algorithms</i> because at each step you take the move that benefits you the most in the moment, but you don't explore suboptimal paths that may be better in the long run. While you may not get the best model, you're willing to go with a model that is close to correct in a faster time. Both of these algorithms at worst require fitting $m!$ models as opposed to the $2^m$ models required for the brute force approach.

#### The Problem

Choose one of either forwards or backwards selection and program the algorithm to build a model to predict `Sales` from the `Advertising` data.

## Applied Questions

##### 1. Gradient Descent in Action

Using your answer to question 1. under Theoretical Questions use `numpy` to perform gradient descent to fit the multiple linear regression model for the following data. <i>Note: If you didn't do that question, you can wait to try this one until after I provide solutions :)</i>

In [None]:
## Code here





In [None]:
## Code here





In [None]:
## Code here





In [None]:
## Code here





Now compare the output of your code to the model you get using `SGDRegressor`, <a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html">https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html</a>.

In [None]:
## Code here





In [None]:
## Code here





##### 2. Build the Best Predictive Model You Can

Return to the data set from `PredictiveModelingAssessmentData.csv`. Use cross-validation to build the best predictive model you can to predict the `y` variable. As a hint theoretically the best you can do is a mean root mean square error of $1$.

In [None]:
## Code here





In [None]:
## Code here





In [None]:
## Code here





In [None]:
## Code here





This notebook was written for the Erd&#337;s Institute C&#337;de Data Science Boot Camp by Matthew Osborne, Ph. D., 2021.

Redistribution of the material contained in this repository is conditional on acknowledgement of Matthew Tyler Osborne, Ph.D.'s original authorship and sponsorship of the Erdős Institute as subject to the license (see License.md)