# 3: Multiple Linear Regression Exercise

### Getting Started
#### Import Libraries 
We import our standard libraries and specific objects/libraries at the top level of our notebook. By adding only specific objects from key modules, such as `statmodels`, we keep our *namespace* more organized. 

In [None]:
# Import libraries and objects
import statsmodels.api as sm
import matplotlib.pyplot as plt
import warnings # for muting warning messages
# mute warning messages
warnings.filterwarnings('ignore')
from ISLP import load_data
from ISLP.models import (ModelSpec as MS,
                         summarize)

Let's take a look at the `Boston` data set

In [None]:
# Load the "Boston" dataset using the "load_data" function from the ISLP package
Boston = load_data('Boston')
Boston

Hint: Type `Boston` to find out more about the dataset.

### Multiple Linear Regression

To fit a multiple linear regression model using least squares, we again use the `ModelSpec()` transform to construct the required model matrix and response. The arguments to `ModelSpec()` can be quite general, but in this case a list of column names is fine. We consider a fit here with the two variables `rm` and `nox`.

In [None]:
y = Boston['medv']
X = MS(['rm', 'nox']).fit_transform(Boston)
model1 = sm.OLS(y, X)
results1 = model1.fit()
summarize(results1)

Notice how we have compacted the first line into a succinct expression describing the construction of `X`.

The `Boston` data set contains 12 variables, and so it would be cumbersome to have to type all of these in order to perform a regression using all of the predictors. Instead, we can use the following short-hand:

In [None]:
terms = Boston.columns.drop('medv')
terms

We can now fit the model with all the variables in `terms` using the same model matrix builder.

In [None]:
X = MS(terms).fit_transform(Boston)
model = sm.OLS(y, X)
results = model.fit()
summarize(results)

What if we would like to perform a regression using all of the variables but one? For example, in the above regression output, `age` has a high $p$`-value`. So we may wish to run a regression excluding this predictor. The following syntax results in a regression using all predictors except `age`.

In [None]:
minus_age = Boston.columns.drop(['medv', 'age']) 
Xma = MS(minus_age).fit_transform(Boston)
model1 = sm.OLS(y, Xma)
summarize(model1.fit())

### Qualitative Predictors

Here we use the `Boston` data again. 

We can examine the relationship between `medv` and `chas`, where

\begin{align*}
\text{chas} = \left\{\begin{array}{ll}
1 & \text { if tract bounds Charles River} \\
0 & \text { if not}
\end{array}\right\}
\end{align*}


In [None]:
# Perform regression
model = sm.OLS.from_formula('medv ~ chas', data=Boston)
result = model.fit()

# Print the summary of the regression
print(result.summary())

$\hat \beta_0 = 22.094$: the average median house value for suburbs that are not bound by the Charles river.

$\hat \beta_1 = 6.346$: the difference in the average median house value for suburbs that are bound by the Charles River versus those that are not.

### Interaction Term

Let's look at the relationship between the response `medv` and the predictors `lstat` (the percent of households with low socioeconomic status) and `age` (the percent of homes built prior to 1940). We can also include the interaction term between `lstat` and `age`.

The syntax used to implement this is `lm(y ~ x1 + x2 + x1:x2, data)` or `lm(y ~ x1 * x2, data)` for shorthand.

In [None]:
model = sm.OLS.from_formula('medv ~ lstat * age', data=Boston)
result = model.fit()

# Print the summary of the regression
print(result.summary())

The interaction term has a $p$-value of $0.025$. Even though the $p$-value for `age` is not significant, we will still include it in our model due to the hierarchical principal.

### Helpful plots

There are a few plots that we discussed that can help to identify problems with our data or with our fit.

In [None]:
model = sm.OLS.from_formula('medv ~ rm', data=Boston)
result = model.fit()

# Plot the specified diagnostic plots
fig, ax = plt.subplots(1, 2, figsize=(12, 5))

# Residuals vs. Fitted Values Plot
sm.graphics.plot_regress_exog(result, 'rm', fig=fig)

# Studentized Residuals vs. Fitted Values Plot
ax[1].scatter(result.fittedvalues, result.get_influence().resid_studentized_internal, alpha=0.8)
ax[1].set_xlabel('Fitted Values')
ax[1].set_ylabel('Studentized Residuals')

plt.show()

In [None]:
# Fit a linear regression model
model = sm.OLS.from_formula('medv ~ rm', data=Boston)
result = model.fit()

# Get the predicted values and studentized residuals
predicted_values = result.predict()
studentized_residuals = result.get_influence().resid_studentized_internal

# Plot the Studentized Residuals vs. Fitted Values
plt.scatter(predicted_values, studentized_residuals, alpha=0.8)
plt.xlabel('Fitted Values')
plt.ylabel('Studentized Residuals')
plt.title('Studentized Residuals vs. Fitted Values')
plt.show()

***What information about our fitted model can you gather from these plots? Are there any outliers or high leverage points?***

***Fit a linear regression model on the `Boston` data set including all the predictors. The shorthand for this is `lm(medv ~ .)`. Interpret the summary including the hypothesis tests for the coefficients and the RSE and $R^2 values. Make plots of the fit including confidence intervals for the fitted line. Recreate and interpret the three plots we have just made using your new fit.***

Ask for help if you get stuck!


*These exercises were adapted from :* James, Gareth, et al. An Introduction to Statistical Learning: with Applications in Python, Springer, 2023.