# Lecture 23: Inference in Multiple Linear Regression 
***

We'll need Numpy, Matplotlib, Pandas, and scipy.stats for this notebook, so let's load them. 

In [None]:
import numpy as np 
from scipy import stats
import statsmodels.api as sm 
import pandas as pd
import matplotlib.pylab as plt 
%matplotlib inline

### Exercise 1 - The Problem of Multiple Comparisons 
*** 

In this exercise we'll explore the **Problem of Multiple Comparisons**.  In multiple linear regression, the PMC rears its ugly head when you try to perform inference on a large number of parameters based on a single data set.  In particular, it tells us that if we look for a statistically significant relationship between a large number of features of a dataset and the response, we're very likely to find evidence that at least one of the features is important just by random chance, even in the case that none of the features are important. This is what motivates us to use things like the full and partial $F$-tests instead. 

We'll explore this phenomenon using the data in pmc_data.csv. The data contains $n=200$ observations for a response $y$ and $p=20$ features $X_1, X_2, \ldots, X_{20}$.  Load this data into a Pandas DataFrame. 

In [None]:
dfPMC = pd.read_csv("data/pmc_data.csv")
dfPMC.head()

**Part A**: Let's start by doing some graphical exploration.  To look for relationships between an individual feature and the response we can make scatter plots.  Try several different features and see if any of them looks like they have a real relationship with the response. 

In [None]:
# Extract the response and a single feature 
y, xk = dfPMC["y"], dfPMC["X1"]

# Make Scatterplot 
fig, ax = plt.subplots(nrows=1,ncols=1,figsize=(10,8))
ax.scatter(xk, y, color="steelblue", s=100, alpha=0.75)
ax.set_xlabel("x",fontsize=16)
ax.set_xlabel("y",fontsize=16)
ax.grid(alpha=0.25)

**Part B**: Did you find any relevant relationships?  Probably not; because there aren't any.  In fact, the response $y$ was generated using the model 

$$
Y = \epsilon \quad \textrm{where} \quad \epsilon \sim N(0,\sigma^2)
$$

OK, but pretend we don't know this.  Let's construct a multiple linear regression on the data and examine the hypothesis tests for the individual parameters provided by statsmodel's summary feature. 

In [None]:
# Extract response vector y and feature matrix X from DataFrame
y, X = dfPMC["y"], dfPMC.iloc[:,1:]

# Add constant to X 
X = sm.add_constant(X)

# Fit OLS model 
model = sm.OLS(y, X).fit() 

# Print model summary 
model.summary()

**Part C**: Examine the p-values and 95% confidence intervals associated with each model parameter.  Do the tests indicate that any of the features have a statistically significant relationship with the response? 

**Part D**: Now, look at the $F$-statistic and associated p-value for the MLR model.  Do these values indicate that at least one of the features has a statistically significant relationship with the response? How can you reconcile the result of **Part C** with this result? 