# Multiple Linear Regression

In [None]:
import random
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn.metrics as metrics

from sklearn.datasets import make_regression
from mpl_toolkits.mplot3d import Axes3D

from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant

# Make the notebook reproducible.
seed = 42
random.seed(seed)
np.random.seed(seed=seed)

In [None]:
#@title Dataset generation (keep hidden until later)

X, y, coef = make_regression(
    n_samples=100, 
    n_features=2, 
    n_informative=2, 
    noise=30.0, 
    bias=50.0, 
    shuffle=True, 
    coef=True, 
    random_state=seed)

Our model: 

$y \sim \mathcal{N}(\beta_0 + \beta_1 x_1 + \beta_2 x_2, \ \sigma)$

Show our dataset in a 3D scatterplot.

Creating our multiple linear regression (MLR) model: 

$\hat \beta = (X^TX)^{-1}X^Ty$

Plot the plane of best fit.

In [None]:
#@title Dataset generation (keep hidden until later)

X, y, coef = make_regression(
    n_samples=100, 
    n_features=1, 
    n_informative=1, 
    noise=30.0, 
    bias=25.0, 
    shuffle=True, 
    coef=True, 
    random_state=seed)

dummy = np.random.choice([0.0, 1.0], size=(100, 1))
X = np.hstack((X, dummy))

y[dummy.squeeze() == 1.0] += 100

MLR with categorical variables: 

$y \sim \mathcal{N}(\beta_0 + \beta_1 x_1 + \beta_2 d, \ \sigma)$ 

Feature ranking and standardization.

In [None]:
#@title Dataset generation (keep hidden until later)

X, y, coef = make_regression(
    n_samples=100, 
    n_features=4, 
    n_informative=4, 
    noise=30.0, 
    bias=30.0, 
    shuffle=True, 
    coef=True, 
    random_state=seed)

x_mu = np.array([100.0, 10.0, 1.0, 0.1])
x_sigma = np.array([50, 3.2, 0.4, 0.025])
X = X * x_sigma + x_mu

df = pd.DataFrame(X, columns=['x1', 'x2', 'x3', 'x4'])
df['y'] = y

MLR model with 4 numerical variables: 

$y \sim \mathcal{N}(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_4,\ \sigma)$ 

Show the correlations between variables.

Create our model and rank the features by importance.

Diagnosing multicollinearity in our MLR models

In [None]:
#@title Dataset generation (keep hidden until later)

X, y, coef = make_regression(
    n_samples=1000, 
    n_features=2, 
    n_informative=2, 
    noise=20.0, 
    bias=50.0, 
    shuffle=True, 
    coef=True, 
    random_state=seed)

x3 = X[:, 0] + X[:, 1] + np.random.normal(0, 0.05, (1000,))
X = np.hstack((X, x3[..., np.newaxis]))

df = pd.DataFrame(X, columns=['x1', 'x2', 'x3'])
df['y'] = y

MLR model with 4 numerical variables: 

$y \sim \mathcal{N}(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3,\ \sigma)$
<br><br>
$x_3$ presents multicollinearity of the form:

$x_{3i} = \alpha_0 + \alpha_1 x_{1i} + \alpha_2 x_{2i} + \delta_i$

Compute the $\beta$ parameters for 10 subsets of our data separately.

Treating multicollinearity in our MLR models

Compute the $\beta$ parameters for 10 subsets of our data separately.