# Multiple Linear Regression - Raw Features

## Objectives

- Conduct multiple linear regressions in `statsmodels`
- Use standard scaling for linear regression for better interpretation

Bonus: comparing statsmodels and sklearn

## Regression with Multiple Predictors

> It's all a bunch of dials

<img width='450px' src='images/dials.png'/>

The main idea here is pretty simple. Whereas, in simple linear regression we took our dependent variable to be a function only of a single independent variable, here we'll be taking the dependent variable to be a function of multiple independent variables.

## Expanding Simple Linear Regression

Our regression equation, then, instead of looking like $\hat{y} = mx + b$, will now look like:

$\hat{y} = \hat{\beta}_0 + \hat{\beta}_1x_1 + ... + \hat{\beta}_nx_n$.

Remember that the hats ( $\hat{}$ ) indicate parameters that are estimated.

Is this still a best-fit *line*? Well, no. What does the graph of, say, z = x + y look like? [Here's](https://academo.org/demos/3d-surface-plotter/) a 3d-plotter. (Of course, once we get beyond two input variables it's going to be very hard to visualize. But in practice linear regressions can make use of dozens or even of hundreds of independent variables!)

## Confounding Variables

Suppose I have a simple linear regression that models the growth of corn plants as a function of the average temperature of the ambient air. And suppose there is a noticeable positive correlation between temperature and plant height.

In [None]:
# Imports!
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

import statsmodels.api as sm

In [None]:
corn = pd.read_csv('data/corn.csv', index_col=0)

In [None]:
corn.head()

In [None]:
sns.lmplot(data=corn, x='temp', y='height')
plt.xlabel('Temperature ($\degree$ F)')
plt.ylabel('Height (cm)')
plt.title('Corn plant height as a function of temperature');

It seems that higher temperatures lead to taller corn plants. But it's hard to know for sure. One **confounding variable** might be *humidity*. If we haven't controlled for humidity, then it's difficult to draw conclusions.

In [None]:
sns.lmplot(data=corn, x='humid', y='height')
plt.xlabel('Humidity (%)')
plt.ylabel('Height (cm)')
plt.title('Corn plant height as a function of humidity');

One solution is to use **both features** in a single model.

In [None]:
ax = plt.figure(figsize=(8, 6)).add_subplot(111, projection='3d')
ax.scatter(corn['temp'], corn['humid'], corn['height'],
           depthshade=True, s=40, color='#ff0000')
# create x,y
xx, yy = np.meshgrid(corn['temp'], corn['humid'])

# multiple linear regression model with both inputs
results = sm.OLS(corn['height'], sm.add_constant(corn[['temp', 'humid']])).fit()
# calculate corresponding z using parameters from the above model
z = results.params['temp'] * xx + results.params['humid'] * yy + results.params['const']

# plot the surface
ax.plot_surface(xx, yy, z, alpha=0.01, color='#00ff00')

ax.view_init(30, azim=240)
ax.set_xlabel('Temperature ($\degree$ F)')
ax.set_ylabel('Humidity (%)')
ax.set_zlabel('Height (cm)')
plt.title('Corn plant height as a function of both temperature and humidity');

## Multiple Regression in `statsmodels` - Let's Practice!

### Diamonds Dataset

Our goal is to predict the sale price of diamonds. First, let's look at the data:

In [None]:
# Only loading in our numerical features - we'll learn about categorical features later
data = sns.load_dataset('diamonds').drop(['cut', 'color', 'clarity'], axis=1)

In [None]:
data.head()

In [None]:
data.describe()

#### Model-Less Baseline

Without modeling, what is a simple way we could predict the sale price of diamonds?

- 


In [None]:
# Code here to do that!

#### Kitchen Sink Approach

One valid way to approach a regression problem like this is to just throw everything into the regression model, and see how it does compared to a model-less baseline (what I call the Kitchen Sink approach). We know this will likely violate some linear regression assumptions, however it's often easier to start here and then iterate to improve!

> You can contrast this against the approach of starting from a single variable and adding more in one by one

In [None]:
# Grab our X and y variables
X = None
y = None

In [None]:
# Create and fit our model
# Don't forget to add a constant!


In [None]:
# Check out our results


#### Evaluate

How'd we do?

- 


#### Another way to evaluate: Statistically Significant Models

A quick note - we discussed some of the pieces of the statsmodels output yesterday, but I want to highlight the F-Statistic (and it's related p-value). This F-test measures the significance of your model relative to a model in which all coefficients are 0, i.e. relative to a model that says there is no correlation whatever between the predictors and the target.

Is our model statistically significant, at $\alpha = .05$ ?

- 


### Now What?

Let's brainstorm: what would be a good next step if we wanted to do one thing to improve our model?

- 


In [None]:
# Code here to do that next step

# Scaling - The Missing & Helpful Step

When you looked at the summary after we did the linear regression, you might have noticed something interesting.

Observing the coefficients, you might notice there are two relatively large coefficients, and then two others smaller than 100:

In [None]:
# May need to change this varaible if you didn't name your fit model 'model'
model.params

And if we go back and describe our X variables, you can check out each column's min and max values, and see that they're all on different scales:

In [None]:
X.describe()

## What's Going on Here?

In a word, it's useful to have all of our variables be on the same scale, so that the resulting coefficients are easier to interpret. If the scales of the variables are very different one from another, then some of the coefficients may end up on very large or very tiny scales.

This happens since the coefficients will effectively attempt to "shrink" or "expand" the features before factoring their importance to the model.

This can make it more difficult for interpretation and identifying coefficients with the most "effect" on the prediction.

For more on this, see [this post](https://stats.stackexchange.com/questions/32649/some-of-my-predictors-are-on-very-different-scales-do-i-need-to-transform-them).

## A Solution: Standard Scaling

One solution is to *scale* our features. There are a few ways to do this but we'll focus on **standard scaling**.

When we do **standard scaling**, we're really scaling it to be the features' respective $z$-scores.

Benefits:

- This tends to make values relatively small (mean value is at $0$ and one standard deviation $\sigma$ from the mean is $1$).
- Easier interpretation: larger coefficients tend to be more influential

Let's model our data again, but let's *scale* our columns as $z$-scores first. 

##  Redoing with Standard Scaling

Let's try standard scaling the model with our wine dataset now.

In [None]:
# First need to import the StandardScaler from sklearn


In [None]:
# Need to instantiate our scaler

# Then we can fit it

# And transform our X values to scaled X values
X_scaled = None

In [None]:
# Check it out
pd.DataFrame(X_scaled, columns=X.columns).describe()

In [None]:
# Now let's model (don't forget to add a constant!)


In [None]:
# Check our results


#### Evaluate

Compare how well this model did with the one before scaling. Does it perform any differently?

- 


## Other Scalers in SKLearn

#### [Standard Scaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)

The most common method of scaling is standardization.  In this method we center the data, then we divide by the standard devation to enforce that the standard deviation of the variable is one.

#### [MinMax Scalar](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html)

From the documentation:

> This estimator scales and translates each feature individually such that it is in the given range on the training set, e.g. between zero and one.

#### [Robust Scaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html)

From the documentation:

> Scale features using statistics that are robust to outliers.
>
> This Scaler removes the median and scales the data according to the quantile range (defaults to IQR: Interquartile Range). The IQR is the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile).

Aka like a standard scaler, but uses median and IQR variance instead of mean and standard deviation.

In [None]:
# Importing the other options so we can check out the differences between them
from sklearn.preprocessing import MinMaxScaler, RobustScaler

In [None]:
# Instantiating our different scalers
stdscaler = StandardScaler()
minmaxscaler = MinMaxScaler()
robscaler = RobustScaler()

# Creating scaled versions of one column
X_scaled_std = stdscaler.fit_transform(X['carat'].values.reshape(-1, 1))
X_scaled_mm = minmaxscaler.fit_transform(X['carat'].values.reshape(-1, 1))
X_scaled_rob = robscaler.fit_transform(X['carat'].values.reshape(-1, 1))
# why fit_transform? We'll discuss in a second

# defining a dictionary of these things to better visualize
scalers = {'Original': X['carat'].values, 
           'Standard Scaler': X_scaled_std, 
           'Min Max Scaler': X_scaled_mm,
           'Robust Scaler': X_scaled_rob}

In [None]:
# visualize it!
for title, data in scalers.items():
    plt.hist(data, bins=20)
    plt.title(f"{title}")
    plt.show()

### Discuss:

What differences do you see between these?

- 


### Recap: Why do we need to use feature scaling?

- In order to compare the magnitude of coefficients thus increasing the interpretability of coefficients
- Handling disparities in units
- Some models use euclidean distance in their computations
- Some models require features to be on equivalent scales
- In the machine learning space, it helps improve the performance of the model and reducing the values/models from varying widely
- Some algorithms are sensitive to the scale of the data

## Modeling Libraries: Statmodels VS Sci-kit Learn

## Statsmodels' `OLS`

Aka y vs X version - what we've been doing so far

In [None]:
# Import
import statsmodels.api as sm

In [None]:
# Now we'll use our X_train_scaled and y_train!
# Note the add constant
model_OLS = sm.OLS(endog=y, exog=sm.add_constant(X_scaled)).fit()

In [None]:
# Check your results!
model_OLS.summary()

## And Now - SKLearn!

Aka the no-summary version

In [None]:
# Import
from sklearn.linear_model import LinearRegression

In [None]:
# Instantiate our model
model_sk = LinearRegression()

In [None]:
# Fit our model - X FIRST THEN Y!  <<<
model_sk.fit(X_scaled, y)

In [None]:
# Get our R2 score
model_sk.score(X_scaled, y)

In [None]:
# Can also use:
y_preds = model_sk.predict(X_scaled)

r2_score(y, y_preds)

In [None]:
# Check our coefficients
model_sk.coef_

In [None]:
# Add the column names to look at
dict(zip(X.columns, model_sk.coef_))

#### So What?

Feel free to use either implementation of Ordinary Least Squares regression on your projects - but please follow instructions on checkpoints and code challenges!

StatsModels has a great summary to help us get a feel for our models. BUT SKLearn is a much more robust library of machine learning models, and is considered the industry standard. 