# Simple and Multiple Linear Regression


AKA - Welcome to statistical modeling! Could also say - welcome to **Supervised Machine Learning**.

What do I mean by 'Supervised' ?

![Types of machine learning, broken down](images/machinelearning_supervisedunsupervised.png)
 
[Image Source](https://fr.mathworks.com/help/stats/machine-learning-in-matlab.html)

## Today's Goals:

- Recognize the importance of model validation
- Implement a train-test split
- Evaluate a simple linear regression model
- Add additional variables to create a multiple linear regression model
- Scale features appropriately, based on training data, for a multiple linear regression model

Bonus, if we have time: other useful metrics for regression models!

### First: Set Up

In [None]:
# Basic imports
import numpy as np
import pandas as pd
# Data visualizations
import matplotlib.pyplot as plt
import seaborn as sns
# Modeling!
from statsmodels.formula.api import ols

Credit data from https://www.kaggle.com/avikpaul4u/credit-card-balance

Target: `Balance`

In [None]:
# Data
df = pd.read_csv('data/Credit.csv', 
                 usecols=['Income', 'Limit', 'Rating', 'Cards', 'Age', 'Balance'])

In [None]:
df.head()

In [None]:
df.describe()

## Model Validation - AKA How to Build Generalizable Models

![validation gif from giphy](https://media.giphy.com/media/242wLqQerWkxd6GgHB/giphy.gif)

### The Bias-Variance Trade Off

<img alt="original image from https://rmartinshort.jimdofree.com/2019/02/17/overfitting-bias-variance-and-leaning-curves/" src="images/underfit-goodfit-overfit.png" width=750, height=350>  

## How To Minimize Bias and Variance

### Combat Underfitting (Bias)

**Bias**: Error introduced by approximating a real-life problem (which may be extremely complicated) by a much simpler model (because the model is too simple to capture the underlying pattern)

**The Solution:** evaluate the performance of our models, using a scoring metric, which will help us catch if a model is underfit - if it's performing quite poorly, it probably isn't capturing the relationship in our data! 

### Combat Overfitting (Variance)

**Variance**: Amount by which our model would change if we estimated it using a different training dataset (because the model is over-learning from the training data)

**The Solution:** don't train your model on ALL of your data, but keep some of it in reserve to test on, in order to simulate how it will work on new/incoming data.


<img alt="original image from https://www.dataquest.io/wp-content/uploads/kaggle_train_test_split.svg plus some added commentary" src="images/traintestsplit_80-20.png" width=650, height=150>  

How does this fight against overfitting? By witholding data from the training process, we are testing whether the model actually _generalizes_ well. If it does poorly on the test set, it's a good sign that our model learned too much noise from the train set and is overfit! 

![arrested development gif, found by Andy](https://heavy.com/wp-content/uploads/2013/05/tumblr_mjm9fqhrle1rvnnvyo6_250.gif)

#### Practice:

Documentation: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [None]:
# Importing the train_test_split function from sklearn


In [None]:
# Need to define our X and y
X = None
y = None

In [None]:
print(X.shape)
X.head()

In [None]:
print(y.shape)
y.head()

In [None]:
# Train test split here!
# Set test_size = .33
# Set random_state = 42


What did that do?

In [None]:
X_train.shape

In [None]:
X_test.shape

In [None]:
len(X_train + X_test) == len(X)

In [None]:
X_train.head()

**YOU SHOULD ALWAYS START WITH A TRAIN TEST SPLIT.**

**FOR THE PROJECT, YOU WILL BE _REQUIRED_ TO WORK WITH A TRAIN TEST SPLIT**

Note - for the checkpoints and code challenge, follow the instructions given - they might not require a train/test split as they attempt to keep thing simple.

BUT we're going to use it in this session! Let's see what this looks like in practice!

In [None]:
# For statsmodels, we'll create a train_df and test_df
train_df = pd.concat([X_train, y_train], axis=1)

test_df = pd.concat([X_test, y_test], axis=1)

In [None]:
# We'll use the training data to make any modeling decisions!
train_df.head()

## Simple Linear Regression

Let's start off with one variable - which should we choose?

In [None]:
# Code to evaluate our options


Our choice:

- 


### Time to Model!

In [None]:
# Set up your formula


In [None]:
# Set up and fit your model


In [None]:
# Check your results!


### Evaluate!

- 


### Now what?

We have a trained model... what can we do with it?

In [None]:
# Get our predictions!


In [None]:
# Just looking at two variables... we can visualize this!

# Plot our points as a scatterplot

# Plot the line of best fit!

# plt.ylabel('')
# plt.xlabel('')
# plt.title('')
# plt.show()

In [None]:
# Compare to our actual train values...


We can score our models without relying on the statsmodels output!

https://scikit-learn.org/stable/modules/model_evaluation.html

In [None]:
# Can use sklearn to score our model, too:
from sklearn.metrics import r2_score

# Score our training data - the same score as from our summary!
# This function requires two inputs: y_true and y_pred:


In [None]:
# Can now predict for our test set!


In [None]:
# Score our testing data


### Evaluate:

- 


In [None]:
# One last thing - can visualize both train and test set!

# Plot our training data

# Plot our testing data


# Plot the line of best fit

# Plotting for the test data just to show it's the same!

# plt.ylabel('')
# plt.xlabel('')
# plt.title('')
# plt.legend()
# plt.show()

## Multiple Linear Regression

Same as simple linear regression, but with more inputs!

In [None]:
# Define our formula
# First, define a list of X columns...
X_cols = None

In [None]:
# ... Then use that list to make our formula, using join
formula = None

In [None]:
# Set up and fit your model


In [None]:
# Check your results!


#### Observation time!

How'd we do? What looks different from the simple linear regression output? What in the world can we do with those coefficients?

- 


## Standardization, AKA Feature Scaling and Centering

Scaling data is the process of **increasing or decreasing the magnitude according to a fixed ratio.** You change the size but not the shape of the data. Often, this involves dividing features by their standard deviation.

Centering also does not change the shape of the data, but instead simply **removes the mean value  of each feature** so that each is centered around zero instead of their original mean.

The idea is that you can standardize data so that a model can interpret each individual feature more consistently.

Documentation: https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-scaler

We've seen one of these options before - when else did we remove the mean and then divide by the standard deviation?

- 


In [None]:
# Import the relevant scaler


In [None]:
# Instantiating our scaler
stdscaler = None

# Creating scaled versions of one column
age_scaled = stdscaler.fit_transform(train_df[['Age']]) # pass Age col as a df
# Why fit_transform? We'll discuss in a second

In [None]:
# Visualize them!
fig, [ax1, ax2] = plt.subplots(1, 2, figsize=(10,4))
# Original data
ax1.hist(train_df['Age'], bins=20)
ax1.set_title("Original Age Data")
# Scaled data
ax2.hist(age_scaled, bins=20)
ax2.set_title("Standard Scaled Age Data")

plt.show()

### What Changed?

- 


### Why do we need to use feature scaling?

- In order to compare the magnitude of coefficients thus increasing the interpretability of coefficients
- Handling disparities in units
- Some models use euclidean distance in their computations
- Some models require features to be on equivalent scales
- In the machine learning space, it helps improve the performance of the model and reducing the values/models from varying widely
- Some algorithms are sensitive to the scale of the data

### `fit_transform` ?

Note above we used `fit_transform` - why is that? **Consistency!**

`fit` allows the scaler to learn the patterns _only from the training data_! 

Why does that matter?

- mean and standard deviation are affected by the data in the rows - so, if we fit to the whole dataset instead of just the training data, the scaler would learn patterns influenced by the test data!

`transform` allows the scaler to apply the pattern it learns - can do so on both the train and test sets, so long as they're equivalent (aka have the same columns and were prepared the same way!)

**ALWAYS FIT ON TRAINING DATA, THEN TRANSFORM THE TRAIN AND TEST SETS!**

### Let's Try It

In [None]:
# Instantiate a new scaler

# Learn the pattern from the training data

#Apply the pattern to the training and testing data
X_train_scaled = None
X_test_scaled = None

In [None]:
# What is this object?
X_train_scaled

In [None]:
# What's that look like as a dataframe?
X_train_scaled = pd.DataFrame(X_train_scaled,
                              columns=X_train.columns,
                              index=X_train.index)

X_train_scaled.head()

In [None]:
#Let's make train_df_scaled, which includes the target variable, for statsmodels
train_df_scaled = pd.concat([X_train_scaled, y_train], axis=1)

In [None]:
# Our formula stays the same
formula

In [None]:
# Set up and fit your model


In [None]:
# Check your results!


### Evaluate:

What changed?

- 


### Also - how do we interpret these coefficients, or those p-values in the summary?

Discuss:

- 


### Next Step?

What could we do next to improve this model?

- 


In [None]:
# Check out a potential issue in our model?


An important note to keep in mind from now on:

!["all models are wrong but some are useful" quote picture](images/allmodelsarewrong.jpg)

[Image Source](https://twitter.com/cwodtke/status/1244433603666178049)

### Additional Resources:

- [Excellent statistical writeup about how to interpret Linear Regression coefficients, and their p-values](https://statisticsbyjim.com/regression/interpret-coefficients-p-values-regression/)
- [Great bias/variance infographic](https://elitedatascience.com/bias-variance-tradeoff) from Elite Data Science

## Extra Credit: Beyond the $R^2$ Score

There are other metrics! 

#### Mean Absolute Error (MAE)

$$\text{MAE}(y, y_\text{pred}) = \frac{1}{n} \sum_{i=0}^{n} \left| y_i - y_\text{pred}i \right|$$

- Measures the average magnitude of errors regardless of direction, by calculating the total absolute value of errors and dividing by the number of samples (number of predictions made)
- **This error term is in the same units as the target!**

#### Mean Squared Error (MSE)

$$\text{MSE}(y, y_\text{pred}) = \frac{1}{n} \sum_{i=0}^{n} (y_i - y_\text{pred}i)^2$$

- Measures the average squared error, by calculating the sum of squared errors for all predictions then dividing by the number of samples (number of predictions)
- In other words - this is the Residual Sum of Squares (RSS) divided by the number of predictions!
- This error term is **NOT** in the same units as the target!

#### Root Mean Squared Error (RMSE)

$$\text{RMSE}(y, y_\text{pred}) = \sqrt{\frac{1}{n} \sum_{i=0}^{n} (y_i - y_\text{pred}i)^2}$$

- Measures the square root of the average squared error, by calculating the sum of squared errors for all predictions then dividing by the number of samples (number of predictions), then taking the square root of all that
- **This error term is in the same units as the target!**

Note - before, we were _maximizing_ R2 (best fit = largest R2 score). But we'd want to minimize these other error metrics.

Documentation: 
- [Regression Metrics in sklearn](https://scikit-learn.org/stable/modules/classes.html#regression-metrics)
- [User Guide for Regression Metrics in sklearn](https://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics)

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error

In [None]:
# Grab training set predictions
train_preds = model_scaled.predict(X_train_scaled)
train_preds

In [None]:
print("Metrics:")
# R2
print(f"R2: {r2_score(y_train, train_preds):.3f}")
# MAE
print(f"Mean Absolute Error: {mean_absolute_error(y_train, train_preds):.3f}")
# MSE
print(f"Mean Squared Error: {mean_squared_error(y_train, train_preds):.3f}")
# RMSE - just MSE but set squared=False
print(f"Root Mean Squared Error: {mean_squared_error(y_train, train_preds, squared=False):.3f}")

Note that I said that MAE and RMSE are both in the same units as our target, but you'll see that they are different here. What's the difference?

> "Taking the square root of the average squared errors has some interesting implications for RMSE. Since the errors are squared before they are averaged, the RMSE gives a relatively high weight to large errors. This means the RMSE should be more useful when large errors are particularly undesirable."

-- Source: ["MAE and RMSE — Which Metric is Better?"](https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d)

How can we interpret these?

- R2: "Our model accounts for 61.2% of the variance in our target"
- MAE/RMSE: "Our model's predictions are, on average, about __ off from our actual target values" (here, balance is likely in dollars - so $___ off)