# Topic 19: Multiple Linear Regression

- onl01-dtsc-ft-022221
- 04/08/21

## Resources:

- **[OSEMN Data Science Workflow Notebook](https://github.com/jirvingphd/fsds-osemn-workflow)**
    - `student_OSEMN.ipynb`: also included in notes repo

## LEARNING OBJECTIVES

- Learn how to expand our last lesson to include multiple independent variables.
- Learn ways to deal with categorical variables.
- Learn about multicollinearity of features
- Learn about how to improve a baseline model based on results
- Learn how to run a multiple regression using statsmodels

<!-- ### TOPICS:

#### Part 1 
- Multiple Linear Regression
- Dealing with Categorical Variables
- Multicollinearity of Features
- Multiple Linear Regression in Statsmodels

#### Part 2
- Feature Scaling & Normalization
- Model Fit and Validation/Cross Validation -->

## Questions?



- 

# Revisiting Our  Simple Linear Regression  Modeling with Movies

### PREVIOUSLY ON... Topic 18

- We discussed the assumptions for a linear regression:
    - Linear relationship between predictor and target variable.
    - Predictor (x) and its error terms have a normal distribution
    - Homoskedasticity ( variance of residuals is constant)
    
- We learned how to run a single regession in statsmodels

## Imports & Loading Data

In [None]:
## Importing our study group functions
%load_ext autoreload
%autoreload 2
import sys
    
py_folder = "../../py_files/" # CHANGE TO REFECT YOUR NOTEBOOKS LOCATION COMPARED TO THE PY_FILES FOLDER
sys.path.append(py_folder)
import functions_SG as sg

In [None]:
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

import statsmodels.api as sm
import statsmodels.stats.api as sms
import statsmodels.formula.api as smf

from scipy import stats

In [None]:
plt.style.use('seaborn-notebook')
# plt.rcParams['figure.figsize'] = [10,6]
pd.set_option('display.float_format', lambda x: f"{x:,}")
pd.set_option('display.max_columns',0)

In [None]:
### NEW MOVIE DATASET

def load_movie_data(verbose=True,include_genre=False):
    ## Thanks to Johnny Dryma for letting us use his data
    movie_data_url = "https://raw.githubusercontent.com/Drymander/dsc-phase-1-project/master/data/2012-2019%20FULL.csv"
    dfm = pd.read_csv(movie_data_url,index_col=0,parse_dates=['release_date'])

    ## List of cols that need processsing before use
    # cols_need_processing=['genres','production_companies',
    #                       'belongs_to_collection']

    ## Save only the columns of interest
    df = dfm[['id','imdb_id','original_title','title','genres','mpaa_rating',
         'release_date','runtime','budget','revenue',
         'vote_count','vote_average','popularity','adult','original_language']].copy()

    ## Keep only movies with financial data
    df=df[(df['budget']>0) & (df['revenue']>0)]

    if include_genre==True:
        df['genre_list'] = df['genres'].map(lambda x: eval(x))
        df['genre_list'] = df['genre_list'].map(lambda x: [g['name'] for g in x])
    else:
        ## Dropping genres isntead
        df.drop(columns=['genres'],inplace=True)
        
    ## Feature Engineering
    # df['profit'] = df['revenue'] - df['budget']
    # df['ROI'] = df['profit']/df['budget']

    ## Removing Extreme values for class purposes
    # df=df[df['ROI']<1000]

    ## Drop nulls & reset index
    df.dropna(inplace=True)
    df.set_index('id',inplace=True)

    if verbose:
        display(df.head(),df.info())
    return df

df = load_movie_data()

## Simple Linear Regression Regression

In [None]:
## Scatter Plots for Linearity Check
def plot_data(X='budget',y='revenue',data=df,fit_reg=False):
    priceFmt = mpl.ticker.StrMethodFormatter("${x:,.0f}")
    ax = sns.regplot(x=X,y=y,data=data,fit_reg=fit_reg)
    ax.yaxis.set_major_formatter(priceFmt)
    fig=ax.get_figure()
    return fig,ax

>- Use one $X$ variable to predict $y$

 $$y=mx+b$$

 $$y = \beta_1 x_1 + \beta_0 $$

In [None]:
plot_data(X='budget',y='revenue',data=df,fit_reg=True);

### Our Baseline Simple Linear Regression

In [None]:
f = "revenue~budget"
model1 = smf.ols(f,df).fit()
display(model1.summary())

fig = sm.graphics.qqplot(model1.resid,dist=stats.norm,fit=True,line='45')
fig = sm.graphics.plot_regress_exog(model1, "budget", fig=plt.figure(figsize=(12,8)))

### Our Second Model After Removing Outliers

In [None]:
## Visualize Data WITH outliers
sns.jointplot(data=df,x='budget',y='revenue')

In [None]:
## Get X outliers
X_outliers_IQR = sg.find_outliers_IQR(df['budget'])
y_outliers_IQR = sg.find_outliers_IQR(df['revenue'])

## Combine outliers
idx_outliers_IQR = X_outliers_IQR  | y_outliers_IQR
idx_outliers_IQR.sum()


## Create df_clean
df_clean = df[~idx_outliers_IQR].copy()
sns.jointplot(data=df_clean,x='budget',y='revenue')

df_clean.head(2)


In [None]:
## Get the model params 
f = "revenue~budget"
model = smf.ols(f,df_clean).fit()
display(model.summary())
sm.graphics.qqplot(model.resid,dist=stats.norm,line='45',fit=True);
sm.graphics.plot_regress_exog(model,'budget',plt.figure(figsize=(12,8)));

# Multiple Linear Regression

### Today's Objectives

- Briefly discuss big-picture re: multiple linear regressions vs simple linear regressions.
- Discuss ways to handle categorical data.
- Discuss a fourth assumption with Multiple Regression - Assumption of No Multicollinearity
- Extend yesterday's task to use multiple features from the movie dataset. 

## Multiple Predictor (X) Variables

$$ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 +\ldots + \beta_n x_n $$

<img src="https://raw.githubusercontent.com/learn-co-students/dsc-multiple-linear-regression-online-ds-ft-100719/master/images/multiple_reg.png" width=400>

#### $\hat Y$ vs $Y$


- Y: Actual value corresponding to a specific X value

- "Y hat" ($\hat Y$): Predicted value predicted fromn a specific X value.


$$ \hat y = \hat\beta_0 + \hat\beta_1 x_1 + \hat\beta_2 x_2 +\ldots + \hat\beta_n x_n $$ 

where $n$ is the number of predictors, $\beta_0$ is the intercept, and $\hat y$ is the so-called "fitted line" or the predicted value associated with the dependent variable.

## DEALING WITH CATEGORICAL VARIABLES

- What are categorical variables?
- Understand creating dummy variables for predictors.
- Use pandas and Scikit-Learn to create dumies
- Understand and avoid the "dummy variable trap"

### What are categorical variables?
- Variables that do not represent a continuous/ordinal number. 

### Identifying categorical variables:
What to look for?
1. Column dtype is 'object'
2. Use `df.describe()` -  check for min/max. Are they integers?
3. Use scatterplots & histograms -  look for columns of datapoints

In [None]:
## Check dtypes
df.info()

In [None]:
## Can use select_dtypes
cat_cols = list(df.select_dtypes('O').columns)
cat_cols

In [None]:
## can do the same for numeric
num_cols =list(df.select_dtypes('number').columns)
num_cols

In [None]:
## Check describe
df.describe()

In [None]:
df[num_cols]

In [None]:
df.isna().sum()

In [None]:
## Inspect the Value Counts for Each Str Col
for col in cat_cols:
    display(df[col].value_counts(dropna=False).sort_index())
    print()

### Transforming Categorical Variables

To use categorical variables for regression, they must be transformed.
There are 2 methods to dealing with them:
1. ~~Label Encoding~~ (not intended for X data!)
    - Replace string categories with integer values (0 to n)
    - Can be done with:
        1. Pandas 
        2. Scikit Learn
        

2. One-hot / dummy encoding
    - Turn each category in a categorical variable into its own variable, that is either a 0 or 1. 0 for rows that do not belong to that sub-category. 1 for rows that belong to the sub-category
    - Can be done with:
        1. Pandas
        2. Scikit Learn


### Label Encoding

In [None]:
## Check the Value Counts for our test column - "BldgType"
df['mpaa_rating'].value_counts(normalize=True)

#### Via pandas.cat.codes

In [None]:
## Label Encode with .cat.codesd
df['mpaa_rating'] = df['mpaa_rating'].astype('category')
df['mpaa_rating'].cat.codes.value_counts(normalize=True)

#### Via Sklearn's LabelEncoder

In [None]:
## Using sklearn LabelEncoder
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
rating_enc = encoder.fit_transform(df['mpaa_rating'])
rating_enc

In [None]:
encoder.inverse_transform(rating_enc)

### Dummy Encoding / One-Hot Encoding

#### Via Pandas.get_dummies()

In [None]:
df['mpaa_rating'].unique()

In [None]:
# df_dummies = pd.get_dummies(df,columns=['mpaa_rating'])
df_dummies = pd.get_dummies(df['mpaa_rating'])

df_dummies

#### Via Scikit-Learn's OneHotEncoder

In [None]:
cat_cols = ['mpaa_rating']

In [None]:
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(drop='first',sparse=False)
ohe_vars = encoder.fit_transform(df[cat_cols])
ohe_vars
# pd.DataFrame(ohe_vars,columns=binarizer.classes_)

In [None]:
cat_vars = pd.DataFrame(ohe_vars,columns=encoder.get_feature_names(cat_cols))
cat_vars

#### The Dummy Variable Trap


In [None]:
pd.get_dummies(df['mpaa_rating'],drop_first=True)#,prefix="ohe")

# Activity: Multiple Linear Regression with Movies

In [None]:
## Load in our data fresh
df = load_movie_data(verbose=False)
df.head(2)

In [None]:
## Check dtypes,etc


In [None]:
## Check nulls

In [None]:
## Inspect the numeric columns


## Encode Categorical Data

In [None]:
## Remake final cat cols


In [None]:
## Create encoded vars


In [None]:
## make encoded vars_df
df_ohe = None

In [None]:
## Create df model from original df and df_ohe
df_model = None

In [None]:
## Drop columns we don't want to use in the model
drop_cols = ['title','imdb_id','original_title','release_date']


## New Assumption: No Multicollinearity

### Multicollinearity
- An additional concern to check for.
- Rule of thumb is if correlation between vars is >0.70 is too high


In [None]:
## Get the correlation matrix for our model_df (without the target)


In [None]:
## Plot this as a heatmap


In [None]:
## Create a mask to make the multiplot easier to look at 


In [None]:
## Fill in the upper right cells with True


In [None]:
## Plot again, with the mask


In [None]:
# Functionize

def multiplot():
    pass

In [None]:
## Drop any multicollinear features


In [None]:
## Create a string representing the right side of the ~ in our formula


In [None]:
## Create the final formula and create the model


> RUH ROH!

### Fixing Statsmodels Formulas

In [None]:
##Fix df column names so there are no spaces


In [None]:
## Create a dict with new ratings names from the mpaa_rating col


In [None]:
## Replace the original mpaa_rating col


### Prepare df_model again

In [None]:
## Create a string representing the right side of the ~ in our formula


In [None]:
## Create the final formula and create the model


In [None]:
## QQ PLOT


## TO DOs

- In today's study group, we did NOT demo the absolute best way to prepare and perform a regression model.

- Additional Topics to discuss tomorrow:
    - train-test-split/cross-validation
    - Using VIF to deal with multicollinearity
    - Using feature selection methods
    - Outlier removal