# Linear Regression with Statsmodels for Movie Revenue

- 07/18/23

## Activity: Create a Linear Regression Model with Statsmodels for Revenue

- Today we will be working with JUST the data data from the TMDB API for years 2000-2021. 
    - We will prepare the data for modeling
        - Some feature engineering
        - Our usual Preprocessing
        - New steps for statsmodels!
    - We will fit a statsmodels linear regression.
    - We will inspect the model summary.
    - We will create the visualizations to check assumptions about the residuals.



- Next class we will continue this activity.
    - We will better check all 4 assumptions.
    - We will discuss tactics for dealing with violations of the assumptions. 
    - We will use our coefficients to make stakeholder recommendations.

### Concepts Demonstrated

- [ ] Using `glob` for loading in all final files. 
- [ ] Statsmodels OLS
- [ ] QQ-Plot
- [ ] Residual Plot

# Loading the Data

In [1]:
import json
import pandas as pd
import numpy as np
import seaborn as sns
from scipy import stats

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_transformer, make_column_selector, ColumnTransformer
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.metrics import r2_score,mean_absolute_error,mean_squared_error
## fixing random for lesson generation
np.random.seed(321)

# Set global scikit-learn configuration 
from sklearn import set_config

# Display estimators as a diagram
set_config(display='diagram')

In [2]:
pd.set_option('display.max_columns',100)

### 📚 Finding & Loading Batches of Files with `glob`

In [3]:
## Checking what data we already have in our Data folder using os.listdir
import os


In [4]:
## Try loading in the first .csv.gz file from the list


> Why isn't it working?

In [5]:
## let's check the filepath 


In [6]:
## add the folder plus filename


In [7]:
## try read csv with folder plus filename


- Now we could do that in a loop, and we would only want to open .csv.gzs.
- But there is a better way!
>- Introducing `glob`
    - Glob takes a filepath/query and will find every filename that matches the pattern provided.
    - We use asterisks as wildcards in our query.
    


In [8]:
import glob
## Make a filepath query


In [9]:
# Use glob.glob to get COMPLETE filepaths


In [10]:
# Use glob.glob to get COMPLETE filepaths and sort


> But where are the rest of the years?

In [11]:
## in a sub-folder


- Recursive Searching with glob.
    - add a `**/` in the middle of your query to grab any matches from all subfolders. 

In [12]:
# Use glob.glob to get COMPLETE filepaths


In [13]:
# ## use a list comprehension to load in all files into 1 dataframe


- Dealing with ParserErrors with "possibly malformed files"

    - for a reason I do not fully understand yet, some of the files I downloaded error if I try to read them.
        - "ParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.`
    - After some googling, the fix was to add `lineterminator='\n'` to pd.read_csv


In [14]:
# ## use a list comprehension to load in all files into 1 dataframe

In [15]:
# remove ids that are 0


In [None]:
# reset index

In [16]:
## saving the combined csv to disk


# Preprocessing

## Feature Engineering


- Belongs to Collection: convert to boolean
- Genres: get just the name and manually OHE
- Cleaning Categories in Certification
- Converting release date to year, month, and day.

### belongs to collection

In [17]:
# there are 3,700+ movies that belong to collections


In [18]:
## Use .notna() to get True if it belongs to a collection


### genre

In [19]:
# View a test case for genres


In [20]:
## Function to get just the genre names as a list 


In [21]:
## Use our function on our test case


In [22]:
## Use our function and exploding the new column


In [23]:
## save unique genres


In [24]:
## Manually One-Hot-Encode Genres


In [25]:
## Drop original genre cols


### certification

In [26]:
## Checking Certification values counts


In [27]:
# fix extra space certs


In [28]:
## fix certification col


### Converting year to sep features

In [29]:
## view value_counts()


In [30]:
## split release date into 3 columns


In [None]:
## drop original feature

In [31]:
# View head of data


## Train Test Split

In [32]:
# View info


In [33]:
## Make x and y variables


In [34]:
# Sum up NAs


In [35]:
## make cat selector and using it to save list of column names


In [36]:
## select manually OHE cols for later


In [37]:
## make num selector and using it to save list of column names


In [38]:
## convert manual ohe to int


In [39]:
## make pipelines


In [40]:
## fit the col transformer


## Finding the categorical pipeline in our col transformer.


In [41]:
## B) Using list-slicing to find the encoder 


## Create the empty list


In [42]:
## checking shape matches len final features


In [43]:
## make X_train_tf 


In [44]:
## make X_test_tf 


### Adding a Constant for Statsmodels

In [45]:
##import statsmodels correctly
import statsmodels.api as sm

> Tip: make sure that add_constant actually added a new column! You may need to change the parameter `has_constant` to "add"

In [46]:
## Make final X_train_df and X_test_df with constants added



# Modeling

## Statsmodels OLS

In [47]:
## instantiate an OLS model WITH the training data.


## Fit the model and view the summary


In [48]:
## Get train data performance from skearn to confirm matches OLS


## Get test data performance


# The Assumptions of Linear Regression

- The 4 Assumptions of a Linear Regression are:
    - Linearity: That the input features have a linear relationship with the target.
    - Independence of features (AKA Little-to-No Multicollinearity): That the features are not strongly related to other features.
    - **Normality: The model's residuals are approximately normally distributed.**
    - **Homoscedasticity: The model residuals have equal variance across all predictions.**


### QQ-Plot for Checking for Normality

In [49]:
## Create a Q-QPlot

# first calculate residuals 


## then use sm's qqplot


### Residual Plot for Checking Homoscedasticity

In [50]:
## Plot scatterplot with y_hat_test vs resids


### Putting it all together

In [51]:
# Function to plot qq plot and residual plot

### Next class: iterating on our model & interpreting coefficients