# Movie Feature Selection 

This notebook works through some of yellowbrick's functions to select the most important features in our movies dataset.  We will use Rank2D, parallel coordinates, radviz, and FeatureSelection.  Lastly, we will run some preliminary models to run some validation curves to get some knowledge of the hyperparameters in the RandomForestClassifier.  The related GitHub for this project is here: https://github.com/georgetown-analytics/Box-Office.  

The raw data sources were cleaned, wrangled, and pre-processed in separate python codes stored here: https://github.com/georgetown-analytics/Box-Office/tree/master/codes.  

The final dataset is stored in a SQLite database here: https://github.com/georgetown-analytics/Box-Office/tree/master/database.

This notebooks is for the feature selection stage of the project.

Author: Rebecca George.  Team Box Office: George Brooks, Rebecca George, Lance Liu

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
import pandas as pd
import numpy as np
import pandas.io.sql as pd_sql
import sqlite3 as sql
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
import seaborn as sns

from yellowbrick.features import Rank1D, Rank2D, ParallelCoordinates, RadViz, FeatureImportances
from yellowbrick.model_selection import ValidationCurve
from yellowbrick.classifier import ROCAUC, ClassificationReport, ConfusionMatrix

from sklearn.preprocessing import Imputer, LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split as tts

from sklearn.naive_bayes import MultinomialNB, GaussianNB 
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import (BaggingClassifier, ExtraTreesClassifier, 
    RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier)
from sklearn.neural_network import MLPClassifier
from sklearn.svm import LinearSVC, NuSVC, SVC
from sklearn.linear_model import LogisticRegressionCV, LogisticRegression, SGDClassifier

import os

%matplotlib notebook

Get path to folder storing the SQLite database

In [3]:
two_up = os.path.abspath(os.path.join(os.getcwd(),"../.."))
path = two_up + '\database\movies.db'

In [6]:
con = sql.connect(path) 

data = pd_sql.read_sql('select * from finalMovies_20180814', con, index_col='index')


### Limit dataset to movies that will be used in modeling.

We will create the new feature "Profit_Bucket_Binary", convert any "nan" in the historical cast/crew revenue columns to 0. 

In [7]:
#Limit dataset to those with filled in Revenue, Budget, and Length columns.  Based on what I've seen, this helps to
#get rid of remaining duplicates in dataset.  Also helps to narrow down dataset to test profitability, where we
#need both revenue and budget filled in.  We could experiment with movies with revenue greater than $500,000, if 
#we choose to.
movies = data[(data['Revenue_Real']>0)&(data['Budget_Real']>0)&(data['Length']>0)& (data['imdbVotes']>0)]

#Make binary variable for if movie made at least 2x budget
movies['Profit_Bucket_Binary']=np.where(movies['Profit_Bucket']=='<1x', 0, np.where(movies['Profit_Bucket']=='[1-2x)', 
                        0, np.where(movies['Profit_Bucket']=='[2-3x)', 1, np.where(movies['Profit_Bucket']=='[3-4x)', 
                        1, np.where(movies['Profit_Bucket']=='[4-5x)', 1, np.where(movies['Profit_Bucket']=='>=5x', 1, ''))))))

#Put zero where null
movies['Revenue_Actor_Real']=movies['Revenue_Actor_Real'].apply(lambda x: 0 if pd.isnull(x) else x)
movies['Revenue_Director_Real']=movies['Revenue_Director_Real'].apply(lambda x: 0 if pd.isnull(x) else x)
movies['Revenue_Writer_Real']=movies['Revenue_Writer_Real'].apply(lambda x: 0 if pd.isnull(x) else x)

#Could experiment with logged values of revenue and budget
movies['Revenue_Real_Log']=np.log(movies['Revenue_Real'])
movies['Budget_Real_Log']=np.log(movies['Budget_Real'])
movies['Revenue_Actor_Real_Log']=np.log(movies['Revenue_Actor_Real'])
movies['Revenue_Director_Real_Log']=np.log(movies['Revenue_Director_Real'])
movies['Revenue_Writer_Real_Log']=np.log(movies['Revenue_Writer_Real'])

# Use Rank2D function to see correlations among features

We will look at the main ~40 features in our dataset to see the correlations. 

In [33]:
#Include Revenue variable to see what it correlates with.
features = movies[["Revenue_Real","Budget_Real", "Holiday", "Summer", "Spring", "Fall", "Winter",
'Rating_RT', 'Rating_IMDB', 'Rating_Metacritic','isCollection','Length', 'imdbVotes',
'Genre_Drama','Genre_Comedy','Genre_Action_Adventure','Genre_Thriller_Horror','Genre_Romance',
 'Genre_Crime_Mystery','Genre_Animation','Genre_Scifi','Genre_Documentary','Genre_Other',      
'Rated_G_PG','Rated_PG-13','Rated_R','Rated_Other','Comp_Disney','Comp_DreamWorks','Comp_Fox',
'Comp_Lionsgate','Comp_MGM','Comp_Miramax','Comp_Paramount','Comp_Sony','Comp_Universal',
'Comp_WarnerBros','Comp_Other', 'Revenue_Actor_Real','Revenue_Director_Real', 'Revenue_Writer_Real',
'Nominated_Major', 'Nominated_Minor', 'Won_Major', 'Won_Minor']]
labels = movies["Profit_Bucket_Binary"]

In [34]:
%matplotlib notebook
oz = Rank2D(features=features, algorithm = 'spearman')
oz.fit_transform(features, labels)
oz.poof()

<IPython.core.display.Javascript object>

In [10]:
#Include Revenue variable to see what it correlates with.
features = movies[["Profit_Bucket_Binary","Budget_Real", "Holiday", "Summer", "Spring", "Fall", "Winter",
'Rating_RT', 'Rating_IMDB', 'Rating_Metacritic','isCollection','Length', 'imdbVotes',
'Genre_Drama','Genre_Comedy','Genre_Action_Adventure','Genre_Thriller_Horror','Genre_Romance',
 'Genre_Crime_Mystery','Genre_Animation','Genre_Scifi','Genre_Documentary','Genre_Other',      
'Rated_G_PG','Rated_PG-13','Rated_R','Rated_Other','Comp_Disney','Comp_DreamWorks','Comp_Fox',
'Comp_Lionsgate','Comp_MGM','Comp_Miramax','Comp_Paramount','Comp_Sony','Comp_Universal',
'Comp_WarnerBros','Comp_Other', 'Revenue_Actor_Real','Revenue_Director_Real', 'Revenue_Writer_Real',
'Nominated_Major', 'Nominated_Minor', 'Won_Major', 'Won_Minor']]
labels = movies["Profit_Bucket_Binary"]

In [15]:
%matplotlib notebook
oz = Rank2D(features=features, algorithm = 'spearman')
oz.fit_transform(features, labels)
oz.poof()

<IPython.core.display.Javascript object>

# Use Parallel Coordinates/Radviz functions 

Look at what features influence profit buckets 

In [35]:
label_encoder = LabelEncoder()
yc = label_encoder.fit_transform(labels)

In [59]:
oz = ParallelCoordinates(normalize='minmax') #minmax, standard
oz.fit_transform(movies[['Revenue_Actor_Real', 'Rating_RT', 
                        'Revenue_Writer_Real', 'Genre_Drama', 'Budget_Real']], movies['Profit_Bucket_Binary']) 
oz.poof()

<IPython.core.display.Javascript object>

In [60]:
yc = label_encoder.fit_transform(movies['Profit_Bucket_Binary'])
features = movies[[('Budget_Real'), 'Revenue_Actor_Real', 'Revenue_Writer_Real', 'Rating_RT']]
oz = RadViz(classes=label_encoder.classes_, features=features)
oz.fit(features, yc)
oz.poof()

<IPython.core.display.Javascript object>

# Use FeatureImportances with some preliminary models

Look at what features influence profit buckets 

In [22]:
features = movies[["Budget_Real", "Holiday", "Summer", "Spring", "Fall", "Winter",
'Rating_RT', 'Rating_IMDB', 'Rating_Metacritic','isCollection','Length', 'imdbVotes',
'Genre_Drama','Genre_Comedy','Genre_Action_Adventure','Genre_Thriller_Horror','Genre_Romance',
 'Genre_Crime_Mystery','Genre_Animation','Genre_Scifi','Genre_Documentary','Genre_Other',      
'Rated_G_PG','Rated_PG-13','Rated_R','Rated_Other','Comp_Disney','Comp_DreamWorks','Comp_Fox',
'Comp_Lionsgate','Comp_MGM','Comp_Miramax','Comp_Paramount','Comp_Sony','Comp_Universal',
'Comp_WarnerBros','Comp_Other', 'Revenue_Actor_Real','Revenue_Director_Real', 'Revenue_Writer_Real',
'Nominated_Major', 'Nominated_Minor', 'Won_Major', 'Won_Minor'
]]
labels = movies["Profit_Bucket_Binary"]

In [23]:
%matplotlib notebook
model = GradientBoostingClassifier()
oz=FeatureImportances(GradientBoostingClassifier())
oz.fit(features, labels)
oz.poof()

<IPython.core.display.Javascript object>

In [24]:
%matplotlib notebook
oz=FeatureImportances(RandomForestClassifier())
oz.fit(features, labels)
oz.poof()

<IPython.core.display.Javascript object>

In [25]:
%matplotlib notebook
oz=FeatureImportances(ExtraTreesClassifier())
oz.fit(features, labels)
oz.poof()

<IPython.core.display.Javascript object>

In [26]:
%matplotlib notebook
oz=FeatureImportances(AdaBoostClassifier())
oz.fit(features, labels)
oz.poof()

<IPython.core.display.Javascript object>

In [27]:
%matplotlib notebook
oz=FeatureImportances(LinearSVC())
oz.fit(features, labels)
oz.poof()

<IPython.core.display.Javascript object>

In [28]:
##Different code to see most important features in random forest, log regression models
##Got this code to determine most important feature from here:
##https://towardsdatascience.com/running-random-forests-inspect-the-feature-importances-with-this-code-2b00dd72b92e
#feature_importances = pd.DataFrame(rf.feature_importances_,
#   index = X_train.columns,columns=['importance']).sort_values('importance',
#    ascending=False)
#(feature_importances.head(10))
#feature_importances = pd.DataFrame(log_reg.coef_.transpose(),
#   index = X_train.columns,columns=['importance']).sort_values('importance',
#    ascending=False)
#feature_importances.abs().sort_values(by='importance', ascending=False)
#(feature_importances.head(10))

# Use ValidationCurve with some preliminary models

Look at what hyperparameters can be tuned in RandomForestClassifier

In [66]:
X_train, X_test, y_train, y_test = tts(features, labels, test_size=0.2)
oz = ValidationCurve(RandomForestClassifier(), param_name='n_estimators', param_range=np.arange(1, 100, 10))
oz.fit(X_train, y_train)
oz.poof()

<IPython.core.display.Javascript object>

In [67]:
oz = ValidationCurve(RandomForestClassifier(), param_name='max_features', param_range=np.arange(1, 43, 1))
oz.fit(X_train, y_train)
oz.poof()

<IPython.core.display.Javascript object>

In [68]:
oz = ValidationCurve(RandomForestClassifier(), param_name='max_depth', param_range=np.arange(1, 40, 5))
oz.fit(X_train, y_train)
oz.poof()

<IPython.core.display.Javascript object>

In [69]:
oz = ValidationCurve(RandomForestClassifier(), param_name='min_samples_split', param_range=np.arange(2, 10, 1))
oz.fit(X_train, y_train)
oz.poof()

<IPython.core.display.Javascript object>

In [70]:
oz = ValidationCurve(RandomForestClassifier(), param_name='min_samples_leaf', param_range=np.arange(1, 10, 1))
oz.fit(X_train, y_train)
oz.poof()

<IPython.core.display.Javascript object>