# Engineering Features

Now that I know how badly my model is scoring, I am going to try and utilize some additional features in order to hopefully feed in better data for my model to train on. 

In [1]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV, ElasticNetCV
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from rake_nltk import Rake
import datetime, time
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pickle

import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

In [2]:
with open('../0_Assets_&_Data/series_df_full.pickle', 'rb') as f:
    series_df = pickle.load(f)
    
with open('../0_Assets_&_Data/model_prelim.pickle', 'rb') as f:
    model_df = pickle.load(f)
    
with open('../0_Assets_&_Data/week_day.pickle', 'rb') as f:
    week_day = pickle.load(f)
    
with open('../0_Assets_&_Data/cleaned_series_df.pickle', 'rb') as f:
    clean_series_df = pickle.load(f)

# Categorizing Seasons

After loading in the dataframe from my previous notebook, the first thing I want to try and do is to reduce the number of dummied months by sorting them into seasons instead. 

In [3]:
week_day['spring'] = (week_day['month'] < 7) & (week_day['month'] > 2)

In [4]:
week_day['summer'] = (week_day['month'] < 9) & (week_day['month'] > 5)

In [5]:
week_day['fall'] = (week_day['month'] < 12) & (week_day['month'] > 8) 

In [6]:
week_day['winter'] = (week_day['month'] < 3) | (week_day['month'] == 12)

In [7]:
week_day_ = week_day.drop('released', axis=1)
week_day_dum1 = pd.get_dummies(week_day_['weekday'], prefix='day')
week_day_dum2 = pd.get_dummies(week_day_['month'], prefix='month')
week_day_dum3 = week_day_[['spring', 'summer', 'fall', 'winter']]

# Making an Actor Dataframe

I want to try and imbue some weight to the actors column so that I don't have to be dummying those out. It seems reasonable that the starring actors should have some sort of effect on the success of a show, whether due to popularity or quality of acting. 

I will start by making a temporary dataframe to CountVectorize, then pass it back through the main Dataframe.

In [8]:
nlp_df = clean_series_df[['writer', 'overview_x', 'number_of_episodes', 'number_of_seasons', 
                     'overview_y', 'status_y', 'actors', 'awards', 'genre_y', 'imdb_rating',
                     'imdb_votes', 'plot', 'runtime_x', 'runtime_cat', 'network']]

In [9]:
nlp_df[['runtime_x', 'awards']] = nlp_df[['runtime_x', 'awards']].astype(float)

I want to make sure I don't have any null values and verify the data type.

In [10]:
nlp_df['overview_x'].fillna('N/A', inplace=True)
nlp_df['overview_y'].fillna('N/A', inplace=True)
nlp_df['plot'].fillna('N/A', inplace=True)

In [11]:
nlp_df['plot'] = nlp_df['plot'].astype(str)
nlp_df['overview_x'] = nlp_df['overview_x'].astype(str)
nlp_df['overview_y'] = nlp_df['overview_y'].astype(str)

In [12]:
nlp_df['bag_of_words'] = nlp_df[['overview_x', 'plot','overview_y']].apply(lambda x: ''.join(x), axis=1)

In [13]:
nlp_df.drop(['overview_x', 'overview_y', 'plot'], axis=1, inplace=True)

In [14]:
nlp_df.columns

Index(['writer', 'number_of_episodes', 'number_of_seasons', 'status_y',
       'actors', 'awards', 'genre_y', 'imdb_rating', 'imdb_votes', 'runtime_x',
       'runtime_cat', 'network', 'bag_of_words'],
      dtype='object')

In [15]:
test_df = nlp_df[['actors']]

In [16]:
test_df['imdb_votes'] = nlp_df[['imdb_votes']]
test_df['imdb_rating'] = nlp_df[['imdb_rating']]
test_df['number_of_episodes'] = nlp_df[['number_of_episodes']]
test_df['number_of_seasons'] = nlp_df[['number_of_seasons']]
test_df['awards'] = nlp_df[['awards']]

In [17]:
test_df['actors'] = test_df['actors'].astype(str)

Now that my data is cleaner, I can CountVectorize the actor names. I am using CountVectorizier over TF-IDF because this dataset/corpus is relatively small and I will want all of the actors to be seen; if I were to TF-IDF these names, I may just receive a shorter list of the strongest features.

For the sake of avoiding actors with the same first or last names, I will be combining their first and last name along with removing punctuation marks.

In [18]:
cv = CountVectorizer(stop_words=None, analyzer='word', 
                     ngram_range=(1, 1), max_df=1.0, min_df=1, max_features=None)

# List of strings
name_list = []

for row in test_df['actors']:
    name_list.append(row)

In [19]:
name_list_clean = []
for item in range(0, len(name_list)):
    name_list_clean.append(name_list[item].lower().replace("'", "").replace(' ', '').replace('.', '').replace('-', ''))


# Count Vectorizing the actors

I have my list of cleaned names, so I can fit the Count Vectorizier and transform to get my weights.

In [20]:
count_train = cv.fit(name_list_clean)
bag_of_words = cv.transform(name_list_clean)

In [21]:
match = cv.vocabulary_

I now have a list of vectorized values for each actor, and will want to iterate through the test_df & multiply the imdb rating by this value, then remove the actual actors themselves.

The first step is to split the list of actors in the single column into their own columns.

In [22]:
actors_split = pd.concat([test_df['actors'].str.split(', ', expand=True)], axis=1)
test_df = pd.concat((test_df ,actors_split), axis=1)

In [23]:
test_df.head()

Unnamed: 0,actors,imdb_votes,imdb_rating,number_of_episodes,number_of_seasons,awards,0,1,2,3
Game of Thrones,"Peter Dinklage, Lena Headey, Emilia Clarke, Ki...",1361235.0,9.5,73.0,8,1.0,Peter Dinklage,Lena Headey,Emilia Clarke,Kit Harington
Westworld,"Evan Rachel Wood, Thandie Newton, Jeffrey Wrig...",307642.0,8.9,20.0,2,1.0,Evan Rachel Wood,Thandie Newton,Jeffrey Wright,James Marsden
Big Little Lies,"Reese Witherspoon, Nicole Kidman, Shailene Woo...",86060.0,8.6,7.0,2,1.0,Reese Witherspoon,Nicole Kidman,Shailene Woodley,Zoë Kravitz
The Deuce,"James Franco, Maggie Gyllenhaal, Dominique Fis...",14113.0,8.1,16.0,2,1.0,James Franco,Maggie Gyllenhaal,Dominique Fishback,Gary Carr
Succession,"Hiam Abbass, Nicholas Braun, Brian Cox, Kieran...",3927.0,7.6,10.0,1,0.0,Hiam Abbass,Nicholas Braun,Brian Cox,Kieran Culkin


In [24]:
test_df[1] = test_df[1].fillna("none")
test_df[2] = test_df[2].fillna("none")
test_df[3] = test_df[3].fillna("none")

After filling in the the null spots, I can proceed to configure the actor names to be in the same format as the name_list_clean. This is the list of names where spaces and punctuation were removed to properly count the occurences of the actors.

In [25]:
for number in range(0, len(test_df[0])):
    test_df['actor_1'] = test_df[0][number].lower().replace("'", "").replace(' ', '').replace('.', '').replace('-', '')

for number in range(0, len(test_df[1])):
    test_df['actor_2'] = test_df[1][number].lower().replace("'", "").replace(' ', '').replace('.', '').replace('-', '')

for number in range(0, len(test_df[2])):
    test_df['actor_3'] = test_df[2][number].lower().replace("'", "").replace(' ', '').replace('.', '').replace('-', '')

for number in range(0, len(test_df[3])):
    test_df['actor_4'] = test_df[3][number].lower().replace("'", "").replace(' ', '').replace('.', '').replace('-', '')

In [26]:
for number in range(0, len(test_df[0])):
    test_df['actor_1'][number] = test_df[0][number].lower().replace("'", "").replace(' ', '').replace('.', '').replace('-', '')

for number in range(0, len(test_df[1])):
    test_df['actor_2'][number] = test_df[1][number].lower().replace("'", "").replace(' ', '').replace('.', '').replace('-', '')

for number in range(0, len(test_df[2])):
    test_df['actor_3'][number] = test_df[2][number].lower().replace("'", "").replace(' ', '').replace('.', '').replace('-', '')

for number in range(0, len(test_df[3])):
    test_df['actor_4'][number] = test_df[3][number].lower().replace("'", "").replace(' ', '').replace('.', '').replace('-', '')

Once the actor names are replaced with their vectorized values, I can drop the list of actor names and make the "none[s]" that I imputed earlier to be 0s.

In [27]:
test_df['actor_1'] = test_df['actor_1'].map(match).fillna(test_df['actor_1'])
test_df['actor_2'] = test_df['actor_2'].map(match).fillna(test_df['actor_2'])
test_df['actor_3'] = test_df['actor_3'].map(match).fillna(test_df['actor_3'])
test_df['actor_4'] = test_df['actor_4'].map(match).fillna(test_df['actor_4'])

# This may allow me to replace the values in the cell with the count vectorized values, 
# but I'll need to have the words match first.

In [28]:
temp_df = test_df.drop(['actors', 0, 1, 2, 3], axis=1)

In [29]:
def remove_nones(df):
    for i in range(0, len(df)):
        if df[i] == 'none':
            df[i] = 0

In [30]:
remove_nones(temp_df['actor_2'])
remove_nones(temp_df['actor_3'])
remove_nones(temp_df['actor_4'])

In [31]:
temp_df.tail()

Unnamed: 0,imdb_votes,imdb_rating,number_of_episodes,number_of_seasons,awards,actor_1,actor_2,actor_3,actor_4
This Is Tom Jones,67.0,8.0,6.0,1,1.0,5080,4988,4868,4875
Trust Us with Your Life,110.0,7.2,8.0,1,0.0,2601,983,1697,5235
Turn-On,51.0,4.6,1.0,1,0.0,4959,569,1883,5024
Muppets Tonight,1369.0,7.9,22.0,2,1.0,1153,2918,2340,500
Ripley's Believe It or Not,1069.0,6.7,76.0,4,0.0,1252,2866,0,0


In [32]:
temp_df['actor_1'] = temp_df['actor_1'].astype(int)
temp_df['actor_2'] = temp_df['actor_2'].astype(int)
temp_df['actor_3'] = temp_df['actor_3'].astype(int)
temp_df['actor_4'] = temp_df['actor_4'].astype(int)
temp_df['imdb_rating'] = temp_df['imdb_rating'].astype(float)

In [33]:
temp_df['actor_1_weighted'] = temp_df['imdb_rating'] * temp_df['actor_1']
temp_df['actor_2_weighted'] = temp_df['imdb_rating'] * temp_df['actor_2']
temp_df['actor_3_weighted'] = temp_df['imdb_rating'] * temp_df['actor_3']
temp_df['actor_4_weighted'] = temp_df['imdb_rating'] * temp_df['actor_4']

In [34]:
temp_df.head()

Unnamed: 0,imdb_votes,imdb_rating,number_of_episodes,number_of_seasons,awards,actor_1,actor_2,actor_3,actor_4,actor_1_weighted,actor_2_weighted,actor_3_weighted,actor_4_weighted
Game of Thrones,1361235.0,9.5,73.0,8,1.0,4118,3111,1530,2978,39121.0,29554.5,14535.0,28291.0
Westworld,307642.0,8.9,20.0,2,1.0,1623,4984,2281,2121,14444.7,44357.6,20300.9,18876.9
Big Little Lies,86060.0,8.6,7.0,2,1.0,4271,3903,4683,5340,36730.6,33565.8,40273.8,45924.0
The Deuce,14113.0,8.1,16.0,2,1.0,2111,3300,1369,1724,17099.1,26730.0,11088.9,13964.4
Succession,3927.0,7.6,10.0,1,0.0,1953,3865,627,2957,14842.8,29374.0,4765.2,22473.2


For the sake of exploration, I will want to see if having these weights as individual features or having a single feature holding all four weights will be better, so I'll make a column where all the weights are added together.

In [35]:
temp_df['actors_cum_sum'] = temp_df['actor_1_weighted'] + temp_df['actor_2_weighted'] + temp_df['actor_3_weighted'] + temp_df['actor_4_weighted'] 

In [36]:
temp_df.head()

Unnamed: 0,imdb_votes,imdb_rating,number_of_episodes,number_of_seasons,awards,actor_1,actor_2,actor_3,actor_4,actor_1_weighted,actor_2_weighted,actor_3_weighted,actor_4_weighted,actors_cum_sum
Game of Thrones,1361235.0,9.5,73.0,8,1.0,4118,3111,1530,2978,39121.0,29554.5,14535.0,28291.0,111501.5
Westworld,307642.0,8.9,20.0,2,1.0,1623,4984,2281,2121,14444.7,44357.6,20300.9,18876.9,97980.1
Big Little Lies,86060.0,8.6,7.0,2,1.0,4271,3903,4683,5340,36730.6,33565.8,40273.8,45924.0,156494.2
The Deuce,14113.0,8.1,16.0,2,1.0,2111,3300,1369,1724,17099.1,26730.0,11088.9,13964.4,68882.4
Succession,3927.0,7.6,10.0,1,0.0,1953,3865,627,2957,14842.8,29374.0,4765.2,22473.2,71455.2


In [37]:
temp_df.isnull().sum()

imdb_votes            0
imdb_rating           0
number_of_episodes    1
number_of_seasons     0
awards                0
actor_1               0
actor_2               0
actor_3               0
actor_4               0
actor_1_weighted      0
actor_2_weighted      0
actor_3_weighted      0
actor_4_weighted      0
actors_cum_sum        0
dtype: int64

In [38]:
temp_df.dropna(inplace=True)

# Doing the same with Genre

I will be repeating the same process as with the Actors, but with Genre this time instead. You can skip to the next bolded section, as the in-between will be mostly the same.

In [39]:
nlp_df.columns

Index(['writer', 'number_of_episodes', 'number_of_seasons', 'status_y',
       'actors', 'awards', 'genre_y', 'imdb_rating', 'imdb_votes', 'runtime_x',
       'runtime_cat', 'network', 'bag_of_words'],
      dtype='object')

In [40]:
genre_df = nlp_df[['genre_y']]

In [41]:
genre_df['imdb_rating'] = nlp_df[['imdb_rating']]

In [42]:
genre_df['genre_y'] = genre_df['genre_y'].astype(str)

In [43]:
cv = CountVectorizer(stop_words=None, analyzer='word', 
                     ngram_range=(1, 1), max_df=1.0, min_df=1, max_features=None)

# List of strings
genre_list = []

for row in genre_df['genre_y']:
    genre_list.append(row)

In [44]:
genre_list_clean = []
for item in range(0, len(genre_list)):
    genre_list_clean.append(genre_list[item].lower().replace("'", "").replace(' ', '').replace('.', '').replace('-', ''))


In [45]:
genre_count = cv.fit(genre_list_clean)
genre_bag_words = cv.transform(genre_list_clean)

In [46]:
genre_match = cv.vocabulary_

In [47]:
genre_split = pd.concat([genre_df['genre_y'].str.split(', ', expand=True)], axis=1)
genre_df = pd.concat((genre_df, genre_split), axis=1)

In [48]:
genre_df.head()

Unnamed: 0,genre_y,imdb_rating,0,1,2,3,4,5,6,7,8,9
Game of Thrones,"Action, Adventure, Drama, Fantasy, Romance",9.5,Action,Adventure,Drama,Fantasy,Romance,,,,,
Westworld,"Drama, Mystery, Sci-Fi",8.9,Drama,Mystery,Sci-Fi,,,,,,,
Big Little Lies,"Crime, Drama, Mystery",8.6,Crime,Drama,Mystery,,,,,,,
The Deuce,Drama,8.1,Drama,,,,,,,,,
Succession,Drama,7.6,Drama,,,,,,,,,


In [49]:
genre_df[1] = genre_df[1].fillna("none")
genre_df[2] = genre_df[2].fillna("none")
genre_df[3] = genre_df[3].fillna("none")
genre_df[4] = genre_df[4].fillna("none")
genre_df[5] = genre_df[5].fillna("none")

In [50]:
genre_df.head()

Unnamed: 0,genre_y,imdb_rating,0,1,2,3,4,5,6,7,8,9
Game of Thrones,"Action, Adventure, Drama, Fantasy, Romance",9.5,Action,Adventure,Drama,Fantasy,Romance,none,,,,
Westworld,"Drama, Mystery, Sci-Fi",8.9,Drama,Mystery,Sci-Fi,none,none,none,,,,
Big Little Lies,"Crime, Drama, Mystery",8.6,Crime,Drama,Mystery,none,none,none,,,,
The Deuce,Drama,8.1,Drama,none,none,none,none,none,,,,
Succession,Drama,7.6,Drama,none,none,none,none,none,,,,


In [51]:
for number in range(0, len(genre_df)):
    genre_df['genre_1'] = genre_df[0][number].lower().replace("'", "").replace(' ', '').replace('.', '').replace('-', '')

for number in range(0, len(genre_df)):
    genre_df['genre_2'] = genre_df[1][number].lower().replace("'", "").replace(' ', '').replace('.', '').replace('-', '')

for number in range(0, len(genre_df)):
    genre_df['genre_3'] = genre_df[2][number].lower().replace("'", "").replace(' ', '').replace('.', '').replace('-', '')

for number in range(0, len(genre_df)):
    genre_df['genre_4'] = genre_df[3][number].lower().replace("'", "").replace(' ', '').replace('.', '').replace('-', '')

for number in range(0, len(genre_df)):
    genre_df['genre_5'] = genre_df[4][number].lower().replace("'", "").replace(' ', '').replace('.', '').replace('-', '')

for number in range(0, len(genre_df)):
    genre_df['genre_6'] = genre_df[5][number].lower().replace("'", "").replace(' ', '').replace('.', '').replace('-', '')



In [52]:
for number in range(0, len(genre_df)):
    genre_df['genre_1'][number] = genre_df[0][number].lower().replace("'", "").replace(' ', '').replace('.', '').replace('-', '')

for number in range(0, len(genre_df)):
    genre_df['genre_2'][number] = genre_df[1][number].lower().replace("'", "").replace(' ', '').replace('.', '').replace('-', '')

for number in range(0, len(genre_df)):
    genre_df['genre_3'][number] = genre_df[2][number].lower().replace("'", "").replace(' ', '').replace('.', '').replace('-', '')

for number in range(0, len(genre_df)):
    genre_df['genre_4'][number] = genre_df[3][number].lower().replace("'", "").replace(' ', '').replace('.', '').replace('-', '')

for number in range(0, len(genre_df)):
    genre_df['genre_5'][number] = genre_df[4][number].lower().replace("'", "").replace(' ', '').replace('.', '').replace('-', '')

for number in range(0, len(genre_df)):
    genre_df['genre_6'][number] = genre_df[5][number].lower().replace("'", "").replace(' ', '').replace('.', '').replace('-', '')



In [53]:
genre_df['genre_1'] = genre_df['genre_1'].map(genre_match).fillna(genre_df['genre_1'])
genre_df['genre_2'] = genre_df['genre_2'].map(genre_match).fillna(genre_df['genre_2'])
genre_df['genre_3'] = genre_df['genre_3'].map(genre_match).fillna(genre_df['genre_3'])
genre_df['genre_4'] = genre_df['genre_4'].map(genre_match).fillna(genre_df['genre_4'])
genre_df['genre_5'] = genre_df['genre_5'].map(genre_match).fillna(genre_df['genre_5'])
genre_df['genre_6'] = genre_df['genre_6'].map(genre_match).fillna(genre_df['genre_6'])

In [54]:
genre_df = genre_df.drop(['genre_y', 0, 1, 2, 3, 4, 5, 6, 7, 8, 9], axis=1)

In [55]:
genre_df.head(3)

Unnamed: 0,imdb_rating,genre_1,genre_2,genre_3,genre_4,genre_5,genre_6
Game of Thrones,9.5,0,1,7,9,18,none
Westworld,8.9,7,15,19,none,none,none
Big Little Lies,8.6,5,7,15,none,none,none


In [56]:
remove_nones(genre_df['genre_2'])
remove_nones(genre_df['genre_3'])
remove_nones(genre_df['genre_4'])
remove_nones(genre_df['genre_5'])
remove_nones(genre_df['genre_6'])

In [57]:
genre_df['genre_1'] = genre_df['genre_1'].astype(int)
genre_df['genre_2'] = genre_df['genre_2'].astype(int)
genre_df['genre_3'] = genre_df['genre_3'].astype(int)
genre_df['genre_4'] = genre_df['genre_4'].astype(int)
genre_df['genre_5'] = genre_df['genre_5'].astype(int)
genre_df['genre_6'] = genre_df['genre_6'].astype(int)
genre_df['imdb_rating'] = genre_df['imdb_rating'].astype(float)

In [58]:
genre_df['genre_1_weighted'] = genre_df['imdb_rating'] * genre_df['genre_1']
genre_df['genre_2_weighted'] = genre_df['imdb_rating'] * genre_df['genre_2']
genre_df['genre_3_weighted'] = genre_df['imdb_rating'] * genre_df['genre_3']
genre_df['genre_4_weighted'] = genre_df['imdb_rating'] * genre_df['genre_4']
genre_df['genre_5_weighted'] = genre_df['imdb_rating'] * genre_df['genre_5']
genre_df['genre_6_weighted'] = genre_df['imdb_rating'] * genre_df['genre_6']

genre_df.head()

Unnamed: 0,imdb_rating,genre_1,genre_2,genre_3,genre_4,genre_5,genre_6,genre_1_weighted,genre_2_weighted,genre_3_weighted,genre_4_weighted,genre_5_weighted,genre_6_weighted
Game of Thrones,9.5,0,1,7,9,18,0,0.0,9.5,66.5,85.5,171.0,0.0
Westworld,8.9,7,15,19,0,0,0,62.3,133.5,169.1,0.0,0.0,0.0
Big Little Lies,8.6,5,7,15,0,0,0,43.0,60.2,129.0,0.0,0.0,0.0
The Deuce,8.1,7,0,0,0,0,0,56.7,0.0,0.0,0.0,0.0,0.0
Succession,7.6,7,0,0,0,0,0,53.2,0.0,0.0,0.0,0.0,0.0


In [59]:
genre_df_weighted = genre_df.drop(['genre_1', 'genre_2', 'genre_3', 'genre_4', 'genre_5', 'genre_6'], axis=1)

In [60]:
genre_df_weighted.head()

Unnamed: 0,imdb_rating,genre_1_weighted,genre_2_weighted,genre_3_weighted,genre_4_weighted,genre_5_weighted,genre_6_weighted
Game of Thrones,9.5,0.0,9.5,66.5,85.5,171.0,0.0
Westworld,8.9,62.3,133.5,169.1,0.0,0.0,0.0
Big Little Lies,8.6,43.0,60.2,129.0,0.0,0.0,0.0
The Deuce,8.1,56.7,0.0,0.0,0.0,0.0,0.0
Succession,7.6,53.2,0.0,0.0,0.0,0.0,0.0


# Writers

And lastly, I want to do the same once again, but this time with Writers.

Once again, skip to the next bolded section to avoid seeing the same cleanup.

In [61]:
nlp_df.columns

Index(['writer', 'number_of_episodes', 'number_of_seasons', 'status_y',
       'actors', 'awards', 'genre_y', 'imdb_rating', 'imdb_votes', 'runtime_x',
       'runtime_cat', 'network', 'bag_of_words'],
      dtype='object')

In [62]:
writer_df = nlp_df[['writer']]

In [63]:
writer_df['imdb_rating'] = nlp_df[['imdb_rating']]

In [64]:
writer_df_clean = writer_df[writer_df['writer'] != 0]
writer_df_clean = writer_df_clean[writer_df_clean['writer'] != '0']

In [65]:
writer_df_clean.head()

Unnamed: 0,writer,imdb_rating
Game of Thrones,"David Benioff, D.B. Weiss",9.5
Big Little Lies,David E. Kelley,8.6
The Deuce,"George Pelecanos, David Simon",8.1
Succession,Jesse Armstrong,7.6
Curb Your Enthusiasm,Larry David,8.7


In [66]:
writer_df_clean.shape

(1236, 2)

In [67]:
cv = CountVectorizer(stop_words=None, analyzer='word', 
                     ngram_range=(1, 1), max_df=1.0, min_df=1, max_features=None)

# List of strings
writer_list = []

for row in writer_df_clean['writer']:
    writer_list.append(row)

In [68]:
writer_list_clean = []
for item in range(0, len(writer_list)):
    writer_list_clean.append(writer_list[item].lower().replace("'", "").replace(' ', '').replace('.', '').replace('-', ''))


In [69]:
writer_count = cv.fit(writer_list_clean)
writer_bag_words = cv.transform(writer_list_clean)

In [70]:
writer_match = cv.vocabulary_

In [71]:
writer_split = pd.concat([writer_df_clean['writer'].str.split(', ', expand=True)], axis=1)
writer_df_clean = pd.concat((writer_df_clean, writer_split), axis=1)

In [72]:
writer_df_clean.head()

Unnamed: 0,writer,imdb_rating,0,1,2,3,4,5,6,7,...,95,96,97,98,99,100,101,102,103,104
Game of Thrones,"David Benioff, D.B. Weiss",9.5,David Benioff,D.B. Weiss,,,,,,,...,,,,,,,,,,
Big Little Lies,David E. Kelley,8.6,David E. Kelley,,,,,,,,...,,,,,,,,,,
The Deuce,"George Pelecanos, David Simon",8.1,George Pelecanos,David Simon,,,,,,,...,,,,,,,,,,
Succession,Jesse Armstrong,7.6,Jesse Armstrong,,,,,,,,...,,,,,,,,,,
Curb Your Enthusiasm,Larry David,8.7,Larry David,,,,,,,,...,,,,,,,,,,


In [73]:
writer_df_clean.isnull().sum()

writer            0
imdb_rating       0
0                 0
1               605
2              1022
3              1171
4              1217
5              1225
6              1231
7              1231
8              1231
9              1232
10             1232
11             1232
12             1233
13             1234
14             1234
15             1234
16             1234
17             1235
18             1235
19             1235
20             1235
21             1235
22             1235
23             1235
24             1235
25             1235
26             1235
27             1235
               ... 
75             1235
76             1235
77             1235
78             1235
79             1235
80             1235
81             1235
82             1235
83             1235
84             1235
85             1235
86             1235
87             1235
88             1235
89             1235
90             1235
91             1235
92             1235
93             1235


In [74]:
writer_df_reduced = writer_df_clean.drop([6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104], axis=1)

In [75]:
writer_df_reduced.head()

Unnamed: 0,writer,imdb_rating,0,1,2,3,4,5
Game of Thrones,"David Benioff, D.B. Weiss",9.5,David Benioff,D.B. Weiss,,,,
Big Little Lies,David E. Kelley,8.6,David E. Kelley,,,,,
The Deuce,"George Pelecanos, David Simon",8.1,George Pelecanos,David Simon,,,,
Succession,Jesse Armstrong,7.6,Jesse Armstrong,,,,,
Curb Your Enthusiasm,Larry David,8.7,Larry David,,,,,


In [76]:
writer_df_reduced[1] = writer_df_reduced[1].fillna("none")
writer_df_reduced[2] = writer_df_reduced[2].fillna("none")
writer_df_reduced[3] = writer_df_reduced[3].fillna("none")
writer_df_reduced[4] = writer_df_reduced[4].fillna("none")
writer_df_reduced[5] = writer_df_reduced[5].fillna("none")

In [77]:
for number in range(0, len(writer_df_reduced)):
    writer_df_reduced['writer_1'] = writer_df_reduced[0][number].lower().replace("'", "").replace(' ', '').replace('.', '').replace('-', '')

for number in range(0, len(writer_df_reduced)):
    writer_df_reduced['writer_2'] = writer_df_reduced[1][number].lower().replace("'", "").replace(' ', '').replace('.', '').replace('-', '')

for number in range(0, len(writer_df_reduced)):
    writer_df_reduced['writer_3'] = writer_df_reduced[2][number].lower().replace("'", "").replace(' ', '').replace('.', '').replace('-', '')

for number in range(0, len(writer_df_reduced)):
    writer_df_reduced['writer_4'] = writer_df_reduced[3][number].lower().replace("'", "").replace(' ', '').replace('.', '').replace('-', '')

for number in range(0, len(writer_df_reduced)):
    writer_df_reduced['writer_5'] = writer_df_reduced[4][number].lower().replace("'", "").replace(' ', '').replace('.', '').replace('-', '')
    
for number in range(0, len(writer_df_reduced)):
    writer_df_reduced['writer_6'] = writer_df_reduced[5][number].lower().replace("'", "").replace(' ', '').replace('.', '').replace('-', '')

In [78]:
for number in range(0, len(writer_df_reduced)):
    writer_df_reduced['writer_1'][number] = writer_df_reduced[0][number].lower().replace("'", "").replace(' ', '').replace('.', '').replace('-', '')

for number in range(0, len(writer_df_reduced)):
    writer_df_reduced['writer_2'][number] = writer_df_reduced[1][number].lower().replace("'", "").replace(' ', '').replace('.', '').replace('-', '')

for number in range(0, len(writer_df_reduced)):
    writer_df_reduced['writer_3'][number] = writer_df_reduced[2][number].lower().replace("'", "").replace(' ', '').replace('.', '').replace('-', '')

for number in range(0, len(writer_df_reduced)):
    writer_df_reduced['writer_4'][number] = writer_df_reduced[3][number].lower().replace("'", "").replace(' ', '').replace('.', '').replace('-', '')

for number in range(0, len(writer_df_reduced)):
    writer_df_reduced['writer_5'][number] = writer_df_reduced[4][number].lower().replace("'", "").replace(' ', '').replace('.', '').replace('-', '')
    
for number in range(0, len(writer_df_reduced)):
    writer_df_reduced['writer_6'][number] = writer_df_reduced[5][number].lower().replace("'", "").replace(' ', '').replace('.', '').replace('-', '')

In [79]:
writer_df_reduced.dtypes

writer         object
imdb_rating    object
0              object
1              object
2              object
3              object
4              object
5              object
writer_1       object
writer_2       object
writer_3       object
writer_4       object
writer_5       object
writer_6       object
dtype: object

In [80]:
writer_df_reduced['writer_1'] = writer_df_reduced['writer_1'].map(writer_match).fillna(writer_df_reduced['writer_1'])
writer_df_reduced['writer_2'] = writer_df_reduced['writer_2'].map(writer_match).fillna(writer_df_reduced['writer_2'])
writer_df_reduced['writer_3'] = writer_df_reduced['writer_3'].map(writer_match).fillna(writer_df_reduced['writer_3'])
writer_df_reduced['writer_4'] = writer_df_reduced['writer_4'].map(writer_match).fillna(writer_df_reduced['writer_4'])
writer_df_reduced['writer_5'] = writer_df_reduced['writer_5'].map(writer_match).fillna(writer_df_reduced['writer_5'])
writer_df_reduced['writer_6'] = writer_df_reduced['writer_6'].map(writer_match).fillna(writer_df_reduced['writer_6'])

In [81]:
writer_df_reduced = writer_df_reduced.drop(['writer', 0, 1, 2, 3, 4, 5], axis=1)

In [82]:
writer_df_reduced.head(3)

Unnamed: 0,imdb_rating,writer_1,writer_2,writer_3,writer_4,writer_5,writer_6
Game of Thrones,9.5,406,465,none,none,none,none
Big Little Lies,8.6,416,none,none,none,none,none
The Deuce,8.1,637,448,none,none,none,none


In [83]:
remove_nones(writer_df_reduced['writer_2'])
remove_nones(writer_df_reduced['writer_3'])
remove_nones(writer_df_reduced['writer_4'])
remove_nones(writer_df_reduced['writer_5'])
remove_nones(writer_df_reduced['writer_6'])

In [84]:
writer_df_reduced['writer_1'] = writer_df_reduced['writer_1'].astype(int)
writer_df_reduced['writer_2'] = writer_df_reduced['writer_2'].astype(int)
writer_df_reduced['writer_3'] = writer_df_reduced['writer_3'].astype(int)
writer_df_reduced['writer_4'] = writer_df_reduced['writer_4'].astype(int)
writer_df_reduced['writer_5'] = writer_df_reduced['writer_5'].astype(int)
writer_df_reduced['writer_6'] = writer_df_reduced['writer_6'].astype(int)
writer_df_reduced['imdb_rating'] = writer_df_reduced['imdb_rating'].astype(float)

In [85]:
writer_df_reduced['writer_1_weighted'] = writer_df_reduced['imdb_rating'] * writer_df_reduced['writer_1']
writer_df_reduced['writer_2_weighted'] = writer_df_reduced['imdb_rating'] * writer_df_reduced['writer_2']
writer_df_reduced['writer_3_weighted'] = writer_df_reduced['imdb_rating'] * writer_df_reduced['writer_3']
writer_df_reduced['writer_4_weighted'] = writer_df_reduced['imdb_rating'] * writer_df_reduced['writer_4']
writer_df_reduced['writer_5_weighted'] = writer_df_reduced['imdb_rating'] * writer_df_reduced['writer_5']
writer_df_reduced['writer_6_weighted'] = writer_df_reduced['imdb_rating'] * writer_df_reduced['writer_6']

writer_df_reduced.head()

Unnamed: 0,imdb_rating,writer_1,writer_2,writer_3,writer_4,writer_5,writer_6,writer_1_weighted,writer_2_weighted,writer_3_weighted,writer_4_weighted,writer_5_weighted,writer_6_weighted
Game of Thrones,9.5,406,465,0,0,0,0,3857.0,4417.5,0.0,0.0,0.0,0.0
Big Little Lies,8.6,416,0,0,0,0,0,3577.6,0.0,0.0,0.0,0.0,0.0
The Deuce,8.1,637,448,0,0,0,0,5159.7,3628.8,0.0,0.0,0.0,0.0
Succession,7.6,829,0,0,0,0,0,6300.4,0.0,0.0,0.0,0.0,0.0
Curb Your Enthusiasm,8.7,1039,0,0,0,0,0,9039.3,0.0,0.0,0.0,0.0,0.0


In [86]:
writer_df_weighted = writer_df_reduced.drop(['writer_1', 'writer_2', 'writer_3', 'writer_4', 'writer_5', 'writer_6'], axis=1)

# Testing out the weights

For the purpose of determining whether the engineered features are having a positive or negative effect on my model, I will see how each of those features alone predict my target value. 

## Simple model for actors

In [87]:
temp_df.columns

Index(['imdb_votes', 'imdb_rating', 'number_of_episodes', 'number_of_seasons',
       'awards', 'actor_1', 'actor_2', 'actor_3', 'actor_4',
       'actor_1_weighted', 'actor_2_weighted', 'actor_3_weighted',
       'actor_4_weighted', 'actors_cum_sum'],
      dtype='object')

In [88]:
temp_df = temp_df.dropna()

In [89]:
X = temp_df.drop(['imdb_rating', 'imdb_votes', 'number_of_episodes', 'number_of_seasons', 'awards', 'actor_1', 'actor_2', 'actor_3', 'actor_4'], axis=1)
X_1 = temp_df[['actors_cum_sum']]
y = temp_df['imdb_rating']

In [90]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [91]:
ss = StandardScaler()
X_train_sc = ss.fit_transform(X_train)
X_test_sc = ss.transform(X_test)

In [92]:
lr = LinearRegression()
lr.fit(X_train_sc, y_train)
lr.score(X_train_sc, y_train), lr.score(X_test_sc, y_test)

(0.17473065409407174, 0.08932844980377641)

#### Modeling with only dummied variables

In [93]:
test_df.dropna(inplace=True)
test_df.shape

(1889, 14)

In [94]:
s = test_df['actors'].str.split(', ')

In [95]:
dum_actors = pd.get_dummies(s.apply(pd.Series).stack()).sum(level=0)
dum_actors.head()

Unnamed: 0,'Big' LeRoy Mobley,'Weird Al' Yankovic,0,A.D. Miles,A.J. Cook,A.J. Johnson,A.J. Langer,AJ Gibson,Aaron Ashmore,Aaron Paul,...,Zalman King,Zeeko Zaki,Zevi Wolmark,Zienia Merton,Zoe Perry,Zoie Palmer,Zooey Deschanel,Zora Andrich,Zoë Kravitz,Zuleikha Robinson
Game of Thrones,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Westworld,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Big Little Lies,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
The Deuce,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Succession,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [96]:
X = dum_actors
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [97]:
ss = StandardScaler()
X_train_sc = ss.fit_transform(X_train)
X_test_sc = ss.transform(X_test)

In [98]:
lr = LinearRegression()
lr.fit(X_train_sc, y_train)
lr.score(X_train_sc, y_train), lr.score(X_test_sc, y_test)

(0.9658608385114963, -2.332319466543675e+26)

I can see that the weighted actors is performing much better than using just the dummied actors alone. The dummied actors-based model is grossly overfitting (96% explained variance on the training data...), so it goes without saying that I will be using the weighted features. The dummied actors simply have too many unique observations to be informative to the model.

## Cumulative weight versus categorized weights by actor

In [99]:
X_train, X_test, y_train, y_test = train_test_split(X_1, y, random_state=42)

In [100]:
ss = StandardScaler()
X_train_sc = ss.fit_transform(X_train)
X_test_sc = ss.transform(X_test)

In [101]:
lr = LinearRegression()
lr.fit(X_train_sc, y_train)
lr.score(X_train_sc, y_train), lr.score(X_test_sc, y_test)

(0.1669388024549112, 0.08424217817979174)

Comparing this model with the weighted actors (split columns) model, I can see that performance is very similar. It seems that there would be little difference in which model I choose, but I plan to go with the split columns/weights.

# Simple model for genres

In [102]:
X = genre_df_weighted.drop('imdb_rating', axis=1)
y = genre_df_weighted['imdb_rating']

In [103]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [104]:
ss = StandardScaler()
X_train_sc = ss.fit_transform(X_train)
X_test_sc = ss.transform(X_test)

In [105]:
lr = LinearRegression()
lr.fit(X_train_sc, y_train)
lr.score(X_train_sc, y_train), lr.score(X_test_sc, y_test)

(0.054594296961762545, 0.0677784685078392)

#### Modeling with only dummied variables

In [106]:
model_df.columns

Index(['number_of_episodes', 'number_of_seasons', 'awards', 'imdb_rating',
       'imdb_votes', 'timeslot_00:00', 'timeslot_afternoon',
       'timeslot_evening', 'timeslot_latenight', 'timeslot_morning',
       'timeslot_unknown', 'day_0', 'day_1', 'day_2', 'day_3', 'day_4',
       'day_5', 'day_6', 'month_1', 'month_2', 'month_3', 'month_4', 'month_5',
       'month_6', 'month_7', 'month_8', 'month_9', 'month_10', 'month_11',
       'month_12', ' Action', ' Adventure', ' Animation', ' Comedy', ' Crime',
       ' Drama', ' Family', ' Fantasy', ' Game-Show', ' History', ' Horror',
       ' Music', ' Musical', ' Mystery', ' News', ' Reality-TV', ' Romance',
       ' Sci-Fi', ' Short', ' Sport', ' Talk-Show', ' Thriller', ' War',
       ' Western', 'Action', 'Adventure', 'Animation', 'Biography', 'Comedy',
       'Crime', 'Documentary', 'Drama', 'Family', 'Fantasy', 'Game-Show',
       'History', 'Horror', 'Music', 'Mystery', 'News', 'Reality-TV',
       'Romance', 'Sci-Fi', 'Sport', 'Ta

In [107]:
model_df.dropna(inplace=True)

In [108]:
X = model_df[[' Action', ' Adventure', ' Animation', ' Comedy', ' Crime',
       ' Drama', ' Family', ' Fantasy', ' Game-Show', ' History', ' Horror',
       ' Music', ' Musical', ' Mystery', ' News', ' Reality-TV', ' Romance',
       ' Sci-Fi', ' Short', ' Sport', ' Talk-Show', ' Thriller', ' War',
       ' Western', 'Action', 'Adventure', 'Animation', 'Biography', 'Comedy',
       'Crime', 'Documentary', 'Drama', 'Family', 'Fantasy', 'Game-Show',
       'History', 'Horror', 'Music', 'Mystery', 'News', 'Reality-TV',
       'Romance', 'Sci-Fi', 'Sport', 'Talk-Show', 'Western']]
y = model_df['imdb_rating']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [109]:
ss = StandardScaler()
X_train_sc = ss.fit_transform(X_train)
X_test_sc = ss.transform(X_test)

In [110]:
lr = LinearRegression()
lr.fit(X_train_sc, y_train)
lr.score(X_train_sc, y_train), lr.score(X_test_sc, y_test)

(0.19677923147368, 0.09611441877593818)

In contrast to the models for the actors data, when looking at only the genre as a predictive feature, it appears that the dummied genres have a slightly better score than the weighted categories. I will be retaining the dummied features for my final model.

# Simple model for writers

In [111]:
X = writer_df_weighted.drop('imdb_rating', axis=1)
y = writer_df_weighted['imdb_rating']

In [112]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [113]:
ss = StandardScaler()
X_train_sc = ss.fit_transform(X_train)
X_test_sc = ss.transform(X_test)

In [114]:
lr = LinearRegression()
lr.fit(X_train_sc, y_train)
lr.score(X_train_sc, y_train), lr.score(X_test_sc, y_test)

(0.056460690914488465, 0.06271659946369246)

#### Modeling with only dummied variables

In [115]:
s = writer_df_clean['writer'].str.split(', ')

In [116]:
dum_writers = pd.get_dummies(s.apply(pd.Series).stack()).sum(level=0)
dum_writers.head()

Unnamed: 0,'Weird Al' Yankovic,A.I. Bezzerides,AJ Carothers,Aaron Chrenen,Aaron McGruder,Aaron Sorkin,Aaron Spelling,Aaron Zelman,Abbi Jacobson,Abby Kohn,...,Yoshiyuki Tomino,Yvette Lee Bowser,Zach Galifianakis,Zach Kanin,Zach Stafford,Zachary Crowe,Zack Bornstein,Zack Estrin,Zak Penn,Àlex Pastor
Game of Thrones,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Big Little Lies,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
The Deuce,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Succession,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Curb Your Enthusiasm,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [117]:
X = dum_writers
y = writer_df_weighted['imdb_rating']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [118]:
ss = StandardScaler()
X_train_sc = ss.fit_transform(X_train)
X_test_sc = ss.transform(X_test)

In [119]:
lr = LinearRegression()
lr.fit(X_train_sc, y_train)
lr.score(X_train_sc, y_train), lr.score(X_test_sc, y_test)

(0.8992503541795358, -2.6872979896613328e+25)

Were I to proceed with using the writers as a feature (and thereby dropping around 600 observations that are missing this information, I would want to use the weighted writers rather than dummying them. Similar to the actors simple regression model, the large number of unique writers (columns) with small number of observations (rows) makes it so that the model grossly overfits. 

Based on this rudimentary scoring, however, I believe that it will be better to retain the observations with missing writer data and to drop that as a feature, instad.

# Building the final model

In [120]:
good_model = model_df.merge(temp_df, left_index=True, right_index=True)

In [129]:
good_model.drop([' Action',
 ' Adventure',
 ' Animation',
 ' Comedy',
 ' Crime',
 ' Drama',
 ' Family',
 ' Fantasy',
 ' Game-Show',
 ' History',
 ' Horror',
 ' Music',
 ' Musical',
 ' Mystery',
 ' News',
 ' Reality-TV',
 ' Romance',
 ' Sci-Fi',
 ' Short',
 ' Sport',
 ' Talk-Show',
 ' Thriller',
 ' War',
 ' Western'], axis=1, inplace=True)

In [130]:
good_model.shape

(1889, 80)

In [131]:
with open('../0_Assets_&_Data/final_model.pickle', 'wb') as f:
    pickle.dump(good_model, f)

I have my final features selected, so I can proceed to GridSearch the model in my next notebook.

In [132]:
list(good_model.columns)

['number_of_episodes_x',
 'number_of_seasons_x',
 'awards_x',
 'imdb_rating_x',
 'imdb_votes_x',
 'timeslot_00:00',
 'timeslot_afternoon',
 'timeslot_evening',
 'timeslot_latenight',
 'timeslot_morning',
 'timeslot_unknown',
 'day_0',
 'day_1',
 'day_2',
 'day_3',
 'day_4',
 'day_5',
 'day_6',
 'month_1',
 'month_2',
 'month_3',
 'month_4',
 'month_5',
 'month_6',
 'month_7',
 'month_8',
 'month_9',
 'month_10',
 'month_11',
 'month_12',
 'Action',
 'Adventure',
 'Animation',
 'Biography',
 'Comedy',
 'Crime',
 'Documentary',
 'Drama',
 'Family',
 'Fantasy',
 'Game-Show',
 'History',
 'Horror',
 'Music',
 'Mystery',
 'News',
 'Reality-TV',
 'Romance',
 'Sci-Fi',
 'Sport',
 'Talk-Show',
 'Western',
 'status_Canceled',
 'status_Ended',
 'status_In Production',
 'status_Returning Series',
 'type_Documentary',
 'type_Miniseries',
 'type_News',
 'type_Reality',
 'type_Scripted',
 'type_Talk Show',
 'type_Video',
 'runtime_full',
 'runtime_half',
 'runtime_special',
 'imdb_votes_y',
 'imdb_r