# Feature Engineering & EDA

With the data centralized, I can take a closer look to see how my features will interact with my target variable, IMDB ratings. This will be a regression problem, so I will be using Linear Regression (including Ridge and Lasso), Random Forest Regression, and Gradient Boost as my models to try and predict with, so I will have to prepare my data accordingly. I plan on categorizing and dummying the features that aren't numerical, as well as categorize numerical data that shouldn't have a greater or worse weight based on the value (i.e. time or dates).

In [1]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV, ElasticNetCV
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer

import datetime, time
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pickle
import json

import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

In [2]:
with open('./Assets_&_Data/series_df_full.pickle', 'rb') as handle:
    series_df = pickle.load(handle)

In [3]:
series_df.columns

Index(['airsDayOfWeek', 'airsTime', 'network', 'overview_x', 'rating',
       'runtime_x', 'number_of_episodes', 'number_of_seasons', 'overview_y',
       'status_y', 'type_x', 'actors', 'awards', 'genre_y', 'imdb_id',
       'imdb_rating', 'imdb_votes', 'plot', 'released', 'writer'],
      dtype='object')

In [4]:
series_df.shape

(1915, 20)

In [5]:
series_df.tail(3)

Unnamed: 0,airsDayOfWeek,airsTime,network,overview_x,rating,runtime_x,number_of_episodes,number_of_seasons,overview_y,status_y,type_x,actors,awards,genre_y,imdb_id,imdb_rating,imdb_votes,plot,released,writer
Turn-On,,,ABC (US),Turn-on was a fast paced comedy that featured ...,,30,1.0,1,Turn-On is an American sketch comedy series th...,Ended,Scripted,"Teresa Graves, Bonnie Boland, Hamilton Camp, T...",0,Comedy,tt0063960,4.6,51,"A multimedia presentation satirizing sex, poli...",05 Feb 1969,0
Muppets Tonight,,,ABC (US),,,30,22.0,2,Muppets Tonight is a live-action/puppet televi...,Ended,Scripted,"Dave Goelz, Kevin Clash, Jerry Nelson, Bill Ba...",1,"Comedy, Family, Music",tt0115279,7.9,1369,Kermit the Frog is still running a variety sho...,08 Mar 1996,0
Ripley's Believe It or Not,Wednesday,9:00 PM,TBS Superstation,A revival of the series based on the newspaper...,,60,76.0,4,,Ended,Scripted,"Dean Cain, Kelly Packard",0,Reality-TV,tt0218787,6.7,1069,Ripley's Believe It or Not! is a curious forma...,12 Jan 2000,0


In [6]:
len(series_df[series_df['airsTime'] == ''])

753

In [7]:
series_df['type_x'].value_counts()

Scripted       1728
Reality          87
Miniseries       48
Talk Show        30
Documentary      11
News             10
Video             1
Name: type_x, dtype: int64

In [8]:
series_df.dtypes

airsDayOfWeek          object
airsTime               object
network                object
overview_x             object
rating                 object
runtime_x              object
number_of_episodes    float64
number_of_seasons       int64
overview_y             object
status_y               object
type_x                 object
actors                 object
awards                 object
genre_y                object
imdb_id                object
imdb_rating            object
imdb_votes             object
plot                   object
released               object
writer                 object
dtype: object

Taking an initial look at the data that I've kept, it appears that I'll have to address several features:
    - airsDayOfWeek will need to be dummied, and I'll have to fill in missing dates using the "released" feature as a datetime type
    - airsTime will have to be cleaned and standardized into categories, then either dummied or grouped.
    - network will have to be dummied
    - the three overview columns (_x, _y, plot) were left in for the purpose of NLP/Topic Modeling, not necessarily to be used directly in the model itself
    - rating has a lot of missing values that I'll need to make a decision on how to address (from the previous notebook)
    - runtime has a lot of missing values that I'll need to make a decision on how to address (from the previous notebook)
    - status will have to be dummied
    - type will have to be dummied
    - awards, writer and genre will need to be given some sort of numerical weight in order to be evaluated by the model
    - imdb_votes and imdb_rating have to be floats

# Categorizing the "airsTime" column 

Due to the amount of missing/blank values, I will be grouping them into categories to see how they are seen in a model - the categories will be:
- morning
- afternoon
- evening
- latenight 
- unknown

In [9]:
clean_time = series_df[['airsTime']]

In [10]:
pm_slot = clean_time[clean_time['airsTime'].str.contains('PM')]

pm_slot_ = clean_time[clean_time['airsTime'].str.contains('pm')]

am_slot = clean_time[clean_time['airsTime'].str.contains('AM')]

am_slot_ = clean_time[clean_time['airsTime'].str.contains('am')]

In [11]:
missing_time = clean_time[clean_time['airsTime'] == '']

In [12]:
remaining_time = clean_time[clean_time['airsTime'] != '']

remaining_time = remaining_time[~remaining_time['airsTime'].str.contains('PM')]

remaining_time = remaining_time[~remaining_time['airsTime'].str.contains('pm')]

remaining_time = remaining_time[~remaining_time['airsTime'].str.contains('AM')]

remaining_time = remaining_time[~remaining_time['airsTime'].str.contains('am')]

In [13]:
for i in reversed(range(1, 25)):
    if i >= 22:
        remaining_time[remaining_time['airsTime'].str.contains(str(i))] = 'latenight'
    elif i >= 18:
        remaining_time[remaining_time['airsTime'].str.contains(str(i))] = 'evening'
    elif i >= 12:
        remaining_time[remaining_time['airsTime'].str.contains(str(i))] = 'afternoon'
    elif i >=5:
        remaining_time[remaining_time['airsTime'].str.contains(str(i))] = 'morning'
    else: 
        remaining_time[remaining_time['airsTime'].str.contains(str(i))] = 'latenight'

In [14]:
remaining_time

Unnamed: 0,airsTime
Veep,latenight
Crashing,latenight
Animals.,latenight
Wyatt Cenac's Problem Areas,latenight
HBO,morning
Tales from the Crypt,latenight
Crashbox,afternoon
Any Given Wednesday with Bill Simmons,latenight
Vinyl,evening
Elizabeth I,evening


In [15]:
# Categorizing PM slots
for i in range(1, 13):
    if i < 6:
        pm_slot[pm_slot['airsTime'].str.contains(str(i))] = 'afternoon'
        pm_slot_[pm_slot_['airsTime'].str.contains(str(i))] = 'afternoon'
    elif i >= 6 & i < 10:
        pm_slot[pm_slot['airsTime'].str.contains(str(i))] = 'evening'
        pm_slot_[pm_slot_['airsTime'].str.contains(str(i))] = 'evening'
    else: 
        pm_slot[pm_slot['airsTime'].str.contains(str(i))] = 'latenight'
        pm_slot_[pm_slot_['airsTime'].str.contains(str(i))] = 'latenight'

#pm_slot[pm_slot['airsTime'].str.contains('8', '9')]

In [16]:
# Categorizing PM slots
for i in range(1, 13):
    if i < 5:
        am_slot[am_slot['airsTime'].str.contains(str(i))] = 'latenight'
        am_slot_[am_slot_['airsTime'].str.contains(str(i))] = 'latenight'
    else: 
        am_slot[am_slot['airsTime'].str.contains(str(i))] = 'morning'
        am_slot_[am_slot_['airsTime'].str.contains(str(i))] = 'morning'

#pm_slot[pm_slot['airsTime'].str.contains('8', '9')]

In [17]:
missing_time['airsTime'] = 'unknown'

In [18]:
timeslot = pd.concat((am_slot, am_slot_, pm_slot, pm_slot_, remaining_time, missing_time))

# Using "released" column to create month column and fill missing "day" values

In [19]:
series_df['released'] = pd.to_datetime(series_df['released'])

In [20]:
week_day = series_df[['released']]

In [21]:
week_day.head()

Unnamed: 0,released
Game of Thrones,2011-04-17
Westworld,2016-10-02
Big Little Lies,2017-02-19
The Deuce,2017-08-25
Succession,2018-06-03


In [22]:
week_day['weekday'] = week_day['released'].dt.dayofweek

In [23]:
week_day.head(10)

Unnamed: 0,released,weekday
Game of Thrones,2011-04-17,6
Westworld,2016-10-02,6
Big Little Lies,2017-02-19,6
The Deuce,2017-08-25,4
Succession,2018-06-03,6
Curb Your Enthusiasm,2000-10-15,6
Veep,2012-04-22,6
Silicon Valley,2014-04-06,6
Ballers,2015-06-21,6
High Maintenance,2016-09-16,4


In [24]:
week_day['month'] = week_day['released'].dt.month

In [25]:
week_day

Unnamed: 0,released,weekday,month
Game of Thrones,2011-04-17,6,4
Westworld,2016-10-02,6,10
Big Little Lies,2017-02-19,6,2
The Deuce,2017-08-25,4,8
Succession,2018-06-03,6,6
Curb Your Enthusiasm,2000-10-15,6,10
Veep,2012-04-22,6,4
Silicon Valley,2014-04-06,6,4
Ballers,2015-06-21,6,6
High Maintenance,2016-09-16,4,9


# Categorizing runtime

I will be categorizing this data as well. Although the runtimes that I've gathered have a lot of different values, in reality/practice, tv shows are usually in 30 minute and 1 hour increments (longer than 1 hour would generally be specials or movies). For this reason, I chose relatively arbitrary runtimes to group group the shows by.

In [26]:
series_df = series_df[(series_df[['runtime_x']] != ('')).all(axis=1)]

In [27]:
series_df[['runtime_x']] = series_df[['runtime_x']].astype(int)

In [28]:
series_df['runtime_x'].value_counts()

30     692
60     425
45     315
25     241
15      35
20      33
50      24
11      23
55      16
10      16
120     13
40      13
85      10
1        9
90       8
12       8
35       4
13       4
65       3
70       3
180      2
80       2
7        2
75       1
95       1
125      1
140      1
6        1
3        1
240      1
Name: runtime_x, dtype: int64

In [29]:
series_df['runtime_x'].head(10)

Game of Thrones         55
Westworld               60
Big Little Lies         50
The Deuce               60
Succession              60
Curb Your Enthusiasm    30
Veep                    30
Silicon Valley          30
Ballers                 30
High Maintenance        10
Name: runtime_x, dtype: int64

In [30]:
series_df['runtime_cat'] = np.where(series_df['runtime_x'] < 30, 'half', 'full')

In [31]:
series_df['runtime_cat'].loc[series_df.runtime_x < 30] = 'half'
series_df['runtime_cat'].loc[series_df.runtime_x > 70] = 'special'

In [32]:
series_df['runtime_cat'].value_counts()

full       1495
half        373
special      40
Name: runtime_cat, dtype: int64

In [33]:
series_df['runtime_cat'].head(10)

Game of Thrones         full
Westworld               full
Big Little Lies         full
The Deuce               full
Succession              full
Curb Your Enthusiasm    full
Veep                    full
Silicon Valley          full
Ballers                 full
High Maintenance        half
Name: runtime_cat, dtype: object

In [34]:
series_df.head(3)

Unnamed: 0,airsDayOfWeek,airsTime,network,overview_x,rating,runtime_x,number_of_episodes,number_of_seasons,overview_y,status_y,...,actors,awards,genre_y,imdb_id,imdb_rating,imdb_votes,plot,released,writer,runtime_cat
Game of Thrones,Sunday,9:00 PM,HBO,Seven noble families fight for control of the ...,TV-MA,55,73.0,8,Seven noble families fight for control of the ...,Returning Series,...,"Peter Dinklage, Lena Headey, Emilia Clarke, Ki...",1,"Action, Adventure, Drama, Fantasy, Romance",tt0944947,9.5,1361235,"In the mythical continent of Westeros, several...",2011-04-17,"David Benioff, D.B. Weiss",full
Westworld,Sunday,9:00 PM,HBO,Westworld is a dark odyssey about the dawn of ...,TV-MA,60,20.0,2,A dark odyssey about the dawn of artificial co...,Returning Series,...,"Evan Rachel Wood, Thandie Newton, Jeffrey Wrig...",1,"Drama, Mystery, Sci-Fi",tt0475784,8.9,307642,Westworld isn't your typical amusement park. I...,2016-10-02,0,full
Big Little Lies,Sunday,9:00 PM,HBO,"Subversive, darkly comedic drama Big Little Li...",TV-MA,50,7.0,2,"Subversive, darkly comedic drama Big Little Li...",Returning Series,...,"Reese Witherspoon, Nicole Kidman, Shailene Woo...",1,"Crime, Drama, Mystery",tt3920596,8.6,86060,While Madeline and Celeste take new in town si...,2017-02-19,David E. Kelley,full


# Separate genre_y values into additional columns

In [35]:
df = pd.Series(series_df['genre_y'].str.split(','))

In [36]:
genre_dummies = pd.get_dummies(df.apply(pd.Series).stack()).sum(level=0)

# Dummies

In [37]:
status_dummies = pd.get_dummies(series_df['status_y'], prefix='status')

In [38]:
type_dummies = pd.get_dummies(series_df['type_x'], prefix='type')

In [39]:
timeslot_dum = pd.get_dummies(timeslot, prefix='timeslot')

In [40]:
week_day_ = week_day.drop('released', axis=1)
week_day_dum1 = pd.get_dummies(week_day_['weekday'], prefix='day')
week_day_dum2 = pd.get_dummies(week_day_['month'], prefix='month')

# Combining all the dataframes

### To drop:
- airsDayOfWeek
- airsTime
- network
- overview_x
- runtime_x
- overview_y
- actors (dummy)
- genre_y (dummy)
- imdb_id
- imdb_rating
- plot
- released
- writer (dummy)

In [41]:
series_df.head(1)

Unnamed: 0,airsDayOfWeek,airsTime,network,overview_x,rating,runtime_x,number_of_episodes,number_of_seasons,overview_y,status_y,...,actors,awards,genre_y,imdb_id,imdb_rating,imdb_votes,plot,released,writer,runtime_cat
Game of Thrones,Sunday,9:00 PM,HBO,Seven noble families fight for control of the ...,TV-MA,55,73.0,8,Seven noble families fight for control of the ...,Returning Series,...,"Peter Dinklage, Lena Headey, Emilia Clarke, Ki...",1,"Action, Adventure, Drama, Fantasy, Romance",tt0944947,9.5,1361235,"In the mythical continent of Westeros, several...",2011-04-17,"David Benioff, D.B. Weiss",full


In [42]:
timeslot_dum.head(1)

Unnamed: 0,timeslot_00:00,timeslot_afternoon,timeslot_evening,timeslot_latenight,timeslot_morning,timeslot_unknown
Random Acts of Flyness,0,0,0,1,0,0


In [43]:
week_day_dum1.head(1)

Unnamed: 0,day_0,day_1,day_2,day_3,day_4,day_5,day_6
Game of Thrones,0,0,0,0,0,0,1


In [44]:
week_day_dum2.head(1)

Unnamed: 0,month_1,month_2,month_3,month_4,month_5,month_6,month_7,month_8,month_9,month_10,month_11,month_12
Game of Thrones,0,0,0,1,0,0,0,0,0,0,0,0


In [45]:
series_df['runtime_cat'].head(1)

Game of Thrones    full
Name: runtime_cat, dtype: object

In [46]:
genre_dummies.head(1)

Unnamed: 0,Action,Adventure,Animation,Comedy,Crime,Drama,Family,Fantasy,Game-Show,History,...,Horror,Music,Mystery,News,Reality-TV,Romance,Sci-Fi,Sport,Talk-Show,Western
Game of Thrones,0,1,0,0,0,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [47]:
series_df = series_df.merge(timeslot_dum, left_index=True, right_index=True)

In [48]:
series_df = series_df.merge(week_day_dum1, left_index=True, right_index=True)
series_df = series_df.merge(week_day_dum2, left_index=True, right_index=True)
series_df = series_df.merge(genre_dummies, left_index=True, right_index=True)

In [49]:
model_df = series_df.drop(['network', 
                           'imdb_id', 
                           'released', 
                           'airsDayOfWeek', 
                           'airsTime', 
                           'overview_x', 
                           'overview_y', 
                           'plot', 
                           'genre_y', 
                           'status_y', 
                           'type_x', 
                           'runtime_x', 
                           'runtime_cat'], 
                          axis=1)

In [50]:
model_df.drop(['actors', 'writer'], axis=1, inplace=True)

In [51]:
series_df['imdb_votes'] = series_df['imdb_votes'].str.replace(',', '').astype(float)

In [52]:
model_df[['awards']] = model_df[['awards']].astype(float)

In [53]:
model_df[['imdb_rating']] = model_df[['imdb_rating']].astype(float)

In [54]:
model_df.drop('rating', axis=1, inplace=True)

# Final cleanup steps

Now that the data is useable in the proper format, I can see how the chosen features predicts on IMDB rating without additional feature engineering so far.

In [55]:
type(model_df['imdb_votes'][0])

str

In [56]:
model_df['imdb_votes'] = model_df['imdb_votes'].str.replace(',', '').astype(float)

In [57]:
series_df[['runtime_x', 'awards']] = series_df[['runtime_x', 'awards']].astype(float)

In [58]:
series_df.isnull().sum()

airsDayOfWeek          0
airsTime               0
network                0
overview_x            79
rating                 0
runtime_x              0
number_of_episodes     1
number_of_seasons      0
overview_y             0
status_y               0
type_x                 0
actors                 0
awards                 0
genre_y                0
imdb_id                0
imdb_rating            0
imdb_votes             0
plot                   0
released               0
writer                 0
runtime_cat            0
timeslot_00:00         0
timeslot_afternoon     0
timeslot_evening       0
timeslot_latenight     0
timeslot_morning       0
timeslot_unknown       0
day_0                  0
day_1                  0
day_2                  0
                      ..
 Romance               0
 Sci-Fi                0
 Short                 0
 Sport                 0
 Talk-Show             0
 Thriller              0
 War                   0
 Western               0
Action                 0


In [59]:
series_df['runtime_x'].value_counts()

30.0     685
60.0     416
45.0     315
25.0     241
15.0      35
20.0      32
50.0      24
11.0      23
55.0      16
10.0      16
40.0      13
120.0     13
85.0      10
1.0        9
12.0       8
90.0       8
35.0       4
13.0       4
70.0       3
65.0       2
180.0      2
7.0        2
80.0       2
140.0      1
240.0      1
3.0        1
6.0        1
95.0       1
125.0      1
75.0       1
Name: runtime_x, dtype: int64

In [60]:
model_df.dropna(inplace=True)

In [61]:
model_df.head()

Unnamed: 0,number_of_episodes,number_of_seasons,awards,imdb_rating,imdb_votes,timeslot_00:00,timeslot_afternoon,timeslot_evening,timeslot_latenight,timeslot_morning,...,Horror,Music,Mystery,News,Reality-TV,Romance,Sci-Fi,Sport,Talk-Show,Western
Game of Thrones,73.0,8,1.0,9.5,1361235.0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
Westworld,20.0,2,1.0,8.9,307642.0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
Big Little Lies,7.0,2,1.0,8.6,86060.0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
The Deuce,16.0,2,1.0,8.1,14113.0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
Succession,10.0,1,0.0,7.6,3927.0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [62]:
model_df.shape

(1889, 76)

In [63]:
model_df.head(3)

Unnamed: 0,number_of_episodes,number_of_seasons,awards,imdb_rating,imdb_votes,timeslot_00:00,timeslot_afternoon,timeslot_evening,timeslot_latenight,timeslot_morning,...,Horror,Music,Mystery,News,Reality-TV,Romance,Sci-Fi,Sport,Talk-Show,Western
Game of Thrones,73.0,8,1.0,9.5,1361235.0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
Westworld,20.0,2,1.0,8.9,307642.0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
Big Little Lies,7.0,2,1.0,8.6,86060.0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


# Initial Modeling

I want to test a few models to see how they score (r2) on the current dataset in order to get an idea of where I am in terms of baseline. 

In [64]:
X = model_df.drop(['imdb_rating', 
                  ], 
                  axis=1)
y = model_df['imdb_rating']

X_train, X_test, y_train, y_test = train_test_split(X, y)

In [65]:
X_train.head(3)

Unnamed: 0,number_of_episodes,number_of_seasons,awards,imdb_votes,timeslot_00:00,timeslot_afternoon,timeslot_evening,timeslot_latenight,timeslot_morning,timeslot_unknown,...,Horror,Music,Mystery,News,Reality-TV,Romance,Sci-Fi,Sport,Talk-Show,Western
Bourbon Street Beat,39.0,1,0.0,105.0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
The River,8.0,1,1.0,15530.0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
NCIS: Los Angeles,219.0,10,1.0,42632.0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [66]:
ss = StandardScaler()

X_train_sc = ss.fit_transform(X_train)
X_test_sc = ss.transform(X_test)

In [67]:
lr = LinearRegression()

lr.fit(X_train_sc, y_train)

lr.score(X_train_sc, y_train), lr.score(X_test_sc, y_test)

(0.2612222815804618, 0.1795306332322919)

In [68]:
lasso = LassoCV()

lasso.fit(X_train_sc, y_train)

lasso.score(X_train_sc, y_train), lasso.score(X_test_sc, y_test)

(0.23754133457029591, 0.21040327459643204)

In [69]:
ridge = RidgeCV()

ridge.fit(X_train_sc, y_train)

ridge.score(X_train_sc, y_train), ridge.score(X_test_sc, y_test)

(0.2620356190198584, 0.17005536165445556)

In [70]:
rf = RandomForestRegressor()

rf.fit(X_train_sc, y_train)

rf.score(X_train_sc, y_train), rf.score(X_test_sc, y_test)

(0.8569101081222248, 0.07779220094937056)

In [71]:
gb = GradientBoostingRegressor()
gb.fit(X_train_sc, y_train)

gb.score(X_train_sc, y_train), gb.score(X_test_sc, y_test)

(0.46422411350696746, 0.23002859498671335)

With no hyperparameter tuning and only some basic feature selection, I can see that my models are all scoring very poorly in their R2 score. The models are also all overfitting, which RandomForest begin the worst offender and GrandientBoost next. However, GradientBoost is also scoring highest on the test set, so it seems likely that this will be the best scoring model. Lasso Regression is scoring low, but is not overfitting as much. 

I will look more into the parameters and outcomes once I have a GridSearch set up.

# Checking against 'imdb_votes' as target

In [73]:
model_df.columns

Index(['number_of_episodes', 'number_of_seasons', 'awards', 'imdb_rating',
       'imdb_votes', 'timeslot_00:00', 'timeslot_afternoon',
       'timeslot_evening', 'timeslot_latenight', 'timeslot_morning',
       'timeslot_unknown', 'day_0', 'day_1', 'day_2', 'day_3', 'day_4',
       'day_5', 'day_6', 'month_1', 'month_2', 'month_3', 'month_4', 'month_5',
       'month_6', 'month_7', 'month_8', 'month_9', 'month_10', 'month_11',
       'month_12', ' Action', ' Adventure', ' Animation', ' Comedy', ' Crime',
       ' Drama', ' Family', ' Fantasy', ' Game-Show', ' History', ' Horror',
       ' Music', ' Musical', ' Mystery', ' News', ' Reality-TV', ' Romance',
       ' Sci-Fi', ' Short', ' Sport', ' Talk-Show', ' Thriller', ' War',
       ' Western', 'Action', 'Adventure', 'Animation', 'Biography', 'Comedy',
       'Crime', 'Documentary', 'Drama', 'Family', 'Fantasy', 'Game-Show',
       'History', 'Horror', 'Music', 'Mystery', 'News', 'Reality-TV',
       'Romance', 'Sci-Fi', 'Sport', 'Ta

In [74]:
X = model_df.drop(['number_of_episodes', 'number_of_seasons', 'awards', 'imdb_rating',
       'imdb_votes', 'timeslot_afternoon', 'timeslot_evening',
       'timeslot_latenight', 'timeslot_morning', 'timeslot_unknown', 'day_0',
       'day_1', 'day_2', 'day_3', 'day_4', 'day_5', 'day_6', 'month_1',
       'month_2', 'month_3', 'month_4', 'month_5', 'month_6', 'month_7',
       'month_8', 'month_9', 'month_10', 'month_11', 'month_12'], axis=1)
y = model_df['imdb_votes']

X_train, X_test, y_train, y_test = train_test_split(X, y)

In [75]:
ss = StandardScaler()

X_train_sc = ss.fit_transform(X_train)
X_test_sc = ss.transform(X_test)

In [76]:
lr = LinearRegression()
lr.fit(X_train_sc, y_train)

lr.score(X_train_sc, y_train), lr.score(X_test_sc, y_test)

(0.08756283604016213, 0.005022077603742625)

In [77]:
model_1_features = [' Action',
       ' Adventure', ' Animation', ' Comedy', ' Crime', ' Drama', ' Family',
       ' Fantasy', ' Game-Show', ' History', ' Horror', ' Music', ' Musical',
       ' Mystery', ' News', ' Reality-TV', ' Romance', ' Sci-Fi', ' Short',
       ' Sport', ' Talk-Show', ' Thriller', ' War', ' Western', 'Action',
       'Adventure', 'Animation', 'Biography', 'Comedy', 'Crime', 'Documentary',
       'Drama', 'Family', 'Fantasy', 'Game-Show', 'History', 'Horror', 'Music',
       'Mystery', 'News', 'Reality-TV', 'Romance', 'Sci-Fi', 'Sport',
       'Talk-Show', 'Western']

In [78]:
lr.coef_

array([-1.13165723e+03, -1.66915822e+03,  3.62065816e+03, -7.02378274e+02,
        1.77168383e+03,  1.55030766e+03,  5.99627929e+03, -1.68035466e+03,
        5.22527199e+03,  2.13975130e+02,  2.53748792e+03,  4.80003746e+03,
        8.04473633e+02,  1.49764866e+02,  8.50865173e+01,  1.22612860e+03,
       -9.76088699e-01,  8.59974378e+03,  3.25532914e+03,  8.82065547e+00,
        3.06375526e+02,  1.22415949e+01,  4.91602006e+03, -1.08880927e+03,
       -5.72840286e+02, -6.49496078e+15, -4.67120117e+15, -9.40205857e+15,
       -1.55686260e+15, -1.13104459e+16, -5.36797054e+15, -4.01891093e+15,
       -7.96482923e+15, -3.15644824e+15, -1.10203861e+15, -3.87330334e+15,
       -9.00129108e+14, -2.00704802e+15, -1.42171849e+15, -1.10203861e+15,
       -2.61034807e+15, -4.90422380e+15, -9.00129108e+14, -2.00704802e+15,
       -1.79643465e+15, -2.00704802e+15, -1.79643465e+15])

In [79]:
list(zip(lr.coef_, model_1_features))

[(-1131.6572310912698, ' Action'),
 (-1669.1582211022242, ' Adventure'),
 (3620.658156967591, ' Animation'),
 (-702.3782736617112, ' Comedy'),
 (1771.6838279468845, ' Crime'),
 (1550.3076603646725, ' Drama'),
 (5996.279291207598, ' Family'),
 (-1680.3546568860518, ' Fantasy'),
 (5225.271986190613, ' Game-Show'),
 (213.97512999938863, ' History'),
 (2537.487923159552, ' Horror'),
 (4800.037457517499, ' Music'),
 (804.4736334033506, ' Musical'),
 (149.7648658429002, ' Mystery'),
 (85.08651728095984, ' News'),
 (1226.128597949699, ' Reality-TV'),
 (-0.9760886994071598, ' Romance'),
 (8599.743775433037, ' Sci-Fi'),
 (3255.329135145225, ' Short'),
 (8.820655473122239, ' Sport'),
 (306.37552554623113, ' Talk-Show'),
 (12.241594901445046, ' Thriller'),
 (4916.020055806766, ' War'),
 (-1088.8092710841945, ' Western'),
 (-572.8402861823275, 'Action'),
 (-6494960780417198.0, 'Adventure'),
 (-4671201170159952.0, 'Animation'),
 (-9402058570159562.0, 'Biography'),
 (-1556862596467208.5, 'Comedy'),


In [80]:
lasso = LassoCV()

lasso.fit(X_train_sc, y_train)

lasso.score(X_train_sc, y_train), lasso.score(X_test_sc, y_test)

(0.07382737969725961, 0.030505805284247397)

In [81]:
ridge = RidgeCV()

ridge.fit(X_train_sc, y_train)

ridge.score(X_train_sc, y_train), ridge.score(X_test_sc, y_test)

(0.0875591663861428, 0.006044379551956913)

In [82]:
rf = RandomForestRegressor()

rf.fit(X_train_sc, y_train)

rf.score(X_train_sc, y_train), rf.score(X_test_sc, y_test)

(0.4215042031738104, -0.07217910165732122)

In [83]:
gb = GradientBoostingRegressor()
gb.fit(X_train_sc, y_train)

gb.score(X_train_sc, y_train), gb.score(X_test_sc, y_test)

(0.45252943845447635, -0.30054178681504573)

The model is scoring significantly worse with this alternative subset of the data to train with, however, the overfitting has reduced as well. I belive there may be some merit in this model, but it will have to score much better to not just be dismissed as noise. 

In [84]:
with open('./Assets_&_Data/model_prelim.pickle', 'wb') as f:
    pickle.dump(model_df, f)
    
with open('./Assets_&_Data/week_day.pickle', 'wb') as f:
    pickle.dump(week_day, f)
    
with open('./Assets_&_Data/cleaned_series_df.pickle', 'wb') as f:
    pickle.dump(series_df, f)

# Next Steps

Now that I know how badly my model is performing, I'd like to take a second look at my features to see if they can be modified to be better predictors.