# Good Film // Bad Film
## Film Plot Synopses and etc. as Predictors of Critical Reception

This notebook builds and evaluates a handful of regression models which predict critical reception scores for films. Independent variables include plot synopsis free text, social media metrics on the leading actors, and other categorical variables such as film genre.

In [8]:
# import pickle
import os
import time

import pandas as pd
import numpy as np

import requests

# import matplotlib.pyplot as plt
# import seaborn as sns
# sns.set(style='ticks', color_codes=True)

# import nltk
# from nltk import FreqDist, word_tokenize
# from nltk.stem import WordNetLemmatizer 
# from nltk.corpus import wordnet
# from nltk.corpus import stopwords

# import enchant
# english_d = enchant.Dict("en_US")
# import re
# import gensim
# from gensim import corpora, models, similarities

# import pyLDAvis
# import pyLDAvis.gensim

# from sklearn.preprocessing import StandardScaler
# from sklearn.feature_extraction.text import TfidfVectorizer
# from sklearn.model_selection import train_test_split, GridSearchCV
# from sklearn.decomposition import TruncatedSVD
# from sklearn.naive_bayes import MultinomialNB, GaussianNB
# from sklearn import metrics
# from sklearn.metrics import accuracy_score, roc_curve, auc
# from sklearn.dummy import DummyClassifier
# from sklearn.ensemble import GradientBoostingClassifier
# from sklearn.pipeline import Pipeline

# import statsmodels.api as sm
# import statsmodels.formula.api as sm
# import scipy.stats as stats
# from statsmodels.formula.api import ols

# import xgboost as xgb
# from xgboost import XGBClassifier
# from patsy import dmatrices

## Load the data

From https://www.kaggle.com/tmdb/tmdb-movie-metadata:

In [2]:
imdb = pd.read_csv('../data/imdb_5000_movies.csv') # Just a big Kaggle dataset full of movies.

One really nice thing about this dataset is that it provides the unique IMdB movie ID's, which we can pass to a third-party API in order to supplement our data with more features. A key to the OMdB API cost a minimum of one dollar. Here we parse out those IMdB ID's:

In [3]:
imdb.shape

(5043, 28)

In [33]:
imdb_ids = [imdb.iloc[i]['movie_imdb_link'].split('title/')[1].split('/?')[0] for i in range(len(imdb))]

Here we make the API calls, and persist our data to little .txt files:

In [34]:
# You will need to enter your own API key in the API_KEY.py file (remove the .template suffix)
from API_KEY import API_KEY

try:
    os.mkdir('../data/movie_metadata')
except FileExistsError:
    pass

In [None]:
i=0; j=0; total=len(imdb_ids)

for id in imdb_ids:
    print(f"Downloading movie {i} of {total}...")
    i+=1
    query_string = f'https://api.themoviedb.org/3/movie/{id}?api_key={API_KEY}'
    json = requests.get(query_string).text
    if "could not be found" in json:
        j+=1
        print(f"{round(j/i, 2)}% of movies not found")
        continue
    f = open(f'../data/movie_metadata/movie_{id}.json', 'w+')
    f.write(json)
    f.close()

Downloading movie 0 of 5043...
Downloading movie 1 of 5043...
Downloading movie 2 of 5043...
Downloading movie 3 of 5043...
Downloading movie 4 of 5043...
0.2% of movies not found
Downloading movie 5 of 5043...
Downloading movie 6 of 5043...
Downloading movie 7 of 5043...
Downloading movie 8 of 5043...
Downloading movie 9 of 5043...
Downloading movie 10 of 5043...
Downloading movie 11 of 5043...
Downloading movie 12 of 5043...
Downloading movie 13 of 5043...
Downloading movie 14 of 5043...
Downloading movie 15 of 5043...
Downloading movie 16 of 5043...
Downloading movie 17 of 5043...
Downloading movie 18 of 5043...
Downloading movie 19 of 5043...
Downloading movie 20 of 5043...
Downloading movie 21 of 5043...
Downloading movie 22 of 5043...
Downloading movie 23 of 5043...
Downloading movie 24 of 5043...
Downloading movie 25 of 5043...
Downloading movie 26 of 5043...
Downloading movie 27 of 5043...
Downloading movie 28 of 5043...
Downloading movie 29 of 5043...
Downloading movie 30 of 5

Downloading movie 249 of 5043...
Downloading movie 250 of 5043...
Downloading movie 251 of 5043...
Downloading movie 252 of 5043...
Downloading movie 253 of 5043...
Downloading movie 254 of 5043...
Downloading movie 255 of 5043...
Downloading movie 256 of 5043...
Downloading movie 257 of 5043...
Downloading movie 258 of 5043...
Downloading movie 259 of 5043...
Downloading movie 260 of 5043...
0.02% of movies not found
Downloading movie 261 of 5043...
Downloading movie 262 of 5043...
Downloading movie 263 of 5043...
Downloading movie 264 of 5043...
Downloading movie 265 of 5043...
Downloading movie 266 of 5043...
Downloading movie 267 of 5043...
Downloading movie 268 of 5043...
Downloading movie 269 of 5043...
Downloading movie 270 of 5043...
Downloading movie 271 of 5043...
Downloading movie 272 of 5043...
Downloading movie 273 of 5043...
Downloading movie 274 of 5043...
Downloading movie 275 of 5043...
Downloading movie 276 of 5043...
Downloading movie 277 of 5043...
Downloading movie

Downloading movie 494 of 5043...
Downloading movie 495 of 5043...
Downloading movie 496 of 5043...
Downloading movie 497 of 5043...
Downloading movie 498 of 5043...
Downloading movie 499 of 5043...
Downloading movie 500 of 5043...
Downloading movie 501 of 5043...
Downloading movie 502 of 5043...
Downloading movie 503 of 5043...
Downloading movie 504 of 5043...
Downloading movie 505 of 5043...
Downloading movie 506 of 5043...
Downloading movie 507 of 5043...
Downloading movie 508 of 5043...
Downloading movie 509 of 5043...
Downloading movie 510 of 5043...
Downloading movie 511 of 5043...
Downloading movie 512 of 5043...
Downloading movie 513 of 5043...
Downloading movie 514 of 5043...
Downloading movie 515 of 5043...
Downloading movie 516 of 5043...
Downloading movie 517 of 5043...
Downloading movie 518 of 5043...
Downloading movie 519 of 5043...
Downloading movie 520 of 5043...
Downloading movie 521 of 5043...
Downloading movie 522 of 5043...
Downloading movie 523 of 5043...
Downloadin

Downloading movie 739 of 5043...
Downloading movie 740 of 5043...
Downloading movie 741 of 5043...
Downloading movie 742 of 5043...
Downloading movie 743 of 5043...
Downloading movie 744 of 5043...
Downloading movie 745 of 5043...
Downloading movie 746 of 5043...
Downloading movie 747 of 5043...
Downloading movie 748 of 5043...
Downloading movie 749 of 5043...
Downloading movie 750 of 5043...
Downloading movie 751 of 5043...
Downloading movie 752 of 5043...
Downloading movie 753 of 5043...
Downloading movie 754 of 5043...
Downloading movie 755 of 5043...
Downloading movie 756 of 5043...
Downloading movie 757 of 5043...
0.02% of movies not found
Downloading movie 758 of 5043...
Downloading movie 759 of 5043...
Downloading movie 760 of 5043...
Downloading movie 761 of 5043...
Downloading movie 762 of 5043...
Downloading movie 763 of 5043...
Downloading movie 764 of 5043...
Downloading movie 765 of 5043...
Downloading movie 766 of 5043...
Downloading movie 767 of 5043...
Downloading movie

This is what we want our eventual pandas dataFrame to look like:

In [None]:
df = pd.DataFrame(columns=['Title', 'Year', 'ID', 'Plot', 'Genre', 'Production', 
                           'Director', 'Actor_1_name', 'Actor_1_fb_likes', 'Actor_2_name', 
                           'Actor_2_fb_likes', 'Actor_3_name', 'Actor_3_fb_likes', 'Budget', 
                           'Rated', 'Language', 'imdbRating'])

This is where we frankenstein together the kaggle dataset with what we got from OMdB:

In [None]:
for i in range(len(imdb_ids)):
    id = imdb.iloc[i]['movie_imdb_link'].split('title/')[1].split('/?')[0]
    x_file = open(os.path.join('Movies', f"movie_{id}"), "r") # Open up each movie's text file
    movie_text = x_file.readlines()[0]
    dict = eval(movie_text) # Coerce string to dictionary
    dict['Plot'] = dict['Plot'].replace("\'", "'") # Cleanup
    df = df.append({'Title': dict['Title'], 'Year': dict['Year'], 'ID': id, 
                    'Plot': dict['Plot'], 'Genre': dict['Genre'], 
                    'imdbRating': dict['imdbRating'], 
                    'Director': imdb.iloc[i,:].loc['director_name'], 
                    'Actor_1_name':imdb.iloc[i,:].loc['actor_1_name'], 
                    'Actor_1_fb_likes':imdb.iloc[i,:].loc['actor_1_facebook_likes'], 
                    'Actor_2_name':imdb.iloc[i,:].loc['actor_2_name'], 
                    'Actor_2_fb_likes':imdb.iloc[i,:].loc['actor_2_facebook_likes'], 
                    'Actor_3_name':imdb.iloc[i,:].loc['actor_3_name'], 
                    'Actor_3_fb_likes':imdb.iloc[i,:].loc['actor_3_facebook_likes'], 
                    'Budget':imdb.iloc[i,:].loc['budget'], 'Language':dict['Language'], 
                    'Rated':dict['Rated']}, ignore_index=True) # Add to dataframe

Here's our beautiful new dataFrame:

In [None]:
df.iloc[:3,:]

## Munging and EDA

We drop datapoints with null 'imdbRating' values, since that is our independent / target variable. We also convert these into binary; we are only concerned with whether or not a film's 'imbdRating' is above the mean (1) or below the mean (0). This binary value is saved into a new column called 'binary_target'.

Null values in the 'actor facebook likes' column were imputed from the column mean.

In [None]:
df = df[~((df['Plot'] == 'N/A')|(df['imdbRating'] == 'N/A'))] # Drops movies with null plots
df.imdbRating = df.imdbRating.astype(float)
df['binary_target'] = df['imdbRating'] >= df['imdbRating'].mean()   #binary target column. True = above mean ; False = below mean
df['binary_target'] = df['binary_target'].astype(int)
df['Actor_1_fb_likes'].fillna((df['Actor_1_fb_likes'].mean()), inplace=True)
df['Actor_2_fb_likes'].fillna((df['Actor_2_fb_likes'].mean()), inplace=True)
df['Actor_3_fb_likes'].fillna((df['Actor_3_fb_likes'].mean()), inplace=True)

In [None]:
df = df.reset_index()
df = df.drop(['index'], axis=1)
df[:3]

### One-hot encoding Genres

In [None]:
final_genres = ['Action', 'Adventure', 'Animation', 'Biography', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Family', 'Fantasy', 'History', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci_Fi', 'Sport', 'Thriller', 'War', 'Western']    

#create a list of lists, where each element is a list of a movie's classified genres
li = []
[li.append(df.iloc[i]['Genre'].split(', ')) for i in range(len(df))]

#adding columns to df for each genre, 1 represents the movie is classified under that genre, 0 is that it is not
for genre in final_genres:
    list = []
    [list.append(1) if genre in movie
    else list.append(0) for movie in li]
    df[genre] = list

### Creating a fresh DataFrame with everything but Plot Text Features

In [None]:
main_df = pd.DataFrame()

#join the features we want to narrow in on with our target variable
genres = df.iloc[:,-21:]
main_df['Year'] = [int(year.split('–')[0]) for year in df['Year'].values]
main_df = main_df.join(genres)

main_df[:3]

In [None]:
main_df.columns

In [None]:
main_df.Year.hist()

In [None]:
main_df.binary_target.hist()

### Log Transforming "Actor Facebook Likes"

'Actor_1_fb_likes' had some striking outliers and needed a log transform:

In [None]:
df.Actor_1_fb_likes.hist() # Check out those outliers. Can't even see them...

In [None]:
def log_transform_col(feature, dataframe):
    logged = pd.Series(np.log(dataframe[feature].values+1), name=feature+'_logged')
    return logged

actor_features = ['Actor_1_fb_likes', 'Actor_2_fb_likes','Actor_3_fb_likes']

actor_likes = [log_transform_col(actor_features[i], df) for i in range(len(actor_features))]

In [None]:
pd.Series(actor_likes).hist() # Much better

Looks way more gaussian after a log transform. We add these logged features to our dataframe:

In [None]:
main_df = main_df.join(actor_likes)
main_df[:3]

### Baseline Logistic Regression without Plot-Text Features

Now that we've cleaned up a bit, we'll start throwing our features at a statsmodels logistic regression estimator to evaluate r^2 and p-values of various features.

In [None]:
# check LogReg with all initial variables from main_df (note: no plot)
s = ("binary_target ~ Year + C(Action) + C(Adventure) + C(Animation) + C(Biography) + C(Comedy)"
                 "+ C(Crime) + C(Documentary) + C(Drama) + C(Family) + C(Fantasy) + C(History)"
                 "+ C(Horror) + C(Musical) + C(Mystery) + C(Romance) + C(Sci_Fi) + C(Sport)"
                 "+ C(Thriller) + C(War)+ C(Western)"
                 "+ Actor_1_fb_likes_logged + Actor_2_fb_likes_logged"
                 "+ Actor_3_fb_likes_logged")

y, X = dmatrices(s, main_df, return_type="dataframe")

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size =  0.2)

logit_model = sm.Logit(y_train.iloc[:,0], X_train)
result = logit_model.fit()

print(result.summary())

y_preds = result.predict(X_test)

accuracy_score(y_test, y_preds >=.5)

This gives us an idea of which features might be less important in the determination of what makes a movie "good". We'll drop the less pertinent features and try again:

In [None]:
s = ("binary_target ~ Year + C(Action) + C(Animation) + C(Biography) + C(Comedy)"
                 "+ C(Documentary) + C(Drama) + C(Family)"
                 "+ C(Horror) + C(Mystery) + C(Romance) + C(Sci_Fi)"
                 "+ C(Thriller)"
                 "+ Actor_1_fb_likes_logged + Actor_2_fb_likes_logged"
                 "+ Actor_3_fb_likes_logged")

main_df = main_df.drop(['Adventure','Crime', 'Fantasy', 'History', 'Musical', 'Sport', 'War', 'Western'], axis=1)

y, X = dmatrices(s, main_df, return_type = "dataframe")

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size =  0.2)

logit_model = sm.Logit(y_train.iloc[:,0], X_train)
result = logit_model.fit()

print(result.summary())

y_preds = result.predict(X_test)

accuracy_score(y_test, y_preds >=.5)

In [None]:
main_df.columns

These are the features we're sticking with for now.

### Correlation Matrices

We'd be remiss not to check for overly correlated features:

In [None]:
# Creating a multi-scatter plot
main_corr= main_df.drop(['binary_target'], axis=1).iloc[:,:]
pd.plotting.scatter_matrix(main_corr, figsize=[15,15]);

In [None]:
sns.heatmap(main_corr.corr(), center=0);

It's fairly intuitive that the social media popularity of the leading actors would be positively correlated, but we'll leave it in in case there are deviations from that norm. It's interesting to note that films in the "animation" genre are so commonly also in the "family" genre. Makes sense too.

## Using NLP to get features from the Plot Synopses

#### Setting up Lemmatization / Normalization Functions:

In [None]:
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def lemmatize(plot_list):
    lemmatized_plots = []
    
    for plot in plot_list:
        tokenized_lower = word_tokenize(plot.lower()) # Make plot summary lowercase and lemmatize        
        tokenized_lower =[word for word in tokenized_lower if english_d.check(word)] # Make sure it's an english word
        dirty_lemma = [lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in tokenized_lower] # Lemmatize
        dirty_lemma_string = ' '.join(dirty_lemma)
        
        # Filter out words that don't match this regex pattern:
        reg = re.compile((r"([a-zA-Z]+(?:'[a-z]+)?)"))
        lemmatized_regex = [word_lem for word_lem in dirty_lemma if word_lem in reg.findall(dirty_lemma_string)]
        
        # Remove stop words
        lemmatized = [word_lem for word_lem in lemmatized_regex if not word_lem in stop_words]
        lemmatized_string = ' '.join(lemmatized)
        lemmatized_plots.append(lemmatized_string)
        
    return lemmatized_plots

In [None]:
def get_wordnet_pos(word):
    """Map POS tag to the first character that lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

### Run the Lemmatizer

In [None]:
all_plots = [plot for plot in df.loc[:,'Plot'].values] # Get all movie plots.
plots = lemmatize(all_plots) # Lemmatize.

### Vectorizing Plots

In [None]:
# Term frequency = Number of times a word appears in a document / number of words in document
# Inverse document frequency = log base e(number of documents / number of documents with word in it)
# tf:idf = tf * idf

tfidf = TfidfVectorizer()
response = tfidf.fit_transform(plots)
print(response.shape)

tfidf_df = pd.DataFrame(response.toarray(), columns=tfidf.get_feature_names())

Now we have all plots lemmatized as "plots" and vectorized / weighted as "tfidf_df".

### Incorporating some LDA; Clustering Documents by Topic

Gensim clusters words that appear together frequently. The clusters can be interpreted as general topics, and each movie gets weights indicating the degree to which it belongs to each topic. These weights are then re-incorporated as features in our dataset.

In [None]:
all_words = [plot.split(' ') for plot in plots] # Just formatting our corpus how Gensim wants it

In [None]:
dictionary = corpora.Dictionary(all_words)
corpus = [dictionary.doc2bow(text) for text in all_words]
pickle.dump(corpus, open('pickles/corpus.pkl', 'wb'))
dictionary.save('pickles/dictionary.gensim')

In [None]:
NUM_TOPICS = 50 # This value was arbitrarily chosen.
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=10) # Also arb
ldamodel.save('pickles/model5.gensim')

In [None]:
topics = ldamodel.print_topics(num_words=4)
topics # These are examples of some of the clusters created by Gensim.

In [None]:
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary)
vis

In [None]:
tm = pd.DataFrame()
for i in range(len(corpus)):
    new_row = np.zeros(NUM_TOPICS)
    for toop in ldamodel.get_document_topics(corpus[i]): # These two lines are where you do what you need to do
        new_row[toop[0]] = toop[1]                       # to flip zeroes to ones if the genre appears
    tm = tm.append(pd.Series(new_row), ignore_index=1)

In [None]:
tm.head() # This is a DataFrame with the weights from the GenSim clustering.

## Joining Topic-Modeled Synopses with Standard Features

In [None]:
df = main_df.join(tm)
len(df.columns)

## Model Building

In [None]:
X = df.drop(['binary_target'], axis=1)
y = df.binary_target
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

train_test_split on X_scaled:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size =  0.2)

### Baselining:

In [None]:
dc = DummyClassifier().fit(X_train, y_train)

In [None]:
accuracy_score(y_test,dc.predict(X_test))

Dimensionality Reduction with SVD - For the PlotText Data:

In [None]:
svd = TruncatedSVD(n_components=50, n_iter=3, random_state=42)
X_train_pca = pd.DataFrame(svd.fit_transform(X_train))
print(svd.explained_variance_ratio_.sum())

Next we'll try a simple Gaussian Naive Bayes Model:

In [None]:
clf = GaussianNB()
clf.fit(X_train_pca, y_train)
y_preds = clf.predict(pd.DataFrame(svd.transform(X_test)))
print(metrics.classification_report(y_test, y_preds))
test_accuracy = accuracy_score(y_test,y_preds)
print("Test accuracy: {:.4}%".format(test_accuracy * 100))

__________

Now we'll go nuts and try an XGBClassifier model. A boosted model seems to work better without PCA, so we'll drop it for this part.

In [None]:
clf = xgb.XGBClassifier(n_jobs=-1)
clf.fit(X_train, y_train)

training_preds = clf.predict(X_train)
test_preds = clf.predict(X_test)#pd.DataFrame(svd.transform(X_test)))

training_accuracy = accuracy_score(y_train, training_preds)
test_accuracy = accuracy_score(y_test, test_preds)

print("Training Accuracy: {:.4}%".format(training_accuracy * 100))
print("Test accuracy: {:.4}%".format(test_accuracy * 100))

In [None]:
y_score = clf.predict_proba(X_test)
fpr, tpr, thresholds = roc_curve(y_test, y_score[:,1])
print('AUC: {}'.format(auc(fpr, tpr)))

In [None]:
def draw_roc_curve(fpr,tpr):
    sns.set_style("darkgrid", {"axes.facecolor": ".9"})
    print('AUC: {}'.format(auc(fpr, tpr)))
    plt.figure(figsize=(10,8))
    lw = 2
    plt.plot(fpr, tpr, color='darkorange', lw=lw, label='ROC curve')
    plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.yticks([i/20.0 for i in range(21)])
    plt.xticks([i/20.0 for i in range(21)])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic (ROC) Curve')
    plt.legend(loc="lower right")
    plt.show()

In [None]:
draw_roc_curve(fpr,tpr)

Looks like we're off to an okay start!

## Grid Searching for better parameters

In [None]:
# svd_components = [100,500,1000]
# itera = [3, 5]

n_est = [50,80,100]
max_depth = [2,3,4]
learning_rates = [.08, .1, .15]

pipe = Pipeline([
#     ('reduce_dim', TruncatedSVD()),
    ('classify', XGBClassifier())
])

param_grid = [
    {
#         'reduce_dim__n_components': svd_components,
#         'reduce_dim__n_iter': itera,
        'classify__n_estimators': n_est,
        'classify__max_depth': max_depth,
        'classify__n_jobs': [-1],
        'classify__learning_rate': learning_rates
    }]

score = {'f1': 'f1', 
         'accuracy': 'accuracy'}

grid_adc = GridSearchCV(pipe, 
                        n_jobs=-1, 
                        param_grid=param_grid, 
                        scoring=score, 
                        refit='accuracy',
                        verbose=10)

grid_adc.fit(X_train, y_train)
grid_adc.best_params_

In [None]:
# Check the gridsearch results:
y_preds = grid_adc.predict(X_test)
accuracy_score(y_test, y_preds)

In [None]:
y_score = grid_adc.predict_proba(X_test)
fpr_gs, tpr_gs, thresholds = roc_curve(y_test, y_score[:,1])
print('AUC: {}'.format(auc(fpr_gs, tpr_gs)))

In [None]:
plt.figure(0).clf()
plt.figure(figsize=(10,8))
lw = 2

plt.plot(fpr, tpr, color='darkorange',
         lw=lw, label='ROC curve XGBClassifier')
plt.plot(fpr_gs,tpr_gs,
         label='Grid Searched XGBClassifier')

plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.yticks([i/20.0 for i in range(21)])
plt.xticks([i/20.0 for i in range(21)])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.legend(loc=0)

Negligible improvement.