In [1]:
import pandas as pd
import numpy as np
import sklearn as sk
import string
import omdb
from omdb import OMDBClient
from bs4 import BeautifulSoup
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder
from sklearn.metrics import r2_score,mean_squared_error

In [2]:
client = omdb.OMDBClient(apikey='9b6f6a00')

First I import the IMDB dataset that I downloaded here: https://www.kaggle.com/stefanoleone992/imdb-extensive-dataset?select=IMDb+movies.csv

Thank you stefanoleone992!

In [3]:
movies = pd.read_csv("IMDB movies.csv")

  interactivity=interactivity, compiler=compiler, result=result)


As you can see here, the dataset consists of over 85,000 movies and has a variety of data on each one; including language, genres, directors, actors, and more. You may notice, however, there is a rather important piece of data missing--whether or not each film won an oscar. This gives me the opportunity to scrape that data myself. 

In [4]:
movies.shape

(85855, 22)

To append the award data to this dataset, I am going to use the omdb API. Essentially, it's an unofficial IMDB api, but it works well. Since I have the IMDB id for each film, I will simply query the website with the id, and grab the "award" field of the data. I will clean it up later with regular expressions, but it will tell me how many Oscars each film has won once I do that.

In [5]:
def getAward(imdbId):
    try:
        return client.imdbid(imdbId)["awards"]
    except:
        return "Not found"

In [6]:
movies["awards"] = movies.iloc[:,0].apply(lambda x : getAward(x))

I would like to build a random forest regression to predict how many oscars a given film will win based on a variety of factors that I think could be good indicators of an award-winning movie. Genre and language are both aspects that I think would be important in this prediction. In fact, Parasite was the very first non-English film to win an Oscar for best picture.

However, the data--as it stands--is not properly formatted. Genre and language both currently exist as lists for each film. I am going to use binary encoding to give the regression model a way to quantify this categorical data.

First, I am going to use the strip function to remove the white space on either side of the languages and genres. 

In [7]:
cleaned_genre = movies.set_index(['imdb_title_id'])["genre"].str.split(',',expand=True).stack()

In [8]:
cleaned_lang = movies.set_index(['imdb_title_id'])["language"].str.split(',',expand=True).stack()

In [9]:
cleaned_lang = cleaned_lang.apply(lambda x : x.strip())

In [10]:
cleaned_genre = cleaned_genre.apply(lambda x : x.strip())

Now I will actually begin the binary encoding process, using the get_dummies function to turn these lists into categorical data. This creates a new dataframe just containing each film by its id and binary indications of whether or not it is in a certain language.

In [11]:
lang_encoded = pd.get_dummies(cleaned_lang, prefix='l').groupby(level=0).sum().reset_index()

Now I will merge the encoded languages for each movie onto the original movie dataframe.

In [12]:
movies = movies.merge(lang_encoded, left_on=["imdb_title_id"], right_on=["imdb_title_id"])

I will repeat this same process in order to binary encode the genre values.

In [13]:
genre_encoded = pd.get_dummies(cleaned_genre, prefix='g').groupby(level=0).sum().reset_index()

In [14]:
movies = movies.merge(genre_encoded, left_on=["imdb_title_id"], right_on=["imdb_title_id"])

Now I will take a look at the awards column. The simple webscraper I executed just pulled all of the awards that each film won, usually in sentence structure format (e.g. "Won 2 Oscars. Another 112 wins & 103 nominations.") Rather than slow down the webscraper by extracting the number of Oscars as it went, I opted to wait until now to use a regular expression to extract the number representing how many Oscars the film won. I put this in separate column "oscars_won"

Obviously this will yield NaN values for any films that did not contain the regular expression (and thus did not win Oscars), so I will go ahead and fill those NaN values with 0.

In [15]:
movies["oscars_won"] = movies[movies["awards"].str.contains(r'Won [1-9]* Oscar', regex=True)]["awards"].str.extract('(\d+)')

In [16]:
movies["oscars_won"] = movies["oscars_won"].fillna(0)

Since I was working with strings, the numbers from the regular expression extraction would also be strings, so I convert those to integers here.

In [17]:
movies["oscars_won"] = movies["oscars_won"].astype(int)

And then I go ahead and drop the original "awards" column because I have extracted the data I need from it and it will serve no purpose.

In [18]:
movies = movies.drop("awards", axis =1)

There is one movie that is oddly labeled with "TV Movie 2019" rather than an actual date published. Not only will this interfere with the conversion of dates to datetime objects, I'm also not particularly interested in having a TV movie mixed in with this dataset. So I will go ahead and just drop that row.

In [19]:
tvMovieIndex = movies[movies["date_published"]=="TV Movie 2019"].index

In [23]:
movies = movies.drop([tvMovieIndex[0], tvMovieIndex[0]])

Now I will convert the date published column--currently a string--to datetime objects. I would also like to use the month a movie came out rather than the date in order to categorize the films more neatly, so I will create a "months_published" column from the converted "date_published" column.

In [24]:
movies["date_published"] = pd.to_datetime(movies["date_published"])

In [25]:
movies["month_published"] = movies["date_published"].dt.month

In [26]:
movies["country"] = movies["country"].fillna('None')

In [27]:
movies["director"] = movies["director"].astype('category')

In [28]:
movies["director_cat"] = movies["director"].cat.codes

In [29]:
movies["American"] = movies["country"].str.contains("USA")

In [30]:
movies["American"] = movies["American"].fillna(0)

In [31]:
factors = ["duration", 'g_Action', 'g_Adult',
       'g_Adventure', 'g_Animation', 'g_Biography', 'g_Comedy', 'g_Crime',
       'g_Documentary', 'g_Drama', 'g_Family', 'g_Fantasy', 'g_Film-Noir',
       'g_History', 'g_Horror', 'g_Music', 'g_Musical', 'g_Mystery', 'g_News',
       'g_Reality-TV', 'g_Romance', 'g_Sci-Fi', 'g_Sport', 'g_Thriller',
       'g_War', 'g_Western', 'month_published'] + movies.columns[22:].tolist()

Below I will reset the index in order to prevent any errors

In [43]:
movies.reset_index(drop=True)

Unnamed: 0,imdb_title_id,title,original_title,year,date_published,genre,duration,country,language,director,...,g_Romance,g_Sci-Fi,g_Sport,g_Thriller,g_War,g_Western,oscars_won,month_published,director_cat,American
0,tt0000009,Miss Jerry,Miss Jerry,1894,1894-10-09,Romance,45,USA,,Alexander Black,...,1,0,0,0,0,0,0,10,1129,True
1,tt0000574,The Story of the Kelly Gang,The Story of the Kelly Gang,1906,1906-12-26,"Biography, Crime, Drama",70,Australia,,Charles Tait,...,0,0,0,0,0,0,0,12,5115,False
2,tt0002101,Cleopatra,Cleopatra,1912,1912-11-13,"Drama, History",100,USA,English,Charles L. Gaskill,...,0,0,0,0,0,0,0,11,5068,True
3,tt0002130,L'Inferno,L'Inferno,1911,1911-03-06,"Adventure, Drama, Fantasy",68,Italy,Italian,"Francesco Bertolini, Adolfo Padovan",...,0,0,0,0,0,0,0,3,9814,False
4,tt0002199,"From the Manger to the Cross; or, Jesus of Naz...","From the Manger to the Cross; or, Jesus of Naz...",1912,1913-01-01,"Biography, Drama",60,USA,English,Sidney Olcott,...,0,0,0,0,0,0,0,1,29464,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
85016,tt9908390,Le lion,Le lion,2020,2020-01-29,Comedy,95,"France, Belgium",French,Ludovic Colbeau-Justin,...,0,0,0,0,0,0,0,1,19400,False
85017,tt9911196,De Beentjes van Sint-Hildegard,De Beentjes van Sint-Hildegard,2020,2020-02-13,"Comedy, Drama",103,Netherlands,"German, Dutch",Johan Nijenhuis,...,0,0,0,0,0,0,0,2,15372,False
85018,tt9911774,Padmavyuhathile Abhimanyu,Padmavyuhathile Abhimanyu,2019,2019-03-08,Drama,130,India,Malayalam,Vineesh Aaradya,...,0,0,0,0,0,0,0,3,32817,False
85019,tt9914286,Sokagin Çocuklari,Sokagin Çocuklari,2019,2019-03-15,"Drama, Family",98,Turkey,Turkish,Ahmet Faik Akinci,...,0,0,0,0,0,0,0,3,493,False


I'll go ahead and create the train and the test subsets.

In [44]:
X_train, X_test, y_train, y_test = train_test_split(movies[factors], movies["oscars_won"], train_size=0.7,test_size=0.3, random_state=0)

I'm going to create the random forest regressor. Now, it may seem strange to use just 1 n estimator, but it provides similarily accurate results as using a higher number, and doesn't result in decimal points for the resulting predictions. That is, using a higher number would result in a given film receiving a prediction of 8.564554 or so Oscars which is impossible. By keeping it to just one, we avoid that and it does not become exceedingly more inaccurate.

In [45]:
regressor = RandomForestRegressor(n_estimators=1, random_state=0)

Here I fit my dataset to the regressor, generate the y predictions, and then check the results using the root mean squared error. This turned out to be a low ~.02. So I can feel fairly confident that my predictions were correct.

In [49]:
regressor.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=1, n_jobs=None, oob_score=False,
                      random_state=0, verbose=0, warm_start=False)

In [39]:
y_pred = regressor.predict(X_test)

In [40]:
mse = mean_squared_error(y_test, y_pred)

In [41]:
rmse = np.sqrt(mse)

In [42]:
rmse

0.02076665995106555