# Data Cleaning

Cleaning and parsing of data from the [NBCUniversal Analytics Challenge](http://sc.aisnet.org/conference2018/student-competitions/nbcuniversal-challenge/).

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

## Loading Data

In [2]:
data = pd.read_csv('Data/NBCU-dataLaurel.csv')
data.head()

Unnamed: 0,imdbid,title,plot,rating,imdb_rating,metacritic,dvd_release,production,actors,imdb_votes,poster,director,release_date,runtime,genre,awards,keywords,Budget,Box Office Gross
0,tt0010323,The Cabinet of Dr. Caligari,"Hypnotist Dr. Caligari uses a somnambulist, Ce...",UNRATED,8.1,,15-Oct-97,Rialto Pictures,"Werner Krauss, Conrad Veidt, Friedrich Feher, ...",42583,https://images-na.ssl-images-amazon.com/images...,Robert Wiene,19-Mar-21,67 min,"Fantasy, Horror, Mystery",1 nomination.,expressionism|somnambulist|avant-garde|hypnosi...,18000,0
1,tt0052893,Hiroshima Mon Amour,A French actress filming an anti-war film in H...,NOT RATED,8.0,,24-Jun-03,Rialto Pictures,"Emmanuelle Riva, Eiji Okada, Stella Dassas, Pi...",21154,https://images-na.ssl-images-amazon.com/images...,Alain Resnais,16-May-60,90 min,"Drama, Romance",Nominated for 1 Oscar. Another 6 wins & 5 nomi...,memory|atomic-bomb|lovers-separation|impossibl...,88300,0
2,tt0058898,Alphaville,A U.S. secret agent is sent to the distant spa...,NOT RATED,7.2,,20-Oct-98,Rialto Pictures,"Eddie Constantine, Anna Karina, Akim Tamiroff",17801,https://images-na.ssl-images-amazon.com/images...,Jean-Luc Godard,5-May-65,99 min,"Drama, Mystery, Sci-Fi",1 win.,dystopia|french-new-wave|satire|comic-violence...,220000,46585
3,tt0074252,"Ugly, Dirty and Bad",Four generations of a family live crowded toge...,,7.9,,1-Nov-16,Compagnia Cinematografica Champion,"Nino Manfredi, Maria Luisa Santella, Francesco...",5705,https://images-na.ssl-images-amazon.com/images...,Ettore Scola,23-Sep-76,115 min,"Comedy, Drama",1 win & 2 nominations.,incest|failed-murder-attempt|poisoned-food|bap...,6590,0
4,tt0084269,Losing Ground,A comedy-drama about a Black American female p...,,6.3,,,Milestone Film & Video,"Billie Allen, Gary Bolling, Clarence Branch Jr...",132,https://images-na.ssl-images-amazon.com/images...,Kathleen Collins,1-Jun-82,86 min,"Comedy, Drama",,artist|painter|marriage|black-independent-film...,0,0


In [3]:
data.shape

(8468, 19)

These are the variables we have to work with:

imdbid: Unique Id used by IMDB to refer to the movie.

Title: Title of the movie

plot: Movie plot summary

rating: MPAA Appropriate audience rating

imdb_rating: IMDB's voters' scoring of a movie on a scale from 1-10 (10 being best)

metacritic: Metacritic movie score on a scale of 0-100 (100 being best)

dvd_release: Movie release date on DVD

production: Principle production company

actors: Lead Actors

imdb_votes: Total votes from IMDB members

poster: Movie Poster artwork

director: Movie director

release_date: Theatrical Release Date

runtime: Runtime length of movie in minutes

genre: Genre Classification

awards: Academy awards & nominations

keywords: Keywords associated with the movie

budget: Budget spent on movie production, marketing, and distribution

box office gross: Box Office Gross Returns as of 9/21/2017

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8468 entries, 0 to 8467
Data columns (total 19 columns):
imdbid              8468 non-null object
title               8468 non-null object
plot                8196 non-null object
rating              5252 non-null object
imdb_rating         7735 non-null float64
metacritic          5079 non-null float64
dvd_release         5335 non-null object
production          6758 non-null object
actors              8153 non-null object
imdb_votes          7735 non-null object
poster              7967 non-null object
director            8390 non-null object
release_date        8283 non-null object
runtime             7846 non-null object
genre               8424 non-null object
awards              5242 non-null object
keywords            6381 non-null object
Budget              8468 non-null object
Box Office Gross    8468 non-null object
dtypes: float64(2), object(17)
memory usage: 1.2+ MB


Notice how many of variables are just objects. We're going to have to deal with converting a few of these into useful types.

### Parsing imdbid

The proper format for imdb id's is simply 7 digits, with no tt.

In [5]:
data['imdbid'] = data['imdbid'].str.replace('tt','')

### Parsing release_dates and dvd_release

Next, we'll change release_date to a datetime-like type.

In [6]:
data['release_date'].head()

0    19-Mar-21
1    16-May-60
2     5-May-65
3    23-Sep-76
4     1-Jun-82
Name: release_date, dtype: object

In [7]:
pd.to_datetime(data['release_date'])[1],data['release_date'][1]

(Timestamp('2060-05-16 00:00:00'), '16-May-60')

Then we see that there is an issue with pandas to_datetime function. It converts very old dates back to the 19th century. Perhaps we need to use the datetime package per [this](https://stackoverflow.com/questions/16600548/how-to-parse-string-dates-with-2-digit-year).

In [8]:
dates = data['release_date']
dates = pd.to_datetime(dates)
opivot_index = dates[dates.apply(lambda x: x.year>2019)].index
for index in opivot_index:
    dates[index] = dates[index].replace(year = dates[index].year-100)
sum(dates>'2019-12-31')

0

In [9]:
data['release_date'] = dates
data = data.drop(data.index[data['release_date'].isna()])

In [10]:
dates = dates.dropna()

Then we've found a way to account for Python's default pivot year.

It seems natural to also do the same for dvd_release.

In [11]:
dates = data['dvd_release']
dates = pd.to_datetime(dates)
opivot_index = dates[dates.apply(lambda x: x.year>2019)].index
for index in opivot_index:
    dates[index] = dates[index].replace(year = dates[index].year-100)
data['dvd_release'] = dates
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8283 entries, 0 to 8467
Data columns (total 19 columns):
imdbid              8283 non-null object
title               8283 non-null object
plot                8049 non-null object
rating              5223 non-null object
imdb_rating         7677 non-null float64
metacritic          5058 non-null float64
dvd_release         5313 non-null datetime64[ns]
production          6710 non-null object
actors              8011 non-null object
imdb_votes          7677 non-null object
poster              7874 non-null object
director            8222 non-null object
release_date        8283 non-null datetime64[ns]
runtime             7754 non-null object
genre               8253 non-null object
awards              5216 non-null object
keywords            6302 non-null object
Budget              8283 non-null object
Box Office Gross    8283 non-null object
dtypes: datetime64[ns](2), float64(2), object(15)
memory usage: 1.3+ MB


### Parsing imdb_votes

Next, we have several important numerical variables that are currenly in object types. First, we'll work with imdb_votes.

In [12]:
data['imdb_votes'].head()

0    42,583
1    21,154
2    17,801
3     5,705
4       132
Name: imdb_votes, dtype: object

So we need to convert imdb_votes to integers. Since there are commas in each number, we cannot simply tell pandas to treat each entry as an integer via the .astype() function. We'll first have to replace each comma with a blank, then apply the int() function. We also have to take care to ignore all of the missing values from imdb_votes as we will be dealing with those later.

In [13]:
votes = data['imdb_votes']
votes_parse = votes.str.replace(',','')
votes_parse.head()

0    42583
1    21154
2    17801
3     5705
4      132
Name: imdb_votes, dtype: object

Then, we've successfully removed all of the commas. Let's confirm that we didn't lose any datapoints along the way.

In [14]:
[len(votes_parse), len(data['imdb_votes'])]

[8283, 8283]

In order to convert to int, we have to find a way to work around missing values. Let's replace all the missing values with -1 and then convert them back to NaN after conversion.

In [15]:
import numpy as np

votes_int = votes_parse.fillna(-1).astype('int')
votes_int[votes_int==-1] = np.nan
votes_int.isna().sum() == votes_parse.isna().sum()

True

In [16]:
data['imdb_votes'] = votes_int
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8283 entries, 0 to 8467
Data columns (total 19 columns):
imdbid              8283 non-null object
title               8283 non-null object
plot                8049 non-null object
rating              5223 non-null object
imdb_rating         7677 non-null float64
metacritic          5058 non-null float64
dvd_release         5313 non-null datetime64[ns]
production          6710 non-null object
actors              8011 non-null object
imdb_votes          7677 non-null float64
poster              7874 non-null object
director            8222 non-null object
release_date        8283 non-null datetime64[ns]
runtime             7754 non-null object
genre               8253 non-null object
awards              5216 non-null object
keywords            6302 non-null object
Budget              8283 non-null object
Box Office Gross    8283 non-null object
dtypes: datetime64[ns](2), float64(3), object(14)
memory usage: 1.3+ MB


### Parsing Budget and Box Office Gross

Now we have to deal with 'Budget' and 'Box Office Gross' in a similar manner.

In [17]:
data[['Budget', 'Box Office Gross']].head()

Unnamed: 0,Budget,Box Office Gross
0,18000,0
1,88300,0
2,220000,46585
3,6590,0
4,0,0


Looks like we don't have to worry about any commas in 'Budget' or in 'Box Office Gross', so the conversions will be much simplier. In the previous revision of Data Cleaning, we found that data entries containing Box Office Gross values in foreign currencies were not worth saving. We will be removing them in this revision.

In [18]:
data = data.drop(data[data['Box Office Gross'].str.contains('GBP')].index)
data = data.drop(data[data['Box Office Gross'].str.contains('EU')].index)
data = data.drop(data[data['Budget'].str.contains('EU')].index)
data = data.drop(data[data['Budget'].str.contains('CAD')].index)

Now that we've removed all datapoints with foreign currencies, we can start parsing Budget and Box Office Gross into numerical types.

In [19]:
gross = data['Box Office Gross']
gross = gross.str.replace(',','')

error_index = []

for i, item in enumerate(gross):
    try:
        int(item)
    except ValueError:
        error_index.append(i)

In [20]:
gross[error_index]

Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  return self.loc[key]


1384        0
1854        0
2880        0
3795        0
5014      NaN
5601    41709
5640        0
5744        0
5894        0
6187        0
6804      NaN
6975        0
7037        0
Name: Box Office Gross, dtype: object

This is peculiar, the value of Box Office Gross seems to change once we store that variable as gross (from these string entries to the number 0). On top of that, the original value in the dataset seems to be random text, perhaps a misscraped comment?

**Edit**: On review, enumerate gives us an enumerated index, so to find the erroneous entries, we need to use df.iloc

When I review the data entries for 1394, it's hard to see what relevance '<strongitemprop="name">Grossbuttotallyworthwhile</strong>' has. Then, we'll remove these samples from the dataset.

In [21]:
error_index = data.iloc[error_index].index
data = data.drop(error_index)

Then, we can finally change box office gross and budget into integer formats.

In [22]:
data['Box Office Gross'] = data['Box Office Gross'].astype('int')
data['Budget'] = data['Budget'].astype('int')
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8028 entries, 0 to 8467
Data columns (total 19 columns):
imdbid              8028 non-null object
title               8028 non-null object
plot                7808 non-null object
rating              5063 non-null object
imdb_rating         7454 non-null float64
metacritic          4885 non-null float64
dvd_release         5151 non-null datetime64[ns]
production          6521 non-null object
actors              7767 non-null object
imdb_votes          7454 non-null float64
poster              7639 non-null object
director            7968 non-null object
release_date        8028 non-null datetime64[ns]
runtime             7518 non-null object
genre               7998 non-null object
awards              5040 non-null object
keywords            6100 non-null object
Budget              8028 non-null int64
Box Office Gross    8028 non-null int64
dtypes: datetime64[ns](2), float64(3), int64(2), object(12)
memory usage: 1.2+ MB


### Parsing imdb_rating

Since imdb_rating has many unique values, we're going to round all the ratings down to the whole number. We can use train on the ranks themselves in decision trees, but we'll have to resort to one-hot encoding for regression, neural nets, etc.

In [23]:
data['imdb_rating'].isna().sum()

574

There are 696 data points missing imdb ratings. We'll try using imdb packages like [imdbpy](https://imdbpy.readthedocs.io/en/latest/) to fill in these values.

**Edit**: Turns out, imdbpy is only useful for getting the release *year* of a movie, not the release *date*. For now, it seems like all we can do is remove these data points.

In [24]:
na_index = data[data['imdb_rating'].isna()].index
data = data.drop(na_index)

In [25]:
import math

data['imdb_rating'] = data['imdb_rating'].apply(lambda x: math.floor(x))

### Parsing runtime

For runtime, we would like to remove the 'min' part of each entry. It's much better for analysis to just assume that runtime is measured in minutes.

In [26]:
data['runtime'].isna().sum()

139

In [27]:
from imdb import IMDb
ia = IMDb()

Unlike previously, imdbpy actualy has a feature to get movie runtimes from imdb. Because of this, we will be able to fill in the missing values.

In [28]:
na_index = data[data['runtime'].isna()].index
for index in na_index:
    movie = ia.get_movie(data.loc[index]['imdbid'])
    try:
        data.loc[index]['imdbid'] = movie['runtime']
    except KeyError:
        data = data.drop(index)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """
2019-04-21 12:40:10,302 CRITICAL [imdbpy] /usr/local/lib/python3.7/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt1736647/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(gaierror(8, 'nodename nor servname provided, or not known'))},); kwds: {}
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 1317, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1229, in request
    self._send_request(method, url, body, headers, e

2019-04-21 12:40:10,312 CRITICAL [imdbpy] /usr/local/lib/python3.7/site-packages/imdb/__init__.py:714: caught an exception retrieving or parsing "plot" info set for mopID "1736647" (accessSystem: http)
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 1317, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1229, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1275, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1224, in endheaders
    self._send_output(message_b

2019-04-21 12:40:10,328 CRITICAL [imdbpy] /usr/local/lib/python3.7/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt1756474/plotsummary', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(gaierror(8, 'nodename nor servname provided, or not known'))},); kwds: {}
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 1317, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1229, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1275, in _send_request
    self.endheaders(body, encode_chun

2019-04-21 12:40:10,340 CRITICAL [imdbpy] /usr/local/lib/python3.7/site-packages/imdb/__init__.py:714: caught an exception retrieving or parsing "main" info set for mopID "1808480" (accessSystem: http)
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 1317, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1229, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1275, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1224, in endheaders
    self._send_output(message_b

2019-04-21 12:40:10,353 CRITICAL [imdbpy] /usr/local/lib/python3.7/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt1841942/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(gaierror(8, 'nodename nor servname provided, or not known'))},); kwds: {}
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 1317, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1229, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1275, in _send_request
    self.endheaders(body, encode_chunke

2019-04-21 12:40:10,362 CRITICAL [imdbpy] /usr/local/lib/python3.7/site-packages/imdb/__init__.py:714: caught an exception retrieving or parsing "plot" info set for mopID "1841942" (accessSystem: http)
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 1317, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1229, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1275, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1224, in endheaders
    self._send_output(message_b

2019-04-21 12:40:10,375 CRITICAL [imdbpy] /usr/local/lib/python3.7/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt1846700/plotsummary', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(gaierror(8, 'nodename nor servname provided, or not known'))},); kwds: {}
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 1317, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1229, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1275, in _send_request
    self.endheaders(body, encode_chun

2019-04-21 12:40:10,387 CRITICAL [imdbpy] /usr/local/lib/python3.7/site-packages/imdb/__init__.py:714: caught an exception retrieving or parsing "main" info set for mopID "1858396" (accessSystem: http)
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 1317, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1229, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1275, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1224, in endheaders
    self._send_output(message_b

2019-04-21 12:40:10,401 CRITICAL [imdbpy] /usr/local/lib/python3.7/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt1859521/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(gaierror(8, 'nodename nor servname provided, or not known'))},); kwds: {}
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 1317, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1229, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1275, in _send_request
    self.endheaders(body, encode_chunke

2019-04-21 12:40:10,410 CRITICAL [imdbpy] /usr/local/lib/python3.7/site-packages/imdb/__init__.py:714: caught an exception retrieving or parsing "plot" info set for mopID "1859521" (accessSystem: http)
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 1317, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1229, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1275, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1224, in endheaders
    self._send_output(message_b

2019-04-21 12:40:10,426 CRITICAL [imdbpy] /usr/local/lib/python3.7/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt1946278/plotsummary', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(gaierror(8, 'nodename nor servname provided, or not known'))},); kwds: {}
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 1317, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1229, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1275, in _send_request
    self.endheaders(body, encode_chun

2019-04-21 12:40:10,439 CRITICAL [imdbpy] /usr/local/lib/python3.7/site-packages/imdb/__init__.py:714: caught an exception retrieving or parsing "main" info set for mopID "2018083" (accessSystem: http)
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 1317, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1229, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1275, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1224, in endheaders
    self._send_output(message_b

2019-04-21 12:40:10,454 CRITICAL [imdbpy] /usr/local/lib/python3.7/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt2044821/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(gaierror(8, 'nodename nor servname provided, or not known'))},); kwds: {}
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 1317, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1229, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1275, in _send_request
    self.endheaders(body, encode_chunke

2019-04-21 12:40:10,460 CRITICAL [imdbpy] /usr/local/lib/python3.7/site-packages/imdb/__init__.py:714: caught an exception retrieving or parsing "plot" info set for mopID "2044821" (accessSystem: http)
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 1317, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1229, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1275, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1224, in endheaders
    self._send_output(message_b

2019-04-21 12:40:10,473 CRITICAL [imdbpy] /usr/local/lib/python3.7/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt2057445/plotsummary', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(gaierror(8, 'nodename nor servname provided, or not known'))},); kwds: {}
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 1317, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1229, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1275, in _send_request
    self.endheaders(body, encode_chun

2019-04-21 12:40:10,482 CRITICAL [imdbpy] /usr/local/lib/python3.7/site-packages/imdb/__init__.py:714: caught an exception retrieving or parsing "main" info set for mopID "2071418" (accessSystem: http)
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 1317, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1229, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1275, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1224, in endheaders
    self._send_output(message_b

2019-04-21 12:40:10,497 CRITICAL [imdbpy] /usr/local/lib/python3.7/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt2098703/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(gaierror(8, 'nodename nor servname provided, or not known'))},); kwds: {}
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 1317, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1229, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1275, in _send_request
    self.endheaders(body, encode_chunke

2019-04-21 12:40:10,504 CRITICAL [imdbpy] /usr/local/lib/python3.7/site-packages/imdb/__init__.py:714: caught an exception retrieving or parsing "plot" info set for mopID "2098703" (accessSystem: http)
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 1317, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1229, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1275, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1224, in endheaders
    self._send_output(message_b

2019-04-21 12:40:10,519 CRITICAL [imdbpy] /usr/local/lib/python3.7/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt2175823/plotsummary', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(gaierror(8, 'nodename nor servname provided, or not known'))},); kwds: {}
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 1317, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1229, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1275, in _send_request
    self.endheaders(body, encode_chun

2019-04-21 12:40:10,529 CRITICAL [imdbpy] /usr/local/lib/python3.7/site-packages/imdb/__init__.py:714: caught an exception retrieving or parsing "main" info set for mopID "2187972" (accessSystem: http)
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 1317, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1229, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1275, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1224, in endheaders
    self._send_output(message_b

2019-04-21 12:40:10,546 CRITICAL [imdbpy] /usr/local/lib/python3.7/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt2191641/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(gaierror(8, 'nodename nor servname provided, or not known'))},); kwds: {}
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 1317, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1229, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1275, in _send_request
    self.endheaders(body, encode_chunke

2019-04-21 12:40:10,556 CRITICAL [imdbpy] /usr/local/lib/python3.7/site-packages/imdb/__init__.py:714: caught an exception retrieving or parsing "plot" info set for mopID "2191641" (accessSystem: http)
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 1317, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1229, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1275, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1224, in endheaders
    self._send_output(message_b

2019-04-21 12:40:10,569 CRITICAL [imdbpy] /usr/local/lib/python3.7/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt2196224/plotsummary', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(gaierror(8, 'nodename nor servname provided, or not known'))},); kwds: {}
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 1317, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1229, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1275, in _send_request
    self.endheaders(body, encode_chun

2019-04-21 12:40:10,580 CRITICAL [imdbpy] /usr/local/lib/python3.7/site-packages/imdb/__init__.py:714: caught an exception retrieving or parsing "main" info set for mopID "2220560" (accessSystem: http)
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 1317, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1229, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1275, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1224, in endheaders
    self._send_output(message_b

2019-04-21 12:40:10,600 CRITICAL [imdbpy] /usr/local/lib/python3.7/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt2244697/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(gaierror(8, 'nodename nor servname provided, or not known'))},); kwds: {}
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 1317, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1229, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1275, in _send_request
    self.endheaders(body, encode_chunke

2019-04-21 12:40:10,608 CRITICAL [imdbpy] /usr/local/lib/python3.7/site-packages/imdb/__init__.py:714: caught an exception retrieving or parsing "plot" info set for mopID "2244697" (accessSystem: http)
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 1317, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1229, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1275, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1224, in endheaders
    self._send_output(message_b

2019-04-21 12:40:10,625 CRITICAL [imdbpy] /usr/local/lib/python3.7/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt2251744/plotsummary', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(gaierror(8, 'nodename nor servname provided, or not known'))},); kwds: {}
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 1317, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1229, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1275, in _send_request
    self.endheaders(body, encode_chun

2019-04-21 12:40:10,641 CRITICAL [imdbpy] /usr/local/lib/python3.7/site-packages/imdb/__init__.py:714: caught an exception retrieving or parsing "main" info set for mopID "2255008" (accessSystem: http)
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 1317, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1229, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1275, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1224, in endheaders
    self._send_output(message_b

2019-04-21 12:40:10,653 CRITICAL [imdbpy] /usr/local/lib/python3.7/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt2262212/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(gaierror(8, 'nodename nor servname provided, or not known'))},); kwds: {}
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 1317, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1229, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1275, in _send_request
    self.endheaders(body, encode_chunke

2019-04-21 12:40:10,660 CRITICAL [imdbpy] /usr/local/lib/python3.7/site-packages/imdb/__init__.py:714: caught an exception retrieving or parsing "plot" info set for mopID "2262212" (accessSystem: http)
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 1317, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1229, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1275, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1224, in endheaders
    self._send_output(message_b

2019-04-21 12:40:10,672 CRITICAL [imdbpy] /usr/local/lib/python3.7/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt2304835/plotsummary', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(gaierror(8, 'nodename nor servname provided, or not known'))},); kwds: {}
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 1317, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1229, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1275, in _send_request
    self.endheaders(body, encode_chun

2019-04-21 12:40:10,682 CRITICAL [imdbpy] /usr/local/lib/python3.7/site-packages/imdb/__init__.py:714: caught an exception retrieving or parsing "main" info set for mopID "2316756" (accessSystem: http)
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 1317, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1229, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1275, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1224, in endheaders
    self._send_output(message_b

2019-04-21 12:40:10,698 CRITICAL [imdbpy] /usr/local/lib/python3.7/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt2387820/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(gaierror(8, 'nodename nor servname provided, or not known'))},); kwds: {}
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 1317, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1229, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1275, in _send_request
    self.endheaders(body, encode_chunke

2019-04-21 12:40:10,706 CRITICAL [imdbpy] /usr/local/lib/python3.7/site-packages/imdb/__init__.py:714: caught an exception retrieving or parsing "plot" info set for mopID "2387820" (accessSystem: http)
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 1317, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1229, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1275, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1224, in endheaders
    self._send_output(message_b

2019-04-21 12:40:10,717 CRITICAL [imdbpy] /usr/local/lib/python3.7/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt2398173/plotsummary', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(gaierror(8, 'nodename nor servname provided, or not known'))},); kwds: {}
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 1317, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1229, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1275, in _send_request
    self.endheaders(body, encode_chun

2019-04-21 12:40:10,728 CRITICAL [imdbpy] /usr/local/lib/python3.7/site-packages/imdb/__init__.py:714: caught an exception retrieving or parsing "main" info set for mopID "2401225" (accessSystem: http)
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 1317, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1229, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1275, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1224, in endheaders
    self._send_output(message_b

2019-04-21 12:40:10,745 CRITICAL [imdbpy] /usr/local/lib/python3.7/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt2404572/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(gaierror(8, 'nodename nor servname provided, or not known'))},); kwds: {}
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 1317, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1229, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1275, in _send_request
    self.endheaders(body, encode_chunke

2019-04-21 12:40:10,755 CRITICAL [imdbpy] /usr/local/lib/python3.7/site-packages/imdb/__init__.py:714: caught an exception retrieving or parsing "plot" info set for mopID "2404572" (accessSystem: http)
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 1317, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1229, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1275, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1224, in endheaders
    self._send_output(message_b

In [29]:
data[data['runtime'].isna()]

Unnamed: 0,imdbid,title,plot,rating,imdb_rating,metacritic,dvd_release,production,actors,imdb_votes,poster,director,release_date,runtime,genre,awards,keywords,Budget,Box Office Gross
832,1935277,Road to Juarez,An American ex-con with Mexican underworld tie...,,5,,NaT,Mousetrap Films,"William Forsythe, Jacqueline Pinol, Pepe Serna...",43.0,http://ia.media-imdb.com/images/M/MV5BNjM3MjUw...,David A. Ponce de Leon,2015-04-24,,"Action, Thriller",,,0,0
1989,3305388,Mountain Top,,,8,,NaT,,"Barry Corbin, Coby Ryan McLaughlin, Valerie Az...",46.0,,Gary Wheeler,2014-05-05,,"Drama, Family, Mystery",,,750000,0
2267,3640942,Koyelaanchal,Koyelaanchal (coal belt of India) brings to li...,,5,,NaT,,"Kannan Arunachalam, Biswanath Basu, Vinod Khan...",171.0,https://images-na.ssl-images-amazon.com/images...,Ashu Trikha,2014-05-09,,"Action, Drama",1 nomination.,,0,0
2428,3822606,Ra Ra Krishnayya,Kittu alias Krishna Sundeep Kishan is a cab dr...,,5,,NaT,,"Jagapathi Babu, Ravi Babu, Tanikella Bharani, ...",82.0,https://images-na.ssl-images-amazon.com/images...,Mahesh P.,2014-07-04,,Romance,,,0,0
2530,3969208,Trust Fund,"Reese Donahue leads a seemingly ideal life, wi...",PG,7,,NaT,Transatlantic Films,"Matthew Alan, Jessica Rothe, Willie Garson, An...",22.0,https://images-na.ssl-images-amazon.com/images...,Sandra L. Martin,2016-01-08,,Drama,,,0,0
2588,4074296,From This Day Forward,When director Sharon Shattuck's father came ou...,,7,,NaT,Argot Pictures,,12.0,http://ia.media-imdb.com/images/M/MV5BMjAxNzkx...,Sharon Shattuck,2015-04-11,,"Documentary, Biography, Family",,f-rated,0,0
2840,4489160,You're My Boss,"A woman who is looking for acceptance, who's l...",,6,,NaT,,"Toni Gonzaga, Coco Martin, Freddie Webb, JM de...",102.0,https://images-na.ssl-images-amazon.com/images...,Antoinette Jadaone,2015-04-04,,"Comedy, Romance",2 wins & 11 nominations.,personal-assistant|acceptance|runner|boss|company,0,0
2900,4621100,Nanak Shah Fakir,Nanak Shah Fakir is a biographical film on the...,,9,,NaT,B4U US,"Arif Zakaria, Puneet Sikka, Adil Hussain, Anur...",107.0,https://images-na.ssl-images-amazon.com/images...,,2015-04-17,,Drama,3 wins.,,0,0
2914,4641602,365 Days,,,7,,NaT,,"Anand, Anaika Soti",17.0,http://ia.media-imdb.com/images/M/MV5BYzlkZTg3...,Ram Gopal Varma,2015-05-22,,Drama,,written-by-director|number-in-title,0,0
3069,4944460,Everyday I Love You,Two people bound together in the same journey ...,,7,,NaT,Star Cinema,"Gerald Anderson, Enrique Gil, Liza Soberano",195.0,https://images-na.ssl-images-amazon.com/images...,Mae Czarina Cruz,2015-11-06,,"Drama, Romance",2 nominations.,,0,0


In [30]:
na_index = data[data['runtime'].isna()].index
runtimes = []
for index in na_index:
    movie = ia.get_movie(data.loc[index]['imdbid'])
    try:
        runtimes.append(movie['runtime'][0])
    except KeyError:
        data = data.drop(index)

In [31]:
data['runtime'].isna().sum()

58

In [32]:
na_data = data[data['runtime'].isna()]

In [33]:
for index in na_data.head().index:
    movie = ia.get_movie(na_data.loc[index]['imdbid'])
    na_data.loc[index]['runtime'] = movie['runtime']

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [34]:
na_data.head().index

Int64Index([832, 1989, 2267, 2428, 2530], dtype='int64')

In [35]:
movie = ia.get_movie('1935277')
na_data.loc[832]['runtime'] = movie['runtime']

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [36]:
runtimes1 = na_data['imdbid'].apply(lambda x: ia.get_movie(x)['runtime'][0])

In [37]:
na_data['runtime'] = na_data['runtime'].fillna(runtimes1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [38]:
data['runtime'] = data['runtime'].fillna(runtimes1)

In [39]:
data['runtime'].isna().sum()

0

Now that we've filled in all of the missing runtimes that we could possibly fill in, we can drop ' min' and convert to integer type.

In [40]:
data['runtime'] = data['runtime'].str.replace(' min', '')
data['runtime'] = data['runtime'].astype('int')

### Parsing genre

Now let's parse through genre in order to use it as a categorical variable.

In [41]:
genres = data['genre']

Thankfully, imdbpy has a function to retrieve genres for movies when available. We're going to attempt to fill in these values and drop the ones in which we can't.

In [42]:
na_genres = genres[genres.isna()].index

In [43]:
genres = []
for index in na_genres:
    movie = ia.get_movie(data.loc[index]['imdbid'])
    try:
        genres.append(movie['genres'][0])
    except KeyError:
        data = data.drop(index)

In [44]:
na_data = data[data['genre'].isna()]
genres1 = na_data['imdbid'].apply(lambda x: ia.get_movie(x)['genre'][0])

In [45]:
data['genre'] = data['genre'].fillna(genres1)

In [46]:
genre_index = data['genre'].index

In [47]:
for i in genre_index:
    parsed_genre = [x.strip() for x in data.loc[i]['genre'].split(',')]

In [48]:
genres = data['genre']

In [49]:
parsed_genres = genres.apply(lambda x: [y.strip() for y in x.split(',')])

In [50]:
genres = pd.Series(parsed_genres.sum()).unique()

Now that we've separated all of the distinct genres out, we have to create categorical variables for each.

In [51]:
for genre in genres:
    print(genre)

Fantasy
Horror
Mystery
Drama
Romance
Sci-Fi
Comedy
Crime
Action
Adventure
Animation
Family
History
Thriller
Biography
Documentary
Music
Sport
Western
Musical
War
News
Short


In [52]:
for i in range(len(parsed_genres[0])):
    print(parsed_genres[0][i])

Fantasy
Horror
Mystery


In [53]:
test = pd.DataFrame([genres])

In [54]:
data.shape[0]

7372

Now that we've separated out all of the genres in 'genre', we can create categorical variables for each category. Then, we can apply one-hot encoding to encode movie genres as categorical variables.

In [55]:
genre_classes = pd.DataFrame(0, index = data.index, columns = genres)

In [56]:
parsed_genres[0][i]

'Mystery'

In [57]:
index = 8464

In [58]:
genre_classes.loc[index][parsed_genres[index]] = 1

In [59]:
for index in data.index:
    genre_classes.loc[index][parsed_genres[index]] = 1

In [60]:
data = pd.merge(data, genre_classes, left_index=True, right_index=True)

In [61]:
data[genres] = data[genres].astype('category')

In [62]:
data = data.drop('genre', axis = 1)

### Parsing production

It is reasonable to believe that the production studio behind a movie is an important predictor of movie success. First, we'll see how many unique studios are in our dataset, then create categorical variables for each.

In [63]:
len(data['production'].unique())

1715

That is far too many categorical variables to encode into categorical variables using one hot encoding. It is also not a good idea to encode them into labels as it's unclear the linearity of the relationship between production studios and target variables. For now we'll leave it alone.

### Adjusting budget and box office gross for inflation

In order to compare budget and box office gross values, we need to make sure we're using the same unit of measure. 2019 dollars are different from 2009 dollars, so we'll have to adjust for inflation where possible using the [cpi](https://github.com/datadesk/cpi) package.

In [64]:
import cpi

In [65]:
cpi.update()

In [66]:
inflated = pd.DataFrame(0, index = data.index, columns = ['inflated_budget', 'inflated_gross'])
for index in data.index:
    inflated['inflated_budget'][index] = cpi.inflate(data['Budget'][index], data['release_date'][index].year)
    inflated['inflated_gross'][index] = cpi.inflate(data['Box Office Gross'][index], data['release_date'][index].year)

In [67]:
data = pd.merge(data, inflated, left_index = True, right_index = True)

Then, we've created new columns, "inflated_gross" and "inflated_budget", that represent "Box Office Gross" and "Budget" adjusted for inflation.

### Parsing rating

In [68]:
data['rating'].head()

0      UNRATED
1    NOT RATED
2    NOT RATED
3          NaN
4          NaN
Name: rating, dtype: object

Immediately, I notice that we'll have to learn about the difference between 'Unrated' and 'Not Rated'.

According to [wikipedia](https://en.wikipedia.org/wiki/Motion_Picture_Association_of_America_film_rating_system#MPAA_film_ratings), 'Not Rated' means that a movie was never submitted to receive a rating and 'Unrated' means that a release may be different from a previously rated version. Based on that, we can reasonably assume most of the movies without a rating were never submitted for a rating and should have 'No Rating'. imdbpy doesn't have a function to retrieve mpaa ratings anymore, so for now all we can do is fill in the 'NaNs' in with 'Not Rated's.

In [69]:
data['rating'].unique()

array(['UNRATED', 'NOT RATED', nan, 'PG', 'G', 'TV-PG', 'PG-13', 'R',
       'TV-14', 'TV-MA', 'M', 'NC-17', 'APPROVED', 'X', 'NR', 'TV-G',
       'TV-Y'], dtype=object)

In [70]:
data['rating'] = data['rating'].fillna('NOT RATED')

Now that we've taken care of the missing entries, we can split encode each rating as a category as this is nominal data, not ordinal.

In [71]:
ratings = data['rating'].unique()

In [72]:
ratings_classes = pd.DataFrame(0, index = data.index, columns = ratings)
for index in data.index:
    ratings_classes.loc[index][data['rating'].loc[index]] = 1

In [73]:
data = pd.merge(data, ratings_classes, left_index=True, right_index=True)

In [74]:
data[ratings] = data[ratings].astype('category')

In [75]:
data = data.drop('rating', axis=1)

### Parsing awards

The final feature that we will be parsing is 'awards'. We're going to try to parse out how many awards a movie has won and how many awards a movie has been nominated for.

In [76]:
data['awards'].head()

0                                        1 nomination.
1    Nominated for 1 Oscar. Another 6 wins & 5 nomi...
2                                               1 win.
3                               1 win & 2 nominations.
4                                                  NaN
Name: awards, dtype: object

In [77]:
data['awards'] = data['awards'].str.lower()

# later research found that it would be easier to replace plurals with singulars for easier parsing.
data['awards'] = data['awards'].str.replace('oscars', 'oscar')
data['awards'] = data['awards'].str.replace('nominations', 'nomination')
data['awards'] = data['awards'].str.replace('wins', 'win')

In [78]:
data['awards'].isna().sum()

2359

There are 2369 movies that don't have an entry under awards. It's safe to assume that movies that fall under this category are those that were neither nominated nor won any awards. For simplicity, we'll fill these in with 'none'.

In [79]:
data['awards'] = data['awards'].fillna('none')

In [80]:
(data['awards'].str.contains('win') | data['awards'].str.contains('won')).sum()

3621

Can we assume that wins and wons should be part of the same category? No, we need to differentiate between what awards a movie has one. A movie that has won an oscar/golden globe is far more impressive than a movie that won any random award.

In [81]:
data['awards'].str.contains('oscar').sum()

455

We also have to be able to differentiate movies that contain oscar. This is because an oscar nomination is different from winning one.

In [82]:
data['awards'].str.contains('golden').sum()

129

The same logic applies to golden globe movies.

In [83]:
data['awards'].str.contains('nominat').sum()

4514

Nominations will also follow the same logic as won/win.

In [84]:
data['awards'][data['awards'].str.contains('won') & data['awards'].str.contains('golden')]

212     won 1 golden globe. another 2 win & 17 nominat...
213     won 1 golden globe. another 5 win & 17 nominat...
3684    won 1 golden globe. another 1 win & 9 nomination.
4106    won 1 golden globe. another 4 win & 22 nominat...
5243    won 1 golden globe. another 12 win & 25 nomina...
Name: awards, dtype: object

Based on changing out combinations of won/nominat with various award names, it's clear that won always pairs with a 'special' award (ie. golden globe or oscars). Now we have to decide whether or not we want to encode these awards as nominal or ordinal scales.

IE. ordinal 'oscar':
* 0: no nomination
* 1: nominated
* 2: won

The question we have to answer is: do we believe that this data is truly hierarchical in nature?

Of course, not being nominated for an oscar is worse than being nominated for one and only being nominated is worse than winning one.

In [85]:
data['awards'][data['awards'].str.contains('won')]

8            won 2 oscar. another 23 win & 22 nomination.
11           won 3 oscar. another 28 win & 17 nomination.
12           won 6 oscar. another 39 win & 66 nomination.
14         won 11 oscar. another 110 win & 74 nomination.
41           won 2 oscar. another 59 win & 91 nomination.
43         won 2 oscar. another 108 win & 242 nomination.
50          won 4 oscar. another 78 win & 129 nomination.
59          won 1 oscar. another 65 win & 152 nomination.
89             won 1 oscar. another 3 win & 9 nomination.
102          won 1 oscar. another 28 win & 72 nomination.
108         won 1 oscar. another 41 win & 142 nomination.
144         won 1 oscar. another 91 win & 257 nomination.
149         won 2 oscar. another 72 win & 116 nomination.
162          won 2 oscar. another 23 win & 47 nomination.
164          won 2 oscar. another 32 win & 63 nomination.
168          won 1 oscar. another 57 win & 88 nomination.
170         won 3 oscar. another 93 win & 151 nomination.
175         wo

In [86]:
words = data['awards'].loc[6].split()
if 'nominations.' in words:
    print(words[words.index('nominations.')-1])

In [87]:
import string

translator = str.maketrans('', '', string.punctuation)
words = data['awards'].loc[6].translate(translator).split()
if 'nominations' in words:
    print(words[words.index('nominations')-1])

In [88]:
translator = str.maketrans('', '', string.punctuation)
data['awards'].loc[6].translate(translator).split()

['nominated', 'for', '2', 'oscar', 'another', '6', 'win', '6', 'nomination']

In [89]:
import re
data['awards'].loc[6]

re.sub(r'[^\w\s]','',data['awards'].loc[6]).split()

['nominated', 'for', '2', 'oscar', 'another', '6', 'win', '6', 'nomination']

First, we'll split each of the awards sentences into a list of words and numbers with no punctuation. This format is much easier to parse through.

In [90]:
import re

data['awards'] = data['awards'].apply(lambda x: re.sub(r'[^\w\s]','',x).lower().split())

Next, we want to find out how many unique awards we have to account for.

In [91]:
data['awards'][data['awards'].apply(lambda x: 'nominated' in x)].apply(lambda x: x[3]).unique()

array(['oscar', 'golden', 'bafta', 'primetime'], dtype=object)

Since there are both 'oscar' and 'oscars' entries, we'll replace the plural forms with singulars for ease of parsing. We would also like to parse out how many other misc. wins and nominations a movie was awarded. At this point we want to preserve as much information as we can, so there's no use not encoding that information as well.

In [92]:
awards = ['oscar', 'golden', 'bafta', 'primetime', 'win', 'nomination']
awards = pd.DataFrame(0, index = data.index, columns = awards)
special_awards = ['oscar', 'golden', 'bafta', 'primetime']
results = ['won', 'nominated']

In [93]:
#data['awards'][data['awards'].apply(lambda x: 'oscar' in x or 'oscars' in x)]
#awards['oscar'].loc[data['awards'][data['awards'].apply(lambda x: ('oscar' in x or 'oscars' in x) and 'won' in x)].index]

In [94]:
for award in special_awards:
    for result in results:
        indices = data[data['awards'].apply(lambda x: award in x and result in x)].index
        if result == 'won':
            awards[award].loc[indices] = 2
        elif result == 'nominated':
            awards[award].loc[indices] = 1

In [95]:
data['awards'][data['awards'].apply(lambda x: 'nomination' in x)].index

Int64Index([   0,    1,    3,    5,    6,    7,    8,    9,   10,   11,
            ...
            8446, 8449, 8450, 8451, 8453, 8456, 8460, 8462, 8464, 8467],
           dtype='int64', length=4510)

Now we parse through how many misc. wins and nominations each movie received.

In [96]:
indices = data['awards'][data['awards'].apply(lambda x: 'nomination' in x)].index
for index in indices:
    num_index = data['awards'].loc[index].index('nomination')-1
    awards['nomination'].loc[index] = int(data['awards'].loc[index][num_index])

In [97]:
indices = data['awards'][data['awards'].apply(lambda x: 'win' in x)].index
for index in indices:
    num_index = data['awards'].loc[index].index('win')-1
    awards['win'].loc[index] = int(data['awards'].loc[index][num_index])

In [98]:
data = pd.merge(data, awards, left_index = True, right_index = True)
data = data.drop('awards', axis = 1)
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7372 entries, 0 to 8467
Data columns (total 63 columns):
imdbid              7372 non-null object
title               7372 non-null object
plot                7316 non-null object
imdb_rating         7372 non-null int64
metacritic          4880 non-null float64
dvd_release         5123 non-null datetime64[ns]
production          6304 non-null object
actors              7212 non-null object
imdb_votes          7372 non-null float64
poster              7299 non-null object
director            7362 non-null object
release_date        7372 non-null datetime64[ns]
runtime             7372 non-null int64
keywords            5870 non-null object
Budget              7372 non-null int64
Box Office Gross    7372 non-null int64
Fantasy             7372 non-null category
Horror              7372 non-null category
Mystery             7372 non-null category
Drama               7372 non-null category
Romance             7372 non-null category
Sci-Fi  

In [99]:
data.to_csv('Data/parsed_data.csv')