# Cleaning Up The Data

The first major step of our data exploration involves making sure that our selected data is clean and tidy for our later analysis. The functions we need are stored in the custom python file `data_cleaning.py`. We will import the functions for use in our data cleaning process. We will also need Pandas, as usual.

In [1]:
import pandas as pd

Let's start simply by loading up the database of basic info for titles stored on IMDB ([data/imdb.title.basics.csv](./data/imdb.title.basics.csv)), and taking a look at some of the first few entries.

In [2]:
imdb_title_basics = pd.read_csv('data/imdb.title.basics.csv')

In [3]:
imdb_title_basics.head(10)

Unnamed: 0,tconst,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy"
5,tt0111414,A Thin Life,A Thin Life,2018,75.0,Comedy
6,tt0112502,Bigfoot,Bigfoot,2017,,"Horror,Thriller"
7,tt0137204,Joe Finds Grace,Joe Finds Grace,2017,83.0,"Adventure,Animation,Comedy"
8,tt0139613,O Silêncio,O Silêncio,2012,,"Documentary,History"
9,tt0144449,Nema aviona za Zagreb,Nema aviona za Zagreb,2012,82.0,Biography


From here on out, any column holding a `tconst` id for IMDB will be renamed `imdb_id`.

In [4]:
imdb_title_basics.rename(columns={'tconst': 'imdb_id'}, inplace=True)

In [5]:
imdb_title_basics.head(10)

Unnamed: 0,imdb_id,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy"
5,tt0111414,A Thin Life,A Thin Life,2018,75.0,Comedy
6,tt0112502,Bigfoot,Bigfoot,2017,,"Horror,Thriller"
7,tt0137204,Joe Finds Grace,Joe Finds Grace,2017,83.0,"Adventure,Animation,Comedy"
8,tt0139613,O Silêncio,O Silêncio,2012,,"Documentary,History"
9,tt0144449,Nema aviona za Zagreb,Nema aviona za Zagreb,2012,82.0,Biography


Looking at the above entries reveals some tasks that must be completed in order for the data to be suitable. In particular, there are a few fields populated by sporadic `NaN` values, and these must be handled appropriately (either by interpolating from other available data or by dropping the entries from analysis---which is needed is the subject of further investigation). Moreover, the genres field must be appropriately split into lists of genres, instead of the comma separated string that is given currently. First, we might want to start by removing duplicates. In particular, some of the entries contain duplicate `primary_title` entries, and hence conflicting information about any one movie.

In [6]:
imdb_title_basics.drop_duplicates(subset='primary_title', inplace=True)

In [7]:
imdb_title_basics.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 136071 entries, 0 to 146143
Data columns (total 6 columns):
imdb_id            136071 non-null object
primary_title      136071 non-null object
original_title     136055 non-null object
start_year         136071 non-null int64
runtime_minutes    106598 non-null float64
genres             131180 non-null object
dtypes: float64(1), int64(1), object(4)
memory usage: 7.3+ MB


Next we want to get invalid genre descriptors, i.e. anything that is not a string.

In [8]:
imdb_title_basics[imdb_title_basics.genres.apply(lambda g: type(g) != str)].shape

(4891, 6)

In [9]:
imdb_title_basics[imdb_title_basics.genres.apply(lambda g: type(g) != str)].genres.unique()

array([nan], dtype=object)

From this, we can see that there are 5408 movies in the IMDB database which are not supplied with appropriate genre descriptions, and moreover that each of these invalid descriptions is `NaN`. Given that this is a major component of our analysis here, we should probably drop these entries from consideration.

In [10]:
imdb_title_basics.dropna(subset=['genres'], inplace=True)

In [11]:
imdb_title_basics[imdb_title_basics.genres.apply(lambda g: type(g) != str)].shape

(0, 6)

Fantastic, we have removed all of the rows where the genres description was an invalid value. We might also want to check that the titles are all appropriately filled in.

In [12]:
imdb_title_basics[imdb_title_basics.primary_title.apply(lambda t: type(t) != str)].shape

(0, 6)

We have thus found that all of the titles in the dataframe are given as appropriate strings, so we shouldn't need to drop any rows with missing values. We may drop rows when we get to merging, but for now, this field of the dataset isn't producing any issues.

In [13]:
imdb_title_basics.head(10)

Unnamed: 0,imdb_id,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy"
5,tt0111414,A Thin Life,A Thin Life,2018,75.0,Comedy
6,tt0112502,Bigfoot,Bigfoot,2017,,"Horror,Thriller"
7,tt0137204,Joe Finds Grace,Joe Finds Grace,2017,83.0,"Adventure,Animation,Comedy"
8,tt0139613,O Silêncio,O Silêncio,2012,,"Documentary,History"
9,tt0144449,Nema aviona za Zagreb,Nema aviona za Zagreb,2012,82.0,Biography


Now, let's drop the runtime_minutes column, as it isn't needed to answer our analytical questions.

In [14]:
imdb_title_basics.drop(columns='runtime_minutes', inplace=True)

In [15]:
imdb_title_basics.head(10)

Unnamed: 0,imdb_id,primary_title,original_title,start_year,genres
0,tt0063540,Sunghursh,Sunghursh,2013,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,"Comedy,Drama,Fantasy"
5,tt0111414,A Thin Life,A Thin Life,2018,Comedy
6,tt0112502,Bigfoot,Bigfoot,2017,"Horror,Thriller"
7,tt0137204,Joe Finds Grace,Joe Finds Grace,2017,"Adventure,Animation,Comedy"
8,tt0139613,O Silêncio,O Silêncio,2012,"Documentary,History"
9,tt0144449,Nema aviona za Zagreb,Nema aviona za Zagreb,2012,Biography


Let's next split up the genres field into a list of separate genres, as opposed to a single comma separated list as is given currently.

In [16]:
imdb_title_basics.genres = imdb_title_basics.genres.apply(lambda g: g.split())

In [17]:
imdb_title_basics.head(10)

Unnamed: 0,imdb_id,primary_title,original_title,start_year,genres
0,tt0063540,Sunghursh,Sunghursh,2013,"[Action,Crime,Drama]"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,"[Biography,Drama]"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,[Drama]
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,"[Comedy,Drama]"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,"[Comedy,Drama,Fantasy]"
5,tt0111414,A Thin Life,A Thin Life,2018,[Comedy]
6,tt0112502,Bigfoot,Bigfoot,2017,"[Horror,Thriller]"
7,tt0137204,Joe Finds Grace,Joe Finds Grace,2017,"[Adventure,Animation,Comedy]"
8,tt0139613,O Silêncio,O Silêncio,2012,"[Documentary,History]"
9,tt0144449,Nema aviona za Zagreb,Nema aviona za Zagreb,2012,[Biography]


Alright, this dataset is looking pretty good. Genres are now correctly represented inside of lists, as will be needed for later analysis.

In [18]:
imdb_title_basics.shape

(131180, 5)

As we can see, the IMDB dataset contains a multitude of entries. Because of its large size, we will use this dataset as a sort of "master" dataset, for which our additionally gathered revenue information will be attached. Let's continue to pare down our master dataset, in particular removing any entries for movies which have not yet been started.

In [19]:
imdb_title_basics = imdb_title_basics[imdb_title_basics.start_year <= 2019]

In [20]:
imdb_title_basics.head(10)

Unnamed: 0,imdb_id,primary_title,original_title,start_year,genres
0,tt0063540,Sunghursh,Sunghursh,2013,"[Action,Crime,Drama]"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,"[Biography,Drama]"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,[Drama]
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,"[Comedy,Drama]"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,"[Comedy,Drama,Fantasy]"
5,tt0111414,A Thin Life,A Thin Life,2018,[Comedy]
6,tt0112502,Bigfoot,Bigfoot,2017,"[Horror,Thriller]"
7,tt0137204,Joe Finds Grace,Joe Finds Grace,2017,"[Adventure,Animation,Comedy]"
8,tt0139613,O Silêncio,O Silêncio,2012,"[Documentary,History]"
9,tt0144449,Nema aviona za Zagreb,Nema aviona za Zagreb,2012,[Biography]


----
Let's take a look at a dataset that contains some budget information, from TMDB, another online movie database. The dataset is stored in [data/tmdb.movies.2.csv](./data/tmdb.movies2.csv).

In [21]:
tmdb_movies = pd.read_csv('data/tmdb.movies.csv')

In [22]:
tmdb_movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 21 columns):
id                      10866 non-null int64
imdb_id                 10856 non-null object
popularity              10866 non-null float64
budget                  10866 non-null int64
revenue                 10866 non-null int64
original_title          10866 non-null object
cast                    10790 non-null object
homepage                2936 non-null object
director                10822 non-null object
tagline                 8042 non-null object
keywords                9373 non-null object
overview                10862 non-null object
runtime                 10866 non-null int64
genres                  10843 non-null object
production_companies    9836 non-null object
release_date            10866 non-null object
vote_count              10866 non-null int64
vote_average            10866 non-null float64
release_year            10866 non-null int64
budget_adj              1

We can easily drop a number of these columns for ease of readability.

In [23]:
tmdb_movies.head(10)

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,...,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
0,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,http://www.jurassicworld.com/,Colin Trevorrow,The park is open.,...,Twenty-two years after the events of Jurassic ...,124,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,6/9/2015,5562,6.5,2015,137999900.0,1392446000.0
1,76341,tt1392190,28.419936,150000000,378436354,Mad Max: Fury Road,Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic...,http://www.madmaxmovie.com/,George Miller,What a Lovely Day.,...,An apocalyptic story set in the furthest reach...,120,Action|Adventure|Science Fiction|Thriller,Village Roadshow Pictures|Kennedy Miller Produ...,5/13/2015,6185,7.1,2015,137999900.0,348161300.0
2,262500,tt2908446,13.112507,110000000,295238201,Insurgent,Shailene Woodley|Theo James|Kate Winslet|Ansel...,http://www.thedivergentseries.movie/#insurgent,Robert Schwentke,One Choice Can Destroy You,...,Beatrice Prior must confront her inner demons ...,119,Adventure|Science Fiction|Thriller,Summit Entertainment|Mandeville Films|Red Wago...,3/18/2015,2480,6.3,2015,101200000.0,271619000.0
3,140607,tt2488496,11.173104,200000000,2068178225,Star Wars: The Force Awakens,Harrison Ford|Mark Hamill|Carrie Fisher|Adam D...,http://www.starwars.com/films/star-wars-episod...,J.J. Abrams,Every generation has a story.,...,Thirty years after defeating the Galactic Empi...,136,Action|Adventure|Science Fiction|Fantasy,Lucasfilm|Truenorth Productions|Bad Robot,12/15/2015,5292,7.5,2015,183999900.0,1902723000.0
4,168259,tt2820852,9.335014,190000000,1506249360,Furious 7,Vin Diesel|Paul Walker|Jason Statham|Michelle ...,http://www.furious7.com/,James Wan,Vengeance Hits Home,...,Deckard Shaw seeks revenge against Dominic Tor...,137,Action|Crime|Thriller,Universal Pictures|Original Film|Media Rights ...,4/1/2015,2947,7.3,2015,174799900.0,1385749000.0
5,281957,tt1663202,9.1107,135000000,532950503,The Revenant,Leonardo DiCaprio|Tom Hardy|Will Poulter|Domhn...,http://www.foxmovies.com/movies/the-revenant,Alejandro GonzÃ¡lez IÃ±Ã¡rritu,"(n. One who has returned, as if from the dead.)",...,"In the 1820s, a frontiersman, Hugh Glass, sets...",156,Western|Drama|Adventure|Thriller,Regency Enterprises|Appian Way|CatchPlay|Anony...,12/25/2015,3929,7.2,2015,124199900.0,490314200.0
6,87101,tt1340138,8.654359,155000000,440603537,Terminator Genisys,Arnold Schwarzenegger|Jason Clarke|Emilia Clar...,http://www.terminatormovie.com/,Alan Taylor,Reset the future,...,"The year is 2029. John Connor, leader of the r...",125,Science Fiction|Action|Thriller|Adventure,Paramount Pictures|Skydance Productions,6/23/2015,2598,5.8,2015,142599900.0,405355100.0
7,286217,tt3659388,7.6674,108000000,595380321,The Martian,Matt Damon|Jessica Chastain|Kristen Wiig|Jeff ...,http://www.foxmovies.com/movies/the-martian,Ridley Scott,Bring Him Home,...,"During a manned mission to Mars, Astronaut Mar...",141,Drama|Adventure|Science Fiction,Twentieth Century Fox Film Corporation|Scott F...,9/30/2015,4572,7.6,2015,99359960.0,547749700.0
8,211672,tt2293640,7.404165,74000000,1156730962,Minions,Sandra Bullock|Jon Hamm|Michael Keaton|Allison...,http://www.minionsmovie.com/,Kyle Balda|Pierre Coffin,"Before Gru, they had a history of bad bosses",...,"Minions Stuart, Kevin and Bob are recruited by...",91,Family|Animation|Adventure|Comedy,Universal Pictures|Illumination Entertainment,6/17/2015,2893,6.5,2015,68079970.0,1064192000.0
9,150540,tt2096673,6.326804,175000000,853708609,Inside Out,Amy Poehler|Phyllis Smith|Richard Kind|Bill Ha...,http://movies.disney.com/inside-out,Pete Docter,Meet the little voices inside your head.,...,"Growing up can be a bumpy road, and it's no ex...",94,Comedy|Animation|Family,Walt Disney Pictures|Pixar Animation Studios|W...,6/9/2015,3935,8.0,2015,160999900.0,785411600.0


In [24]:
tmdb_movies.drop(columns=[
    'id',
    'cast', 
    'homepage', 
    'director', 
    'tagline', 
    'overview', 
    'keywords', 
    'production_companies',
    'runtime',
    'release_year'], inplace=True)

In [25]:
tmdb_movies.head(10)

Unnamed: 0,imdb_id,popularity,budget,revenue,original_title,genres,release_date,vote_count,vote_average,budget_adj,revenue_adj
0,tt0369610,32.985763,150000000,1513528810,Jurassic World,Action|Adventure|Science Fiction|Thriller,6/9/2015,5562,6.5,137999900.0,1392446000.0
1,tt1392190,28.419936,150000000,378436354,Mad Max: Fury Road,Action|Adventure|Science Fiction|Thriller,5/13/2015,6185,7.1,137999900.0,348161300.0
2,tt2908446,13.112507,110000000,295238201,Insurgent,Adventure|Science Fiction|Thriller,3/18/2015,2480,6.3,101200000.0,271619000.0
3,tt2488496,11.173104,200000000,2068178225,Star Wars: The Force Awakens,Action|Adventure|Science Fiction|Fantasy,12/15/2015,5292,7.5,183999900.0,1902723000.0
4,tt2820852,9.335014,190000000,1506249360,Furious 7,Action|Crime|Thriller,4/1/2015,2947,7.3,174799900.0,1385749000.0
5,tt1663202,9.1107,135000000,532950503,The Revenant,Western|Drama|Adventure|Thriller,12/25/2015,3929,7.2,124199900.0,490314200.0
6,tt1340138,8.654359,155000000,440603537,Terminator Genisys,Science Fiction|Action|Thriller|Adventure,6/23/2015,2598,5.8,142599900.0,405355100.0
7,tt3659388,7.6674,108000000,595380321,The Martian,Drama|Adventure|Science Fiction,9/30/2015,4572,7.6,99359960.0,547749700.0
8,tt2293640,7.404165,74000000,1156730962,Minions,Family|Animation|Adventure|Comedy,6/17/2015,2893,6.5,68079970.0,1064192000.0
9,tt2096673,6.326804,175000000,853708609,Inside Out,Comedy|Animation|Family,6/9/2015,3935,8.0,160999900.0,785411600.0


This looks much more compact, and so we can turn our attention to fixing the field formatting and removing any invalid entries due to null values. To start, let's drop any duplicate movie entries by way of the `original_title` field.

In [26]:
tmdb_movies.drop_duplicates(subset='original_title', inplace=True)

In [27]:
tmdb_movies.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10571 entries, 0 to 10865
Data columns (total 11 columns):
imdb_id           10561 non-null object
popularity        10571 non-null float64
budget            10571 non-null int64
revenue           10571 non-null int64
original_title    10571 non-null object
genres            10548 non-null object
release_date      10571 non-null object
vote_count        10571 non-null int64
vote_average      10571 non-null float64
budget_adj        10571 non-null float64
revenue_adj       10571 non-null float64
dtypes: float64(4), int64(3), object(4)
memory usage: 991.0+ KB


It looks like we have a few null genres descriptions and IMDB ids. Let's drop these rows, to better facilitate eventual merging into our IMDB dataset.

In [28]:
tmdb_movies.dropna(subset=['genres', 'imdb_id'], inplace=True)

In [29]:
tmdb_movies.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10540 entries, 0 to 10865
Data columns (total 11 columns):
imdb_id           10540 non-null object
popularity        10540 non-null float64
budget            10540 non-null int64
revenue           10540 non-null int64
original_title    10540 non-null object
genres            10540 non-null object
release_date      10540 non-null object
vote_count        10540 non-null int64
vote_average      10540 non-null float64
budget_adj        10540 non-null float64
revenue_adj       10540 non-null float64
dtypes: float64(4), int64(3), object(4)
memory usage: 988.1+ KB


This looks great, now we just need to do some formatting on our fields.

In [30]:
tmdb_movies.head(10)

Unnamed: 0,imdb_id,popularity,budget,revenue,original_title,genres,release_date,vote_count,vote_average,budget_adj,revenue_adj
0,tt0369610,32.985763,150000000,1513528810,Jurassic World,Action|Adventure|Science Fiction|Thriller,6/9/2015,5562,6.5,137999900.0,1392446000.0
1,tt1392190,28.419936,150000000,378436354,Mad Max: Fury Road,Action|Adventure|Science Fiction|Thriller,5/13/2015,6185,7.1,137999900.0,348161300.0
2,tt2908446,13.112507,110000000,295238201,Insurgent,Adventure|Science Fiction|Thriller,3/18/2015,2480,6.3,101200000.0,271619000.0
3,tt2488496,11.173104,200000000,2068178225,Star Wars: The Force Awakens,Action|Adventure|Science Fiction|Fantasy,12/15/2015,5292,7.5,183999900.0,1902723000.0
4,tt2820852,9.335014,190000000,1506249360,Furious 7,Action|Crime|Thriller,4/1/2015,2947,7.3,174799900.0,1385749000.0
5,tt1663202,9.1107,135000000,532950503,The Revenant,Western|Drama|Adventure|Thriller,12/25/2015,3929,7.2,124199900.0,490314200.0
6,tt1340138,8.654359,155000000,440603537,Terminator Genisys,Science Fiction|Action|Thriller|Adventure,6/23/2015,2598,5.8,142599900.0,405355100.0
7,tt3659388,7.6674,108000000,595380321,The Martian,Drama|Adventure|Science Fiction,9/30/2015,4572,7.6,99359960.0,547749700.0
8,tt2293640,7.404165,74000000,1156730962,Minions,Family|Animation|Adventure|Comedy,6/17/2015,2893,6.5,68079970.0,1064192000.0
9,tt2096673,6.326804,175000000,853708609,Inside Out,Comedy|Animation|Family,6/9/2015,3935,8.0,160999900.0,785411600.0


Let's start by splitting up the genres descriptions into an appropriate list (to match the style of the new IMDB dataset).

In [31]:
tmdb_movies.genres = tmdb_movies.genres.apply(lambda g: g.split('|'))

In [32]:
tmdb_movies.head(10)

Unnamed: 0,imdb_id,popularity,budget,revenue,original_title,genres,release_date,vote_count,vote_average,budget_adj,revenue_adj
0,tt0369610,32.985763,150000000,1513528810,Jurassic World,"[Action, Adventure, Science Fiction, Thriller]",6/9/2015,5562,6.5,137999900.0,1392446000.0
1,tt1392190,28.419936,150000000,378436354,Mad Max: Fury Road,"[Action, Adventure, Science Fiction, Thriller]",5/13/2015,6185,7.1,137999900.0,348161300.0
2,tt2908446,13.112507,110000000,295238201,Insurgent,"[Adventure, Science Fiction, Thriller]",3/18/2015,2480,6.3,101200000.0,271619000.0
3,tt2488496,11.173104,200000000,2068178225,Star Wars: The Force Awakens,"[Action, Adventure, Science Fiction, Fantasy]",12/15/2015,5292,7.5,183999900.0,1902723000.0
4,tt2820852,9.335014,190000000,1506249360,Furious 7,"[Action, Crime, Thriller]",4/1/2015,2947,7.3,174799900.0,1385749000.0
5,tt1663202,9.1107,135000000,532950503,The Revenant,"[Western, Drama, Adventure, Thriller]",12/25/2015,3929,7.2,124199900.0,490314200.0
6,tt1340138,8.654359,155000000,440603537,Terminator Genisys,"[Science Fiction, Action, Thriller, Adventure]",6/23/2015,2598,5.8,142599900.0,405355100.0
7,tt3659388,7.6674,108000000,595380321,The Martian,"[Drama, Adventure, Science Fiction]",9/30/2015,4572,7.6,99359960.0,547749700.0
8,tt2293640,7.404165,74000000,1156730962,Minions,"[Family, Animation, Adventure, Comedy]",6/17/2015,2893,6.5,68079970.0,1064192000.0
9,tt2096673,6.326804,175000000,853708609,Inside Out,"[Comedy, Animation, Family]",6/9/2015,3935,8.0,160999900.0,785411600.0


And finally let's turn our release date into a datetime object.

In [33]:
tmdb_movies.release_date = pd.to_datetime(tmdb_movies.release_date, format='%m/%d/%Y')

In [34]:
tmdb_movies.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10540 entries, 0 to 10865
Data columns (total 11 columns):
imdb_id           10540 non-null object
popularity        10540 non-null float64
budget            10540 non-null int64
revenue           10540 non-null int64
original_title    10540 non-null object
genres            10540 non-null object
release_date      10540 non-null datetime64[ns]
vote_count        10540 non-null int64
vote_average      10540 non-null float64
budget_adj        10540 non-null float64
revenue_adj       10540 non-null float64
dtypes: datetime64[ns](1), float64(4), int64(3), object(3)
memory usage: 988.1+ KB


In [35]:
tmdb_movies.head(10)

Unnamed: 0,imdb_id,popularity,budget,revenue,original_title,genres,release_date,vote_count,vote_average,budget_adj,revenue_adj
0,tt0369610,32.985763,150000000,1513528810,Jurassic World,"[Action, Adventure, Science Fiction, Thriller]",2015-06-09,5562,6.5,137999900.0,1392446000.0
1,tt1392190,28.419936,150000000,378436354,Mad Max: Fury Road,"[Action, Adventure, Science Fiction, Thriller]",2015-05-13,6185,7.1,137999900.0,348161300.0
2,tt2908446,13.112507,110000000,295238201,Insurgent,"[Adventure, Science Fiction, Thriller]",2015-03-18,2480,6.3,101200000.0,271619000.0
3,tt2488496,11.173104,200000000,2068178225,Star Wars: The Force Awakens,"[Action, Adventure, Science Fiction, Fantasy]",2015-12-15,5292,7.5,183999900.0,1902723000.0
4,tt2820852,9.335014,190000000,1506249360,Furious 7,"[Action, Crime, Thriller]",2015-04-01,2947,7.3,174799900.0,1385749000.0
5,tt1663202,9.1107,135000000,532950503,The Revenant,"[Western, Drama, Adventure, Thriller]",2015-12-25,3929,7.2,124199900.0,490314200.0
6,tt1340138,8.654359,155000000,440603537,Terminator Genisys,"[Science Fiction, Action, Thriller, Adventure]",2015-06-23,2598,5.8,142599900.0,405355100.0
7,tt3659388,7.6674,108000000,595380321,The Martian,"[Drama, Adventure, Science Fiction]",2015-09-30,4572,7.6,99359960.0,547749700.0
8,tt2293640,7.404165,74000000,1156730962,Minions,"[Family, Animation, Adventure, Comedy]",2015-06-17,2893,6.5,68079970.0,1064192000.0
9,tt2096673,6.326804,175000000,853708609,Inside Out,"[Comedy, Animation, Family]",2015-06-09,3935,8.0,160999900.0,785411600.0


----
Let's consider some more of the information that is provided by IMDB. In particular, we might be interested in the crews that work on particular films. Let's load up the respective dataset, and begin exploring.

In [36]:
imdb_title_crew = pd.read_csv('data/imdb.title.crew.csv')

In [37]:
imdb_title_crew.rename(columns={'tconst': 'imdb_id'}, inplace=True)
imdb_title_crew.head(10)

Unnamed: 0,imdb_id,directors,writers
0,tt0285252,nm0899854,nm0899854
1,tt0438973,,"nm0175726,nm1802864"
2,tt0462036,nm1940585,nm1940585
3,tt0835418,nm0151540,"nm0310087,nm0841532"
4,tt0878654,"nm0089502,nm2291498,nm2292011",nm0284943
5,tt0879859,nm2416460,
6,tt0996958,nm2286991,"nm2286991,nm2651190"
7,tt0999913,nm0527109,"nm0527109,nm0329051,nm0001603,nm0930684"
8,tt10003792,nm10539228,nm10539228
9,tt10005130,nm10540239,"nm5482263,nm10540239"


In [38]:
imdb_title_crew.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 146144 entries, 0 to 146143
Data columns (total 3 columns):
imdb_id      146144 non-null object
directors    140417 non-null object
writers      110261 non-null object
dtypes: object(3)
memory usage: 3.3+ MB


Let's begin to pare down our null values from our directors and writers. One consideration we should make is whether to only consider those films that have both writers and directors. For the purposes of the crew analysis, which is secondary to the main analysis of genre trends, we will only consider those films that have both writers and directors. We can get the number of rows which have either null directors or null writers below.

In [39]:
imdb_title_crew[imdb_title_crew.directors.isna() | imdb_title_crew.writers.isna()].shape

(37136, 3)

Let's drop these rows from our dataset.

In [40]:
imdb_title_crew.dropna(subset = ['directors', 'writers'], inplace=True)

In [41]:
imdb_title_crew.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 109008 entries, 0 to 146142
Data columns (total 3 columns):
imdb_id      109008 non-null object
directors    109008 non-null object
writers      109008 non-null object
dtypes: object(3)
memory usage: 3.3+ MB


In [42]:
imdb_title_crew.head(10)

Unnamed: 0,imdb_id,directors,writers
0,tt0285252,nm0899854,nm0899854
2,tt0462036,nm1940585,nm1940585
3,tt0835418,nm0151540,"nm0310087,nm0841532"
4,tt0878654,"nm0089502,nm2291498,nm2292011",nm0284943
6,tt0996958,nm2286991,"nm2286991,nm2651190"
7,tt0999913,nm0527109,"nm0527109,nm0329051,nm0001603,nm0930684"
8,tt10003792,nm10539228,nm10539228
9,tt10005130,nm10540239,"nm5482263,nm10540239"
10,tt10005378,nm9232888,nm9232888
11,tt10011102,nm4853354,"nm2215938,nm0219964"


Finally, let's separate our string lists into Python lists, as we have done previously.

In [43]:
imdb_title_crew.directors = imdb_title_crew.directors.apply(lambda d: d.split(','))
imdb_title_crew.writers = imdb_title_crew.writers.apply(lambda w: w.split(','))

In [44]:
imdb_title_crew.head(10)

Unnamed: 0,imdb_id,directors,writers
0,tt0285252,[nm0899854],[nm0899854]
2,tt0462036,[nm1940585],[nm1940585]
3,tt0835418,[nm0151540],"[nm0310087, nm0841532]"
4,tt0878654,"[nm0089502, nm2291498, nm2292011]",[nm0284943]
6,tt0996958,[nm2286991],"[nm2286991, nm2651190]"
7,tt0999913,[nm0527109],"[nm0527109, nm0329051, nm0001603, nm0930684]"
8,tt10003792,[nm10539228],[nm10539228]
9,tt10005130,[nm10540239],"[nm5482263, nm10540239]"
10,tt10005378,[nm9232888],[nm9232888]
11,tt10011102,[nm4853354],"[nm2215938, nm0219964]"


This looks good! All of the entries in this table are non-null and have valid writers and directors in list format, whose name ids can be cross-referenced in [data/imdb.name.basics.csv](./data/imdb.name.basics.csv).

----
Let's also look at some ratings information from IMDB.

In [45]:
imdb_title_ratings = pd.read_csv('data/imdb.title.ratings.csv')

In [46]:
imdb_title_ratings.rename(columns={'tconst': 'imdb_id'}, inplace=True)
imdb_title_ratings.head(10)

Unnamed: 0,imdb_id,averagerating,numvotes
0,tt10356526,8.3,31
1,tt10384606,8.9,559
2,tt1042974,6.4,20
3,tt1043726,4.2,50352
4,tt1060240,6.5,21
5,tt1069246,6.2,326
6,tt1094666,7.0,1613
7,tt1130982,6.4,571
8,tt1156528,7.2,265
9,tt1161457,4.2,148


In [47]:
imdb_title_ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73856 entries, 0 to 73855
Data columns (total 3 columns):
imdb_id          73856 non-null object
averagerating    73856 non-null float64
numvotes         73856 non-null int64
dtypes: float64(1), int64(1), object(1)
memory usage: 1.7+ MB


Let's now check to see if any of the average ratings or number of votes are less than or equal to 0, since the info view reveals that all of the fields contain non-null values for each column.

In [48]:
imdb_title_ratings[imdb_title_ratings.averagerating <= 0].shape

(0, 3)

In [49]:
imdb_title_ratings[imdb_title_ratings.numvotes <= 0].shape

(0, 3)

Alright, it looks like the records for the ratings dataset are all in good shape, and don't require any additional cleaning to be suitable for further analysis!

----
The next step is to save the cleaned dataframes in Pickle format (so that we can keep our lists) stored in the `cleaned_data` folder.

In [None]:
imdb_title_basics.to_pickle('cleaned_data/imdb_title_basics.pkl')
imdb_title_crew.to_pickle('cleaned_data/imdb_title_crew.pkl')
imdb_title_ratings.to_pickle('cleaned_data/imdb_title_ratins.pkl')
tmdb_movies.to_pickle('cleaned_data/tmdb_movies.pkl')