# Cleaning Up The Data

The first major step of our data exploration involves making sure that our selected data is clean and tidy for our later analysis.

----

Let's start by importing the needed libraries.

In [282]:
import pandas as pd

Let's start simply by loading up the database of basic info for titles stored on IMDB ([data/imdb.title.basics.csv](./data/imdb.title.basics.csv)), and taking a look at some of the first few entries.

In [283]:
imdb_title_basics = pd.read_csv('data/imdb.title.basics.csv')

In [284]:
imdb_title_basics.head(10)

Unnamed: 0,tconst,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy"
5,tt0111414,A Thin Life,A Thin Life,2018,75.0,Comedy
6,tt0112502,Bigfoot,Bigfoot,2017,,"Horror,Thriller"
7,tt0137204,Joe Finds Grace,Joe Finds Grace,2017,83.0,"Adventure,Animation,Comedy"
8,tt0139613,O Silêncio,O Silêncio,2012,,"Documentary,History"
9,tt0144449,Nema aviona za Zagreb,Nema aviona za Zagreb,2012,82.0,Biography


From here on out, any column holding a `tconst` id for IMDB will be renamed `imdb_id`.

In [285]:
imdb_title_basics.rename(columns={'tconst': 'imdb_id'}, inplace=True)

In [286]:
imdb_title_basics.head(10)

Unnamed: 0,imdb_id,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy"
5,tt0111414,A Thin Life,A Thin Life,2018,75.0,Comedy
6,tt0112502,Bigfoot,Bigfoot,2017,,"Horror,Thriller"
7,tt0137204,Joe Finds Grace,Joe Finds Grace,2017,83.0,"Adventure,Animation,Comedy"
8,tt0139613,O Silêncio,O Silêncio,2012,,"Documentary,History"
9,tt0144449,Nema aviona za Zagreb,Nema aviona za Zagreb,2012,82.0,Biography


Looking at the above entries reveals some tasks that must be completed in order for the data to be suitable. In particular, there are a few fields populated by sporadic `NaN` values, and these must be handled appropriately (either by interpolating from other available data or by dropping the entries from analysis---which is needed is the subject of further investigation). Moreover, the genres field must be appropriately split into lists of genres, instead of the comma separated string that is given currently. First, we might want to start by removing duplicates. In particular, some of the entries contain duplicate `primary_title` entries, and hence conflicting information about any one movie.

In [287]:
imdb_title_basics.drop_duplicates(subset='primary_title', inplace=True)

In [288]:
imdb_title_basics.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 136071 entries, 0 to 146143
Data columns (total 6 columns):
imdb_id            136071 non-null object
primary_title      136071 non-null object
original_title     136055 non-null object
start_year         136071 non-null int64
runtime_minutes    106598 non-null float64
genres             131180 non-null object
dtypes: float64(1), int64(1), object(4)
memory usage: 7.3+ MB


Next we want to get invalid genre descriptors, i.e. anything that is not a string.

In [289]:
imdb_title_basics[imdb_title_basics.genres.apply(lambda g: type(g) != str)].shape

(4891, 6)

In [290]:
imdb_title_basics[imdb_title_basics.genres.apply(lambda g: type(g) != str)].genres.unique()

array([nan], dtype=object)

From this, we can see that there are 5408 movies in the IMDB database which are not supplied with appropriate genre descriptions, and moreover that each of these invalid descriptions is `NaN`. Given that this is a major component of our analysis here, we should probably drop these entries from consideration.

In [291]:
imdb_title_basics.dropna(subset=['genres'], inplace=True)

In [292]:
imdb_title_basics[imdb_title_basics.genres.apply(lambda g: type(g) != str)].shape

(0, 6)

Fantastic, we have removed all of the rows where the genres description was an invalid value. We might also want to check that the titles are all appropriately filled in.

In [293]:
imdb_title_basics[imdb_title_basics.primary_title.apply(lambda t: type(t) != str)].shape

(0, 6)

We have thus found that all of the titles in the dataframe are given as appropriate strings, so we shouldn't need to drop any rows with missing values. We may drop rows when we get to merging, but for now, this field of the dataset isn't producing any issues.

In [294]:
imdb_title_basics.head(10)

Unnamed: 0,imdb_id,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy"
5,tt0111414,A Thin Life,A Thin Life,2018,75.0,Comedy
6,tt0112502,Bigfoot,Bigfoot,2017,,"Horror,Thriller"
7,tt0137204,Joe Finds Grace,Joe Finds Grace,2017,83.0,"Adventure,Animation,Comedy"
8,tt0139613,O Silêncio,O Silêncio,2012,,"Documentary,History"
9,tt0144449,Nema aviona za Zagreb,Nema aviona za Zagreb,2012,82.0,Biography


Now, let's drop the runtime_minutes column, as it isn't needed to answer our analytical questions.

In [295]:
imdb_title_basics.drop(columns='runtime_minutes', inplace=True)

In [296]:
imdb_title_basics.head(10)

Unnamed: 0,imdb_id,primary_title,original_title,start_year,genres
0,tt0063540,Sunghursh,Sunghursh,2013,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,"Comedy,Drama,Fantasy"
5,tt0111414,A Thin Life,A Thin Life,2018,Comedy
6,tt0112502,Bigfoot,Bigfoot,2017,"Horror,Thriller"
7,tt0137204,Joe Finds Grace,Joe Finds Grace,2017,"Adventure,Animation,Comedy"
8,tt0139613,O Silêncio,O Silêncio,2012,"Documentary,History"
9,tt0144449,Nema aviona za Zagreb,Nema aviona za Zagreb,2012,Biography


Let's next split up the genres field into a list of separate genres, as opposed to a single comma separated list as is given currently.

In [297]:
imdb_title_basics.genres = imdb_title_basics.genres.apply(lambda g: g.split(','))

In [298]:
imdb_title_basics.head(10)

Unnamed: 0,imdb_id,primary_title,original_title,start_year,genres
0,tt0063540,Sunghursh,Sunghursh,2013,"[Action, Crime, Drama]"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,"[Biography, Drama]"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,[Drama]
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,"[Comedy, Drama]"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,"[Comedy, Drama, Fantasy]"
5,tt0111414,A Thin Life,A Thin Life,2018,[Comedy]
6,tt0112502,Bigfoot,Bigfoot,2017,"[Horror, Thriller]"
7,tt0137204,Joe Finds Grace,Joe Finds Grace,2017,"[Adventure, Animation, Comedy]"
8,tt0139613,O Silêncio,O Silêncio,2012,"[Documentary, History]"
9,tt0144449,Nema aviona za Zagreb,Nema aviona za Zagreb,2012,[Biography]


Alright, this dataset is looking pretty good. Genres are now correctly represented inside of lists, as will be needed for later analysis.

In [299]:
imdb_title_basics.shape

(131180, 5)

As we can see, the IMDB dataset contains a multitude of entries. Because of its large size, we will use this dataset as a sort of "master" dataset, for which our additionally gathered revenue information will be attached. Let's continue to pare down our master dataset, in particular removing any entries for movies which have not yet been started.

In [300]:
imdb_title_basics = imdb_title_basics[imdb_title_basics.start_year <= 2019]

In [301]:
imdb_title_basics.head(10)

Unnamed: 0,imdb_id,primary_title,original_title,start_year,genres
0,tt0063540,Sunghursh,Sunghursh,2013,"[Action, Crime, Drama]"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,"[Biography, Drama]"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,[Drama]
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,"[Comedy, Drama]"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,"[Comedy, Drama, Fantasy]"
5,tt0111414,A Thin Life,A Thin Life,2018,[Comedy]
6,tt0112502,Bigfoot,Bigfoot,2017,"[Horror, Thriller]"
7,tt0137204,Joe Finds Grace,Joe Finds Grace,2017,"[Adventure, Animation, Comedy]"
8,tt0139613,O Silêncio,O Silêncio,2012,"[Documentary, History]"
9,tt0144449,Nema aviona za Zagreb,Nema aviona za Zagreb,2012,[Biography]


----
Let's consider some more of the information that is provided by IMDB. In particular, we might be interested in the crews that work on particular films. Let's load up the respective dataset, and begin exploring.

In [302]:
imdb_title_crew = pd.read_csv('data/imdb.title.crew.csv')

In [303]:
imdb_title_crew.rename(columns={'tconst': 'imdb_id'}, inplace=True)
imdb_title_crew.head(10)

Unnamed: 0,imdb_id,directors,writers
0,tt0285252,nm0899854,nm0899854
1,tt0438973,,"nm0175726,nm1802864"
2,tt0462036,nm1940585,nm1940585
3,tt0835418,nm0151540,"nm0310087,nm0841532"
4,tt0878654,"nm0089502,nm2291498,nm2292011",nm0284943
5,tt0879859,nm2416460,
6,tt0996958,nm2286991,"nm2286991,nm2651190"
7,tt0999913,nm0527109,"nm0527109,nm0329051,nm0001603,nm0930684"
8,tt10003792,nm10539228,nm10539228
9,tt10005130,nm10540239,"nm5482263,nm10540239"


In [304]:
imdb_title_crew.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 146144 entries, 0 to 146143
Data columns (total 3 columns):
imdb_id      146144 non-null object
directors    140417 non-null object
writers      110261 non-null object
dtypes: object(3)
memory usage: 3.3+ MB


Let's begin to pare down our null values from our directors and writers. One consideration we should make is whether to only consider those films that have both writers and directors. For the purposes of the crew analysis, which is secondary to the main analysis of genre trends, we will only consider those films that have both writers and directors. We can get the number of rows which have either null directors or null writers below.

In [305]:
imdb_title_crew[imdb_title_crew.directors.isna() | imdb_title_crew.writers.isna()].shape

(37136, 3)

Let's drop these rows from our dataset.

In [306]:
imdb_title_crew.dropna(subset = ['directors', 'writers'], inplace=True)

In [307]:
imdb_title_crew.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 109008 entries, 0 to 146142
Data columns (total 3 columns):
imdb_id      109008 non-null object
directors    109008 non-null object
writers      109008 non-null object
dtypes: object(3)
memory usage: 3.3+ MB


In [308]:
imdb_title_crew.head(10)

Unnamed: 0,imdb_id,directors,writers
0,tt0285252,nm0899854,nm0899854
2,tt0462036,nm1940585,nm1940585
3,tt0835418,nm0151540,"nm0310087,nm0841532"
4,tt0878654,"nm0089502,nm2291498,nm2292011",nm0284943
6,tt0996958,nm2286991,"nm2286991,nm2651190"
7,tt0999913,nm0527109,"nm0527109,nm0329051,nm0001603,nm0930684"
8,tt10003792,nm10539228,nm10539228
9,tt10005130,nm10540239,"nm5482263,nm10540239"
10,tt10005378,nm9232888,nm9232888
11,tt10011102,nm4853354,"nm2215938,nm0219964"


Finally, let's separate our string lists into Python lists, as we have done previously.

In [309]:
imdb_title_crew.directors = imdb_title_crew.directors.apply(lambda d: d.split(','))
imdb_title_crew.writers = imdb_title_crew.writers.apply(lambda w: w.split(','))

In [310]:
imdb_title_crew.head(10)

Unnamed: 0,imdb_id,directors,writers
0,tt0285252,[nm0899854],[nm0899854]
2,tt0462036,[nm1940585],[nm1940585]
3,tt0835418,[nm0151540],"[nm0310087, nm0841532]"
4,tt0878654,"[nm0089502, nm2291498, nm2292011]",[nm0284943]
6,tt0996958,[nm2286991],"[nm2286991, nm2651190]"
7,tt0999913,[nm0527109],"[nm0527109, nm0329051, nm0001603, nm0930684]"
8,tt10003792,[nm10539228],[nm10539228]
9,tt10005130,[nm10540239],"[nm5482263, nm10540239]"
10,tt10005378,[nm9232888],[nm9232888]
11,tt10011102,[nm4853354],"[nm2215938, nm0219964]"


This looks good! All of the entries in this table are non-null and have valid writers and directors in list format, whose name ids can be cross-referenced in [data/imdb.name.basics.csv](./data/imdb.name.basics.csv).

----
Let's also look at some ratings information from IMDB.

In [311]:
imdb_title_ratings = pd.read_csv('data/imdb.title.ratings.csv')

In [312]:
imdb_title_ratings.rename(columns={'tconst': 'imdb_id'}, inplace=True)
imdb_title_ratings.head(10)

Unnamed: 0,imdb_id,averagerating,numvotes
0,tt10356526,8.3,31
1,tt10384606,8.9,559
2,tt1042974,6.4,20
3,tt1043726,4.2,50352
4,tt1060240,6.5,21
5,tt1069246,6.2,326
6,tt1094666,7.0,1613
7,tt1130982,6.4,571
8,tt1156528,7.2,265
9,tt1161457,4.2,148


In [313]:
imdb_title_ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73856 entries, 0 to 73855
Data columns (total 3 columns):
imdb_id          73856 non-null object
averagerating    73856 non-null float64
numvotes         73856 non-null int64
dtypes: float64(1), int64(1), object(1)
memory usage: 1.7+ MB


Let's now check to see if any of the average ratings or number of votes are less than or equal to 0, since the info view reveals that all of the fields contain non-null values for each column.

In [314]:
imdb_title_ratings[imdb_title_ratings.averagerating <= 0].shape

(0, 3)

In [315]:
imdb_title_ratings[imdb_title_ratings.numvotes <= 0].shape

(0, 3)

Alright, it looks like the records for the ratings dataset are all in good shape, and don't require any additional cleaning to be suitable for further analysis!

----
Finally, we will look at The Numbers movie budget dataset.

In [316]:
tn_movie_budgets = pd.read_csv('data/tn.movie_budgets.csv')

In [317]:
tn_movie_budgets.head(10)

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"
2,3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350"
3,4,"May 1, 2015",Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963"
4,5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747"
5,6,"Dec 18, 2015",Star Wars Ep. VII: The Force Awakens,"$306,000,000","$936,662,225","$2,053,311,220"
6,7,"Apr 27, 2018",Avengers: Infinity War,"$300,000,000","$678,815,482","$2,048,134,200"
7,8,"May 24, 2007",Pirates of the Caribbean: At Worldâs End,"$300,000,000","$309,420,425","$963,420,425"
8,9,"Nov 17, 2017",Justice League,"$300,000,000","$229,024,295","$655,945,209"
9,10,"Nov 6, 2015",Spectre,"$300,000,000","$200,074,175","$879,620,923"


We have a substantial amount of formatting that is required for this dataset. In particular, we must change the financial columns to numeric values, fix the release date fields, and drop/rename columns.

In [318]:
tn_movie_budgets.drop(columns='id', inplace=True)

In [319]:
tn_movie_budgets.rename(columns={'movie': 'primary_title'}, inplace=True)

In [320]:
tn_movie_budgets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5782 entries, 0 to 5781
Data columns (total 5 columns):
release_date         5782 non-null object
primary_title        5782 non-null object
production_budget    5782 non-null object
domestic_gross       5782 non-null object
worldwide_gross      5782 non-null object
dtypes: object(5)
memory usage: 226.0+ KB


In [321]:
tn_movie_budgets.release_date = pd.to_datetime(tn_movie_budgets.release_date, infer_datetime_format=True)

In [322]:
tn_movie_budgets.head(10)

Unnamed: 0,release_date,primary_title,production_budget,domestic_gross,worldwide_gross
0,2009-12-18,Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
1,2011-05-20,Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"
2,2019-06-07,Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350"
3,2015-05-01,Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963"
4,2017-12-15,Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747"
5,2015-12-18,Star Wars Ep. VII: The Force Awakens,"$306,000,000","$936,662,225","$2,053,311,220"
6,2018-04-27,Avengers: Infinity War,"$300,000,000","$678,815,482","$2,048,134,200"
7,2007-05-24,Pirates of the Caribbean: At Worldâs End,"$300,000,000","$309,420,425","$963,420,425"
8,2017-11-17,Justice League,"$300,000,000","$229,024,295","$655,945,209"
9,2015-11-06,Spectre,"$300,000,000","$200,074,175","$879,620,923"


We should also filter the dataset by the `release_date`.

In [323]:
tn_movie_budgets = tn_movie_budgets[tn_movie_budgets.release_date.dt.year >= 2010]

In [324]:
tn_movie_budgets.head(10)

Unnamed: 0,release_date,primary_title,production_budget,domestic_gross,worldwide_gross
1,2011-05-20,Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"
2,2019-06-07,Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350"
3,2015-05-01,Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963"
4,2017-12-15,Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747"
5,2015-12-18,Star Wars Ep. VII: The Force Awakens,"$306,000,000","$936,662,225","$2,053,311,220"
6,2018-04-27,Avengers: Infinity War,"$300,000,000","$678,815,482","$2,048,134,200"
8,2017-11-17,Justice League,"$300,000,000","$229,024,295","$655,945,209"
9,2015-11-06,Spectre,"$300,000,000","$200,074,175","$879,620,923"
10,2012-07-20,The Dark Knight Rises,"$275,000,000","$448,139,099","$1,084,439,099"
11,2018-05-25,Solo: A Star Wars Story,"$275,000,000","$213,767,512","$393,151,347"


Let's write a small function which we will use to fix our financial columns into integers.

In [325]:
def dollars_to_integer(s):
    '''
    Takes a string representing a dollar amount and parses into a integer
    '''
    s_no_sign = s[1:]
    return int(s_no_sign.replace(',', ''))

In [326]:
tn_movie_budgets.production_budget = tn_movie_budgets.production_budget.apply(dollars_to_integer)
tn_movie_budgets.domestic_gross = tn_movie_budgets.domestic_gross.apply(dollars_to_integer)
tn_movie_budgets.worldwide_gross = tn_movie_budgets.worldwide_gross.apply(dollars_to_integer)

In [327]:
tn_movie_budgets.head(10)

Unnamed: 0,release_date,primary_title,production_budget,domestic_gross,worldwide_gross
1,2011-05-20,Pirates of the Caribbean: On Stranger Tides,410600000,241063875,1045663875
2,2019-06-07,Dark Phoenix,350000000,42762350,149762350
3,2015-05-01,Avengers: Age of Ultron,330600000,459005868,1403013963
4,2017-12-15,Star Wars Ep. VIII: The Last Jedi,317000000,620181382,1316721747
5,2015-12-18,Star Wars Ep. VII: The Force Awakens,306000000,936662225,2053311220
6,2018-04-27,Avengers: Infinity War,300000000,678815482,2048134200
8,2017-11-17,Justice League,300000000,229024295,655945209
9,2015-11-06,Spectre,300000000,200074175,879620923
10,2012-07-20,The Dark Knight Rises,275000000,448139099,1084439099
11,2018-05-25,Solo: A Star Wars Story,275000000,213767512,393151347


In [328]:
tn_movie_budgets.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2194 entries, 1 to 5780
Data columns (total 5 columns):
release_date         2194 non-null datetime64[ns]
primary_title        2194 non-null object
production_budget    2194 non-null int64
domestic_gross       2194 non-null int64
worldwide_gross      2194 non-null int64
dtypes: datetime64[ns](1), int64(3), object(1)
memory usage: 102.8+ KB


Everything here looks good as well. All of our entries contain non-null values, and we have appropriately formatted the fields.

----
Let's quickly parse the IMDB name dataset so that we can get access to name information from the associated name IDs that are present throughout the IMDB database. In particular, we aim to look at successful crews for films within a certain genre, so as to recommend to Microsoft a strong set of talent from which to create a movie.

In [329]:
imdb_name_basics = pd.read_csv('data/imdb.name.basics.csv')

In [330]:
imdb_name_basics.head()

Unnamed: 0,nconst,primary_name,birth_year,death_year,primary_profession,known_for_titles
0,nm0061671,Mary Ellen Bauder,,,"miscellaneous,production_manager,producer","tt0837562,tt2398241,tt0844471,tt0118553"
1,nm0061865,Joseph Bauer,,,"composer,music_department,sound_department","tt0896534,tt6791238,tt0287072,tt1682940"
2,nm0062070,Bruce Baum,,,"miscellaneous,actor,writer","tt1470654,tt0363631,tt0104030,tt0102898"
3,nm0062195,Axel Baumann,,,"camera_department,cinematographer,art_department","tt0114371,tt2004304,tt1618448,tt1224387"
4,nm0062798,Pete Baxter,,,"production_designer,art_department,set_decorator","tt0452644,tt0452692,tt3458030,tt2178256"


We aren't particularly worried about when our actors, directors, and writers were born, nor about when they died (as morbid as it is), so we should drop these columns to facilitate a more compact footprint. While we are at it, let's also rename our `nconst` field to something that is slightly more descriptive.

In [331]:
imdb_name_basics.drop(columns=['birth_year', 'death_year', 'primary_profession', 'known_for_titles'], inplace=True)

In [332]:
imdb_name_basics.rename(columns={'nconst': 'imdb_nameid'}, inplace=True)

In [333]:
imdb_name_basics.head()

Unnamed: 0,imdb_nameid,primary_name
0,nm0061671,Mary Ellen Bauder
1,nm0061865,Joseph Bauer
2,nm0062070,Bruce Baum
3,nm0062195,Axel Baumann
4,nm0062798,Pete Baxter


In [334]:
imdb_name_basics.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 606648 entries, 0 to 606647
Data columns (total 2 columns):
imdb_nameid     606648 non-null object
primary_name    606648 non-null object
dtypes: object(2)
memory usage: 9.3+ MB


Clearly we have some null values, but we have quite a few records available to us so we can safely drop these rows.

In [335]:
imdb_name_basics.dropna(inplace=True)

In [336]:
imdb_name_basics.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 606648 entries, 0 to 606647
Data columns (total 2 columns):
imdb_nameid     606648 non-null object
primary_name    606648 non-null object
dtypes: object(2)
memory usage: 13.9+ MB


Importantly, we won't be merging this dataset with our other larger movie information dataset that will be created in the [next notebook](data-merging.ipynb). This dataset instead will be used simply for name lookups using the IMDB name IDs that are given, and to identify actors in particular movies.

----
The next step is to save the cleaned dataframes in Pickle format (so that we can keep our lists) stored in the `cleaned_data` folder.

In [337]:
imdb_title_basics.to_pickle('cleaned_data/imdb_title_basics.pkl')
imdb_title_crew.to_pickle('cleaned_data/imdb_title_crew.pkl')
imdb_title_ratings.to_pickle('cleaned_data/imdb_title_ratings.pkl')
imdb_name_basics.to_pickle('cleaned_data/imdb_name_basics.pkl')
tn_movie_budgets.to_pickle('cleaned_data/tn_movie_budgets.pkl')