# Data Prepartion - IMDb Dataset

This notebook prepares the IMDb files chosen for analysis to be merged with data from other sources. The result will be a sngle dataframe saved to disk as a CSV.

## Data Selection

The IMDb title basics and title ratings files have been chosen for further analysis. The basics file contains two relevant features for analysis in genres and runtime, as well as movie title and year. The ratings file contains two worthwhile targets in average rating and number of votes. These metrics are interesting for further investigation of their relevance to the problem of what movies Microsoft's new studio should make.

In [40]:
# import the relevant packages and load the files
import pandas as pd

basics = pd.read_csv('./zippedData/imdb.title.basics.csv.gz')
ratings = pd.read_csv('./zippedData/imdb.title.ratings.csv.gz')

In [41]:
# check to make sure the data loaded as expected
basics.head()

Unnamed: 0,tconst,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy"


In [42]:
# check to make sure the data loaded as expected
ratings.head()

Unnamed: 0,tconst,averagerating,numvotes
0,tt10356526,8.3,31
1,tt10384606,8.9,559
2,tt1042974,6.4,20
3,tt1043726,4.2,50352
4,tt1060240,6.5,21


## Join the DataFrames

The two DataFrames will be joined on the `tconst` field, a unique identifier for each movie title.

In [43]:
# check the length of tconst
basics.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 146144 entries, 0 to 146143
Data columns (total 6 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   tconst           146144 non-null  object 
 1   primary_title    146144 non-null  object 
 2   original_title   146123 non-null  object 
 3   start_year       146144 non-null  int64  
 4   runtime_minutes  114405 non-null  float64
 5   genres           140736 non-null  object 
dtypes: float64(1), int64(1), object(4)
memory usage: 6.7+ MB


In [44]:
# check the length of tconst
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73856 entries, 0 to 73855
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   tconst         73856 non-null  object 
 1   averagerating  73856 non-null  float64
 2   numvotes       73856 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 1.7+ MB


In [45]:
# check if each tconst is distinct
len(basics['tconst'].unique())

146144

In [46]:
# check if each tconst is distinct
len(ratings['tconst'].unique())

73856

There are fewer records of movies in the ratings table than the basics table. Both tables have completely distinct records. These tables should be **outer joined** to preserve information about movies that don't have an IMDb rating. *This information could still be useful after the merge with data from other sources.*

In [47]:
# set the index to the tconst column to join them
imdb = basics.set_index('tconst').join(ratings.set_index('tconst'), how='outer')

# check to see if join went as expected
imdb.head()

Unnamed: 0_level_0,primary_title,original_title,start_year,runtime_minutes,genres,averagerating,numvotes
tconst,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama",7.0,77.0
tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama",7.2,43.0
tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama,6.9,4517.0
tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama",6.1,13.0
tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy",6.5,119.0


In [48]:
# check the number of records and if there are missing values for averagerating and numvotes
imdb.info()

<class 'pandas.core.frame.DataFrame'>
Index: 146144 entries, tt0063540 to tt9916754
Data columns (total 7 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   primary_title    146144 non-null  object 
 1   original_title   146123 non-null  object 
 2   start_year       146144 non-null  int64  
 3   runtime_minutes  114405 non-null  float64
 4   genres           140736 non-null  object 
 5   averagerating    73856 non-null   float64
 6   numvotes         73856 non-null   float64
dtypes: float64(3), int64(1), object(3)
memory usage: 8.9+ MB


## Create new ID

The tables have been merged. All records remain intact, while there are some expected missing values for the averagerating and numvotes columns. Currently, the index is set to an id specified by the `tconst` column. In order to join this IMDb data with other sources, a new ID will need to be created. `tconst` is unique to the IMDb data. The new ID will be a string containing the name of the movie and its release year.

In [49]:
# take a look at the current format
imdb.head()

Unnamed: 0_level_0,primary_title,original_title,start_year,runtime_minutes,genres,averagerating,numvotes
tconst,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama",7.0,77.0
tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama",7.2,43.0
tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama,6.9,4517.0
tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama",6.1,13.0
tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy",6.5,119.0


In [50]:
# create new column and remove any leading or trailing whitespace in the title
imdb['movie_id'] = imdb['primary_title'].str.strip()

# convert titles to lowercase
imdb['movie_id'] = imdb['movie_id'].str.lower()

# check result
imdb.head()

Unnamed: 0_level_0,primary_title,original_title,start_year,runtime_minutes,genres,averagerating,numvotes,movie_id
tconst,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama",7.0,77.0,sunghursh
tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama",7.2,43.0,one day before the rainy season
tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama,6.9,4517.0,the other side of the wind
tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama",6.1,13.0,sabse bada sukh
tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy",6.5,119.0,the wandering soap opera


In [51]:
# append the year to the movie_id
imdb['movie_id'] = imdb['movie_id'] + imdb['start_year'].astype(str)

# check result
imdb.head()

Unnamed: 0_level_0,primary_title,original_title,start_year,runtime_minutes,genres,averagerating,numvotes,movie_id
tconst,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama",7.0,77.0,sunghursh2013
tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama",7.2,43.0,one day before the rainy season2019
tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama,6.9,4517.0,the other side of the wind2018
tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama",6.1,13.0,sabse bada sukh2018
tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy",6.5,119.0,the wandering soap opera2017


In [52]:
# set the index to the new movie_id
imdb = imdb.set_index('movie_id')

#check result
imdb.head()

Unnamed: 0_level_0,primary_title,original_title,start_year,runtime_minutes,genres,averagerating,numvotes
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
sunghursh2013,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama",7.0,77.0
one day before the rainy season2019,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama",7.2,43.0
the other side of the wind2018,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama,6.9,4517.0
sabse bada sukh2018,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama",6.1,13.0
the wandering soap opera2017,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy",6.5,119.0


## Drop unnecessary columns

The new index has been created, consisting of the lowercase name of the movie with the year appended at the end. The IMDb data is almost ready to be joined with other data sources. Now that the year has been included in the index, it is no longer needed for analysis. Let's drop that column. The original_title column will also be dropped as only the primary_title is necessary for further analysis. The original_title is the title in its original language.

In [53]:
# drop the original_title column, but keep all rows
imdb = imdb.drop(['original_title', 'start_year'], axis=1)

# check the result
imdb.info()

<class 'pandas.core.frame.DataFrame'>
Index: 146144 entries, sunghursh2013 to chico albuquerque - revelações2013
Data columns (total 5 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   primary_title    146144 non-null  object 
 1   runtime_minutes  114405 non-null  float64
 2   genres           140736 non-null  object 
 3   averagerating    73856 non-null   float64
 4   numvotes         73856 non-null   float64
dtypes: float64(3), object(2)
memory usage: 6.7+ MB


In [54]:
# take a look
imdb.head()

Unnamed: 0_level_0,primary_title,runtime_minutes,genres,averagerating,numvotes
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
sunghursh2013,Sunghursh,175.0,"Action,Crime,Drama",7.0,77.0
one day before the rainy season2019,One Day Before the Rainy Season,114.0,"Biography,Drama",7.2,43.0
the other side of the wind2018,The Other Side of the Wind,122.0,Drama,6.9,4517.0
sabse bada sukh2018,Sabse Bada Sukh,,"Comedy,Drama",6.1,13.0
the wandering soap opera2017,The Wandering Soap Opera,80.0,"Comedy,Drama,Fantasy",6.5,119.0


## Save as a CSV file

In [55]:
# save as csv file
imdb.to_csv('imdb_data')

## Summary

The IMDb data is now ready to be joined with data from other sources. The two original tables have been joined, a new movie id index has been created, and columns unnecessary for further analysis have been dropped. With the new csv file, we'll next attempt to join other data on the new movie id index.