# Analysis of Successful Movies (Notebook 1)
* Benjamin Grossmann

This notebook retrieves data from IMDB (Internet Movie Database).
Then it performs some preprocessing and filtering to keep only the movies that meet the desired criteria.
The final steps are to save the reduced data set.

After the reduced data set has been saved, further work on this project should proceed in Notebook 2. This will reduce the time to bring the data into a project-ready state.

If the reduced data set should need to be reset to its initial condition, then re-run Notebook 1.

THe IMDB files being used:
* https://datasets.imdbws.com/title.basics.tsv.gz
* https://datasets.imdbws.com/title.akas.tsv.gz
* https://datasets.imdbws.com/title.ratings.tsv.gz

The data dictionay:
* https://www.imdb.com/interfaces/

The data kept from these files need to follow these specifications:
* Exclude any movie with missing values for genre or runtime
* Include only full-length movies (titleType = "movie").
* Include only fictional movies (not from documentary genre)
* Include only movies that were released 2000 - 2021 (include 2000 and 2021)
* Include only movies that were released in the United States

# Retrieve Data

In [1]:
import numpy as np
import pandas as pd

In [2]:
%%time
# This code block can take between 2 to 10 minutes

# Store the data locations as strings
url_basics = 'https://datasets.imdbws.com/title.basics.tsv.gz'
url_akas = 'https://datasets.imdbws.com/title.akas.tsv.gz'
url_ratings = 'https://datasets.imdbws.com/title.ratings.tsv.gz'

# Retrieve the data from online
basics = pd.read_csv(url_basics, sep='\t', low_memory=False)
akas = pd.read_csv(url_akas, sep='\t', low_memory=False)
ratings = pd.read_csv(url_ratings, sep='\t', low_memory=False)

Wall time: 3min 38s


In [3]:
# Display shape information about the data
print(f"The basics dataframe shape: {basics.shape}")
print(f"The akas dataframe shape: {akas.shape}")
print(f"The ratings dataframe shape: {ratings.shape}")

The basics dataframe shape: (8952170, 9)
The akas dataframe shape: (32123546, 8)
The ratings dataframe shape: (1247906, 3)


# Standardizing Missing Values

In [4]:
%%time
# This code block can take around 1 minute

# Replace the \N placeholders with np.nan
basics = basics.replace({'\\N':np.nan})
akas = akas.replace({'\\N':np.nan})
ratings = ratings.replace({'\\N':np.nan})

Wall time: 1min


# Filter/Clean Akas DataFrame

Examining some details about the akas dataframe.

In [5]:
print(f'The akas dataframe shape: {akas.shape}\n')
display(akas.info())
display(akas.head())

The akas dataframe shape: (32123546, 8)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32123546 entries, 0 to 32123545
Data columns (total 8 columns):
 #   Column           Dtype 
---  ------           ----- 
 0   titleId          object
 1   ordering         int64 
 2   title            object
 3   region           object
 4   language         object
 5   types            object
 6   attributes       object
 7   isOriginalTitle  object
dtypes: int64(1), object(7)
memory usage: 1.9+ GB


None

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
0,tt0000001,1,Карменсіта,UA,,imdbDisplay,,0
1,tt0000001,2,Carmencita,DE,,,literal title,0
2,tt0000001,3,Carmencita - spanyol tánc,HU,,imdbDisplay,,0
3,tt0000001,4,Καρμενσίτα,GR,,imdbDisplay,,0
4,tt0000001,5,Карменсита,RU,,imdbDisplay,,0


## a) Keep Release Region US

The 'region' column will be filtered to contain only the value 'US'.

In [6]:
print(f'The akas dataframe shape: {akas.shape}')

print(f'Eliminating entries that are not released in the US.')
is_US = (akas['region']=='US')
akas = akas.loc[ is_US , :]

print(f'The akas dataframe shape: {akas.shape}')
akas.head()

The akas dataframe shape: (32123546, 8)
Eliminating entries that are not released in the US.
The akas dataframe shape: (1324438, 8)


Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
5,tt0000001,6,Carmencita,US,,imdbDisplay,,0
14,tt0000002,7,The Clown and His Dogs,US,,,literal English title,0
33,tt0000005,10,Blacksmith Scene,US,,imdbDisplay,,0
36,tt0000005,1,Blacksmithing Scene,US,,alternative,,0
41,tt0000005,6,Blacksmith Scene #1,US,,alternative,,0


## b) Convert Original Title Data Type

The 'isOriginalTitle' column is object data type, but it should be boolean data type. I will check to see what data values and types are present.

In [7]:
print(akas['isOriginalTitle'].unique())
print(akas['isOriginalTitle'].value_counts(dropna=False))
print(akas['isOriginalTitle'].apply(type).value_counts())

['0' nan '1']
0      1323054
NaN       1375
1            9
Name: isOriginalTitle, dtype: int64
<class 'str'>      1323063
<class 'float'>       1375
Name: isOriginalTitle, dtype: int64


So, 'isOriginalTitle' does contain string numerical values that can be converted, but nan values are present. Since nan is considered a float, I will convert the string numbers to also be floats.

In [8]:
akas['isOriginalTitle'] = akas['isOriginalTitle'].apply(float)
print(akas['isOriginalTitle'].unique())
print(akas['isOriginalTitle'].value_counts(dropna=False))
print(akas['isOriginalTitle'].apply(type).value_counts())

[ 0. nan  1.]
0.0    1323054
NaN       1375
1.0          9
Name: isOriginalTitle, dtype: int64
<class 'float'>    1324438
Name: isOriginalTitle, dtype: int64


In [9]:
print(f'The akas dataframe shape: {akas.shape}\n')
display(akas.info())
display(akas.head())

The akas dataframe shape: (1324438, 8)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1324438 entries, 5 to 32123290
Data columns (total 8 columns):
 #   Column           Non-Null Count    Dtype  
---  ------           --------------    -----  
 0   titleId          1324438 non-null  object 
 1   ordering         1324438 non-null  int64  
 2   title            1324438 non-null  object 
 3   region           1324438 non-null  object 
 4   language         3545 non-null     object 
 5   types            1025660 non-null  object 
 6   attributes       44063 non-null    object 
 7   isOriginalTitle  1323063 non-null  float64
dtypes: float64(1), int64(1), object(6)
memory usage: 90.9+ MB


None

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
5,tt0000001,6,Carmencita,US,,imdbDisplay,,0.0
14,tt0000002,7,The Clown and His Dogs,US,,,literal English title,0.0
33,tt0000005,10,Blacksmith Scene,US,,imdbDisplay,,0.0
36,tt0000005,1,Blacksmithing Scene,US,,alternative,,0.0
41,tt0000005,6,Blacksmith Scene #1,US,,alternative,,0.0


## c) Propagate Filtering to Other Dataframes

Only the rows that passed the filtering of the akas dataframe should have corresponding entries in the other dataframes. Those that were dropped should also have their corresponding entries in the other dataframes dropped.

In [10]:
print(f'The basics dataframe shape: {basics.shape}')

# Filter out entries from the basics dataframe that are not released in the US
print(f'Eliminating entries that are not in the akas dataframe.')
basics_matches_akas =  basics['tconst'].isin(akas['titleId'])
basics = basics.loc[ basics_matches_akas , :]

print(f'The basics dataframe shape: {basics.shape}')
basics.head()

The basics dataframe shape: (8952170, 9)
Eliminating entries that are not in the akas dataframe.
The basics dataframe shape: (1242916, 9)


Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,,5,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,,1,"Comedy,Short"
5,tt0000006,short,Chinese Opium Den,Chinese Opium Den,0,1894,,1,Short
6,tt0000007,short,Corbett and Courtney Before the Kinetograph,Corbett and Courtney Before the Kinetograph,0,1894,,1,"Short,Sport"


In [11]:
print(f'The ratings dataframe shape: {ratings.shape}')

# Filter out entries from the basics dataframe that are not released in the US
print(f'Eliminating entries that are not in the akas dataframe.')
ratings_matches_akas =  ratings['tconst'].isin(akas['titleId'])
ratings = ratings.loc[ ratings_matches_akas , :]

print(f'The ratings dataframe shape: {ratings.shape}')
ratings.head()

The ratings dataframe shape: (1247906, 3)
Eliminating entries that are not in the akas dataframe.
The ratings dataframe shape: (471592, 3)


Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1882
1,tt0000002,5.9,250
4,tt0000005,6.2,2488
5,tt0000006,5.2,166
6,tt0000007,5.4,773


# Filter/Clean the Basics Dataframe

Examining some details about the basics dataframe.

In [12]:
print(f'The basics dataframe shape: {basics.shape}\n')
display(basics.info())
display(basics.head())

The basics dataframe shape: (1242916, 9)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1242916 entries, 0 to 8952125
Data columns (total 9 columns):
 #   Column          Non-Null Count    Dtype 
---  ------          --------------    ----- 
 0   tconst          1242916 non-null  object
 1   titleType       1242916 non-null  object
 2   primaryTitle    1242916 non-null  object
 3   originalTitle   1242916 non-null  object
 4   isAdult         1242916 non-null  object
 5   startYear       1160598 non-null  object
 6   endYear         31136 non-null    object
 7   runtimeMinutes  787379 non-null   object
 8   genres          1215844 non-null  object
dtypes: object(9)
memory usage: 94.8+ MB


None

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,,5,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,,1,"Comedy,Short"
5,tt0000006,short,Chinese Opium Den,Chinese Opium Den,0,1894,,1,Short
6,tt0000007,short,Corbett and Courtney Before the Kinetograph,Corbett and Courtney Before the Kinetograph,0,1894,,1,"Short,Sport"


## a) Drop Genre Missing Values

In [13]:
# Removing movies with missing values for genre
print(f"The basics dataframe shape: {basics.shape}")
print(f"There are {basics['genres'].isna().sum()} missing values in 'genres'")

has_genres = ~basics['genres'].isna()

print(f"Eliminating entries that are missing genre.")
basics = basics.loc[ (has_genres) , : ]

print(f"The basics dataframe shape: {basics.shape}")
print(f"There are {basics['genres'].isna().sum()} missing values in 'genres'")

The basics dataframe shape: (1242916, 9)
There are 27072 missing values in 'genres'
Eliminating entries that are missing genre.
The basics dataframe shape: (1215844, 9)
There are 0 missing values in 'genres'


## c) Drop Genre 'Documentary'

In [14]:
# Because each entry can contain multiple genres,
# some extra steps are needed to split out each unique genre

def splitter(n):
    return n.split(",")

In [15]:
# Find all the genres present in the basics dataframe
genre_list = list(set(list(np.concatenate(list(map( splitter , basics['genres'].unique().tolist() ))).flat)))

print(f"There are {len(genre_list)} genres represented in 'genres'")
genre_list

There are 28 genres represented in 'genres'


['Reality-TV',
 'History',
 'Fantasy',
 'Game-Show',
 'Crime',
 'Adult',
 'Musical',
 'Horror',
 'Adventure',
 'Talk-Show',
 'Comedy',
 'Drama',
 'Family',
 'Thriller',
 'Sport',
 'War',
 'Documentary',
 'Action',
 'News',
 'Film-Noir',
 'Romance',
 'Short',
 'Western',
 'Mystery',
 'Music',
 'Sci-Fi',
 'Animation',
 'Biography']

The 'Documentary' genre is present. All releases with a 'Documentary' designation will be dropped from the dataframe.

In [16]:
# Removing rows with a sub-string value 'Documentary'
print(f'The basics dataframe shape: {basics.shape}')
print(f'Eliminating entries that are documentaries')

is_documentary = basics['genres'].str.contains('Documentary', case=False)
# note: case parameter means case-sensitive spelling

basics = basics.loc[ ~is_documentary , :]

print(f'The basics dataframe shape: {basics.shape}')

The basics dataframe shape: (1215844, 9)
Eliminating entries that are documentaries
The basics dataframe shape: (1056414, 9)


In [17]:
# Re-evaluate the genre_list

genre_list = list(set(list(np.concatenate(list(map( splitter , basics['genres'].unique().tolist() ))).flat)))

print(f"There are {len(genre_list)} genres represented in 'genres'")
genre_list

There are 27 genres represented in 'genres'


['Reality-TV',
 'History',
 'Fantasy',
 'Game-Show',
 'Crime',
 'Adult',
 'Musical',
 'Horror',
 'Adventure',
 'Talk-Show',
 'Comedy',
 'Drama',
 'Family',
 'Thriller',
 'Sport',
 'War',
 'Action',
 'News',
 'Film-Noir',
 'Romance',
 'Short',
 'Western',
 'Mystery',
 'Music',
 'Sci-Fi',
 'Animation',
 'Biography']

The value 'Documentary' is no longer present.

## b) Drop Runtime Missing Values

In [18]:
# Removing movies with missing values for genre or runtime
print(f"The basics dataframe shape: {basics.shape}")
print(f"There are {basics['runtimeMinutes'].isna().sum()} missing values in 'runtimeMinutes'")

has_runtimeMinutes = ~basics['runtimeMinutes'].isna()

print(f"Eliminating entries that are missing runtime.")
basics = basics.loc[ (has_runtimeMinutes) , : ]

print(f"The basics dataframe shape: {basics.shape}")
print(f"There are {basics['runtimeMinutes'].isna().sum()} missing values in 'runtimeMinutes'")

The basics dataframe shape: (1056414, 9)
There are 393987 missing values in 'runtimeMinutes'
Eliminating entries that are missing runtime.
The basics dataframe shape: (662427, 9)
There are 0 missing values in 'runtimeMinutes'


## c) Convert Runtime Data Type

The 'runtimeMinutes' values are supposed to be integers. The column was object type before dropping the nan values. Now that the nan values (floats) have been eliminated, I will inspect it again.

In [19]:
print(basics['runtimeMinutes'].apply(type).value_counts())

<class 'str'>    662427
Name: runtimeMinutes, dtype: int64


In [20]:
basics['runtimeMinutes'] = basics['runtimeMinutes'].apply(int)
print(f'The basics dataframe shape: {basics.shape}')
display(basics.info())

The basics dataframe shape: (662427, 9)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 662427 entries, 1 to 8952106
Data columns (total 9 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   tconst          662427 non-null  object
 1   titleType       662427 non-null  object
 2   primaryTitle    662427 non-null  object
 3   originalTitle   662427 non-null  object
 4   isAdult         662427 non-null  object
 5   startYear       656868 non-null  object
 6   endYear         15646 non-null   object
 7   runtimeMinutes  662427 non-null  int64 
 8   genres          662427 non-null  object
dtypes: int64(1), object(8)
memory usage: 50.5+ MB


None

## b) Keep Full-Length Movies

In [21]:
print(f"There are {basics['titleType'].nunique()} unique values in 'titleType'")
basics['titleType'].unique().tolist()

There are 10 unique values in 'titleType'


['short',
 'movie',
 'tvEpisode',
 'tvSeries',
 'tvMovie',
 'tvShort',
 'tvMiniSeries',
 'tvSpecial',
 'video',
 'videoGame']

In [22]:
# Removing releases without type 'movie'
print(f'The basics dataframe shape: {basics.shape}')

print(f'Eliminating entries that are not full-length movies.')
is_movie = (basics['titleType'] == 'movie')
basics = basics.loc[ is_movie , : ]

print(f'The basics dataframe shape: {basics.shape}')

The basics dataframe shape: (662427, 9)
Eliminating entries that are not full-length movies.
The basics dataframe shape: (150742, 9)


In [23]:
print(f"There are {basics['titleType'].nunique()} unique values in 'titleType'")
basics['titleType'].unique().tolist()

There are 1 unique values in 'titleType'


['movie']

## d) Keep Release Years 2000-2021

The 'startYear' column is of the object data type, so the values should be checked. Ideally, the years are integers. Because np.nan values are floats, I will convert any string numbers to floats to avoid an error in trying to convert np.nan values to integers.

In [24]:
startYear_null_count = basics['startYear'].isna().sum()
startYear_null_percent = 100*startYear_null_count/basics.shape[0]
startYear_min = basics['startYear'].apply(float).min()
startYear_max = basics['startYear'].apply(float).max()

print(f"The basics dataframe has {startYear_null_count} null startYear values.")
print(f"\nThe basics dataframe:")
print(basics['startYear'].apply(type).value_counts())
print(f"\nNull values constitute {startYear_null_percent:.2f}% of the basics dataframe")
print(f"The range of years is {startYear_min} through {startYear_max}")

The basics dataframe has 2088 null startYear values.

The basics dataframe:
<class 'str'>      148654
<class 'float'>      2088
Name: startYear, dtype: int64

Null values constitute 1.39% of the basics dataframe
The range of years is 1903.0 through 2027.0


I will drop the null values. They constitute less than 2% of the movies. Imputing any values could potentially be misleading, as characteristics of successful movies have changed in the past century.

By filtering the 'startYear' by inequalities, the np.nan values will be filtered out as they not less than or greater then a numerical value.

In [25]:
print(f'The basics dataframe shape: {basics.shape}')
print(f'Eliminating entries that before the year 2000.')
print(f'Eliminating entries that are after the year 2021.')

is_year_before_now = (basics['startYear'].apply(float) <= 2021)
is_year_since_2000 = (basics['startYear'].apply(float) >= 2000)

basics = basics.loc[ is_year_before_now & is_year_since_2000, : ]

print(f'The basics dataframe shape: {basics.shape}')

The basics dataframe shape: (150742, 9)
Eliminating entries that before the year 2000.
Eliminating entries that are after the year 2021.
The basics dataframe shape: (79294, 9)


Since all the remaining values should be integers, I can convert their datatype to match.

In [26]:
basics['startYear'] = basics['startYear'].apply(int)

print(f"There are {basics['startYear'].nunique()} years in 'startYear'")
basics['startYear'].sort_values().unique().tolist()

There are 22 years in 'startYear'


[2000,
 2001,
 2002,
 2003,
 2004,
 2005,
 2006,
 2007,
 2008,
 2009,
 2010,
 2011,
 2012,
 2013,
 2014,
 2015,
 2016,
 2017,
 2018,
 2019,
 2020,
 2021]

In [27]:
print(f'The basics dataframe shape: {basics.shape}')
display(basics.info())

The basics dataframe shape: (79294, 9)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 79294 entries, 34805 to 8951935
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   tconst          79294 non-null  object
 1   titleType       79294 non-null  object
 2   primaryTitle    79294 non-null  object
 3   originalTitle   79294 non-null  object
 4   isAdult         79294 non-null  object
 5   startYear       79294 non-null  int64 
 6   endYear         0 non-null      object
 7   runtimeMinutes  79294 non-null  int64 
 8   genres          79294 non-null  object
dtypes: int64(2), object(7)
memory usage: 6.0+ MB


None

## f) Convert Adult Data Type

The 'isAdult' column is object type, but it is suppose to contain boolean values.

In [28]:
basics['isAdult'].unique()

array(['0', '1'], dtype=object)

Since there are no nan values, I can convert the values easily to integers.

In [29]:
basics['isAdult'] = basics['isAdult'].apply(int)

In [30]:
print(f'The basics dataframe shape: {basics.shape}')
display(basics.info())

The basics dataframe shape: (79294, 9)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 79294 entries, 34805 to 8951935
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   tconst          79294 non-null  object
 1   titleType       79294 non-null  object
 2   primaryTitle    79294 non-null  object
 3   originalTitle   79294 non-null  object
 4   isAdult         79294 non-null  int64 
 5   startYear       79294 non-null  int64 
 6   endYear         0 non-null      object
 7   runtimeMinutes  79294 non-null  int64 
 8   genres          79294 non-null  object
dtypes: int64(3), object(6)
memory usage: 6.0+ MB


None

## g) Drop End Year Column

The 'endYear' column is suppose to contain years.

In [31]:
basics['endYear'].unique()

array([nan], dtype=object)

However, it only contains nan values; it is completely empty. So I will drop this column.

In [32]:
basics = basics.drop(columns=['endYear'])
print(f'The basics dataframe shape: {basics.shape}')
display(basics.info())

The basics dataframe shape: (79294, 8)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 79294 entries, 34805 to 8951935
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   tconst          79294 non-null  object
 1   titleType       79294 non-null  object
 2   primaryTitle    79294 non-null  object
 3   originalTitle   79294 non-null  object
 4   isAdult         79294 non-null  int64 
 5   startYear       79294 non-null  int64 
 6   runtimeMinutes  79294 non-null  int64 
 7   genres          79294 non-null  object
dtypes: int64(3), object(5)
memory usage: 5.4+ MB


None

## h) Propagate Filtering to Other Dataframes

Only the rows that passed the filtering of the basics dataframe should have corresponding entries in the other dataframes. Those that were dropped should also have their corresponding entries in the other dataframes dropped.

In [33]:
print(f'The akas dataframe shape: {akas.shape}')

# Filter out entries from the basics dataframe that are not released in the US
print(f'Eliminating entries that are not in the basics dataframe.')
akas_matches_basics = akas['titleId'].isin(basics['tconst'])
akas = akas.loc[ akas_matches_basics , :]

print(f'The akas dataframe shape: {akas.shape}')
akas.head()

The akas dataframe shape: (1324438, 8)
Eliminating entries that are not in the basics dataframe.
The akas dataframe shape: (88441, 8)


Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
193605,tt0035423,35,Kate and Leopold,US,,,alternative spelling,0.0
193607,tt0035423,37,Kate & Leopold,US,,imdbDisplay,,0.0
454297,tt0062336,5,The Tango of the Widower and Its Distorting Mi...,US,,imdbDisplay,,0.0
521612,tt0069049,3,The Other Side of the Wind,US,,imdbDisplay,,0.0
698869,tt0088751,1,Attack of the B-Movie Monster,US,,working,,0.0


In [34]:
print(f'The ratings dataframe shape: {ratings.shape}')

# Filter out entries from the basics dataframe that are not released in the US
print(f'Eliminating entries that are not in the basics dataframe.')
ratings_matches_basics =  ratings['tconst'].isin(basics['tconst'])
ratings = ratings.loc[ ratings_matches_basics , :]

print(f'The ratings dataframe shape: {ratings.shape}')
ratings.head()

The ratings dataframe shape: (471592, 3)
Eliminating entries that are not in the basics dataframe.
The ratings dataframe shape: (65730, 3)


Unnamed: 0,tconst,averageRating,numVotes
17916,tt0035423,6.4,83864
40761,tt0062336,6.4,161
46614,tt0069049,6.7,7238
63604,tt0088751,5.2,320
67385,tt0093119,2.9,1191


# Filter/Clean Ratings Dataframe

There are no filtering conditions that need be be initially applied to the ratings dataframe.

In [35]:
print(f'The ratings dataframe shape: {ratings.shape}\n')
display(ratings.info())
display(ratings.head())

The ratings dataframe shape: (65730, 3)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 65730 entries, 17916 to 1247884
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   tconst         65730 non-null  object 
 1   averageRating  65730 non-null  float64
 2   numVotes       65730 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 2.0+ MB


None

Unnamed: 0,tconst,averageRating,numVotes
17916,tt0035423,6.4,83864
40761,tt0062336,6.4,161
46614,tt0069049,6.7,7238
63604,tt0088751,5.2,320
67385,tt0093119,2.9,1191


All the columns of the ratings dataframe agree with the expected data types.

# Review Dataframes

## a) Basics Dataframe

In [36]:
print(f'The basics dataframe shape: {basics.shape}\n')
display(basics.info())
display(basics.head())

The basics dataframe shape: (79294, 8)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 79294 entries, 34805 to 8951935
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   tconst          79294 non-null  object
 1   titleType       79294 non-null  object
 2   primaryTitle    79294 non-null  object
 3   originalTitle   79294 non-null  object
 4   isAdult         79294 non-null  int64 
 5   startYear       79294 non-null  int64 
 6   runtimeMinutes  79294 non-null  int64 
 7   genres          79294 non-null  object
dtypes: int64(3), object(5)
memory usage: 5.4+ MB


None

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,runtimeMinutes,genres
34805,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001,118,"Comedy,Fantasy,Romance"
61119,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El Tango del Viudo y Su Espejo Deformante,0,2020,70,Drama
67672,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018,122,Drama
86806,tt0088751,movie,The Naked Monster,The Naked Monster,0,2005,100,"Comedy,Horror,Sci-Fi"
91077,tt0093119,movie,Grizzly II: Revenge,Grizzly II: The Predator,0,2020,74,"Horror,Music,Thriller"


All the basics column data types appear consistent with the value data types.

## b) Akas Dataframe

In [37]:
print(f'The akas dataframe shape: {akas.shape}\n')
display(akas.info())
display(akas.head())

The akas dataframe shape: (88441, 8)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 88441 entries, 193605 to 32122788
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   titleId          88441 non-null  object 
 1   ordering         88441 non-null  int64  
 2   title            88441 non-null  object 
 3   region           88441 non-null  object 
 4   language         875 non-null    object 
 5   types            82276 non-null  object 
 6   attributes       4107 non-null   object 
 7   isOriginalTitle  88441 non-null  float64
dtypes: float64(1), int64(1), object(6)
memory usage: 6.1+ MB


None

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
193605,tt0035423,35,Kate and Leopold,US,,,alternative spelling,0.0
193607,tt0035423,37,Kate & Leopold,US,,imdbDisplay,,0.0
454297,tt0062336,5,The Tango of the Widower and Its Distorting Mi...,US,,imdbDisplay,,0.0
521612,tt0069049,3,The Other Side of the Wind,US,,imdbDisplay,,0.0
698869,tt0088751,1,Attack of the B-Movie Monster,US,,working,,0.0


I will check the 'isOriginal' column to see if any nan values survived all the filter propagation from other dataframes.

In [38]:
akas['isOriginalTitle'].unique()

array([0., 1.])

Since there are no nan values remaining, I can change the floats to ints.

In [39]:
akas['isOriginalTitle'] = akas['isOriginalTitle'].apply(int)
display(akas.info())
display(akas.head())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 88441 entries, 193605 to 32122788
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   titleId          88441 non-null  object
 1   ordering         88441 non-null  int64 
 2   title            88441 non-null  object
 3   region           88441 non-null  object
 4   language         875 non-null    object
 5   types            82276 non-null  object
 6   attributes       4107 non-null   object
 7   isOriginalTitle  88441 non-null  int64 
dtypes: int64(2), object(6)
memory usage: 6.1+ MB


None

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
193605,tt0035423,35,Kate and Leopold,US,,,alternative spelling,0
193607,tt0035423,37,Kate & Leopold,US,,imdbDisplay,,0
454297,tt0062336,5,The Tango of the Widower and Its Distorting Mi...,US,,imdbDisplay,,0
521612,tt0069049,3,The Other Side of the Wind,US,,imdbDisplay,,0
698869,tt0088751,1,Attack of the B-Movie Monster,US,,working,,0


All the akas expected value data types are consistent with the data dictionary, and the column data types match the value data types.

## c) Ratings Dataframe

In [40]:
print(f'The ratings dataframe shape: {ratings.shape}\n')
display(ratings.info())
display(ratings.head())

The ratings dataframe shape: (65730, 3)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 65730 entries, 17916 to 1247884
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   tconst         65730 non-null  object 
 1   averageRating  65730 non-null  float64
 2   numVotes       65730 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 2.0+ MB


None

Unnamed: 0,tconst,averageRating,numVotes
17916,tt0035423,6.4,83864
40761,tt0062336,6.4,161
46614,tt0069049,6.7,7238
63604,tt0088751,5.2,320
67385,tt0093119,2.9,1191


All the ratings column data types appear consistent with the value data types.

# Saving the Dataframes

In [41]:
# This code is used to create a Data folder for saving downloaded data

# make new folder with os if it does not exist
import os
os.makedirs('Data', exist_ok = True)

# verify folder was created and list all files inside
os.listdir('Data/')

['title_akas.csv.gz', 'title_basics.csv.gz', 'title_ratings.csv.gz']

In [42]:
# save dataframes to file
basics.to_csv('Data/title_basics.csv.gz', compression='gzip', index=False)
akas.to_csv('Data/title_akas.csv.gz', compression='gzip', index=False)
ratings.to_csv('Data/title_ratings.csv.gz', compression='gzip', index=False)

The following code is only used to verify the dataframes are saved and can be retrieved as expected.

In [43]:
# open saved files for review
basics = pd.read_csv('Data/title_basics.csv.gz', low_memory = False)
akas = pd.read_csv('Data/title_akas.csv.gz', low_memory = False)
ratings = pd.read_csv('Data/title_ratings.csv.gz', low_memory = False)

In [44]:
# Review data loaded from files
print(f'The basics dataframe shape: {basics.shape}\n')
display(basics.info())
display(basics.head())
print(f'The akas dataframe shape: {akas.shape}\n')
display(akas.info())
display(akas.head())
print(f'The ratings dataframe shape: {ratings.shape}\n')
display(ratings.info())
display(ratings.head())

The basics dataframe shape: (79294, 8)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 79294 entries, 0 to 79293
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   tconst          79294 non-null  object
 1   titleType       79294 non-null  object
 2   primaryTitle    79294 non-null  object
 3   originalTitle   79294 non-null  object
 4   isAdult         79294 non-null  int64 
 5   startYear       79294 non-null  int64 
 6   runtimeMinutes  79294 non-null  int64 
 7   genres          79294 non-null  object
dtypes: int64(3), object(5)
memory usage: 4.8+ MB


None

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,runtimeMinutes,genres
0,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001,118,"Comedy,Fantasy,Romance"
1,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El Tango del Viudo y Su Espejo Deformante,0,2020,70,Drama
2,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018,122,Drama
3,tt0088751,movie,The Naked Monster,The Naked Monster,0,2005,100,"Comedy,Horror,Sci-Fi"
4,tt0093119,movie,Grizzly II: Revenge,Grizzly II: The Predator,0,2020,74,"Horror,Music,Thriller"


The akas dataframe shape: (88441, 8)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 88441 entries, 0 to 88440
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   titleId          88441 non-null  object
 1   ordering         88441 non-null  int64 
 2   title            88441 non-null  object
 3   region           88441 non-null  object
 4   language         875 non-null    object
 5   types            82276 non-null  object
 6   attributes       4107 non-null   object
 7   isOriginalTitle  88441 non-null  int64 
dtypes: int64(2), object(6)
memory usage: 5.4+ MB


None

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
0,tt0035423,35,Kate and Leopold,US,,,alternative spelling,0
1,tt0035423,37,Kate & Leopold,US,,imdbDisplay,,0
2,tt0062336,5,The Tango of the Widower and Its Distorting Mi...,US,,imdbDisplay,,0
3,tt0069049,3,The Other Side of the Wind,US,,imdbDisplay,,0
4,tt0088751,1,Attack of the B-Movie Monster,US,,working,,0


The ratings dataframe shape: (65730, 3)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65730 entries, 0 to 65729
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   tconst         65730 non-null  object 
 1   averageRating  65730 non-null  float64
 2   numVotes       65730 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 1.5+ MB


None

Unnamed: 0,tconst,averageRating,numVotes
0,tt0035423,6.4,83864
1,tt0062336,6.4,161
2,tt0069049,6.7,7238
3,tt0088751,5.2,320
4,tt0093119,2.9,1191
