# Analysis of Successful Movies (Notebook 1)
* Benjamin Grossmann

This notebook retrieves the data from online.
Then it performs some preprocessing and filtering to keep only the movies that meet the desired criteria.
The final steps are to save the reduced data set.

After the reduced data set has been saved, further work on this project should use Notebook 2. This will reduce the time to bring the data into a project-ready state.

If the reduced data set should need to be reset to its initial condition, then re-run Notebook 1.

THe IMDB files being used:
* https://datasets.imdbws.com/title.basics.tsv.gz
* https://datasets.imdbws.com/title.akas.tsv.gz
* https://datasets.imdbws.com/title.ratings.tsv.gz

The data dictionay:
* https://www.imdb.com/interfaces/

# Load Data

In [1]:
import numpy as np
import pandas as pd
# from sqlalchemy import create_engine

In [2]:
%%time
# This code block can take around 5 minutes

# Store the data locations as strings
url_basics = 'https://datasets.imdbws.com/title.basics.tsv.gz'
url_akas = 'https://datasets.imdbws.com/title.akas.tsv.gz'
url_ratings = 'https://datasets.imdbws.com/title.ratings.tsv.gz'

# Retrieve the data from online
basics = pd.read_csv(url_basics, sep='\t', low_memory=False)
akas = pd.read_csv(url_akas, sep='\t', low_memory=False)
ratings = pd.read_csv(url_ratings, sep='\t', low_memory=False)

Wall time: 6min 52s


In [3]:
# Display basic information about the data
print(f"The basics dataframe shape: {basics.shape}")
print(f"The akas dataframe shape: {akas.shape}")
print(f"The ratings dataframe shape: {ratings.shape}")

The basics dataframe shape: (8928653, 9)
The akas dataframe shape: (32014153, 8)
The ratings dataframe shape: (1245178, 3)


# Standardizing Missing Values

In [4]:
%%time
# This code block can take around 1 minutes

# Replace the \N placeholders with np.nan
basics = basics.replace({'\\N':np.nan})
akas = akas.replace({'\\N':np.nan})
ratings = ratings.replace({'\\N':np.nan})

Wall time: 50.6 s


# Filter the Akas DataFrame

In [5]:
print(f'The akas dataframe shape: {akas.shape}')
display(akas.info())
display(akas.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32014153 entries, 0 to 32014152
Data columns (total 8 columns):
 #   Column           Dtype 
---  ------           ----- 
 0   titleId          object
 1   ordering         int64 
 2   title            object
 3   region           object
 4   language         object
 5   types            object
 6   attributes       object
 7   isOriginalTitle  object
dtypes: int64(1), object(7)
memory usage: 1.9+ GB


None

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
0,tt0000001,1,Карменсіта,UA,,imdbDisplay,,0
1,tt0000001,2,Carmencita,DE,,,literal title,0
2,tt0000001,3,Carmencita - spanyol tánc,HU,,imdbDisplay,,0
3,tt0000001,4,Καρμενσίτα,GR,,imdbDisplay,,0
4,tt0000001,5,Карменсита,RU,,imdbDisplay,,0


## Keep US releases

First, the 'region' column will be filtered to contain only the value 'US'.

In [6]:
print(f'The akas dataframe shape: {akas.shape}')

print(f'Eliminate entries that are not release in the US')
is_US = (akas['region']=='US')
akas = akas.loc[ is_US , :]

print(f'The akas dataframe shape: {akas.shape}')
akas.head()

The akas dataframe shape: (32014153, 8)
Eliminate entries that are not release in the US
The akas dataframe shape: (1322196, 8)


Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
5,tt0000001,6,Carmencita,US,,imdbDisplay,,0
14,tt0000002,7,The Clown and His Dogs,US,,,literal English title,0
33,tt0000005,10,Blacksmith Scene,US,,imdbDisplay,,0
36,tt0000005,1,Blacksmithing Scene,US,,alternative,,0
41,tt0000005,6,Blacksmith Scene #1,US,,alternative,,0


## Convert Data Types

The 'isOriginalTitle' column is object data type, but it should be boolean data type.

In [7]:
akas['isOriginalTitle'].unique()

array(['0', nan, '1'], dtype=object)

So, 'isOriginalTitle' does contain string numerical values that can be converted, but nan values are present. Since nan is considered a float, I will convert the string numbers to floats.

In [8]:
akas['isOriginalTitle'] = akas['isOriginalTitle'].apply(float)
print(f'The akas dataframe shape: {akas.shape}')
display(akas.info())
display(akas.head())

The akas dataframe shape: (1322196, 8)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1322196 entries, 5 to 32013897
Data columns (total 8 columns):
 #   Column           Non-Null Count    Dtype  
---  ------           --------------    -----  
 0   titleId          1322196 non-null  object 
 1   ordering         1322196 non-null  int64  
 2   title            1322196 non-null  object 
 3   region           1322196 non-null  object 
 4   language         3529 non-null     object 
 5   types            1025559 non-null  object 
 6   attributes       43993 non-null    object 
 7   isOriginalTitle  1320821 non-null  float64
dtypes: float64(1), int64(1), object(6)
memory usage: 90.8+ MB


None

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
5,tt0000001,6,Carmencita,US,,imdbDisplay,,0.0
14,tt0000002,7,The Clown and His Dogs,US,,,literal English title,0.0
33,tt0000005,10,Blacksmith Scene,US,,imdbDisplay,,0.0
36,tt0000005,1,Blacksmithing Scene,US,,alternative,,0.0
41,tt0000005,6,Blacksmith Scene #1,US,,alternative,,0.0


# Filter/Clean the Basics Dataframe

In [9]:
print(f'The basics dataframe shape: {basics.shape}')
display(basics.info())
display(basics.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8928653 entries, 0 to 8928652
Data columns (total 9 columns):
 #   Column          Dtype 
---  ------          ----- 
 0   tconst          object
 1   titleType       object
 2   primaryTitle    object
 3   originalTitle   object
 4   isAdult         object
 5   startYear       object
 6   endYear         object
 7   runtimeMinutes  object
 8   genres          object
dtypes: object(9)
memory usage: 613.1+ MB


None

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,,4,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,,12,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,,1,"Comedy,Short"


## Drop Missing Genres Or Runtimes

In [12]:
# Removing movies with missing values for genre or runtime
print(f"The basics dataframe shape: {basics.shape}")
print(f"There are {basics['runtimeMinutes'].isna().sum()} missing values in 'runtimeMinutes'")
print(f"There are {basics['genres'].isna().sum()} missing values in 'genres'")

has_runtimeMinutes = ~basics['runtimeMinutes'].isna()
has_genres = ~basics['genres'].isna()

print(f"Eliminate entries that are missing runtime or genre")
basics = basics.loc[ (has_runtimeMinutes) & (has_genres) , : ]

print(f"The basics dataframe shape: {basics.shape}")
print(f"There are {basics['runtimeMinutes'].isna().sum()} missing values in 'runtimeMinutes'")
print(f"There are {basics['genres'].isna().sum()} missing values in 'genres'")

The basics dataframe shape: (8928653, 9)
Eliminate entries that are missing runtime or genre
The basics dataframe shape: (2336224, 9)
There are 0 missing values in 'runtimeMinutes'
There are 0 missing values in 'genres'


In [13]:
basics['runtimeMinutes'] = basics['runtimeMinutes'].apply(int)
print(f'The basics dataframe shape: {basics.shape}')
display(basics.info())
display(basics.head())

The basics dataframe shape: (2336224, 9)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2336224 entries, 0 to 8928652
Data columns (total 9 columns):
 #   Column          Dtype 
---  ------          ----- 
 0   tconst          object
 1   titleType       object
 2   primaryTitle    object
 3   originalTitle   object
 4   isAdult         object
 5   startYear       object
 6   endYear         object
 7   runtimeMinutes  int64 
 8   genres          object
dtypes: int64(1), object(8)
memory usage: 178.2+ MB


None

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,,4,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,,12,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,,1,"Comedy,Short"


## Keep Full-Length Movies

In [14]:
print(f"There are {basics['titleType'].nunique()} unique values in 'titleType'")
basics['titleType'].unique().tolist()

There are 10 unique values in 'titleType'


['short',
 'movie',
 'tvEpisode',
 'tvSeries',
 'tvShort',
 'tvMovie',
 'tvMiniSeries',
 'video',
 'tvSpecial',
 'videoGame']

In [15]:
# Removing movies without a string value
print(f'The basics dataframe shape: {basics.shape}')

print(f'Eliminate entries that are not full-length movies')
is_movie = (basics['titleType'] == 'movie')
basics = basics.loc[ is_movie , : ]

print(f'The basics dataframe shape: {basics.shape}')

The basics dataframe shape: (2336224, 9)
Eliminate entries that are not full-length movies
The basics dataframe shape: (360501, 9)


In [16]:
print(f"There are {basics['titleType'].nunique()} unique values in 'titleType'")
basics['titleType'].unique().tolist()

There are 1 unique values in 'titleType'


['movie']

## Drop Documentaries

In [17]:
# Find all the genres present in the basics dataframe
# Because each entry can belong to multiple genres,
# some extra steps are needed to show each unique genre.

def splitter(n):
    return n.split(",")

genre_list = list(set(list(np.concatenate(list(map( splitter , basics['genres'].unique().tolist() ))).flat)))

print(f"There are {len(genre_list)} genres represented in 'genres'")
genre_list

There are 28 genres represented in 'genres'


['Documentary',
 'History',
 'Fantasy',
 'Sport',
 'Biography',
 'Romance',
 'Crime',
 'Game-Show',
 'Adult',
 'Western',
 'War',
 'Drama',
 'Reality-TV',
 'Film-Noir',
 'Horror',
 'Musical',
 'Music',
 'Adventure',
 'News',
 'Comedy',
 'Family',
 'Animation',
 'Action',
 'Short',
 'Talk-Show',
 'Sci-Fi',
 'Thriller',
 'Mystery']

In [18]:
# Removing rows with a sub-string value
print(f'The basics dataframe shape: {basics.shape}')
print(f'Eliminate entries that are documentaries')

is_documentary = basics['genres'].str.contains('Documentary', case=False)
# note: case parameter means case-sensitive spelling

basics = basics.loc[ ~is_documentary , :]

print(f'The basics dataframe shape: {basics.shape}')

The basics dataframe shape: (360501, 9)
Eliminate entries that are documentaries
The basics dataframe shape: (274436, 9)


In [19]:
# Re-evaluate the genre_list

genre_list = list(set(list(np.concatenate(list(map( splitter , basics['genres'].unique().tolist() ))).flat)))

print(f"There are {len(genre_list)} genres represented in 'genres'")
genre_list

There are 27 genres represented in 'genres'


['History',
 'Fantasy',
 'Sport',
 'Biography',
 'Romance',
 'Crime',
 'Game-Show',
 'Adult',
 'Western',
 'War',
 'Drama',
 'Reality-TV',
 'Film-Noir',
 'Horror',
 'Musical',
 'Music',
 'Adventure',
 'News',
 'Comedy',
 'Family',
 'Animation',
 'Action',
 'Short',
 'Talk-Show',
 'Sci-Fi',
 'Thriller',
 'Mystery']

The value 'Documentary' is no longer present.

## Keep Released 21st Century Movies

The 'startYear' column is of the object data type, so the values should be checked. Ideally, the years are integers. Because np.nan values are floats, I will convert any string numbers to floats to avoid an error in trying to convert np.nan values to integers.

In [20]:
startYear_null_count = basics['startYear'].isna().sum()
startYear_null_percent = 100*startYear_null_count/basics.shape[0]
startYear_min = basics['startYear'].apply(float).min()
startYear_max = basics['startYear'].apply(float).max()

print(f"The basics dataframe has {startYear_null_count} null startYear values.")
print(f"\nThe basics dataframe:")
print(basics['startYear'].apply(type).value_counts())
print(f"\nNull values constitute {startYear_null_percent:.2f}% of the basics dataframe")
print(f"The range of years is {startYear_min} through {startYear_max}")

The basics dataframe has 4155 null startYear values.

The basics dataframe:
<class 'str'>      270281
<class 'float'>      4155
Name: startYear, dtype: int64

Null values constitute 1.51% of the basics dataframe
The range of years is 1903.0 through 2027.0


I will drop the null values. They constitute only 1.51% of the movies. Imputing any values could potentially be extremely misleading, as characteristics of successful movies have changed in the past century.

By filtering the 'startYear' by inequalities, the np.nan values will be filtered out as they not less than or greater then a numerical value.

In [21]:
print(f'The basics dataframe shape: {basics.shape}')
print(f'Eliminate entries that before the year 2000')
print(f'Eliminate entries that are after the year 2022')

is_released = (basics['startYear'].apply(float) <= 2022)
is_year_20XX = (basics['startYear'].apply(float) >= 2000)

basics = basics.loc[ is_released & is_year_20XX, : ]

print(f'The basics dataframe shape: {basics.shape}')

The basics dataframe shape: (274436, 9)
Eliminate entries that before the year 2000
Eliminate entries that are after the year 2022
The basics dataframe shape: (139319, 9)


In [31]:
basics['startYear'] = basics['startYear'].apply(int)

print(f"There are {basics['startYear'].nunique()} years in 'startYear'")
basics['startYear'].sort_values().unique().tolist()

There are 23 years in 'startYear'


[2000,
 2001,
 2002,
 2003,
 2004,
 2005,
 2006,
 2007,
 2008,
 2009,
 2010,
 2011,
 2012,
 2013,
 2014,
 2015,
 2016,
 2017,
 2018,
 2019,
 2020,
 2021,
 2022]

In [32]:
print(f'The basics dataframe shape: {basics.shape}')
display(basics.info())
display(basics.head())

The basics dataframe shape: (139319, 9)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 139319 entries, 34805 to 8928502
Data columns (total 9 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   tconst          139319 non-null  object
 1   titleType       139319 non-null  object
 2   primaryTitle    139319 non-null  object
 3   originalTitle   139319 non-null  object
 4   isAdult         139319 non-null  object
 5   startYear       139319 non-null  int64 
 6   endYear         0 non-null       object
 7   runtimeMinutes  139319 non-null  int64 
 8   genres          139319 non-null  object
dtypes: int64(2), object(7)
memory usage: 10.6+ MB


None

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
34805,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001,,118,"Comedy,Fantasy,Romance"
61119,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El Tango del Viudo y Su Espejo Deformante,0,2020,,70,Drama
67672,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018,,122,Drama
77968,tt0079644,movie,November 1828,November 1828,0,2001,,140,"Drama,War"
86806,tt0088751,movie,The Naked Monster,The Naked Monster,0,2005,,100,"Comedy,Horror,Sci-Fi"


## Keep US releases

In [34]:
print(f'The basics dataframe shape: {basics.shape}')

# Filter out entries from the basics dataframe that are not released in the US
print(f'Eliminate entries that were not released in the US')
is_basic_in_US =  basics['tconst'].isin(akas['titleId'])
basics = basics.loc[ is_basic_in_US , :]

print(f'The basics dataframe shape: {basics.shape}')
basics.head()

The basics dataframe shape: (139319, 9)
The basics dataframe shape: (81040, 9)


Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
34805,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001,,118,"Comedy,Fantasy,Romance"
61119,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El Tango del Viudo y Su Espejo Deformante,0,2020,,70,Drama
67672,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018,,122,Drama
86806,tt0088751,movie,The Naked Monster,The Naked Monster,0,2005,,100,"Comedy,Horror,Sci-Fi"
91077,tt0093119,movie,Grizzly II: Revenge,Grizzly II: The Predator,0,2020,,74,"Horror,Music,Thriller"


## Convert 'isAdult' Data Type

The 'isAdult' column is object type, but it is suppose to contain boolean values.

In [35]:
basics['isAdult'].unique()

array(['0', '1'], dtype=object)

Since there are no nan values, I can convert the values easily to integers.

In [36]:
basics['isAdult'] = basics['isAdult'].apply(int)

In [37]:
print(f'The basics dataframe shape: {basics.shape}')
display(basics.info())
display(basics.head())

The basics dataframe shape: (81040, 9)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 81040 entries, 34805 to 8928418
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   tconst          81040 non-null  object
 1   titleType       81040 non-null  object
 2   primaryTitle    81040 non-null  object
 3   originalTitle   81040 non-null  object
 4   isAdult         81040 non-null  int64 
 5   startYear       81040 non-null  int64 
 6   endYear         0 non-null      object
 7   runtimeMinutes  81040 non-null  int64 
 8   genres          81040 non-null  object
dtypes: int64(3), object(6)
memory usage: 6.2+ MB


None

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
34805,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001,,118,"Comedy,Fantasy,Romance"
61119,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El Tango del Viudo y Su Espejo Deformante,0,2020,,70,Drama
67672,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018,,122,Drama
86806,tt0088751,movie,The Naked Monster,The Naked Monster,0,2005,,100,"Comedy,Horror,Sci-Fi"
91077,tt0093119,movie,Grizzly II: Revenge,Grizzly II: The Predator,0,2020,,74,"Horror,Music,Thriller"


## Drop 'endYear' Column

The 'endYear' column is suppose to contain years.

In [38]:
basics['endYear'].unique()

array([nan], dtype=object)

However, it only contains nan values. So I will drop this column.

In [39]:
basics = basics.drop(columns=['endYear'])
print(f'The basics dataframe shape: {basics.shape}')
display(basics.info())
display(basics.head())

The basics dataframe shape: (81040, 8)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 81040 entries, 34805 to 8928418
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   tconst          81040 non-null  object
 1   titleType       81040 non-null  object
 2   primaryTitle    81040 non-null  object
 3   originalTitle   81040 non-null  object
 4   isAdult         81040 non-null  int64 
 5   startYear       81040 non-null  int64 
 6   runtimeMinutes  81040 non-null  int64 
 7   genres          81040 non-null  object
dtypes: int64(3), object(5)
memory usage: 5.6+ MB


None

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,runtimeMinutes,genres
34805,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001,118,"Comedy,Fantasy,Romance"
61119,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El Tango del Viudo y Su Espejo Deformante,0,2020,70,Drama
67672,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018,122,Drama
86806,tt0088751,movie,The Naked Monster,The Naked Monster,0,2005,100,"Comedy,Horror,Sci-Fi"
91077,tt0093119,movie,Grizzly II: Revenge,Grizzly II: The Predator,0,2020,74,"Horror,Music,Thriller"


# Filter/Clean Ratings Dataframe

In [40]:
print(f'The ratings dataframe shape: {ratings.shape}')
display(ratings.info())
display(ratings.head())

The ratings dataframe shape: (1245178, 3)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1245178 entries, 0 to 1245177
Data columns (total 3 columns):
 #   Column         Non-Null Count    Dtype  
---  ------         --------------    -----  
 0   tconst         1245178 non-null  object 
 1   averageRating  1245178 non-null  float64
 2   numVotes       1245178 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 28.5+ MB


None

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1879
1,tt0000002,5.9,249
2,tt0000003,6.5,1655
3,tt0000004,5.8,162
4,tt0000005,6.2,2479


All the columns of the ratings dataframe agree with the expected data types.

## Keep US releases

In [43]:
print(f'The ratings dataframe shape: {ratings.shape}')

# Filter out entries from the basics dataframe that are not released in the US
print(f'Eliminate entries that were not released in the US')
is_ratings_in_US =  ratings['tconst'].isin(akas['titleId'])
ratings = ratings.loc[ is_ratings_in_US , :]

print(f'The basics dataframe shape: {ratings.shape}')
ratings.head()

The ratings dataframe shape: (1245178, 3)
Eliminate entries that were not released in the US
The basics dataframe shape: (470464, 3)


Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1879
1,tt0000002,5.9,249
4,tt0000005,6.2,2479
5,tt0000006,5.2,165
6,tt0000007,5.4,771


# Review Dataframes

In [44]:
print(f'The basics dataframe shape: {basics.shape}')
display(basics.info())
display(basics.head())
print(f'The akas dataframe shape: {akas.shape}')
display(akas.info())
display(akas.head())
print(f'The ratings dataframe shape: {ratings.shape}')
display(ratings.info())
display(ratings.head())

The basics dataframe shape: (81040, 8)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 81040 entries, 34805 to 8928418
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   tconst          81040 non-null  object
 1   titleType       81040 non-null  object
 2   primaryTitle    81040 non-null  object
 3   originalTitle   81040 non-null  object
 4   isAdult         81040 non-null  int64 
 5   startYear       81040 non-null  int64 
 6   runtimeMinutes  81040 non-null  int64 
 7   genres          81040 non-null  object
dtypes: int64(3), object(5)
memory usage: 5.6+ MB


None

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,runtimeMinutes,genres
34805,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001,118,"Comedy,Fantasy,Romance"
61119,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El Tango del Viudo y Su Espejo Deformante,0,2020,70,Drama
67672,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018,122,Drama
86806,tt0088751,movie,The Naked Monster,The Naked Monster,0,2005,100,"Comedy,Horror,Sci-Fi"
91077,tt0093119,movie,Grizzly II: Revenge,Grizzly II: The Predator,0,2020,74,"Horror,Music,Thriller"


The akas dataframe shape: (1322196, 8)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1322196 entries, 5 to 32013897
Data columns (total 8 columns):
 #   Column           Non-Null Count    Dtype  
---  ------           --------------    -----  
 0   titleId          1322196 non-null  object 
 1   ordering         1322196 non-null  int64  
 2   title            1322196 non-null  object 
 3   region           1322196 non-null  object 
 4   language         3529 non-null     object 
 5   types            1025559 non-null  object 
 6   attributes       43993 non-null    object 
 7   isOriginalTitle  1320821 non-null  float64
dtypes: float64(1), int64(1), object(6)
memory usage: 90.8+ MB


None

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
5,tt0000001,6,Carmencita,US,,imdbDisplay,,0.0
14,tt0000002,7,The Clown and His Dogs,US,,,literal English title,0.0
33,tt0000005,10,Blacksmith Scene,US,,imdbDisplay,,0.0
36,tt0000005,1,Blacksmithing Scene,US,,alternative,,0.0
41,tt0000005,6,Blacksmith Scene #1,US,,alternative,,0.0


The ratings dataframe shape: (470464, 3)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 470464 entries, 0 to 1245174
Data columns (total 3 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   tconst         470464 non-null  object 
 1   averageRating  470464 non-null  float64
 2   numVotes       470464 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 14.4+ MB


None

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1879
1,tt0000002,5.9,249
4,tt0000005,6.2,2479
5,tt0000006,5.2,165
6,tt0000007,5.4,771


All the column data types appear consistent with the value data types.

# Saving the Dataframes

In [None]:
# # This code was was used to create a Data folder for saving downloaded data
# # Use this only if the Data folder does not exist already

# # make new folder with os
# import os
# os.makedirs('Data', exist_ok = True)
# # verify folder was created
# os.listdir('Data/')

In [45]:
# save dataframes to file
basics.to_csv('Data/title_basics.csv.gz', compression='gzip', index=False)
akas.to_csv('Data/title_akas.csv.gz', compression='gzip', index=False)
ratings.to_csv('Data/title_ratings.csv.gz', compression='gzip', index=False)

In [46]:
# open saved files and preview
basics = pd.read_csv('Data/title_basics.csv.gz', low_memory = False)
akas = pd.read_csv('Data/title_akas.csv.gz', low_memory = False)
ratings = pd.read_csv('Data/title_ratings.csv.gz', low_memory = False)

In [47]:
print(f'The basics dataframe shape: {basics.shape}')
display(basics.info())
display(basics.head())
print(f'The akas dataframe shape: {akas.shape}')
display(akas.info())
display(akas.head())
print(f'The ratings dataframe shape: {ratings.shape}')
display(ratings.info())
display(ratings.head())

The basics dataframe shape: (81040, 8)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 81040 entries, 0 to 81039
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   tconst          81040 non-null  object
 1   titleType       81040 non-null  object
 2   primaryTitle    81040 non-null  object
 3   originalTitle   81040 non-null  object
 4   isAdult         81040 non-null  int64 
 5   startYear       81040 non-null  int64 
 6   runtimeMinutes  81040 non-null  int64 
 7   genres          81040 non-null  object
dtypes: int64(3), object(5)
memory usage: 4.9+ MB


None

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,runtimeMinutes,genres
0,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001,118,"Comedy,Fantasy,Romance"
1,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El Tango del Viudo y Su Espejo Deformante,0,2020,70,Drama
2,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018,122,Drama
3,tt0088751,movie,The Naked Monster,The Naked Monster,0,2005,100,"Comedy,Horror,Sci-Fi"
4,tt0093119,movie,Grizzly II: Revenge,Grizzly II: The Predator,0,2020,74,"Horror,Music,Thriller"


The akas dataframe shape: (1322196, 8)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1322196 entries, 0 to 1322195
Data columns (total 8 columns):
 #   Column           Non-Null Count    Dtype  
---  ------           --------------    -----  
 0   titleId          1322196 non-null  object 
 1   ordering         1322196 non-null  int64  
 2   title            1322196 non-null  object 
 3   region           1322196 non-null  object 
 4   language         3529 non-null     object 
 5   types            1025559 non-null  object 
 6   attributes       43993 non-null    object 
 7   isOriginalTitle  1320821 non-null  float64
dtypes: float64(1), int64(1), object(6)
memory usage: 80.7+ MB


None

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
0,tt0000001,6,Carmencita,US,,imdbDisplay,,0.0
1,tt0000002,7,The Clown and His Dogs,US,,,literal English title,0.0
2,tt0000005,10,Blacksmith Scene,US,,imdbDisplay,,0.0
3,tt0000005,1,Blacksmithing Scene,US,,alternative,,0.0
4,tt0000005,6,Blacksmith Scene #1,US,,alternative,,0.0


The ratings dataframe shape: (470464, 3)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 470464 entries, 0 to 470463
Data columns (total 3 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   tconst         470464 non-null  object 
 1   averageRating  470464 non-null  float64
 2   numVotes       470464 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 10.8+ MB


None

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1879
1,tt0000002,5.9,249
2,tt0000005,6.2,2479
3,tt0000006,5.2,165
4,tt0000007,5.4,771
