# Analysis of Successful Movies (Notebook 1)
* Benjamin Grossmann

This notebook retrieves the data from online.
Then it performs some preprocessing and filtering to keep only the movies that meet the desired criteria.
The final steps are to save the reduced data set.

After the reduced data set has been saved, further work on this project should use Notebook 2. This will reduce the time to bring the data into a project-ready state.

If the reduced data set should need to be reset to its initial condition, then re-run Notebook 1.

THe IMDB files being used:
* https://datasets.imdbws.com/title.basics.tsv.gz
* https://datasets.imdbws.com/title.akas.tsv.gz
* https://datasets.imdbws.com/title.ratings.tsv.gz

In [1]:
import numpy as np
import pandas as pd
from sqlalchemy import create_engine

In [2]:
# This code block can take around 6 minutes
%%time

# Store the data locations as strings
url_basics = 'https://datasets.imdbws.com/title.basics.tsv.gz'
url_akas = 'https://datasets.imdbws.com/title.akas.tsv.gz'
url_ratings = 'https://datasets.imdbws.com/title.ratings.tsv.gz'

# Retrieve the data from online
basics = pd.read_csv(url_basics, sep='\t', low_memory=False)
akas = pd.read_csv(url_akas, sep='\t', low_memory=False)
ratings = pd.read_csv(url_ratings, sep='\t', low_memory=False)

# Display basic information about the data
print('The basics dataframe:')
display(basics.info())
display(basics.head())
print('The akas dataframe:')
display(akas.info())
display(akas.head())
print('The ratings dataframe:')
display(ratings.info())
display(ratings.head())

basics:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8928653 entries, 0 to 8928652
Data columns (total 9 columns):
 #   Column          Dtype 
---  ------          ----- 
 0   tconst          object
 1   titleType       object
 2   primaryTitle    object
 3   originalTitle   object
 4   isAdult         object
 5   startYear       object
 6   endYear         object
 7   runtimeMinutes  object
 8   genres          object
dtypes: object(9)
memory usage: 613.1+ MB


None

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,12,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short"


akas:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32014153 entries, 0 to 32014152
Data columns (total 8 columns):
 #   Column           Dtype 
---  ------           ----- 
 0   titleId          object
 1   ordering         int64 
 2   title            object
 3   region           object
 4   language         object
 5   types            object
 6   attributes       object
 7   isOriginalTitle  object
dtypes: int64(1), object(7)
memory usage: 1.9+ GB


None

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
0,tt0000001,1,Карменсіта,UA,\N,imdbDisplay,\N,0
1,tt0000001,2,Carmencita,DE,\N,\N,literal title,0
2,tt0000001,3,Carmencita - spanyol tánc,HU,\N,imdbDisplay,\N,0
3,tt0000001,4,Καρμενσίτα,GR,\N,imdbDisplay,\N,0
4,tt0000001,5,Карменсита,RU,\N,imdbDisplay,\N,0


ratings:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1245178 entries, 0 to 1245177
Data columns (total 3 columns):
 #   Column         Non-Null Count    Dtype  
---  ------         --------------    -----  
 0   tconst         1245178 non-null  object 
 1   averageRating  1245178 non-null  float64
 2   numVotes       1245178 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 28.5+ MB


None

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1879
1,tt0000002,5.9,249
2,tt0000003,6.5,1655
3,tt0000004,5.8,162
4,tt0000005,6.2,2479


Wall time: 5min 56s


In [3]:
%%time
# Replace the \N placeholders with np.nan
basics = basics.replace({'\\N':np.nan})
akas = akas.replace({'\\N':np.nan})
ratings = ratings.replace({'\\N':np.nan})

In [14]:
print(f'The basics dataframe shape: {basics.shape}')
display(basics.info())
display(basics.head())
# print('The akas dataframe:')
# display(akas.info())
# display(akas.head())
# print('The ratings dataframe:')
# display(ratings.info())
# display(ratings.head())

The basics dataframe: (8928653, 9)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8928653 entries, 0 to 8928652
Data columns (total 9 columns):
 #   Column          Dtype 
---  ------          ----- 
 0   tconst          object
 1   titleType       object
 2   primaryTitle    object
 3   originalTitle   object
 4   isAdult         object
 5   startYear       object
 6   endYear         object
 7   runtimeMinutes  object
 8   genres          object
dtypes: object(9)
memory usage: 613.1+ MB


None

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,,4,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,,12,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,,1,"Comedy,Short"


In [18]:
# Removing movies with missing values for genre or runtime

print(f'The basics dataframe shape: {basics.shape}'

has_runtimeMinutes = ~basics['runtimeMinutes'].isna()
has_genres = ~basics['genres'].isna()

print(f'Eliminate entries that are missing runtime or genre')
basics = basics.loc[ (has_runtimeMinutes) & (has_genres) , : ]

print(f'The basics dataframe shape: {basics.shape}')

The basics dataframe: (2336224, 9)


In [65]:
basics['titleType'].unique().tolist()

['movie']

In [20]:
# Removing movies without a string value
print(f'The basics dataframe shape: {basics.shape}')

print(f'Eliminate entries that are not full-length movies')
is_movie = (basics['titleType'] == 'movie')
basics = basics.loc[ is_movie , : ]

print(f'The basics dataframe shape: {basics.shape}')

Eliminate movies that are not full-length
The basics dataframe: (360501, 9)


In [81]:
# Find all the genres present in the basics dataframe
# Because each entry can belong to multiple genres,
# some extra steps are needed to show each unique genre.

def splitter(n):
    return n.split(",")

set(list(np.concatenate(list(map( splitter , basics['genres'].unique().tolist() ))).flat))

{'Action',
 'Adult',
 'Adventure',
 'Animation',
 'Biography',
 'Comedy',
 'Crime',
 'Drama',
 'Family',
 'Fantasy',
 'Film-Noir',
 'Game-Show',
 'History',
 'Horror',
 'Music',
 'Musical',
 'Mystery',
 'News',
 'Reality-TV',
 'Romance',
 'Sci-Fi',
 'Short',
 'Sport',
 'Talk-Show',
 'Thriller',
 'War',
 'Western'}

In [30]:
# Removing rows with a sub-string value
print(f'The basics dataframe shape: {basics.shape}')
print(f'Eliminate entries that are documentaries')

is_documentary = basics['genres'].str.contains('Documentary', case=False)
# note: case parameter means case-sensitive spelling

basics = basics.loc[ ~is_documentary , :]

print(f'The basics dataframe shape: {basics.shape}')

Eliminate movies that are Documentaries
The basics dataframe: (274436, 9)


The 'startYear' column is of the object data type, so the values should be checked. Ideally, the years are integers.

In [84]:
print(f"The basics dataframe has {basics['startYear'].isna().sum()} null values.")
print(f"\nThe basics dataframe:")
print(basics['startYear'].apply(type).value_counts())
percent_null = 100*basics['startYear'].isna().sum()/basics.shape[0]
print(f"\nNull values constitute {percent_null:.2f}% of the current basics dataframe")

The basics dataframe has 4155 null values.

The basics dataframe:
<class 'str'>      270281
<class 'float'>      4155
Name: startYear, dtype: int64

Null values constitute 1.51% of the current basics dataframe


I will drop these rows 

In [None]:
basics.loc[ (basics['startYear'].apply(float) >= 2000), : ]['startYear']

In [None]:
# This code was was used to create a Data folder for saving downloaded data


# # make new folder with os
# import os
# os.makedirs('Data', exist_ok = True)
# # verify folder was created
# os.listdir('Data/')

In [None]:
# save dataframes to file
basics.to_csv('Data/title_basics.csv.gz', compression='gzip', index=False)
akas.to_csv('Data/title_akas.csv.gz', compression='gzip', index=False)
ratings.to_csv('Data/title_ratings.csv.gz', compression='gzip', index=False)

In [None]:
# open saved files and preview
basics = pd.read_csv('Data/title_basics.csv.gz', low_memory = False)
akas = pd.read_csv('Data/title_akas.csv.gz', low_memory = False)
ratings = pd.read_csv('Data/title_ratings.csv.gz', low_memory = False)

In [None]:
display(basics.head())
display(akas.head())
display(ratings.head())