# **Movie Production Business Analysis - IMDB ETL**

- Yvon Bilodeau
- May 2022

## **Business Problem**

For this project, you have been hired to produce a MySQL database on Movies from a subset of IMDB's publicly available dataset. Ultimately, you will use this database to analyze what makes a movie successful and will provide recommendations to the stakeholder on how to make a successful movie.


## **The Data**

IMDB Provides Several Files with varied information for Movies, TV Shows, Made for TV Movies, etc.

Overview/Data Dictionary: [https://www.imdb.com/interfaces/](https://www.imdb.com/interfaces/)

Downloads page: [https://datasets.imdbws.com/](https://datasets.imdbws.com/)
        
From their previous research, they realized they want to focus on the following files:

- title.basics.tsv.gz
- title.ratings.tsv.gz
- title.akas.tsv.gz

### **Specifications**

Your stakeholder only wants you to include information for movies based on the following specifications:

- Exclude any movie with missing values for genre or runtime
- Include only full-length movies (titleType = "movie").
- Include only fictional movies (not from documentary genre)
- Include only movies that were released 2000 - 2021 (include 2000 and 2021)
- Include only movies that were released in the United States

### **Download**

In [1]:
# Create url variables
basics_url = "https://datasets.imdbws.com/title.basics.tsv.gz"
ratings_url ="https://datasets.imdbws.com/title.ratings.tsv.gz"
akas_url = "https://datasets.imdbws.com/title.akas.tsv.gz"
#akas_url = "C:\Users\DELL\Downloads\title.akas.tsv.gz"

### **Preprocessing**

In [2]:
import numpy as np
import pandas as pd
import os

#### **Title Basics**

In [10]:
# Create dataframe
basics_df = pd.read_csv(basics_url,sep='\t', low_memory=False)

- Replace "\N" with np.nan
- Eliminate movies that are null for runtimeMinutes
- Eliminate movies that are null for genre
- Keep only titleType==Movie
- Keep startYear 2000-2022
- Eliminate movies that include  "Documentary" in genre

In [12]:
basics_df.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,12,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short"


In [13]:
# Replace "\N" with np.nan
basics_df.replace({'\\N':np.nan},inplace=True)

In [14]:
basics_df.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,,4,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,,12,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,,1,"Comedy,Short"


In [15]:
# Eliminate movies that are null for runtimeMinutes
basics_df = basics_df.dropna(subset = ['runtimeMinutes', 'genres', 'startYear'])

In [16]:
# Keep only titleType == Movie
basics_df = basics_df.loc[basics_df['titleType'] == 'movie']

In [17]:
basics_df.head(20)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
570,tt0000574,movie,The Story of the Kelly Gang,The Story of the Kelly Gang,0,1906,,70,"Action,Adventure,Biography"
587,tt0000591,movie,The Prodigal Son,L'enfant prodigue,0,1907,,90,Drama
672,tt0000679,movie,The Fairylogue and Radio-Plays,The Fairylogue and Radio-Plays,0,1908,,120,"Adventure,Fantasy"
1172,tt0001184,movie,Don Juan de Serrallonga,Don Juan de Serrallonga,0,1910,,58,"Adventure,Drama"
1273,tt0001285,movie,The Life of Moses,The Life of Moses,0,1909,,50,"Biography,Drama,Family"
1485,tt0001498,movie,The Battle of Trafalgar,The Battle of Trafalgar,0,1911,,51,War
1578,tt0001592,movie,In the Prime of Life,Ekspeditricen,0,1911,,52,Drama
1773,tt0001790,movie,"Les Misérables, Part 1: Jean Valjean",Les misérables - Époque 1: Jean Valjean,0,1913,,60,Drama
1794,tt0001812,movie,Oedipus Rex,Oedipus Rex,0,1911,,56,Drama
1872,tt0001892,movie,Den sorte drøm,Den sorte drøm,0,1911,,53,Drama


In [18]:
# Keep startYear 2000-2022
basics_df['startYear'] = basics_df['startYear'].astype(int)
basics_df = basics_df.loc[(basics_df['startYear'] >= 2000) & (basics_df['startYear'] <=2021)]

In [19]:
basics_df.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
34805,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001,,118,"Comedy,Fantasy,Romance"
61119,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El Tango del Viudo y Su Espejo Deformante,0,2020,,70,Drama
67672,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018,,122,Drama
77968,tt0079644,movie,November 1828,November 1828,0,2001,,140,"Drama,War"
86806,tt0088751,movie,The Naked Monster,The Naked Monster,0,2005,,100,"Comedy,Horror,Sci-Fi"


In [20]:
# Eliminate movies that include "Documentary" in genre
documentary_filter = basics_df['genres'].str.contains('documentary', case=False)
basics_df = basics_df[~documentary_filter]

In [21]:
basics_df.head(20)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
34805,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001,,118,"Comedy,Fantasy,Romance"
61119,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El Tango del Viudo y Su Espejo Deformante,0,2020,,70,Drama
67672,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018,,122,Drama
77968,tt0079644,movie,November 1828,November 1828,0,2001,,140,"Drama,War"
86806,tt0088751,movie,The Naked Monster,The Naked Monster,0,2005,,100,"Comedy,Horror,Sci-Fi"
87119,tt0089067,movie,El día de los albañiles 2,El día de los albañiles 2,0,2001,,90,Comedy
90923,tt0092960,movie,En tres y dos,En tres y dos,0,2004,,102,Drama
91077,tt0093119,movie,Grizzly II: Revenge,Grizzly II: The Predator,0,2020,,74,"Horror,Music,Thriller"
92773,tt0094859,movie,Chief Zabu,Chief Zabu,0,2016,,74,Comedy
93944,tt0096056,movie,Crime and Punishment,Crime and Punishment,0,2002,,126,Drama


In [22]:
# Make directory
os.makedirs('Data/',exist_ok=True) 

In [23]:
# Confirm directory created
os.listdir("Data/")

['title_basics.csv.gz', 'title_ratings.csv.gz']

In [24]:
## Save dataframe to file
basics_df.to_csv("Data/title_basics.csv.gz", compression='gzip', index=False)

In [25]:
# Open saved file and preview again
basics_df = pd.read_csv("Data/title_basics.csv.gz", low_memory=False)
basics_df.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001,,118,"Comedy,Fantasy,Romance"
1,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El Tango del Viudo y Su Espejo Deformante,0,2020,,70,Drama
2,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018,,122,Drama
3,tt0079644,movie,November 1828,November 1828,0,2001,,140,"Drama,War"
4,tt0088751,movie,The Naked Monster,The Naked Monster,0,2005,,100,"Comedy,Horror,Sci-Fi"


In [26]:
del basics_df

#### **Title Ratings**

- Keep only US entries.
- Replace "\N" with np.nan

In [27]:
# Create dataframe
ratings_df = pd.read_csv(ratings_url,sep='\t', low_memory=False)

In [28]:
ratings_df.head()

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1874
1,tt0000002,5.9,248
2,tt0000003,6.5,1647
3,tt0000004,5.8,160
4,tt0000005,6.2,2475


In [29]:
# Replace "\N" with np.nan
ratings_df.replace({'\\N':np.nan},inplace=True)

In [30]:
ratings_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1237911 entries, 0 to 1237910
Data columns (total 3 columns):
 #   Column         Non-Null Count    Dtype  
---  ------         --------------    -----  
 0   tconst         1237911 non-null  object 
 1   averageRating  1237911 non-null  float64
 2   numVotes       1237911 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 28.3+ MB


In [31]:
## Save dataframe to file
ratings_df.to_csv("Data/title_ratings.csv.gz", compression='gzip', index=False)

In [32]:
# Open saved file and preview again
ratings_df = pd.read_csv("Data/title_ratings.csv.gz", low_memory=False)
ratings_df.head()

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1874
1,tt0000002,5.9,248
2,tt0000003,6.5,1647
3,tt0000004,5.8,160
4,tt0000005,6.2,2475


In [33]:
del ratings_df

#### **Title Akas**

- Replace "\N" with np.nan (if any)

In [4]:
# Create dataframe
dl_file = "C:/Users/DELL/Downloads/title.akas.tsv.gz"
akas_df = pd.read_csv(dl_file,sep='\t', low_memory=True, chunksize=100000)

KeyboardInterrupt: 

In [None]:
akas_df.head()

In [None]:
akas_df.info()

In [None]:
# Replace "\N" with np.nan
akas_df.replace({'\\N':np.nan},inplace=True)

In [None]:
akas_df.info()

In [None]:
# Save dataframe to file
akas_df.to_csv("Data/title_akas.csv.gz", compression='gzip', index=False)

In [None]:
# Open saved file and preview again
akas_df = pd.read_csv("Data/title_akas.csv.gz", low_memory=False)
akas_df.head()

In [None]:
# Open saved file and preview again
# basics_df = pd.read_csv("Data/title_basics.csv.gz", low_memory=False)
# basics_df.head()

In [None]:
# Filter the basics table down to only include the US by using the filter akas dataframe
#keepers = basics_df['tconst'].isin(akas_df['titleId'])
#keepers

In [None]:
#basics_df = basics_df[keepers]
#basics_df