![TMDB_logo.svg](attachment:TMDB_logo.svg)

# <center> TMDB Project </center>

### Objective:
The objective of this project is to create a MYSQL database on Movies from a subset of TMDB's publicly available dataset. This database will be explored and analyzed with the end goal of understanding what are the qualities and characteristics that make a movie successful. We will then use these key findings to provide stakeholders with data driven recommendations on how to make a successful movie.

- Part 1: Download several files from IMDB’s movie data set and filter out the subset of movies requested by the stakeholder.
- Part 2: Use an API to extract box office revenue and profit data to add to our IMDB data and perform exploratory data analysis.
- Part 3: Construct and export a MySQL database using our data.
- Part 4: Apply hypothesis testing to explore what makes a movie successful.
- Part 5: Produce a Linear Regression model to predict a movies performance.


### Data Contributions

Subsets of the IMDb data are refreshed daily and made available to customers for personal and non-commercial use.

Link to the IMDb's Non-Commercial Datasets: [Click Here](https://datasets.imdbws.com/)

Link to IMDb's Dataset Overview/Dictionaries: [Click Here](https://developer.imdb.com/non-commercial-datasets/)

Link to IMDb's Non-Commercial Licensing: [Click Here](https://help.imdb.com/article/imdb/general-information/can-i-use-imdb-data-in-my-software/G5JTRESSHJBBHTGX?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=1ed1aea6-d2ad-4705-95fd-ba13f1b5014f&pf_rd_r=XRE3QWF2G5YWTD2SGT0V&pf_rd_s=center-1&pf_rd_t=60601&pf_rd_i=interfaces&ref_=fea_mn_lk1)




## Import Python Libraries

In [1]:
import pandas as pd
import numpy as np

### Uploading IMDb Data

Our Stakeholders are only interested in the following files:

- title.basics.tsv.gz
- title.ratings.tsv.gz
- title.akas.tsv.gz

In [2]:
# creating variables for the datasets so that we can load the data directly into Pandas using the URLs
basics_url="https://datasets.imdbws.com/title.basics.tsv.gz"
akas_url="https://datasets.imdbws.com/title.akas.tsv.gz"
ratings_url="https://datasets.imdbws.com/title.ratings.tsv.gz"

### Title Basics Dictionary

**Column** | **Description**
--- | ---
titleId (string) | a tconst, an alphanumeric unique identifier of the title
tconst (string) | alphanumeric unique identifier of the title
titleType (string) | the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc)
primaryTitle (string) | the more popular title / the title used by the filmmakers on promotional materials at the point of release
originalTitle (string) | original title, in the original language
isAdult (boolean) | 0: non-adult title; 1: adult title
startYear (YYYY) | represents the release year of a title. In the case of TV Series, it is the series start year
endYear (YYYY) | TV Series end year. ‘\\N’ for all other title types
runtimeMinutes | primary runtime of the title, in minutes
genres (string array) | includes up to three genres associated with the title

In [4]:
# Loading the basics data
# using sep='\t' to indicate it's a .tsv file and not .csv
# using low_memory=False to avoid warnings about mixed datatypes since it's a large file
basics_ = pd.read_csv(basics_url, sep='\t', low_memory=False)

In [5]:
# creating a copy
# displaying first 2 columns
basics_df = basics_.copy()
basics_df.head(2)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"


### Title Akas Dictionary

**Column** | **Description**
--- | ---
titleId (string) | a tconst, an alphanumeric unique identifier of the title
ordering (integer) | a number to uniquely identify rows for a given titleId
title (string) | the localized title
region (string) | the region for this version of the title
language (string) |the language of the title
types (array) | Enumerated set of attributes for this alternative title. One or more of the following: \"alternative\", \"dvd\", \"festival\", \"tv\", \"video\", \"working\", \"original\", \"imdbDisplay\". New values may be added in the future without warning
attributes (array) | Additional terms to describe this alternative title, not enumerated
isOriginalTitle (boolean) | 0: not original title; 1: original title

In [6]:
# Loading the akas data
# using sep='\t' to indicate it's a .tsv file and not .csv
# using low_memory=False to avoid warnings about mixed datatypes since it's a large file
akas_ = pd.read_csv(akas_url, sep='\t', low_memory=False)

In [7]:
# creating a copy
# displaying first 2 columns
akas_df = akas_.copy()
akas_df.head(2)

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
0,tt0000001,1,Карменсіта,UA,\N,imdbDisplay,\N,0
1,tt0000001,2,Carmencita,DE,\N,\N,literal title,0


### Title Ratings Dictionary

**Column** | **Description**
--- | ---
tconst (string) | alphanumeric unique identifier of the title
averageRating | weighted average of all the individual user ratings
numVotes | number of votes the title has received

In [8]:
# Loading the ratings data
# using sep='\t' to indicate it's a .tsv file and not .csv
# using low_memory=False to avoid warnings about mixed datatypes since it's a large file
ratings_ = pd.read_csv(ratings_url, sep='\t', low_memory=False)

In [9]:
# creating a copy
# displaying first 2 columns
ratings_df = ratings_.copy()
ratings_df.head(2)

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1989
1,tt0000002,5.8,264


### Inspecting and Cleaning Data

- Title Akas Dataset

In [11]:
# verifying how many rows/columns, datatypes, missing items and duplicate rows
print(akas_df.info())
print(('-'*30))
print(f'There are {akas_df.duplicated().sum()} duplicate rows.')
print(('-'*30))
print(f'There are {akas_df.isna().sum().sum()} missing values.')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36993302 entries, 0 to 36993301
Data columns (total 8 columns):
 #   Column           Dtype 
---  ------           ----- 
 0   titleId          object
 1   ordering         int64 
 2   title            object
 3   region           object
 4   language         object
 5   types            object
 6   attributes       object
 7   isOriginalTitle  object
dtypes: int64(1), object(7)
memory usage: 2.2+ GB
None
------------------------------
There are 0 duplicate rows.
------------------------------
There are 121 missing values.


In [12]:
# replacing all \N values with np.nan
akas_df.replace({'\\N':np.nan}, inplace=True)

In [13]:
# filtering for only US region
akas_df = akas_df[akas_df['region'] == 'US']

In [14]:
# verifying to see if filtered correctly
akas_df.head()

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
5,tt0000001,6,Carmencita,US,,imdbDisplay,,0
14,tt0000002,7,The Clown and His Dogs,US,,,literal English title,0
33,tt0000005,10,Blacksmith Scene,US,,imdbDisplay,,0
36,tt0000005,1,Blacksmithing Scene,US,,alternative,,0
41,tt0000005,6,Blacksmith Scene #1,US,,alternative,,0


In [15]:
# verifying final akas data
akas_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1462273 entries, 5 to 36993046
Data columns (total 8 columns):
 #   Column           Non-Null Count    Dtype 
---  ------           --------------    ----- 
 0   titleId          1462273 non-null  object
 1   ordering         1462273 non-null  int64 
 2   title            1462273 non-null  object
 3   region           1462273 non-null  object
 4   language         4115 non-null     object
 5   types            983139 non-null   object
 6   attributes       47379 non-null    object
 7   isOriginalTitle  1460931 non-null  object
dtypes: int64(1), object(7)
memory usage: 100.4+ MB


- Title Basics Dataset

In [16]:
# verifying how many rows/columns, datatypes, missing items and duplicate rows
print(basics_df.info())
print(('-'*30))
print(f'There are {basics_df.duplicated().sum()} duplicate rows.')
print(('-'*30))
print(f'There are {basics_df.isna().sum().sum()} missing values.')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10115816 entries, 0 to 10115815
Data columns (total 9 columns):
 #   Column          Dtype 
---  ------          ----- 
 0   tconst          object
 1   titleType       object
 2   primaryTitle    object
 3   originalTitle   object
 4   isAdult         object
 5   startYear       object
 6   endYear         object
 7   runtimeMinutes  object
 8   genres          object
dtypes: object(9)
memory usage: 694.6+ MB
None
------------------------------
There are 0 duplicate rows.
------------------------------
There are 39 missing values.


In [17]:
# replacing all \N values with np.nan
basics_df.replace({'\\N':np.nan}, inplace=True)

In [18]:
# dropping any movies that have null values in the runtimeMinutes and genres column only
basics_df.dropna(subset=['runtimeMinutes','genres'], inplace=True)

In [19]:
# convert datatype from object to float
basics_df['startYear'] = basics_df['startYear'].astype(float)

In [20]:
# filtering for only titleType "movie"
basics_df = basics_df.loc[basics_df['titleType'] == 'movie']
# printing info to make sure startYear was changed to float
basics_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 387995 entries, 8 to 10115766
Data columns (total 9 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   tconst          387995 non-null  object 
 1   titleType       387995 non-null  object 
 2   primaryTitle    387995 non-null  object 
 3   originalTitle   387995 non-null  object 
 4   isAdult         387995 non-null  object 
 5   startYear       381419 non-null  float64
 6   endYear         0 non-null       object 
 7   runtimeMinutes  387995 non-null  object 
 8   genres          387995 non-null  object 
dtypes: float64(1), object(8)
memory usage: 29.6+ MB


In [21]:
# filtering for the years 2000-2021 only
basics_df = basics_df[(basics_df['startYear']>=2000)&(basics_df['startYear']<2022)]
# verifying it titleTypes are 'movie' and if years filtered are correct
basics_df.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
13081,tt0013274,movie,Istoriya grazhdanskoy voyny,Istoriya grazhdanskoy voyny,0,2021.0,,94,Documentary
34800,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001.0,,118,"Comedy,Fantasy,Romance"
61112,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El tango del viudo y su espejo deformante,0,2020.0,,70,Drama
67486,tt0068865,movie,Lives of Performers,Lives of Performers,0,2016.0,,90,Drama
67664,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018.0,,122,Drama


In [22]:
# Excluding movies that are in the genres 'documentary'
# finding all documenary
is_documentary = basics_df['genres'].str.contains('documentary',case=False)
# removing all the documentary using its inverse
basics_df = basics_df[~is_documentary]

In [23]:
# The basics_df does not have region column so in order to filter for US only we will use the already filtered akas dataframe
# finding all the movies in basics_df that are in the akas_df
keepers =basics_df['tconst'].isin(akas_df['titleId'])
keepers

34800        True
61112        True
67486        True
67664        True
80549       False
            ...  
10115498     True
10115537    False
10115582     True
10115666    False
10115756    False
Name: tconst, Length: 139028, dtype: bool

In [24]:
#keeping all the trues
basics_df = basics_df[keepers]

In [25]:
#displaying column head
basics_df.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
34800,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001.0,,118,"Comedy,Fantasy,Romance"
61112,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El tango del viudo y su espejo deformante,0,2020.0,,70,Drama
67486,tt0068865,movie,Lives of Performers,Lives of Performers,0,2016.0,,90,Drama
67664,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018.0,,122,Drama
86791,tt0088751,movie,The Naked Monster,The Naked Monster,0,2005.0,,100,"Comedy,Horror,Sci-Fi"


In [26]:
#displaying final info
basics_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 82027 entries, 34800 to 10115582
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   tconst          82027 non-null  object 
 1   titleType       82027 non-null  object 
 2   primaryTitle    82027 non-null  object 
 3   originalTitle   82027 non-null  object 
 4   isAdult         82027 non-null  object 
 5   startYear       82027 non-null  float64
 6   endYear         0 non-null      object 
 7   runtimeMinutes  82027 non-null  object 
 8   genres          82027 non-null  object 
dtypes: float64(1), object(8)
memory usage: 6.3+ MB


- Ratings Database

In [27]:
# verifying how many rows/columns, datatypes, missing items and duplicate rows
print(ratings_df.info())
print(('-'*30))
print(f'There are {ratings_df.duplicated().sum()} duplicate rows.')
print(('-'*30))
print(f'There are {ratings_df.isna().sum().sum()} missing values.')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1342918 entries, 0 to 1342917
Data columns (total 3 columns):
 #   Column         Non-Null Count    Dtype  
---  ------         --------------    -----  
 0   tconst         1342918 non-null  object 
 1   averageRating  1342918 non-null  float64
 2   numVotes       1342918 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 30.7+ MB
None
------------------------------
There are 0 duplicate rows.
------------------------------
There are 0 missing values.


In [28]:
#replacing all \N values with np.nan
ratings_df.replace({'\\N':np.nan}, inplace=True)

In [29]:
# Filter the ratings table down to only include the US by using the filtered akas dataframe as we did prior for the basics_df
keepers2 =ratings_df['tconst'].isin(akas_df['titleId'])
keepers2

0           True
1           True
2          False
3          False
4           True
           ...  
1342913    False
1342914    False
1342915    False
1342916    False
1342917    False
Name: tconst, Length: 1342918, dtype: bool

In [30]:
#keeping all the true values
ratings_df = ratings_df[keepers2]

In [31]:
#displaying head columns
ratings_df.head()

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1989
1,tt0000002,5.8,264
4,tt0000005,6.2,2651
5,tt0000006,5.0,182
6,tt0000007,5.4,829


In [32]:
#displaying final info
ratings_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 506911 entries, 0 to 1342894
Data columns (total 3 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   tconst         506911 non-null  object 
 1   averageRating  506911 non-null  float64
 2   numVotes       506911 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 15.5+ MB


### Saving Datasets as Compressed .csv.gz Files

In [34]:
# Using os to create a new folder called 'Data' - If output is just empty brackets [], it worked.
import os
os.makedirs('Data/',exist_ok=True)
# Confirm folder created (will list all files already in folder)
os.listdir("Data/")

['.ipynb_checkpoints',
 'FINAL_COMBINED_MOVIES.csv.gz',
 'final_movies_2000.csv.gz',
 'final_movies_2001.csv.gz',
 'final_movies_2002.csv.gz',
 'final_movies_2003.csv.gz',
 'final_movies_2004.csv.gz',
 'final_movies_2005.csv.gz',
 'final_movies_2006.csv.gz',
 'final_movies_2007.csv.gz',
 'final_movies_2008.csv.gz',
 'final_movies_2009.csv.gz',
 'final_movies_2010.csv.gz',
 'final_movies_2011.csv.gz',
 'final_movies_2012.csv.gz',
 'final_movies_2013.csv.gz',
 'final_movies_2014.csv.gz',
 'final_movies_2015.csv.gz',
 'final_movies_2016.csv.gz',
 'final_movies_2017.csv.gz',
 'final_movies_2018.csv.gz',
 'final_movies_2019.csv.gz',
 'final_movies_2020.csv.gz',
 'final_movies_2021.csv.gz',
 'tmdb_api_results_2000.json',
 'tmdb_api_results_2001.json',
 'tmdb_api_results_2002.json',
 'tmdb_api_results_2003.json',
 'tmdb_api_results_2004.json',
 'tmdb_api_results_2005.json',
 'tmdb_api_results_2006.json',
 'tmdb_api_results_2007.json',
 'tmdb_api_results_2008.json',
 'tmdb_api_results_2009.jso

In [35]:
# Save current dataframes to Data folder and providing name
# adding index=False to avoid creating an "Unnamed: 0" column
basics_df.to_csv("Data/title_basics.csv.gz",compression='gzip',index=False)

In [36]:
# overriding original dataframe variable to import newly saved file just to confirm that it's correct
basics_df = pd.read_csv("Data/title_basics.csv.gz", low_memory = False)
basics_df.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001.0,,118,"Comedy,Fantasy,Romance"
1,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El tango del viudo y su espejo deformante,0,2020.0,,70,Drama
2,tt0068865,movie,Lives of Performers,Lives of Performers,0,2016.0,,90,Drama
3,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018.0,,122,Drama
4,tt0088751,movie,The Naked Monster,The Naked Monster,0,2005.0,,100,"Comedy,Horror,Sci-Fi"


In [37]:
#repeating above same steps for the akas dataset
akas_df.to_csv("Data/title_akas.csv.gz",compression='gzip',index=False)
akas_df = pd.read_csv("Data/title_akas.csv.gz", low_memory = False)
akas_df.head()

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
0,tt0000001,6,Carmencita,US,,imdbDisplay,,0.0
1,tt0000002,7,The Clown and His Dogs,US,,,literal English title,0.0
2,tt0000005,10,Blacksmith Scene,US,,imdbDisplay,,0.0
3,tt0000005,1,Blacksmithing Scene,US,,alternative,,0.0
4,tt0000005,6,Blacksmith Scene #1,US,,alternative,,0.0


In [39]:
#repeating above same steps for the ratings dataset
ratings_df.to_csv("Data/title_ratings.csv.gz",compression='gzip',index=False)
ratings_df = pd.read_csv("Data/title_ratings.csv.gz", low_memory = False)
ratings_df.head()

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1989
1,tt0000002,5.8,264
2,tt0000005,6.2,2651
3,tt0000006,5.0,182
4,tt0000007,5.4,829
