# Task

## Business Problem

For this project, you have been hired to produce a MySQL database on Movies from a subset of IMDB's publicly available dataset. Ultimately, you will use this database to analyze what makes a movie successful and will provide recommendations to the stakeholder on how to make a successful movie.

Over the course of this project, you will:

* Part 1: Create your project repository, download IMDB’s movie data, and filter out the subset of movies requested by the stakeholder.
* Part 3: Design a MySQL database for your data and insert the data.
* Part 3: Use an API to extract box office financial data and transform and load it into your database.
* Part 4: Apply hypothesis testing to explore what makes a movie "successful."

**Part 1**

For Part 1 of the project, you will be creating your project repository, downloading the IMDB data for the requested tables, filtering out unnecessary data, and saving the filtered tables as csv files (".csv.gz") in your repository.

Instructions + Getting Started:

* Please read through all of the instructions before you start working on your project.

The Data

IMDB Provides a large dataset with varied information for Movies, TV Shows, Made for TV Movies, etc. for free (for Non-commercial use). The Data Dictionary is located here: [https://www.imdb.com/interfaces/](https://www.imdb.com/interfaces/).

* We have provided partially-processed files for you in this [Google Drive folder](https://drive.google.com/drive/folders/1I8FKN3S9acXMNzyXq3lo8n9PjplSPB97?usp=drive_link).

Specifications

Your stakeholder only wants you to include information for movies based on the following specifications:

* Include only movies that were released in the United States.
* Include only movies that were released 2000 - 2022 (startYear >=2000 and startYear<=2022)
* Include only full-length movies (titleType = "movie").
* Exclude movies that are missing genre or runtime.
* Include only fictional genres (where Genres does not include "Documentary".)

Provided Files

* From their previous research, they realized the data they want is in the following files:
    * title.basics.tsv.gz
    * title.ratings.tsv.gz
* However, to filter for movies released in the United States, you will also need "title-akas-us-only.csv"

Note: this is a pre-filtered version of the title.akas.tsv.gz file. The full file is large and can cause problems for computers with less RAM/memory. We have included information on the preprocessing steps that have already been performed in the included Google Doc "IMDB Movie Dataset Info" in the folder linked above.

Deliverables

After filtering out movies that do not meet the stakeholder's specifications:

* Before saving, run a final .info() for each of the dataframes to show a summary of how many movies remain and the datatypes of each feature
* Save each file to a csv file in the "Data/" folder in your repo.
* Commit your changes to your repository in GitHub desktop and Publish repository / Push Changes.
* Submit the link to your repository.

# Solution

In [1]:
# Import libraries
import pandas as pd
import numpy as np

In [2]:
# Load TSVs with Pandas
akas = pd.read_csv("Data/title-akas-us-only.csv", low_memory=False)

In [3]:
# Load TSVs with Pandas
basics = pd.read_csv("Data/title.basics.tsv.gz", sep='\t', low_memory=False)

In [4]:
# Filter the basics table down to only include the US by using the filter akas dataframe
filter_us_titles = basics['tconst'].isin(akas['titleId'])
basics = basics[filter_us_titles]
basics

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short"
5,tt0000006,short,Chinese Opium Den,Chinese Opium Den,0,1894,\N,1,Short
6,tt0000007,short,Corbett and Courtney Before the Kinetograph,Corbett and Courtney Before the Kinetograph,0,1894,\N,1,"Short,Sport"
...,...,...,...,...,...,...,...,...,...
10016872,tt9916560,tvMovie,March of Dimes Presents: Once Upon a Dime,March of Dimes Presents: Once Upon a Dime,0,1963,\N,58,Family
10016901,tt9916620,movie,The Copeland Case,The Copeland Case,0,\N,\N,\N,Drama
10016939,tt9916702,short,Loving London: The Playground,Loving London: The Playground,0,\N,\N,\N,"Drama,Short"
10016962,tt9916756,short,Pretty Pretty Black Girl,Pretty Pretty Black Girl,0,2019,\N,\N,Short


In [5]:
# Convert to dataframes
df_basics = pd.DataFrame(basics)
df_akas = pd.DataFrame(akas)

**Let's start with basics table**

In [6]:
# Make basics a dataframe
df_basics = pd.DataFrame(basics)
df_basics.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1365643 entries, 0 to 10016966
Data columns (total 9 columns):
 #   Column          Non-Null Count    Dtype 
---  ------          --------------    ----- 
 0   tconst          1365643 non-null  object
 1   titleType       1365643 non-null  object
 2   primaryTitle    1365643 non-null  object
 3   originalTitle   1365643 non-null  object
 4   isAdult         1365643 non-null  object
 5   startYear       1365643 non-null  object
 6   endYear         1365643 non-null  object
 7   runtimeMinutes  1365643 non-null  object
 8   genres          1365643 non-null  object
dtypes: object(9)
memory usage: 104.2+ MB


In [7]:
df_basics.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short"
5,tt0000006,short,Chinese Opium Den,Chinese Opium Den,0,1894,\N,1,Short
6,tt0000007,short,Corbett and Courtney Before the Kinetograph,Corbett and Courtney Before the Kinetograph,0,1894,\N,1,"Short,Sport"


In [8]:
# Replace placeholder "\N" values with null values (np.nan)
df_basics = df_basics.replace({'\\N':np.nan})

In [9]:
# Drop rows with null values in the runtimeMinutes or genres columns ONLY. Do not drop null values from other columns. 
# Hint: dropna has a subset argument that accepts a list of columns to check.
df_basics = df_basics.dropna(axis=0, subset=['runtimeMinutes','genres'])

In [10]:
# Filter to keep only full-length movies (titleType==Movie)
df_basics = df_basics.loc[df_basics['titleType']=='movie']

In [11]:
df_basics['titleType'].value_counts()

movie    203476
Name: titleType, dtype: int64

In [12]:
df_basics.describe

<bound method NDFrame.describe of              tconst titleType                    primaryTitle  \
8         tt0000009     movie                      Miss Jerry   
144       tt0000147     movie   The Corbett-Fitzsimmons Fight   
570       tt0000574     movie     The Story of the Kelly Gang   
587       tt0000591     movie                The Prodigal Son   
672       tt0000679     movie  The Fairylogue and Radio-Plays   
...             ...       ...                             ...   
10016366  tt9915436     movie               Vida em Movimento   
10016544  tt9915872     movie            The Last White Witch   
10016684  tt9916170     movie                   The Rehearsal   
10016693  tt9916190     movie                       Safeguard   
10016777  tt9916362     movie                           Coven   

                           originalTitle isAdult startYear endYear  \
8                             Miss Jerry       0      1894     NaN   
144        The Corbett-Fitzsimmons Fight     

In [13]:
# Convert startYear to a float dtype
df_basics['startYear'] = df_basics['startYear'].astype('float')

In [14]:
# Filter to keep movies with startYears that are >=2000 and <=2022
df_basics = df_basics.loc[(df_basics['startYear']>=2000)&(df_basics['startYear']<=2022)]

In [15]:
df_basics['startYear'].value_counts()

2019.0    8102
2018.0    7866
2017.0    7816
2016.0    7415
2015.0    7228
2014.0    7170
2020.0    7031
2013.0    6945
2021.0    6929
2022.0    6684
2012.0    6597
2011.0    6124
2010.0    5580
2009.0    5087
2008.0    4232
2007.0    3610
2006.0    3342
2005.0    2923
2004.0    2535
2003.0    2188
2002.0    2003
2001.0    1931
2000.0    1789
Name: startYear, dtype: int64

In [16]:
df_basics['genres'].value_counts()

Documentary                      21375
Drama                            17085
Comedy                            7148
Horror                            4071
Comedy,Drama                      4000
                                 ...  
Adult,Crime,Mystery                  1
Comedy,Documentary,Reality-TV        1
Biography,Music,Mystery              1
Comedy,Reality-TV,Romance            1
Biography,Fantasy,Musical            1
Name: genres, Length: 1054, dtype: int64

In [17]:
# Eliminate movies that include "Documentary" in genre:
df_basics = df_basics.loc[df_basics['genres']!='Documentary']

In [18]:
df_basics['genres'].value_counts()

Drama                            17085
Comedy                            7148
Horror                            4071
Comedy,Drama                      4000
Drama,Romance                     2623
                                 ...  
Adult,Crime,Mystery                  1
Comedy,Documentary,Reality-TV        1
Biography,Music,Mystery              1
Comedy,Reality-TV,Romance            1
Biography,Fantasy,Musical            1
Name: genres, Length: 1053, dtype: int64

In [19]:
# Eliminate movies that include "Documentary" in genre:
df_basics = df_basics[df_basics['genres'].str.contains('Documentary')==False]

In [20]:
# Display a final preview of your filtered title basics and save to a csv
# Tip: You should have ~80,000 rows left. If you have significantly more or less, double-check your filtering steps.
df_basics.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 86979 entries, 34802 to 10016777
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   tconst          86979 non-null  object 
 1   titleType       86979 non-null  object 
 2   primaryTitle    86979 non-null  object 
 3   originalTitle   86979 non-null  object 
 4   isAdult         86979 non-null  object 
 5   startYear       86979 non-null  float64
 6   endYear         0 non-null      object 
 7   runtimeMinutes  86979 non-null  object 
 8   genres          86979 non-null  object 
dtypes: float64(1), object(8)
memory usage: 6.6+ MB


In [21]:
df_basics.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
34802,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001.0,,118,"Comedy,Fantasy,Romance"
61114,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El tango del viudo y su espejo deformante,0,2020.0,,70,Drama
67666,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018.0,,122,Drama
86793,tt0088751,movie,The Naked Monster,The Naked Monster,0,2005.0,,100,"Comedy,Horror,Sci-Fi"
93930,tt0096056,movie,Crime and Punishment,Crime and Punishment,0,2002.0,,126,Drama


In [23]:
# Save the data to a CSV file in your Data folder.
df_basics.to_csv("Data/title-basics-us-only.csv")

In [24]:
#  Load and filter the title ratings file
ratings = pd.read_csv("Data/title.ratings.tsv.gz", sep='\t', low_memory=False)

In [28]:
# Keep only movies that are included in your final title basics dataframe.
# Hint: Filter ratings using title basics similarly to how you filtered basics with AKAs:
filter_basics = ratings['tconst'].isin(basics['tconst'])
ratings = ratings[filter_basics]
ratings

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1988
1,tt0000002,5.8,265
4,tt0000005,6.2,2632
5,tt0000006,5.1,182
6,tt0000007,5.4,825
...,...,...,...
1331453,tt9916200,8.1,231
1331454,tt9916204,8.2,264
1331461,tt9916348,8.3,18
1331462,tt9916362,6.4,5422


In [29]:
# Load the title ratings data into a dataframe.
df_ratings = pd.DataFrame(ratings)

In [30]:
df_ratings.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 504000 entries, 0 to 1331467
Data columns (total 3 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   tconst         504000 non-null  object 
 1   averageRating  504000 non-null  float64
 2   numVotes       504000 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 15.4+ MB


In [31]:
df_ratings.head()

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1988
1,tt0000002,5.8,265
4,tt0000005,6.2,2632
5,tt0000006,5.1,182
6,tt0000007,5.4,825


In [32]:
# Replace placeholder "\N" values with null values (np.nan)
df_ratings = df_ratings.replace({'\\N':np.nan})

In [33]:
df_ratings.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 504000 entries, 0 to 1331467
Data columns (total 3 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   tconst         504000 non-null  object 
 1   averageRating  504000 non-null  float64
 2   numVotes       504000 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 15.4+ MB


In [34]:
df_ratings.head()

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1988
1,tt0000002,5.8,265
4,tt0000005,6.2,2632
5,tt0000006,5.1,182
6,tt0000007,5.4,825


In [35]:
df_ratings.to_csv("Data/title-ratings-us-only.csv")