# Movies
- Andrea Cohen
- 02.21.23

## Business Problem:
- to produce a MySQL database on Movies from a subset of IMDB's publicly available dataset
- to use this database to analyze what makes a movie successful
- to provide recommendations to the stakeholder on how to make a successful movie

## Tasks:
1.  Download several files from IMDB’s movie data set and filter out the subset of moves requested by the stakeholder.
2.  Use an API to extract box office revenue and profit data to add to the IMDB data and perform exploratory data analysis.
3.  Construct and export a MySQL database using the data.
4.  Apply hypothesis testing to explore what makes a movie successful.
5.  Produce a Linear Regression model to predict movie performance.

## Data:

Data Location - The dataset files can be accessed and downloaded from https://datasets.imdbws.com/. The data is refreshed daily.

Data Source - TMDB

![png](TMDB1024_1.png)

IMDb Dataset Details -

- title.akas.tsv.gz -  
Contains the following information for titles:

 - titleId (string) - a tconst, an alphanumeric unique identifier of the title
 - ordering (integer) – a number to uniquely identify rows for a given titleId
 - title (string) – the localized title
 - region (string) - the region for this version of the title
 - language (string) - the language of the title
 - types (array) - Enumerated set of attributes for this alternative title. One or more of the following: "alternative", "dvd", "festival", "tv", "video", "working", "original", "imdbDisplay". New values may be added in the future without warning
 - attributes (array) - Additional terms to describe this alternative title, not enumerated
 - isOriginalTitle (boolean) – 0: not original title; 1: original title  
 
 
- title.basics.tsv.gz -   
Contains the following information for titles:
 - tconst (string) - alphanumeric unique identifier of the title
 - titleType (string) – the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc)
 - primaryTitle (string) – the more popular title / the title used by the filmmakers on promotional materials at the point of release
 - originalTitle (string) - original title, in the original language
 - isAdult (boolean) - 0: non-adult title; 1: adult title
 - startYear (YYYY) – represents the release year of a title. In the case of TV Series, it is the series start year
 - endYear (YYYY) – TV Series end year. ‘\N’ for all other title types
 - runtimeMinutes – primary runtime of the title, in minutes
 - genres (string array) – includes up to three genres associated with the title  
 
- title.ratings.tsv.gz –   
Contains the IMDb rating and votes information for titles
 - tconst (string) - alphanumeric unique identifier of the title
 - averageRating – weighted average of all the individual user ratings
 - numVotes - number of votes the title has received

## Preliminary Steps:

### Import libraries

In [1]:
# imports
import pandas as pd
import numpy as np

### Load the data

In [2]:
basics = pd.read_csv("https://datasets.imdbws.com/title.basics.tsv.gz", sep='\t', low_memory=False)
akas = pd.read_csv("https://datasets.imdbws.com/title.akas.tsv.gz", sep='\t', low_memory=False)
ratings = pd.read_csv("https://datasets.imdbws.com/title.ratings.tsv.gz", sep='\t', low_memory=False)
display(basics.head())
display(akas.head())
display(ratings.head())

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,12,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short"


Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
0,tt0000001,1,Карменсіта,UA,\N,imdbDisplay,\N,0
1,tt0000001,2,Carmencita,DE,\N,\N,literal title,0
2,tt0000001,3,Carmencita - spanyol tánc,HU,\N,imdbDisplay,\N,0
3,tt0000001,4,Καρμενσίτα,GR,\N,imdbDisplay,\N,0
4,tt0000001,5,Карменсита,RU,\N,imdbDisplay,\N,0


Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1953
1,tt0000002,5.8,263
2,tt0000003,6.5,1787
3,tt0000004,5.6,179
4,tt0000005,6.2,2589


### Preprocessing

#### Filtering and cleaning title AKAs

In [3]:
# Include only movies that were released in the United States
usfilter = akas['region']=='US'
akas = akas[usfilter]

In [4]:
display(akas.head())

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
5,tt0000001,6,Carmencita,US,\N,imdbDisplay,\N,0
14,tt0000002,7,The Clown and His Dogs,US,\N,\N,literal English title,0
33,tt0000005,10,Blacksmith Scene,US,\N,imdbDisplay,\N,0
36,tt0000005,1,Blacksmithing Scene,US,\N,alternative,\N,0
41,tt0000005,6,Blacksmith Scene #1,US,\N,alternative,\N,0


In [5]:
akas['region'].value_counts()

US    1416568
Name: region, dtype: int64

    - Only movies that were released in the US are included.

In [6]:
# Replace "\N" with np.nan
akas.replace({'\\N':np.nan}, inplace=True)

In [7]:
display(akas.head())

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
5,tt0000001,6,Carmencita,US,,imdbDisplay,,0
14,tt0000002,7,The Clown and His Dogs,US,,,literal English title,0
33,tt0000005,10,Blacksmith Scene,US,,imdbDisplay,,0
36,tt0000005,1,Blacksmithing Scene,US,,alternative,,0
41,tt0000005,6,Blacksmith Scene #1,US,,alternative,,0


#### Filtering and cleaning title basics

In [8]:
# Replace "\N" with np.nan
basics.replace({'\\N':np.nan}, inplace=True)

In [9]:
display(basics.head())

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,,4,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,,12,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,,1,"Comedy,Short"


In [10]:
# Exclude any movie with missing values for genre or runtime
basics = basics.dropna(subset=['genres'])
basics = basics.dropna(subset=['runtimeMinutes'])
display(basics.head())

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,,4,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,,12,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,,1,"Comedy,Short"


In [11]:
basics['genres'].isna().sum()

0

    - There are 0 missing values in genre.

In [12]:
basics['runtimeMinutes'].isna().sum()

0

    - There are 0 missing values in runtime.

In [13]:
# Include only full-length movies (titleType = "movie")
typefilter = basics['titleType']=='movie'
basics = basics[typefilter]
display(basics.head())

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
8,tt0000009,movie,Miss Jerry,Miss Jerry,0,1894,,45,Romance
144,tt0000147,movie,The Corbett-Fitzsimmons Fight,The Corbett-Fitzsimmons Fight,0,1897,,100,"Documentary,News,Sport"
570,tt0000574,movie,The Story of the Kelly Gang,The Story of the Kelly Gang,0,1906,,70,"Action,Adventure,Biography"
587,tt0000591,movie,The Prodigal Son,L'enfant prodigue,0,1907,,90,Drama
672,tt0000679,movie,The Fairylogue and Radio-Plays,The Fairylogue and Radio-Plays,0,1908,,120,"Adventure,Fantasy"


In [14]:
basics['titleType'].value_counts()

movie    377695
Name: titleType, dtype: int64

    - Only full-length movies are included in titleType.

In [15]:
# Include only movies that were released 2000 - 2022 (include 2000 and 2022)
basics['startYear'] = basics['startYear'].astype(float)
startyearfilter1 = basics['startYear']>=2000
startyearfilter2 = basics['startYear']<=2022
basics = basics[startyearfilter1 & startyearfilter2]
display(basics.head())

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
13082,tt0013274,movie,Istoriya grazhdanskoy voyny,Istoriya grazhdanskoy voyny,0,2021.0,,133,Documentary
34803,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001.0,,118,"Comedy,Fantasy,Romance"
61116,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El tango del viudo y su espejo deformante,0,2020.0,,70,Drama
67669,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018.0,,122,Drama
76059,tt0077684,movie,Histórias de Combóios em Portugal,Histórias de Combóios em Portugal,0,2022.0,,46,Documentary


In [16]:
basics['startYear'].value_counts()

2017.0    14313
2018.0    14266
2019.0    13983
2016.0    13913
2015.0    13429
2014.0    13051
2013.0    12350
2022.0    12269
2021.0    12167
2012.0    11605
2020.0    11454
2011.0    10747
2010.0    10181
2009.0     9325
2008.0     8128
2007.0     6940
2006.0     6486
2005.0     5801
2004.0     5181
2003.0     4567
2002.0     4116
2001.0     3846
2000.0     3630
Name: startYear, dtype: int64

    - Only movies released between 2000 and 2022 are included.

In [17]:
# Include only fictional movies (not from documentary genre)
is_documentary = basics['genres'].str.contains('documentary',case=False)
basics = basics[~is_documentary]
display(basics.head())

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
34803,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001.0,,118,"Comedy,Fantasy,Romance"
61116,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El tango del viudo y su espejo deformante,0,2020.0,,70,Drama
67669,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018.0,,122,Drama
77964,tt0079644,movie,November 1828,November 1828,0,2001.0,,140,"Drama,War"
86801,tt0088751,movie,The Naked Monster,The Naked Monster,0,2005.0,,100,"Comedy,Horror,Sci-Fi"


In [18]:
basics['genres'].value_counts()

Drama                         35838
Comedy                        13414
Comedy,Drama                   6440
Horror                         5762
Drama,Romance                  4293
                              ...  
Action,Animation,Game-Show        1
Adult,Crime,Mystery               1
Family,Musical,Sport              1
Horror,Music,Mystery              1
Crime,Fantasy,Sci-Fi              1
Name: genres, Length: 968, dtype: int64

    - Only fictional movies are included.

In [19]:
# Include only movies that were released in the United States
keepers = basics['tconst'].isin(akas['titleId'])
basics = basics[keepers]
display(basics.head())

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
34803,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001.0,,118,"Comedy,Fantasy,Romance"
61116,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El tango del viudo y su espejo deformante,0,2020.0,,70,Drama
67669,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018.0,,122,Drama
86801,tt0088751,movie,The Naked Monster,The Naked Monster,0,2005.0,,100,"Comedy,Horror,Sci-Fi"
93938,tt0096056,movie,Crime and Punishment,Crime and Punishment,0,2002.0,,126,Drama


    - Only movies that were released in the US are included.

#### Filtering and cleaning title ratings

In [20]:
# Replace "\N" with np.nan
ratings.replace({'\\N':np.nan}, inplace=True)

In [21]:
display(ratings.head())

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1953
1,tt0000002,5.8,263
2,tt0000003,6.5,1787
3,tt0000004,5.6,179
4,tt0000005,6.2,2589


In [22]:
# Include only movies that were released in the United States
keepers2 = ratings['tconst'].isin(akas['titleId'])
ratings = ratings[keepers2]
display(ratings.head())

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1953
1,tt0000002,5.8,263
4,tt0000005,6.2,2589
5,tt0000006,5.1,177
6,tt0000007,5.4,812


    - Only movies that were released in the US are included.

### Summary of how many movies remain and the datatypes of each feature

In [23]:
print('Summary for title basics:')
display(basics.info())

Summary for title basics:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 85588 entries, 34803 to 9637986
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   tconst          85588 non-null  object 
 1   titleType       85588 non-null  object 
 2   primaryTitle    85588 non-null  object 
 3   originalTitle   85588 non-null  object 
 4   isAdult         85588 non-null  object 
 5   startYear       85588 non-null  float64
 6   endYear         0 non-null      object 
 7   runtimeMinutes  85588 non-null  object 
 8   genres          85588 non-null  object 
dtypes: float64(1), object(8)
memory usage: 6.5+ MB


None

    - There are 85588 movies remaining in the title basics dataframe.
    - All of the features are datatype object, except startYear, which is datatype float64.

In [24]:
print('Summary for title AKAs:')
display(akas.info())

Summary for title AKAs:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1416568 entries, 5 to 35026261
Data columns (total 8 columns):
 #   Column           Non-Null Count    Dtype 
---  ------           --------------    ----- 
 0   titleId          1416568 non-null  object
 1   ordering         1416568 non-null  int64 
 2   title            1416568 non-null  object
 3   region           1416568 non-null  object
 4   language         3833 non-null     object
 5   types            974118 non-null   object
 6   attributes       46043 non-null    object
 7   isOriginalTitle  1415223 non-null  object
dtypes: int64(1), object(7)
memory usage: 97.3+ MB


None

    - There are 1416568 movies remaining in the title AKAs dataframe.
    - ordering is datatype int64.  All of the rest of the features are datatype object.

In [25]:
print('Summary for title ratings:')
display(ratings.info())

Summary for title ratings:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 489933 entries, 0 to 1282602
Data columns (total 3 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   tconst         489933 non-null  object 
 1   averageRating  489933 non-null  float64
 2   numVotes       489933 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 15.0+ MB


None

    - There are 489933 movies remaining in the title ratings dataframe.
    - tconst is datatype object, averageRating is datatype float64, and numVotes is datatype int64.

### Save each dataframe to the data folder

In [26]:
basics.to_csv("Data/title_basics.csv.gz",compression='gzip',index=False)
akas.to_csv("Data/title_akas.csv.gz",compression='gzip',index=False)
ratings.to_csv("Data/title_ratings.csv.gz",compression='gzip',index=False)

In [27]:
# Open saved file and preview again
basics = pd.read_csv("Data/title_basics.csv.gz", low_memory = False)
akas = pd.read_csv("Data/title_akas.csv.gz", low_memory = False)
ratings = pd.read_csv("Data/title_ratings.csv.gz", low_memory = False)
display(basics.head())
display(akas.head())
display(ratings.head())

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001.0,,118,"Comedy,Fantasy,Romance"
1,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El tango del viudo y su espejo deformante,0,2020.0,,70,Drama
2,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018.0,,122,Drama
3,tt0088751,movie,The Naked Monster,The Naked Monster,0,2005.0,,100,"Comedy,Horror,Sci-Fi"
4,tt0096056,movie,Crime and Punishment,Crime and Punishment,0,2002.0,,126,Drama


Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
0,tt0000001,6,Carmencita,US,,imdbDisplay,,0.0
1,tt0000002,7,The Clown and His Dogs,US,,,literal English title,0.0
2,tt0000005,10,Blacksmith Scene,US,,imdbDisplay,,0.0
3,tt0000005,1,Blacksmithing Scene,US,,alternative,,0.0
4,tt0000005,6,Blacksmith Scene #1,US,,alternative,,0.0


Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1953
1,tt0000002,5.8,263
2,tt0000005,6.2,2589
3,tt0000006,5.1,177
4,tt0000007,5.4,812


    - The 3 dataframes are saved to the local file.