# Movies - Exploratory Data Analysis
- Andrea Cohen
- 03.02.23

## Business Problem:
- to produce a MySQL database on Movies from a subset of IMDB's publicly available dataset
- to use this database to analyze what makes a movie successful
- to provide recommendations to the stakeholder on how to make a successful movie

## Tasks:
- Download several files from IMDB’s movie data set and filter out the subset of moves requested by the stakeholder.
- Use an API to extract box office revenue and profit data to add to the IMDB data and perform exploratory data analysis.
- Construct and export a MySQL database using the data.
- Apply hypothesis testing to explore what makes a movie successful.
- Produce a Linear Regression model to predict movie performance.

## Data:

Data Location - The dataset files can be accessed and downloaded from https://datasets.imdbws.com/. The data is refreshed daily.

Data Source - TMDB

![png](TMDB1024_1.png)

IMDb Dataset Details -

- title.akas.tsv.gz -  
Contains the following information for titles:

 - titleId (string) - a tconst, an alphanumeric unique identifier of the title
 - ordering (integer) – a number to uniquely identify rows for a given titleId
 - title (string) – the localized title
 - region (string) - the region for this version of the title
 - language (string) - the language of the title
 - types (array) - Enumerated set of attributes for this alternative title. One or more of the following: "alternative", "dvd", "festival", "tv", "video", "working", "original", "imdbDisplay". New values may be added in the future without warning
 - attributes (array) - Additional terms to describe this alternative title, not enumerated
 - isOriginalTitle (boolean) – 0: not original title; 1: original title  
 
 
- title.basics.tsv.gz -   
Contains the following information for titles:
 - tconst (string) - alphanumeric unique identifier of the title
 - titleType (string) – the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc)
 - primaryTitle (string) – the more popular title / the title used by the filmmakers on promotional materials at the point of release
 - originalTitle (string) - original title, in the original language
 - isAdult (boolean) - 0: non-adult title; 1: adult title
 - startYear (YYYY) – represents the release year of a title. In the case of TV Series, it is the series start year
 - endYear (YYYY) – TV Series end year. ‘\N’ for all other title types
 - runtimeMinutes – primary runtime of the title, in minutes
 - genres (string array) – includes up to three genres associated with the title   
 
 
- title.ratings.tsv.gz –   
Contains the IMDb rating and votes information for titles
 - tconst (string) - alphanumeric unique identifier of the title
 - averageRating – weighted average of all the individual user ratings
 - numVotes - number of votes the title has received

## Preliminary steps

### Imports

In [None]:
import pandas as pd
import pymysql
pymysql.install_as_MySQLdb()
from sqlalchemy.engine import create_engine
from urllib.parse import quote_plus

### Create connection string

In [None]:
username = "root"
password = quote_plus("Jake2006!")
db_name = "books"
connection_str = f"mysql+pymysql://{username}:{password}@localhost/{db_name}"

### Create the engine

In [None]:
engine = create_engine(connection_str)

### Load the data
Load in your csv.gz's of results for each year extracted.
Concatenate the data into 1 dataframe for the remainder of the analysis.

In [None]:
# load the csv.gz's of results for each year extracted
df_2000 = pd.read_csv('Data/final_tmdb_data_2000.csv.gz')
df_2001 = pd.read_csv('Data/final_tmdb_data_2001.csv.gz')

In [None]:
# concatenate the data into 1 dataframe
movies_eda = pd.concat([df_2000, df_2001], ignore_index=True)
display(movies_eda.head())
display(movies_eda.tail())

## How many movies had at least some valid financial information (values > 0 for budget OR revenue)?

In [None]:
budget_filter = movies_eda['budget'] > 0
revenue_filter = movies_eda['revenue'] > 0
movies_financial = movies_eda[budget_filter | revenue_filter]
movies_financial.info()

- movies had at