# Part 3 - Producing a MySQL Database

## Business Problem

Create a database to analyze what makes a movie successful, and will provide recommendations to the stakeholder on how to make a successful movie. 

### Specifications - Database

* Take data that has been cleaned and create a MySQL database. 

* Normalize the tables before adding them to the new database. 
    
    * All data from the TMDB API should be in 1 table together (even though it will not be perfectly normalized).
    
    *Keep only imdb_id, revenue, budget, and certification columns.

## Transformation Steps:

* Normalize Genre:

    * Convert the single string of genres from title_basics into 2 new tables.         
        1. title_genres: with columns:
            * tconst
            * genre_id
        
        2. genres:
            * genre_id
            * genre_name


* Discard unnecessary information:

    * For the title basics table, drop the following columns:
        * "original_title" (we will use the primary title column instead)
        * "isAdult" ("Adult" will show up in the genres so this is redundant information).
        * "titleType" (every row will be a movie).
        * "genres" and other variants of genre (genre is now represented in the 2 new tables described above.
     
    * Do not include the title_akas table in your SQL database.
        * You have already filtered out the desired movies using this table and the remaining data is mostly nulls and not of-interest to the stakeholder.

## MySQL Database Requirements

* Use sqlalchemy with pandas to execute your SQL queries inside your notebook.


* Create a new database on your MySQL server and call it "movies".



* Make sure to have the following tables in your "movies" database:

    * title_basics
    * title_ratings
    * title_genres
    * genres
    * tmdb_data


* Make sure to set a Primary Key for each table that isn't a joiner table (e.g. title_genres is a joiner table).


* After creating each table, show the first 5 rows of that table using a SQL query.


* Make sure to run the "SHOW TABLES" SQL query at the end of your notebook to show that all required tables have been created.

### Deliverables

Submit a link to your github respository containing the Jupyter Notebook file.



# Getting Started Tips

## Normalizing Genres - Overview

* In order to normalize genres, we will need to:

    * Convert the single string of genres from title basics into 2 new tables.
        
        1. title_genres: with the columns:

            * tconst
            * genre_id
        
        2. genres:
            * genre_id
            * genre_name


* Creating these tables will be a multi-step process.

    
    
        1. Getting a list of all individual genres.
    
    
        2. Create a new title_genres table with with the movie ids duplicated, once for each genre that a movie belongs to.
    
    
        3. Create a mapper dictionary with numeric ids for each genre.
    
    
        4. Use the mapper dictionary to replace the string genres in title_genres with numeric genre_ids.
    
    
        5. Convert the mapper dictionary into a final genres table with the numeric genre_id and the string genre.

### 1. Getting List of Unique Genres:

* The genres column should be separated into separate genres.

    * For example: "Comedy,Fantasy,Romance" is actually 3 genres that the movie belongs to, not one combined-genre.


* First, you will need to get a list of all of the unique genres that appear in the column. Right now, the genre column contains a string with the genres separated by a comma.

    1. We are going to convert these strings into lists of strings into a new 'genres_split' column.

In [7]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import os

pd.set_option('display.max_columns',100)

In [8]:
FOLDER = "Data/"
sorted(os.listdir(FOLDER))

['final_tmdb_data_2000.csv.gz',
 'final_tmdb_data_2001.csv.gz',
 'title_akas_cleaned.csv.gz',
 'title_basics_cleaned.csv.gz',
 'title_ratings_cleaned.csv.gz',
 'tmdb_api_results_2000.json',
 'tmdb_api_results_2001.json']

In [16]:
# Concatenate the DataFrames
import glob
q  = f"{FOLDER}final*.csv.gz"
files = sorted(glob.glob(q))
df = pd.concat([pd.read_csv(f, lineterminator='\n') for f in files] )
df

Unnamed: 0,imdb_id\r
0,0
0,0
