# Part 3 - Producing a MySQL Database

## Business Problem

Create a database to analyze what makes a movie successful, and will provide recommendations to the stakeholder on how to make a successful movie. 

### Specifications - Database

* Take data that has been cleaned and create a MySQL database. 

* Normalize the tables before adding them to the new database. 
    
    * All data from the TMDB API should be in 1 table together (even though it will not be perfectly normalized).
    
    *Keep only imdb_id, revenue, budget, and certification columns.

## Transformation Steps:

* Normalize Genre:

    * Convert the single string of genres from title_basics into 2 new tables.         
        1. title_genres: with columns:
            * tconst
            * genre_id
        
        2. genres:
            * genre_id
            * genre_name


* Discard unnecessary information:

    * For the title basics table, drop the following columns:
        * "original_title" (we will use the primary title column instead)
        * "isAdult" ("Adult" will show up in the genres so this is redundant information).
        * "titleType" (every row will be a movie).
        * "genres" and other variants of genre (genre is now represented in the 2 new tables described above.
     
    * Do not include the title_akas table in your SQL database.
        * You have already filtered out the desired movies using this table and the remaining data is mostly nulls and not of-interest to the stakeholder.

## MySQL Database Requirements

* Use sqlalchemy with pandas to execute your SQL queries inside your notebook.


* Create a new database on your MySQL server and call it "movies".



* Make sure to have the following tables in your "movies" database:

    * title_basics
    * title_ratings
    * title_genres
    * genres
    * tmdb_data


* Make sure to set a Primary Key for each table that isn't a joiner table (e.g. title_genres is a joiner table).


* After creating each table, show the first 5 rows of that table using a SQL query.


* Make sure to run the "SHOW TABLES" SQL query at the end of your notebook to show that all required tables have been created.

### Deliverables

Submit a link to your github respository containing the Jupyter Notebook file.



# Getting Started Tips

## Normalizing Genres - Overview

* In order to normalize genres, we will need to:

    * Convert the single string of genres from title basics into 2 new tables.
        
        1. title_genres: with the columns:

            * tconst
            * genre_id
        
        2. genres:
            * genre_id
            * genre_name


* Creating these tables will be a multi-step process.

    
    
        1. Getting a list of all individual genres.
    
    
        2. Create a new title_genres table with with the movie ids duplicated, once for each genre that a movie belongs to.
    
    
        3. Create a mapper dictionary with numeric ids for each genre.
    
    
        4. Use the mapper dictionary to replace the string genres in title_genres with numeric genre_ids.
    
    
        5. Convert the mapper dictionary into a final genres table with the numeric genre_id and the string genre.

### 1. Getting List of Unique Genres:

* The genres column should be separated into separate genres.

    * For example: "Comedy,Fantasy,Romance" is actually 3 genres that the movie belongs to, not one combined-genre.


* First, you will need to get a list of all of the unique genres that appear in the column. Right now, the genre column contains a string with the genres separated by a comma.

    1. We are going to convert these strings into lists of strings into a new 'genres_split' column.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os,json
import missingno as ms

pd.set_option('display.max_columns',100)

In [2]:
FOLDER = "Data/"
sorted(os.listdir(FOLDER))

['combined_tmdb_api_data.csv.gz',
 'final_tmdb_data_2000.csv.gz',
 'final_tmdb_data_2001.csv.gz',
 'final_tmdb_data_2002.csv.gz',
 'final_tmdb_data_2003.csv.gz',
 'final_tmdb_data_2004.csv.gz',
 'final_tmdb_data_2005.csv.gz',
 'final_tmdb_data_2006.csv.gz',
 'final_tmdb_data_2007.csv.gz',
 'final_tmdb_data_2008.csv.gz',
 'final_tmdb_data_2009.csv.gz',
 'final_tmdb_data_2010.csv.gz',
 'final_tmdb_data_2011.csv.gz',
 'final_tmdb_data_2012.csv.gz',
 'final_tmdb_data_2013.csv.gz',
 'final_tmdb_data_2014.csv.gz',
 'final_tmdb_data_2015.csv.gz',
 'final_tmdb_data_2016.csv.gz',
 'final_tmdb_data_2017.csv.gz',
 'final_tmdb_data_2018.csv.gz',
 'final_tmdb_data_2019.csv.gz',
 'final_tmdb_data_2020.csv.gz',
 'final_tmdb_data_2021.csv.gz',
 'final_tmdb_data_2022.csv.gz',
 'title_akas_cleaned.csv.gz',
 'title_basics_cleaned.csv.gz',
 'title_ratings_cleaned.csv.gz']

# Basics - Normalizing Genres

## Basics

In [3]:
## title basics
basics = pd.read_csv(f'{FOLDER}title_basics_cleaned.csv.gz',low_memory=False)
basics.info()
basics.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 82204 entries, 0 to 82203
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   tconst          82204 non-null  object 
 1   titleType       82204 non-null  object 
 2   primaryTitle    82204 non-null  object 
 3   originalTitle   82204 non-null  object 
 4   isAdult         82204 non-null  int64  
 5   startYear       82204 non-null  float64
 6   runtimeMinutes  82204 non-null  int64  
 7   genres          82204 non-null  object 
dtypes: float64(1), int64(2), object(5)
memory usage: 5.0+ MB


Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,runtimeMinutes,genres
0,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001.0,118,"Comedy,Fantasy,Romance"
1,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El Tango del Viudo y Su Espejo Deformante,0,2020.0,70,Drama
2,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018.0,122,Drama
3,tt0088751,movie,The Naked Monster,The Naked Monster,0,2005.0,100,"Comedy,Horror,Sci-Fi"
4,tt0094859,movie,Chief Zabu,Chief Zabu,0,2016.0,74,Comedy


## Ratings

In [4]:
ratings = pd.read_csv(f"{FOLDER}/title_ratings_cleaned.csv.gz", low_memory=False)
ratings.info()
ratings.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 67422 entries, 0 to 67421
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   tconst         67422 non-null  object 
 1   averageRating  67422 non-null  float64
 2   numVotes       67422 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 1.5+ MB


Unnamed: 0,tconst,averageRating,numVotes
0,tt0035423,6.4,84521
1,tt0062336,6.4,161
2,tt0069049,6.7,7333
3,tt0088751,5.2,323
4,tt0094859,7.9,83


# Transform

## Basics

***
* Normalize and Separate Genre. 

* `origninal_title` (Use the primary title column instead).

* `isAdult` ("Adult" will show up in the genres so this is redundant information).

* `titleType` (every row will be a movie).

* `genres` and other variants of genre (genre is now represented in the 2 new tables described above).

***

In [5]:
basics.info()
basics.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 82204 entries, 0 to 82203
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   tconst          82204 non-null  object 
 1   titleType       82204 non-null  object 
 2   primaryTitle    82204 non-null  object 
 3   originalTitle   82204 non-null  object 
 4   isAdult         82204 non-null  int64  
 5   startYear       82204 non-null  float64
 6   runtimeMinutes  82204 non-null  int64  
 7   genres          82204 non-null  object 
dtypes: float64(1), int64(2), object(5)
memory usage: 5.0+ MB


Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,runtimeMinutes,genres
0,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001.0,118,"Comedy,Fantasy,Romance"
1,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El Tango del Viudo y Su Espejo Deformante,0,2020.0,70,Drama
2,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018.0,122,Drama
3,tt0088751,movie,The Naked Monster,The Naked Monster,0,2005.0,100,"Comedy,Horror,Sci-Fi"
4,tt0094859,movie,Chief Zabu,Chief Zabu,0,2016.0,74,Comedy


In [6]:
# columns to drop
columns_dropped = ['originalTitle', 'isAdult', 'titleType']
basics = basics.drop(columns=columns_dropped)
basics.head()

Unnamed: 0,tconst,primaryTitle,startYear,runtimeMinutes,genres
0,tt0035423,Kate & Leopold,2001.0,118,"Comedy,Fantasy,Romance"
1,tt0062336,The Tango of the Widower and Its Distorting Mi...,2020.0,70,Drama
2,tt0069049,The Other Side of the Wind,2018.0,122,Drama
3,tt0088751,The Naked Monster,2005.0,100,"Comedy,Horror,Sci-Fi"
4,tt0094859,Chief Zabu,2016.0,74,Comedy


### Normalizing Genres

In [7]:
# Fill in Missing genres
# creat a new column with single-string genres as a list of strings
basics['genres_split'] = basics['genres'].str.split(',')
basics

Unnamed: 0,tconst,primaryTitle,startYear,runtimeMinutes,genres,genres_split
0,tt0035423,Kate & Leopold,2001.0,118,"Comedy,Fantasy,Romance","[Comedy, Fantasy, Romance]"
1,tt0062336,The Tango of the Widower and Its Distorting Mi...,2020.0,70,Drama,[Drama]
2,tt0069049,The Other Side of the Wind,2018.0,122,Drama,[Drama]
3,tt0088751,The Naked Monster,2005.0,100,"Comedy,Horror,Sci-Fi","[Comedy, Horror, Sci-Fi]"
4,tt0094859,Chief Zabu,2016.0,74,Comedy,[Comedy]
...,...,...,...,...,...,...
82199,tt9914942,Life Without Sara Amat,2019.0,74,Drama,[Drama]
82200,tt9915872,The Last White Witch,2019.0,97,"Comedy,Drama,Fantasy","[Comedy, Drama, Fantasy]"
82201,tt9916170,The Rehearsal,2019.0,51,Drama,[Drama]
82202,tt9916190,Safeguard,2020.0,95,"Action,Adventure,Thriller","[Action, Adventure, Thriller]"


In [8]:
# Explode dataframe to make each genre into a separate rows
exploded_data = basics.explode('genres_split')
exploded_data

Unnamed: 0,tconst,primaryTitle,startYear,runtimeMinutes,genres,genres_split
0,tt0035423,Kate & Leopold,2001.0,118,"Comedy,Fantasy,Romance",Comedy
0,tt0035423,Kate & Leopold,2001.0,118,"Comedy,Fantasy,Romance",Fantasy
0,tt0035423,Kate & Leopold,2001.0,118,"Comedy,Fantasy,Romance",Romance
1,tt0062336,The Tango of the Widower and Its Distorting Mi...,2020.0,70,Drama,Drama
2,tt0069049,The Other Side of the Wind,2018.0,122,Drama,Drama
...,...,...,...,...,...,...
82202,tt9916190,Safeguard,2020.0,95,"Action,Adventure,Thriller",Action
82202,tt9916190,Safeguard,2020.0,95,"Action,Adventure,Thriller",Adventure
82202,tt9916190,Safeguard,2020.0,95,"Action,Adventure,Thriller",Thriller
82203,tt9916362,Coven,2020.0,92,"Drama,History",Drama


In [10]:
# saving tconst and genres_split as a new df
title_genres = exploded_data[['tconst', 'genres_split']].copy()
title_genres.head()

Unnamed: 0,tconst,genres_split
0,tt0035423,Comedy
0,tt0035423,Fantasy
0,tt0035423,Romance
1,tt0062336,Drama
2,tt0069049,Drama


In [11]:
# Replacing text genres with integer IDs
unique_data = sorted(title_genres['genres_split'].unique())
unique_data

['Action',
 'Adult',
 'Adventure',
 'Animation',
 'Biography',
 'Comedy',
 'Crime',
 'Drama',
 'Family',
 'Fantasy',
 'Game-Show',
 'History',
 'Horror',
 'Music',
 'Musical',
 'Mystery',
 'News',
 'Reality-TV',
 'Romance',
 'Sci-Fi',
 'Short',
 'Sport',
 'Talk-Show',
 'Thriller',
 'War',
 'Western']

In [13]:
# Create dictionary with list of unique genres as the key and the new iteger id as values
genre_id_map = dict(zip(unique_data, range(len(unique_data))))
genre_id_map

{'Action': 0,
 'Adult': 1,
 'Adventure': 2,
 'Animation': 3,
 'Biography': 4,
 'Comedy': 5,
 'Crime': 6,
 'Drama': 7,
 'Family': 8,
 'Fantasy': 9,
 'Game-Show': 10,
 'History': 11,
 'Horror': 12,
 'Music': 13,
 'Musical': 14,
 'Mystery': 15,
 'News': 16,
 'Reality-TV': 17,
 'Romance': 18,
 'Sci-Fi': 19,
 'Short': 20,
 'Sport': 21,
 'Talk-Show': 22,
 'Thriller': 23,
 'War': 24,
 'Western': 25}

In [14]:
# Replacing Values in title_genres Table with Genre ID
title_genres['Genre_ID'] = title_genres['genres_split'].replace(genre_id_map)

# drop the original genre column
title_genres.drop(columns=['genres_split'], inplace=True)
title_genres

Unnamed: 0,tconst,Genre_ID
0,tt0035423,5
0,tt0035423,9
0,tt0035423,18
1,tt0062336,7
2,tt0069049,7
...,...,...
82202,tt9916190,0
82202,tt9916190,2
82202,tt9916190,23
82203,tt9916362,7


# Creating Tables

## genres table

In [15]:
genre_lookup = pd.DataFrame({'Genre_Name:': genre_id_map.keys(), 
                              'Genre_ID':genre_id_map.values()})

genre_lookup.head()

Unnamed: 0,Genre_Name:,Genre_ID
0,Action,0
1,Adult,1
2,Adventure,2
3,Animation,3
4,Biography,4


In [16]:
## Dropping original genre columns 
basics = basics.drop(columns=['genres','genres_split'])
basics

Unnamed: 0,tconst,primaryTitle,startYear,runtimeMinutes
0,tt0035423,Kate & Leopold,2001.0,118
1,tt0062336,The Tango of the Widower and Its Distorting Mi...,2020.0,70
2,tt0069049,The Other Side of the Wind,2018.0,122
3,tt0088751,The Naked Monster,2005.0,100
4,tt0094859,Chief Zabu,2016.0,74
...,...,...,...,...
82199,tt9914942,Life Without Sara Amat,2019.0,74
82200,tt9915872,The Last White Witch,2019.0,97
82201,tt9916170,The Rehearsal,2019.0,51
82202,tt9916190,Safeguard,2020.0,95


# Connecting to MySQL

In [2]:
import pymysql
pymysql.install_as_MySQLdb()
from sqlalchemy import create_engine
from sqlalchemy_utils import create_database, database_exists
from sqlalchemy.types import *
connection_str = "mysql+pymysql://root:iamroot@localhost/books"
engine = create_engine(connection_str)

In [3]:
## Getting mysql server password
import json
with open(r"C:\Users\nbeac\.secret\, 'r') as f:
    results = json.load(f)
results.keys()

SyntaxError: EOL while scanning string literal (192832250.py, line 3)