# Part 3 - Producing a MySQL Database

## Business Problem

Create a database to analyze what makes a movie successful, and will provide recommendations to the stakeholder on how to make a successful movie. 



### Specifications - Database

* Take data that has been cleaned and create a MySQL database. 

* Normalize the tables before adding them to the new database. 
    
    * All data from the TMDB API should be in 1 table together (even though it will not be perfectly normalized).
    
    *Keep only imdb_id, revenue, budget, and certification columns.

## Transformation Steps:

* Normalize Genre:

    * Convert the single string of genres from title_basics into 2 new tables.         
        1. title_genres: with columns:
            * tconst
            * genre_id
        
        2. genres:
            * genre_id
            * genre_name


* Discard unnecessary information:

    * For the title basics table, drop the following columns:
        * "original_title" (we will use the primary title column instead)
        * "isAdult" ("Adult" will show up in the genres so this is redundant information).
        * "titleType" (every row will be a movie).
        * "genres" and other variants of genre (genre is now represented in the 2 new tables described above.
     
    * Do not include the title_akas table in your SQL database.
        * You have already filtered out the desired movies using this table and the remaining data is mostly nulls and not of-interest to the stakeholder.

## MySQL Database Requirements

* Use sqlalchemy with pandas to execute your SQL queries inside your notebook.


* Create a new database on your MySQL server and call it "movies".



* Make sure to have the following tables in your "movies" database:

    * title_basics
    * title_ratings
    * title_genres
    * genres
    * tmdb_data


* Make sure to set a Primary Key for each table that isn't a joiner table (e.g. title_genres is a joiner table).


* After creating each table, show the first 5 rows of that table using a SQL query.


* Make sure to run the "SHOW TABLES" SQL query at the end of your notebook to show that all required tables have been created.

### Deliverables

Submit a link to your github respository containing the Jupyter Notebook file.



# Getting Started Tips

## Normalizing Genres - Overview

* In order to normalize genres, we will need to:

    * Convert the single string of genres from title basics into 2 new tables.
        
        1. title_genres: with the columns:

            * tconst
            * genre_id
        
        2. genres:
            * genre_id
            * genre_name


* Creating these tables will be a multi-step process.

    
    
        1. Getting a list of all individual genres.
    
    
        2. Create a new title_genres table with with the movie ids duplicated, once for each genre that a movie belongs to.
    
    
        3. Create a mapper dictionary with numeric ids for each genre.
    
    
        4. Use the mapper dictionary to replace the string genres in title_genres with numeric genre_ids.
    
    
        5. Convert the mapper dictionary into a final genres table with the numeric genre_id and the string genre.

### 1. Getting List of Unique Genres:

* The genres column should be separated into separate genres.

    * For example: "Comedy,Fantasy,Romance" is actually 3 genres that the movie belongs to, not one combined-genre.


* First, you will need to get a list of all of the unique genres that appear in the column. Right now, the genre column contains a string with the genres separated by a comma.

    1. We are going to convert these strings into lists of strings into a new 'genres_split' column.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os,json
import missingno as ms

plt.rcParams['figure.facecolor'] = 'white'

pd.set_option('display.max_columns',100)

In [2]:
FOLDER = "Data/"
sorted(os.listdir(FOLDER))

['combined_tmdb_api_data.csv.gz',
 'final_tmdb_data_2000.csv.gz',
 'final_tmdb_data_2001.csv.gz',
 'final_tmdb_data_2002.csv.gz',
 'final_tmdb_data_2003.csv.gz',
 'final_tmdb_data_2004.csv.gz',
 'final_tmdb_data_2005.csv.gz',
 'final_tmdb_data_2006.csv.gz',
 'final_tmdb_data_2007.csv.gz',
 'final_tmdb_data_2008.csv.gz',
 'final_tmdb_data_2009.csv.gz',
 'final_tmdb_data_2010.csv.gz',
 'final_tmdb_data_2011.csv.gz',
 'final_tmdb_data_2012.csv.gz',
 'final_tmdb_data_2013.csv.gz',
 'final_tmdb_data_2014.csv.gz',
 'final_tmdb_data_2015.csv.gz',
 'final_tmdb_data_2016.csv.gz',
 'final_tmdb_data_2017.csv.gz',
 'final_tmdb_data_2018.csv.gz',
 'final_tmdb_data_2019.csv.gz',
 'final_tmdb_data_2020.csv.gz',
 'final_tmdb_data_2021.csv.gz',
 'final_tmdb_data_2022.csv.gz',
 'title_akas_cleaned.csv.gz',
 'title_basics_cleaned.csv.gz',
 'title_ratings_cleaned.csv.gz']

# Basics - Normalizing Genres

## Basics

In [3]:
## title basics
basics = pd.read_csv(f'{FOLDER}title_basics_cleaned.csv.gz',low_memory=False)
basics.info()
basics.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 82204 entries, 0 to 82203
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   tconst          82204 non-null  object 
 1   titleType       82204 non-null  object 
 2   primaryTitle    82204 non-null  object 
 3   originalTitle   82204 non-null  object 
 4   isAdult         82204 non-null  int64  
 5   startYear       82204 non-null  float64
 6   runtimeMinutes  82204 non-null  int64  
 7   genres          82204 non-null  object 
dtypes: float64(1), int64(2), object(5)
memory usage: 5.0+ MB


Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,runtimeMinutes,genres
0,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001.0,118,"Comedy,Fantasy,Romance"
1,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El Tango del Viudo y Su Espejo Deformante,0,2020.0,70,Drama
2,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018.0,122,Drama
3,tt0088751,movie,The Naked Monster,The Naked Monster,0,2005.0,100,"Comedy,Horror,Sci-Fi"
4,tt0094859,movie,Chief Zabu,Chief Zabu,0,2016.0,74,Comedy


## Ratings

In [4]:
ratings = pd.read_csv(f"{FOLDER}/title_ratings_cleaned.csv.gz", low_memory=False)
ratings.info()
ratings.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 67422 entries, 0 to 67421
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   tconst         67422 non-null  object 
 1   averageRating  67422 non-null  float64
 2   numVotes       67422 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 1.5+ MB


Unnamed: 0,tconst,averageRating,numVotes
0,tt0035423,6.4,84521
1,tt0062336,6.4,161
2,tt0069049,6.7,7333
3,tt0088751,5.2,323
4,tt0094859,7.9,83


## TMDB API Results

In [5]:
import glob
q = f"{FOLDER}final*.csv.gz"
files = glob.glob(q)
files 

['Data\\final_tmdb_data_2000.csv.gz',
 'Data\\final_tmdb_data_2001.csv.gz',
 'Data\\final_tmdb_data_2002.csv.gz',
 'Data\\final_tmdb_data_2003.csv.gz',
 'Data\\final_tmdb_data_2004.csv.gz',
 'Data\\final_tmdb_data_2005.csv.gz',
 'Data\\final_tmdb_data_2006.csv.gz',
 'Data\\final_tmdb_data_2007.csv.gz',
 'Data\\final_tmdb_data_2008.csv.gz',
 'Data\\final_tmdb_data_2009.csv.gz',
 'Data\\final_tmdb_data_2010.csv.gz',
 'Data\\final_tmdb_data_2011.csv.gz',
 'Data\\final_tmdb_data_2012.csv.gz',
 'Data\\final_tmdb_data_2013.csv.gz',
 'Data\\final_tmdb_data_2014.csv.gz',
 'Data\\final_tmdb_data_2015.csv.gz',
 'Data\\final_tmdb_data_2016.csv.gz',
 'Data\\final_tmdb_data_2017.csv.gz',
 'Data\\final_tmdb_data_2018.csv.gz',
 'Data\\final_tmdb_data_2019.csv.gz',
 'Data\\final_tmdb_data_2020.csv.gz',
 'Data\\final_tmdb_data_2021.csv.gz',
 'Data\\final_tmdb_data_2022.csv.gz']

In [6]:
df = pd.concat([pd.read_csv(f) for f in files])
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 62384 entries, 0 to 1679
Data columns (total 26 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  62384 non-null  bool   
 1   backdrop_path          39108 non-null  object 
 2   belongs_to_collection  3980 non-null   object 
 3   budget                 62384 non-null  int64  
 4   genres                 62384 non-null  object 
 5   homepage               15351 non-null  object 
 6   id                     62384 non-null  int64  
 7   imdb_id                62384 non-null  object 
 8   original_language      62384 non-null  object 
 9   original_title         62384 non-null  object 
 10  overview               61003 non-null  object 
 11  popularity             62384 non-null  float64
 12  poster_path            56816 non-null  object 
 13  production_companies   62384 non-null  object 
 14  production_countries   62384 non-null  object 
 15  rel

Unnamed: 0,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,certification
0,False,/vMFs7nw6P0bIV1jDsQpxAieAVnH.jpg,,10000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 10402, '...",,62127,tt0113026,en,The Fantasticks,Two rural teens sing and dance their way throu...,2.289,/hfO64mXz3DgUxkBVU7no2UWRP7x.jpg,"[{'id': 60, 'logo_path': '/2eqFolQI0NLL7ExZts5...","[{'iso_3166_1': 'US', 'name': 'United States o...",2000-09-22,0,86,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Try to remember the first time magic happened,The Fantasticks,False,5.5,22,
1,False,,,0,"[{'id': 878, 'name': 'Science Fiction'}]",,110977,tt0113092,en,For the Cause,Earth is in a state of constant war and two co...,3.133,/h9bWO13nWRGZJo4XVPiElXyrRMU.jpg,"[{'id': 925, 'logo_path': '/dIb9hjXNOkgxu4kBWd...","[{'iso_3166_1': 'US', 'name': 'United States o...",2000-11-15,0,100,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,The ultimate showdown on a forbidden planet.,For the Cause,False,5.1,8,
2,False,,,0,"[{'id': 18, 'name': 'Drama'}, {'id': 28, 'name...",,442869,tt0116391,hi,Gang,"After falling prey to underworld, four friends...",1.091,/yB5wRu4uyXXwZA3PEj8cITu0xt3.jpg,[],"[{'iso_3166_1': 'IN', 'name': 'India'}]",2000-04-14,0,152,"[{'english_name': 'Hindi', 'iso_639_1': 'hi', ...",Released,,Gang,False,0.0,0,
3,False,/n4GJFGzsc7NinI1VeGDXIcQjtU2.jpg,,150000,"[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...",http://www.wkw-inthemoodforlove.com/,843,tt0118694,cn,花樣年華,"Hong Kong, 1962: Chow Mo-Wan and Su Li-Zhen mo...",22.892,/iYypPT4bhqXfq1b6EnmxvRt6b2Y.jpg,"[{'id': 539, 'logo_path': None, 'name': 'Block...","[{'iso_3166_1': 'HK', 'name': 'Hong Kong'}]",2000-09-29,12854953,99,"[{'english_name': 'Cantonese', 'iso_639_1': 'c...",Released,"Feel the heat, keep the feeling burning, let t...",In the Mood for Love,False,8.103,1948,PG
4,False,,,0,"[{'id': 18, 'name': 'Drama'}]",,49511,tt0118852,en,Chinese Coffee,"When Harry Levine, an aging, unsuccessful Gree...",3.913,/nZGWnSuf1FIuzyEuMRZHHZWViAp.jpg,"[{'id': 1596, 'logo_path': None, 'name': 'Shoo...","[{'iso_3166_1': 'US', 'name': 'United States o...",2000-09-02,0,99,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,There's a fine line between friendship and bet...,Chinese Coffee,False,6.9,46,R


In [7]:
# Dropping Placeholder rows with id=0
df = df.loc[df['imdb_id']!='0']

# Transform

## Basics

***
* Normalize and Separate Genre. 

* `origninal_title` (Use the primary title column instead).

* `isAdult` ("Adult" will show up in the genres so this is redundant information).

* `titleType` (every row will be a movie).

* `genres` and other variants of genre (genre is now represented in the 2 new tables described above).

***

In [8]:
basics.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 82204 entries, 0 to 82203
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   tconst          82204 non-null  object 
 1   titleType       82204 non-null  object 
 2   primaryTitle    82204 non-null  object 
 3   originalTitle   82204 non-null  object 
 4   isAdult         82204 non-null  int64  
 5   startYear       82204 non-null  float64
 6   runtimeMinutes  82204 non-null  int64  
 7   genres          82204 non-null  object 
dtypes: float64(1), int64(2), object(5)
memory usage: 5.0+ MB


In [9]:
# columns to drop
columns_dropped = ['originalTitle', 'isAdult', 'titleType']
basics = basics.drop(columns=columns_dropped)
basics

Unnamed: 0,tconst,primaryTitle,startYear,runtimeMinutes,genres
0,tt0035423,Kate & Leopold,2001.0,118,"Comedy,Fantasy,Romance"
1,tt0062336,The Tango of the Widower and Its Distorting Mi...,2020.0,70,Drama
2,tt0069049,The Other Side of the Wind,2018.0,122,Drama
3,tt0088751,The Naked Monster,2005.0,100,"Comedy,Horror,Sci-Fi"
4,tt0094859,Chief Zabu,2016.0,74,Comedy
...,...,...,...,...,...
82199,tt9914942,Life Without Sara Amat,2019.0,74,Drama
82200,tt9915872,The Last White Witch,2019.0,97,"Comedy,Drama,Fantasy"
82201,tt9916170,The Rehearsal,2019.0,51,Drama
82202,tt9916190,Safeguard,2020.0,95,"Action,Adventure,Thriller"


### Normalizing Genres

In [10]:
# Fill in Missing genres
# creat a new column with single-string genres as a list of strings
basics['genres_split'] = basics['genres'].str.split(',')
basics

Unnamed: 0,tconst,primaryTitle,startYear,runtimeMinutes,genres,genres_split
0,tt0035423,Kate & Leopold,2001.0,118,"Comedy,Fantasy,Romance","[Comedy, Fantasy, Romance]"
1,tt0062336,The Tango of the Widower and Its Distorting Mi...,2020.0,70,Drama,[Drama]
2,tt0069049,The Other Side of the Wind,2018.0,122,Drama,[Drama]
3,tt0088751,The Naked Monster,2005.0,100,"Comedy,Horror,Sci-Fi","[Comedy, Horror, Sci-Fi]"
4,tt0094859,Chief Zabu,2016.0,74,Comedy,[Comedy]
...,...,...,...,...,...,...
82199,tt9914942,Life Without Sara Amat,2019.0,74,Drama,[Drama]
82200,tt9915872,The Last White Witch,2019.0,97,"Comedy,Drama,Fantasy","[Comedy, Drama, Fantasy]"
82201,tt9916170,The Rehearsal,2019.0,51,Drama,[Drama]
82202,tt9916190,Safeguard,2020.0,95,"Action,Adventure,Thriller","[Action, Adventure, Thriller]"


In [11]:
# Explode dataframe to make each genre into a separate rows
exploded_data = basics.explode('genres_split')
exploded_data

Unnamed: 0,tconst,primaryTitle,startYear,runtimeMinutes,genres,genres_split
0,tt0035423,Kate & Leopold,2001.0,118,"Comedy,Fantasy,Romance",Comedy
0,tt0035423,Kate & Leopold,2001.0,118,"Comedy,Fantasy,Romance",Fantasy
0,tt0035423,Kate & Leopold,2001.0,118,"Comedy,Fantasy,Romance",Romance
1,tt0062336,The Tango of the Widower and Its Distorting Mi...,2020.0,70,Drama,Drama
2,tt0069049,The Other Side of the Wind,2018.0,122,Drama,Drama
...,...,...,...,...,...,...
82202,tt9916190,Safeguard,2020.0,95,"Action,Adventure,Thriller",Action
82202,tt9916190,Safeguard,2020.0,95,"Action,Adventure,Thriller",Adventure
82202,tt9916190,Safeguard,2020.0,95,"Action,Adventure,Thriller",Thriller
82203,tt9916362,Coven,2020.0,92,"Drama,History",Drama


In [12]:
# saving tconst and genres_split as a new df
title_genres = exploded_data[['tconst', 'genres_split']].copy()
title_genres.head()

Unnamed: 0,tconst,genres_split
0,tt0035423,Comedy
0,tt0035423,Fantasy
0,tt0035423,Romance
1,tt0062336,Drama
2,tt0069049,Drama


In [13]:
# Replacing text genres with integer IDs
unique_genres = sorted(title_genres['genres_split'].unique())
unique_genres

['Action',
 'Adult',
 'Adventure',
 'Animation',
 'Biography',
 'Comedy',
 'Crime',
 'Drama',
 'Family',
 'Fantasy',
 'Game-Show',
 'History',
 'Horror',
 'Music',
 'Musical',
 'Mystery',
 'News',
 'Reality-TV',
 'Romance',
 'Sci-Fi',
 'Short',
 'Sport',
 'Talk-Show',
 'Thriller',
 'War',
 'Western']

In [14]:
# Create dictionary with list of unique genres as the key and the new iteger id as values
genre_id_map = dict(zip(unique_genres, range(len(unique_genres))))
genre_id_map

{'Action': 0,
 'Adult': 1,
 'Adventure': 2,
 'Animation': 3,
 'Biography': 4,
 'Comedy': 5,
 'Crime': 6,
 'Drama': 7,
 'Family': 8,
 'Fantasy': 9,
 'Game-Show': 10,
 'History': 11,
 'Horror': 12,
 'Music': 13,
 'Musical': 14,
 'Mystery': 15,
 'News': 16,
 'Reality-TV': 17,
 'Romance': 18,
 'Sci-Fi': 19,
 'Short': 20,
 'Sport': 21,
 'Talk-Show': 22,
 'Thriller': 23,
 'War': 24,
 'Western': 25}

In [15]:
## Use .map or .replace with our genre_id_map dictionary
title_genres['Genre_ID'] = title_genres['genres_split'].replace(genre_id_map)

## Drop the original genre column
title_genres.drop(columns=['genres_split'],inplace=True)
title_genres

Unnamed: 0,tconst,Genre_ID
0,tt0035423,5
0,tt0035423,9
0,tt0035423,18
1,tt0062336,7
2,tt0069049,7
...,...,...
82202,tt9916190,0
82202,tt9916190,2
82202,tt9916190,23
82203,tt9916362,7


# Creating Tables

## genres table

In [16]:
## Manaully make dataframe with named cols from the .keyd and .values
genre_lookup = pd.DataFrame({'Genre_Name': genre_id_map.keys(),
                             'Genre_ID':genre_id_map.values()})
genre_lookup.head()

Unnamed: 0,Genre_Name,Genre_ID
0,Action,0
1,Adult,1
2,Adventure,2
3,Animation,3
4,Biography,4


In [17]:
## Dropping original genre columns 
basics = basics.drop(columns=['genres','genres_split'])
basics

Unnamed: 0,tconst,primaryTitle,startYear,runtimeMinutes
0,tt0035423,Kate & Leopold,2001.0,118
1,tt0062336,The Tango of the Widower and Its Distorting Mi...,2020.0,70
2,tt0069049,The Other Side of the Wind,2018.0,122
3,tt0088751,The Naked Monster,2005.0,100
4,tt0094859,Chief Zabu,2016.0,74
...,...,...,...,...
82199,tt9914942,Life Without Sara Amat,2019.0,74
82200,tt9915872,The Last White Witch,2019.0,97
82201,tt9916170,The Rehearsal,2019.0,51
82202,tt9916190,Safeguard,2020.0,95


# Connecting to MySQL

In [18]:
import pymysql
pymysql.install_as_MySQLdb()
from sqlalchemy import create_engine
from sqlalchemy_utils import create_database, database_exists
from sqlalchemy.types import *
# engine = create_engine(connection_str)

In [19]:
## Getting mysql server password
import json
with open(r"C:\Users\nbeac\.secret\tmdb_api.json") as f:
    login = json.load(f)
login.keys()

dict_keys(['api-key'])

In [20]:
connect_str = connection_str = "mysql+pymysql://root:iamroot@localhost/movies"

In [21]:
## Check if database exists, if not, create it
if database_exists(connect_str) == False:
    print("Creating the database.")
    create_database(connect_str)
else:
    print('The database already exists.')

The database already exists.


In [22]:
## create engine
engine = create_engine(connect_str)

In [23]:
## Test your connection by checking for any tables that exist (there should be none at this point)
q = """SHOW TABLES;"""
tables = pd.read_sql(q, engine)
tables

Unnamed: 0,Tables_in_movies
0,genres
1,title_basics
2,title_genres
3,title_ratings
4,tmdb_data_aab
5,tmdb_data_mvp


## Saving `title_basics`

In [24]:
## saving text length
key_len = basics['tconst'].map(len).max()
title_len = basics['primaryTitle'].map(len).max()
key_len, title_len

(10, 242)

In [25]:
basics_schema = {
    "tconst": String(key_len+1), 
    "primaryTitle": Text(title_len+1),
    'startYear':Float(),
    'runtimeMinutes':Integer()
    }
basics_schema

{'tconst': String(length=11),
 'primaryTitle': Text(length=243),
 'startYear': Float(),
 'runtimeMinutes': Integer()}

In [26]:
basics

Unnamed: 0,tconst,primaryTitle,startYear,runtimeMinutes
0,tt0035423,Kate & Leopold,2001.0,118
1,tt0062336,The Tango of the Widower and Its Distorting Mi...,2020.0,70
2,tt0069049,The Other Side of the Wind,2018.0,122
3,tt0088751,The Naked Monster,2005.0,100
4,tt0094859,Chief Zabu,2016.0,74
...,...,...,...,...
82199,tt9914942,Life Without Sara Amat,2019.0,74
82200,tt9915872,The Last White Witch,2019.0,97
82201,tt9916170,The Rehearsal,2019.0,51
82202,tt9916190,Safeguard,2020.0,95


In [27]:
## Saving basics as table with schema,
basics.to_sql('title_basics',engine,dtype=basics_schema,if_exists='replace',index=False)

## setting title basics as the primary key
engine.execute('ALTER TABLE title_basics ADD PRIMARY KEY (`tconst`);')

<sqlalchemy.engine.cursor.LegacyCursorResult at 0x1f06ddfe820>

In [28]:
## query first rows 
q = """SELECT * FROM title_basics LIMIT 5"""
pd.read_sql(q,engine)

Unnamed: 0,tconst,primaryTitle,startYear,runtimeMinutes
0,tt0035423,Kate & Leopold,2001.0,118
1,tt0062336,The Tango of the Widower and Its Distorting Mi...,2020.0,70
2,tt0069049,The Other Side of the Wind,2018.0,122
3,tt0088751,The Naked Monster,2005.0,100
4,tt0094859,Chief Zabu,2016.0,74


## Saving Genre Tables

In [29]:
genre_lookup.head()

Unnamed: 0,Genre_Name,Genre_ID
0,Action,0
1,Adult,1
2,Adventure,2
3,Animation,3
4,Biography,4


In [30]:
## Primary key is Genre_ID
genre_lookup.to_sql('genres',engine,index=False, if_exists='replace')

engine.execute('ALTER TABLE genres ADD PRIMARY KEY (`Genre_ID`);')

<sqlalchemy.engine.cursor.LegacyCursorResult at 0x1f07fa02d00>

In [31]:
q = """SELECT * FROM genres LIMIT 5"""
pd.read_sql(q,engine)

Unnamed: 0,Genre_Name,Genre_ID
0,Action,0
1,Adult,1
2,Adventure,2
3,Animation,3
4,Biography,4


### Saving `title_genres` table

In [32]:
## NO PRIMARY KEY - DUPLCIATE VALUES
title_genres.to_sql('title_genres',engine,index=False,
                    if_exists='replace' )

# engine.execute('ALTER TABLE title_genres ADD PRIMARY KEY (`tconst`);')

153592

In [33]:
q = """SELECT * FROM title_genres LIMIT 5"""
pd.read_sql(q,engine)

Unnamed: 0,tconst,Genre_ID
0,tt0035423,5
1,tt0035423,9
2,tt0035423,18
3,tt0062336,7
4,tt0069049,7


### Saving `title_ratings`

In [34]:
ratings_schema = {'tconst':String(key_len+1), 
                 'averageRating':Float(),
                 'numVotes':Integer()}#get_schema(ratings)
ratings_schema

{'tconst': String(length=11), 'averageRating': Float(), 'numVotes': Integer()}

In [35]:
ratings.to_sql('title_ratings',engine,if_exists='replace',index=False,
              dtype=ratings_schema)
engine.execute("ALTER TABLE title_ratings ADD PRIMARY KEY (`tconst`)")

<sqlalchemy.engine.cursor.LegacyCursorResult at 0x1f07fa45ac0>

In [36]:
q = """SELECT * FROM title_ratings LIMIT 5"""
pd.read_sql(q,engine)

Unnamed: 0,tconst,averageRating,numVotes
0,tt0035423,6.4,84521
1,tt0062336,6.4,161
2,tt0069049,6.7,7333
3,tt0088751,5.2,323
4,tt0094859,7.9,83


### Saving `TMDB API` Data

In [37]:
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 62384 entries, 0 to 1679
Data columns (total 26 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  62384 non-null  bool   
 1   backdrop_path          39108 non-null  object 
 2   belongs_to_collection  3980 non-null   object 
 3   budget                 62384 non-null  int64  
 4   genres                 62384 non-null  object 
 5   homepage               15351 non-null  object 
 6   id                     62384 non-null  int64  
 7   imdb_id                62384 non-null  object 
 8   original_language      62384 non-null  object 
 9   original_title         62384 non-null  object 
 10  overview               61003 non-null  object 
 11  popularity             62384 non-null  float64
 12  poster_path            56816 non-null  object 
 13  production_companies   62384 non-null  object 
 14  production_countries   62384 non-null  object 
 15  rel

Unnamed: 0,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,certification
0,False,/vMFs7nw6P0bIV1jDsQpxAieAVnH.jpg,,10000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 10402, '...",,62127,tt0113026,en,The Fantasticks,Two rural teens sing and dance their way throu...,2.289,/hfO64mXz3DgUxkBVU7no2UWRP7x.jpg,"[{'id': 60, 'logo_path': '/2eqFolQI0NLL7ExZts5...","[{'iso_3166_1': 'US', 'name': 'United States o...",2000-09-22,0,86,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Try to remember the first time magic happened,The Fantasticks,False,5.5,22,
1,False,,,0,"[{'id': 878, 'name': 'Science Fiction'}]",,110977,tt0113092,en,For the Cause,Earth is in a state of constant war and two co...,3.133,/h9bWO13nWRGZJo4XVPiElXyrRMU.jpg,"[{'id': 925, 'logo_path': '/dIb9hjXNOkgxu4kBWd...","[{'iso_3166_1': 'US', 'name': 'United States o...",2000-11-15,0,100,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,The ultimate showdown on a forbidden planet.,For the Cause,False,5.1,8,
2,False,,,0,"[{'id': 18, 'name': 'Drama'}, {'id': 28, 'name...",,442869,tt0116391,hi,Gang,"After falling prey to underworld, four friends...",1.091,/yB5wRu4uyXXwZA3PEj8cITu0xt3.jpg,[],"[{'iso_3166_1': 'IN', 'name': 'India'}]",2000-04-14,0,152,"[{'english_name': 'Hindi', 'iso_639_1': 'hi', ...",Released,,Gang,False,0.0,0,
3,False,/n4GJFGzsc7NinI1VeGDXIcQjtU2.jpg,,150000,"[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...",http://www.wkw-inthemoodforlove.com/,843,tt0118694,cn,花樣年華,"Hong Kong, 1962: Chow Mo-Wan and Su Li-Zhen mo...",22.892,/iYypPT4bhqXfq1b6EnmxvRt6b2Y.jpg,"[{'id': 539, 'logo_path': None, 'name': 'Block...","[{'iso_3166_1': 'HK', 'name': 'Hong Kong'}]",2000-09-29,12854953,99,"[{'english_name': 'Cantonese', 'iso_639_1': 'c...",Released,"Feel the heat, keep the feeling burning, let t...",In the Mood for Love,False,8.103,1948,PG
4,False,,,0,"[{'id': 18, 'name': 'Drama'}]",,49511,tt0118852,en,Chinese Coffee,"When Harry Levine, an aging, unsuccessful Gree...",3.913,/nZGWnSuf1FIuzyEuMRZHHZWViAp.jpg,"[{'id': 1596, 'logo_path': None, 'name': 'Shoo...","[{'iso_3166_1': 'US', 'name': 'United States o...",2000-09-02,0,99,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,There's a fine line between friendship and bet...,Chinese Coffee,False,6.9,46,R


In [38]:
## Keeping some columns
cols_to_keep = ['imdb_id','budget','revenue','certification',
                'original_language']
mvp = df[cols_to_keep]
mvp

Unnamed: 0,imdb_id,budget,revenue,certification,original_language
0,tt0113026,10000000,0,,en
1,tt0113092,0,0,,en
2,tt0116391,0,0,,hi
3,tt0118694,150000,12854953,PG,cn
4,tt0118852,0,0,R,en
...,...,...,...,...,...
1675,tt9851854,0,0,,te
1676,tt9854058,0,0,,en
1677,tt9893158,0,0,,en
1678,tt9893160,0,0,,en


In [39]:
mvp.isna().sum()

imdb_id                  0
budget                   0
revenue                  0
certification        47244
original_language        0
dtype: int64

In [40]:
mvp

Unnamed: 0,imdb_id,budget,revenue,certification,original_language
0,tt0113026,10000000,0,,en
1,tt0113092,0,0,,en
2,tt0116391,0,0,,hi
3,tt0118694,150000,12854953,PG,cn
4,tt0118852,0,0,R,en
...,...,...,...,...,...
1675,tt9851854,0,0,,te
1676,tt9854058,0,0,,en
1677,tt9893158,0,0,,en
1678,tt9893160,0,0,,en


In [41]:
## saving text length
key_len = mvp['imdb_id'].map(len).max()
cert_len = mvp['certification'].fillna('').map(len).max()
lang_len = mvp['original_language'].map(len).max()

key_len, cert_len,lang_len

(10, 31, 2)

In [42]:
## saving schema
api_data_schema = {'imdb_id':String(key_len+1), 
                 'budget':Float(),
                   'revenue':Float(),
                 'certification':Text(cert_len+1),
                  'original_language':Text(lang_len+1)}
api_data_schema

{'imdb_id': String(length=11),
 'budget': Float(),
 'revenue': Float(),
 'certification': Text(length=32),
 'original_language': Text(length=3)}

In [43]:
## Primary key is Genre_ID
mvp.to_sql('tmdb_data_mvp',engine, index=False,dtype=api_data_schema, if_exists='replace')

engine.execute('ALTER TABLE tmdb_data_mvp ADD PRIMARY KEY (`imdb_id`);')

<sqlalchemy.engine.cursor.LegacyCursorResult at 0x1f07fa2bfd0>

In [44]:
q = """SELECT * FROM tmdb_data_mvp LIMIT 5"""
pd.read_sql(q,engine)

Unnamed: 0,imdb_id,budget,revenue,certification,original_language
0,tt0035423,48000000.0,76019000.0,PG-13,en
1,tt0062336,0.0,0.0,,es
2,tt0069049,12000000.0,0.0,R,en
3,tt0088751,350000.0,0.0,,en
4,tt0094859,187.0,0.0,,en


In [45]:
df.head()

Unnamed: 0,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,certification
0,False,/vMFs7nw6P0bIV1jDsQpxAieAVnH.jpg,,10000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 10402, '...",,62127,tt0113026,en,The Fantasticks,Two rural teens sing and dance their way throu...,2.289,/hfO64mXz3DgUxkBVU7no2UWRP7x.jpg,"[{'id': 60, 'logo_path': '/2eqFolQI0NLL7ExZts5...","[{'iso_3166_1': 'US', 'name': 'United States o...",2000-09-22,0,86,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Try to remember the first time magic happened,The Fantasticks,False,5.5,22,
1,False,,,0,"[{'id': 878, 'name': 'Science Fiction'}]",,110977,tt0113092,en,For the Cause,Earth is in a state of constant war and two co...,3.133,/h9bWO13nWRGZJo4XVPiElXyrRMU.jpg,"[{'id': 925, 'logo_path': '/dIb9hjXNOkgxu4kBWd...","[{'iso_3166_1': 'US', 'name': 'United States o...",2000-11-15,0,100,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,The ultimate showdown on a forbidden planet.,For the Cause,False,5.1,8,
2,False,,,0,"[{'id': 18, 'name': 'Drama'}, {'id': 28, 'name...",,442869,tt0116391,hi,Gang,"After falling prey to underworld, four friends...",1.091,/yB5wRu4uyXXwZA3PEj8cITu0xt3.jpg,[],"[{'iso_3166_1': 'IN', 'name': 'India'}]",2000-04-14,0,152,"[{'english_name': 'Hindi', 'iso_639_1': 'hi', ...",Released,,Gang,False,0.0,0,
3,False,/n4GJFGzsc7NinI1VeGDXIcQjtU2.jpg,,150000,"[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...",http://www.wkw-inthemoodforlove.com/,843,tt0118694,cn,花樣年華,"Hong Kong, 1962: Chow Mo-Wan and Su Li-Zhen mo...",22.892,/iYypPT4bhqXfq1b6EnmxvRt6b2Y.jpg,"[{'id': 539, 'logo_path': None, 'name': 'Block...","[{'iso_3166_1': 'HK', 'name': 'Hong Kong'}]",2000-09-29,12854953,99,"[{'english_name': 'Cantonese', 'iso_639_1': 'c...",Released,"Feel the heat, keep the feeling burning, let t...",In the Mood for Love,False,8.103,1948,PG
4,False,,,0,"[{'id': 18, 'name': 'Drama'}]",,49511,tt0118852,en,Chinese Coffee,"When Harry Levine, an aging, unsuccessful Gree...",3.913,/nZGWnSuf1FIuzyEuMRZHHZWViAp.jpg,"[{'id': 1596, 'logo_path': None, 'name': 'Shoo...","[{'iso_3166_1': 'US', 'name': 'United States o...",2000-09-02,0,99,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,There's a fine line between friendship and bet...,Chinese Coffee,False,6.9,46,R


In [46]:
# # saving AAB version
# ## Primary key is Genre_ID
# df.to_sql('tmdb_data_aab',engine, index=False,dtype=api_data_schema, if_exists='replace')
# engine.execute('ALTER TABLE tmdb_data_aab ADD PRIMARY KEY (`imdb_id`);')

In [47]:
def get_schema(table,debug=False):
    ## save pandas dtypes in list, make empty dict
    dtypes = table.dtypes
    schema = {}
    
    # for each column
    for col in dtypes.index:
        ## print info if in debug mode
        if debug:
            print(f"{col} = {dtypes.loc[col]}")

        ## if its a string column (object)
        if dtypes.loc[col]=='object':
            
            ## Fill null values and make sure whole column is str
            data = table[col].fillna('').astype(str)
            
            ## get len first
            len_str = data.map(len).max()
            
            ## if the string is shorter than 21845 use String
            # (forget how i knew it was max size)
            if len_str < 21845:
                schema[col] = String( len_str + 1)
                
            ## If longer use Text
            else:
                schema[col] = Text(len_str+1)
        
        # if float make Float
        elif dtypes.loc[col] == 'float':
            schema[col] = Float()

        ## if int make Integer
        elif dtypes.loc[col] == 'int':
            schema[col] = Integer()
            
        ## if bool make Boolean
        elif dtypes.loc[col] == 'bool':
            schema[col] = Boolean()
            
    return schema

In [48]:
len(df)

62384

In [49]:
# identifying incomaptible rows
bad_titles = (df['original_title']!=df['title']) &\
                (df['original_language']!='en') &\
               ~df['spoken_languages'].str.contains('english',case=False)
df[bad_titles]

Unnamed: 0,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,certification
513,False,/v0yLHXchzHdAESEBx9ix3tbyV5r.jpg,,0,"[{'id': 18, 'name': 'Drama'}]",,73525,tt0216990,es,Sin dejar huella,A woman steals from her drug-dealer boyfriend ...,1.606,/hfoci8y0lOEKsykZDrV47vjNOKt.jpg,"[{'id': 357, 'logo_path': None, 'name': 'Vía D...","[{'iso_3166_1': 'MX', 'name': 'Mexico'}, {'iso...",2001-03-23,0,109,[],Released,,Without a Trace,False,6.5,10,
585,False,,,0,[],,550283,tt0222023,es,Hubo un tiempo en que los sueños dieron paso a...,"Bruno, a young prostitute on the streets of Me...",0.600,/fwyml5vli57318zgS2hqH6jQMeY.jpg,[],"[{'iso_3166_1': 'MX', 'name': 'Mexico'}]",2000-06-16,0,50,[],Released,,Long Sleepless Nights,False,5.0,2,
593,False,,,0,[],,572393,tt0222989,fr,Chittagong: Dernière escale,,0.600,/k2bi20nfkZYmWJZXNk0NrENhdcH.jpg,[],[],2001-12-05,0,0,[],Released,,Chittagong: The Last Stopover,False,0.0,0,
754,False,,,0,[],,713555,tt0243171,fa,Charkh,When he loses his football playing with a coup...,0.600,/yWRRJlIbVUNYcXpNhUc7u6nFTos.jpg,[],[],2000-04-09,0,70,[],Released,,The Cart,False,0.0,0,
843,False,,,0,[],,345083,tt0251353,it,Delitto in Prima Serata,Someone is murdered in a European modeling age...,0.600,/vU1AvCa8klwIsorISI5lkd6kYRc.jpg,[],[],2000-01-01,0,85,[],Released,,Primetime Murder,False,0.0,0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1331,False,/4f0bOEFJaUm9N3acffJk5J1Fmey.jpg,,0,"[{'id': 18, 'name': 'Drama'}]",,890591,tt20517646,it,Prossimo Tuo (Hotel Milano),Riki and Luca are two young men in search of t...,0.600,/7povyIL3MNYdm8DMNICDzOxdTWw.jpg,"[{'id': 127343, 'logo_path': None, 'name': 'Ze...","[{'iso_3166_1': 'IT', 'name': 'Italy'}]",,0,0,[],Post Production,,The Neighbor,False,0.0,0,
1349,False,,,0,"[{'id': 18, 'name': 'Drama'}]",,993249,tt20877264,ka,Chemi otakhi,"Tina, a young woman who has lost her way in li...",0.600,/hugiYuqHeSwB7CM1st1VWj4h7vH.jpg,[],[],2022-07-04,0,107,[],Released,,Room Of My Own,False,0.0,0,
1392,False,/m7GkAuvm5jaRvdLeue1m0lEvD8p.jpg,,0,"[{'id': 18, 'name': 'Drama'}]",,1005299,tt21378744,fa,جنگ جهانی سوم,,4.034,/xG0nLMvAlDfLhXe6LZBu0LVU38T.jpg,[],[],2022-08-31,0,117,[],In Production,,World War III,False,0.0,0,
1421,False,,,0,"[{'id': 10749, 'name': 'Romance'}, {'id': 18, ...",,269579,tt2818640,es,Un cuento de circo & a love song,A boy who grew up at the circus decides to lea...,1.105,/xXhbxTbNenb4G91b3mvTVMfeq4O.jpg,[],"[{'iso_3166_1': 'MX', 'name': 'Mexico'}]",2022-12-31,0,114,[],In Production,,A circus tale & a love song,False,0.0,0,


In [50]:
df = df[~bad_titles]
df

Unnamed: 0,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,certification
0,False,/vMFs7nw6P0bIV1jDsQpxAieAVnH.jpg,,10000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 10402, '...",,62127,tt0113026,en,The Fantasticks,Two rural teens sing and dance their way throu...,2.289,/hfO64mXz3DgUxkBVU7no2UWRP7x.jpg,"[{'id': 60, 'logo_path': '/2eqFolQI0NLL7ExZts5...","[{'iso_3166_1': 'US', 'name': 'United States o...",2000-09-22,0,86,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Try to remember the first time magic happened,The Fantasticks,False,5.500,22,
1,False,,,0,"[{'id': 878, 'name': 'Science Fiction'}]",,110977,tt0113092,en,For the Cause,Earth is in a state of constant war and two co...,3.133,/h9bWO13nWRGZJo4XVPiElXyrRMU.jpg,"[{'id': 925, 'logo_path': '/dIb9hjXNOkgxu4kBWd...","[{'iso_3166_1': 'US', 'name': 'United States o...",2000-11-15,0,100,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,The ultimate showdown on a forbidden planet.,For the Cause,False,5.100,8,
2,False,,,0,"[{'id': 18, 'name': 'Drama'}, {'id': 28, 'name...",,442869,tt0116391,hi,Gang,"After falling prey to underworld, four friends...",1.091,/yB5wRu4uyXXwZA3PEj8cITu0xt3.jpg,[],"[{'iso_3166_1': 'IN', 'name': 'India'}]",2000-04-14,0,152,"[{'english_name': 'Hindi', 'iso_639_1': 'hi', ...",Released,,Gang,False,0.000,0,
3,False,/n4GJFGzsc7NinI1VeGDXIcQjtU2.jpg,,150000,"[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...",http://www.wkw-inthemoodforlove.com/,843,tt0118694,cn,花樣年華,"Hong Kong, 1962: Chow Mo-Wan and Su Li-Zhen mo...",22.892,/iYypPT4bhqXfq1b6EnmxvRt6b2Y.jpg,"[{'id': 539, 'logo_path': None, 'name': 'Block...","[{'iso_3166_1': 'HK', 'name': 'Hong Kong'}]",2000-09-29,12854953,99,"[{'english_name': 'Cantonese', 'iso_639_1': 'c...",Released,"Feel the heat, keep the feeling burning, let t...",In the Mood for Love,False,8.103,1948,PG
4,False,,,0,"[{'id': 18, 'name': 'Drama'}]",,49511,tt0118852,en,Chinese Coffee,"When Harry Levine, an aging, unsuccessful Gree...",3.913,/nZGWnSuf1FIuzyEuMRZHHZWViAp.jpg,"[{'id': 1596, 'logo_path': None, 'name': 'Shoo...","[{'iso_3166_1': 'US', 'name': 'United States o...",2000-09-02,0,99,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,There's a fine line between friendship and bet...,Chinese Coffee,False,6.900,46,R
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1675,False,/8tyq1kXM3YQmu3obW6LxWm5TVRO.jpg,,0,"[{'id': 28, 'name': 'Action'}, {'id': 36, 'nam...",,605153,tt9851854,te,మేజర్,Based on the life of real-life Hero Major Sand...,19.029,/sJOfJuyQVZPwNQ8g21Qv0lojQhC.jpg,"[{'id': 69124, 'logo_path': None, 'name': 'G. ...","[{'iso_3166_1': 'IN', 'name': 'India'}]",2022-06-03,0,149,"[{'english_name': 'Telugu', 'iso_639_1': 'te',...",Released,Jaan Doonga Desh Nahi,Major,False,8.233,15,
1676,False,,,0,"[{'id': 80, 'name': 'Crime'}]",,969840,tt9854058,en,Shadows,A young low-level drug dealer is reunited with...,0.600,/2HaAOGM1EmiSwsJrdq1RNhYehce.jpg,[],[],2022-05-13,0,101,[],Released,Family Is The Last Line Of Defense,Shadows,False,0.000,0,
1677,False,,,0,"[{'id': 80, 'name': 'Crime'}, {'id': 10749, 'n...",,796955,tt9893158,en,Clowning,"With his girlfriend pregnant, Dante, a pacifis...",3.136,/xppIANX9DQoRYg3FlNCifDYuFwP.jpg,"[{'id': 109533, 'logo_path': '/xtQJYJg54jp5QVS...","[{'iso_3166_1': 'US', 'name': 'United States o...",2022-03-13,0,96,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Warm nights... Feels like death in the sand du...,Clowning,False,8.000,1,
1678,False,/jX5XGqJUTzvpta2RjcX6pMZqxk5.jpg,,0,"[{'id': 53, 'name': 'Thriller'}, {'id': 80, 'n...",,606303,tt9893160,en,No Way Out,"Nick, a talented photographer who is new to Lo...",18.247,/df9pAqtYzM40llo9Joxy2ftqSrP.jpg,"[{'id': 13238, 'logo_path': '/kDNZz8imH866Mezx...","[{'iso_3166_1': 'US', 'name': 'United States o...",2022-08-12,0,89,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Find what you love and let it kill you.,No Way Out,False,3.000,3,


In [51]:
df['revenue'] = df['revenue'].astype(float)

In [52]:
df = df.reset_index(drop=True)
df

Unnamed: 0,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,certification
0,False,/vMFs7nw6P0bIV1jDsQpxAieAVnH.jpg,,10000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 10402, '...",,62127,tt0113026,en,The Fantasticks,Two rural teens sing and dance their way throu...,2.289,/hfO64mXz3DgUxkBVU7no2UWRP7x.jpg,"[{'id': 60, 'logo_path': '/2eqFolQI0NLL7ExZts5...","[{'iso_3166_1': 'US', 'name': 'United States o...",2000-09-22,0.0,86,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Try to remember the first time magic happened,The Fantasticks,False,5.500,22,
1,False,,,0,"[{'id': 878, 'name': 'Science Fiction'}]",,110977,tt0113092,en,For the Cause,Earth is in a state of constant war and two co...,3.133,/h9bWO13nWRGZJo4XVPiElXyrRMU.jpg,"[{'id': 925, 'logo_path': '/dIb9hjXNOkgxu4kBWd...","[{'iso_3166_1': 'US', 'name': 'United States o...",2000-11-15,0.0,100,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,The ultimate showdown on a forbidden planet.,For the Cause,False,5.100,8,
2,False,,,0,"[{'id': 18, 'name': 'Drama'}, {'id': 28, 'name...",,442869,tt0116391,hi,Gang,"After falling prey to underworld, four friends...",1.091,/yB5wRu4uyXXwZA3PEj8cITu0xt3.jpg,[],"[{'iso_3166_1': 'IN', 'name': 'India'}]",2000-04-14,0.0,152,"[{'english_name': 'Hindi', 'iso_639_1': 'hi', ...",Released,,Gang,False,0.000,0,
3,False,/n4GJFGzsc7NinI1VeGDXIcQjtU2.jpg,,150000,"[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...",http://www.wkw-inthemoodforlove.com/,843,tt0118694,cn,花樣年華,"Hong Kong, 1962: Chow Mo-Wan and Su Li-Zhen mo...",22.892,/iYypPT4bhqXfq1b6EnmxvRt6b2Y.jpg,"[{'id': 539, 'logo_path': None, 'name': 'Block...","[{'iso_3166_1': 'HK', 'name': 'Hong Kong'}]",2000-09-29,12854953.0,99,"[{'english_name': 'Cantonese', 'iso_639_1': 'c...",Released,"Feel the heat, keep the feeling burning, let t...",In the Mood for Love,False,8.103,1948,PG
4,False,,,0,"[{'id': 18, 'name': 'Drama'}]",,49511,tt0118852,en,Chinese Coffee,"When Harry Levine, an aging, unsuccessful Gree...",3.913,/nZGWnSuf1FIuzyEuMRZHHZWViAp.jpg,"[{'id': 1596, 'logo_path': None, 'name': 'Shoo...","[{'iso_3166_1': 'US', 'name': 'United States o...",2000-09-02,0.0,99,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,There's a fine line between friendship and bet...,Chinese Coffee,False,6.900,46,R
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
62002,False,/8tyq1kXM3YQmu3obW6LxWm5TVRO.jpg,,0,"[{'id': 28, 'name': 'Action'}, {'id': 36, 'nam...",,605153,tt9851854,te,మేజర్,Based on the life of real-life Hero Major Sand...,19.029,/sJOfJuyQVZPwNQ8g21Qv0lojQhC.jpg,"[{'id': 69124, 'logo_path': None, 'name': 'G. ...","[{'iso_3166_1': 'IN', 'name': 'India'}]",2022-06-03,0.0,149,"[{'english_name': 'Telugu', 'iso_639_1': 'te',...",Released,Jaan Doonga Desh Nahi,Major,False,8.233,15,
62003,False,,,0,"[{'id': 80, 'name': 'Crime'}]",,969840,tt9854058,en,Shadows,A young low-level drug dealer is reunited with...,0.600,/2HaAOGM1EmiSwsJrdq1RNhYehce.jpg,[],[],2022-05-13,0.0,101,[],Released,Family Is The Last Line Of Defense,Shadows,False,0.000,0,
62004,False,,,0,"[{'id': 80, 'name': 'Crime'}, {'id': 10749, 'n...",,796955,tt9893158,en,Clowning,"With his girlfriend pregnant, Dante, a pacifis...",3.136,/xppIANX9DQoRYg3FlNCifDYuFwP.jpg,"[{'id': 109533, 'logo_path': '/xtQJYJg54jp5QVS...","[{'iso_3166_1': 'US', 'name': 'United States o...",2022-03-13,0.0,96,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Warm nights... Feels like death in the sand du...,Clowning,False,8.000,1,
62005,False,/jX5XGqJUTzvpta2RjcX6pMZqxk5.jpg,,0,"[{'id': 53, 'name': 'Thriller'}, {'id': 80, 'n...",,606303,tt9893160,en,No Way Out,"Nick, a talented photographer who is new to Lo...",18.247,/df9pAqtYzM40llo9Joxy2ftqSrP.jpg,"[{'id': 13238, 'logo_path': '/kDNZz8imH866Mezx...","[{'iso_3166_1': 'US', 'name': 'United States o...",2022-08-12,0.0,89,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Find what you love and let it kill you.,No Way Out,False,3.000,3,


In [53]:
schema = get_schema(df)
schema['title'] = Text()
schema['original_title'] = Text()

In [54]:
schema

{'adult': Boolean(),
 'backdrop_path': String(length=33),
 'belongs_to_collection': String(length=198),
 'genres': String(length=257),
 'homepage': String(length=211),
 'imdb_id': String(length=11),
 'original_language': String(length=3),
 'original_title': Text(),
 'overview': String(length=1001),
 'popularity': Float(),
 'poster_path': String(length=33),
 'production_companies': String(length=2944),
 'production_countries': String(length=1061),
 'release_date': String(length=11),
 'revenue': Float(),
 'spoken_languages': String(length=943),
 'status': String(length=16),
 'tagline': String(length=257),
 'title': Text(),
 'video': Boolean(),
 'vote_average': Float(),
 'certification': String(length=32)}

In [55]:
df_for_db = df.drop(columns=['title','original_title'])
df_for_db

Unnamed: 0,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,video,vote_average,vote_count,certification
0,False,/vMFs7nw6P0bIV1jDsQpxAieAVnH.jpg,,10000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 10402, '...",,62127,tt0113026,en,Two rural teens sing and dance their way throu...,2.289,/hfO64mXz3DgUxkBVU7no2UWRP7x.jpg,"[{'id': 60, 'logo_path': '/2eqFolQI0NLL7ExZts5...","[{'iso_3166_1': 'US', 'name': 'United States o...",2000-09-22,0.0,86,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Try to remember the first time magic happened,False,5.500,22,
1,False,,,0,"[{'id': 878, 'name': 'Science Fiction'}]",,110977,tt0113092,en,Earth is in a state of constant war and two co...,3.133,/h9bWO13nWRGZJo4XVPiElXyrRMU.jpg,"[{'id': 925, 'logo_path': '/dIb9hjXNOkgxu4kBWd...","[{'iso_3166_1': 'US', 'name': 'United States o...",2000-11-15,0.0,100,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,The ultimate showdown on a forbidden planet.,False,5.100,8,
2,False,,,0,"[{'id': 18, 'name': 'Drama'}, {'id': 28, 'name...",,442869,tt0116391,hi,"After falling prey to underworld, four friends...",1.091,/yB5wRu4uyXXwZA3PEj8cITu0xt3.jpg,[],"[{'iso_3166_1': 'IN', 'name': 'India'}]",2000-04-14,0.0,152,"[{'english_name': 'Hindi', 'iso_639_1': 'hi', ...",Released,,False,0.000,0,
3,False,/n4GJFGzsc7NinI1VeGDXIcQjtU2.jpg,,150000,"[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...",http://www.wkw-inthemoodforlove.com/,843,tt0118694,cn,"Hong Kong, 1962: Chow Mo-Wan and Su Li-Zhen mo...",22.892,/iYypPT4bhqXfq1b6EnmxvRt6b2Y.jpg,"[{'id': 539, 'logo_path': None, 'name': 'Block...","[{'iso_3166_1': 'HK', 'name': 'Hong Kong'}]",2000-09-29,12854953.0,99,"[{'english_name': 'Cantonese', 'iso_639_1': 'c...",Released,"Feel the heat, keep the feeling burning, let t...",False,8.103,1948,PG
4,False,,,0,"[{'id': 18, 'name': 'Drama'}]",,49511,tt0118852,en,"When Harry Levine, an aging, unsuccessful Gree...",3.913,/nZGWnSuf1FIuzyEuMRZHHZWViAp.jpg,"[{'id': 1596, 'logo_path': None, 'name': 'Shoo...","[{'iso_3166_1': 'US', 'name': 'United States o...",2000-09-02,0.0,99,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,There's a fine line between friendship and bet...,False,6.900,46,R
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
62002,False,/8tyq1kXM3YQmu3obW6LxWm5TVRO.jpg,,0,"[{'id': 28, 'name': 'Action'}, {'id': 36, 'nam...",,605153,tt9851854,te,Based on the life of real-life Hero Major Sand...,19.029,/sJOfJuyQVZPwNQ8g21Qv0lojQhC.jpg,"[{'id': 69124, 'logo_path': None, 'name': 'G. ...","[{'iso_3166_1': 'IN', 'name': 'India'}]",2022-06-03,0.0,149,"[{'english_name': 'Telugu', 'iso_639_1': 'te',...",Released,Jaan Doonga Desh Nahi,False,8.233,15,
62003,False,,,0,"[{'id': 80, 'name': 'Crime'}]",,969840,tt9854058,en,A young low-level drug dealer is reunited with...,0.600,/2HaAOGM1EmiSwsJrdq1RNhYehce.jpg,[],[],2022-05-13,0.0,101,[],Released,Family Is The Last Line Of Defense,False,0.000,0,
62004,False,,,0,"[{'id': 80, 'name': 'Crime'}, {'id': 10749, 'n...",,796955,tt9893158,en,"With his girlfriend pregnant, Dante, a pacifis...",3.136,/xppIANX9DQoRYg3FlNCifDYuFwP.jpg,"[{'id': 109533, 'logo_path': '/xtQJYJg54jp5QVS...","[{'iso_3166_1': 'US', 'name': 'United States o...",2022-03-13,0.0,96,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Warm nights... Feels like death in the sand du...,False,8.000,1,
62005,False,/jX5XGqJUTzvpta2RjcX6pMZqxk5.jpg,,0,"[{'id': 53, 'name': 'Thriller'}, {'id': 80, 'n...",,606303,tt9893160,en,"Nick, a talented photographer who is new to Lo...",18.247,/df9pAqtYzM40llo9Joxy2ftqSrP.jpg,"[{'id': 13238, 'logo_path': '/kDNZz8imH866Mezx...","[{'iso_3166_1': 'US', 'name': 'United States o...",2022-08-12,0.0,89,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Find what you love and let it kill you.,False,3.000,3,


In [56]:
df_for_db.to_sql('tmdb_data_aab',engine, index=False, 
                            if_exists='replace',dtype=get_schema(df_for_db))
#                           method='multi')

62007

In [57]:
# ## loop through adding more columns and remove what errors
# good_cols = [*cols_to_keep]
# bad_cols = []
# all_cols = df.drop(columns=cols_to_keep).columns

# for col in all_cols:
#     print(f"- Adding {col}")
#     try: 
#         cols_to_try = [*good_cols, col]
#         df_filtered = df[ cols_to_try]
#         schema= get_schema(df_filtered)
#         ## Primary key is Genre_ID
#         df_filtered.to_sql('tmdb_data_aab',engine, index=False,
#                             if_exists='replace',#dtype=schema,
#                           method='multi')
        
#         # append col name to good_cols if no error
#         good_cols.append(col)
#     except Exception as e:
#         print("   - ERROR")
#         bad_cols.append({col:e})
# bad_cols

In [58]:
engine.execute('ALTER TABLE tmdb_data_aab ADD PRIMARY KEY (`imdb_id`);')

<sqlalchemy.engine.cursor.LegacyCursorResult at 0x1f0010ca160>

In [59]:
q = """SELECT * FROM tmdb_data_aab LIMIT 5"""
pd.read_sql(q,engine)

Unnamed: 0,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,video,vote_average,vote_count,certification
0,0,/ab5yL8zgRotrICzGbEl10z24N71.jpg,,48000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 14, 'nam...",,11232,tt0035423,en,When her scientist ex-boyfriend discovers a po...,15.993,/mUvikzKJJSg9khrVdxK8kg3TMHA.jpg,"[{'id': 14, 'logo_path': '/m6AHu84oZQxvq7n1rsv...","[{'iso_3166_1': 'US', 'name': 'United States o...",2001-12-25,76019000.0,118,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,If they lived in the same century they'd be pe...,0,6.317,1137,PG-13
1,0,/fw5tsNib4QZBEw18xmebpVe3WZ8.jpg,,0,"[{'id': 18, 'name': 'Drama'}, {'id': 14, 'name...",http://poetastros.com/el-tango-del-viudo/,602986,tt0062336,es,"A man whose wife has committed suicide, appea...",1.231,/yzbqP9woGq2wGUJh0DzVXlr3Th7.jpg,"[{'id': 96241, 'logo_path': None, 'name': 'Poe...","[{'iso_3166_1': 'CL', 'name': 'Chile'}]",2020-02-21,0.0,63,"[{'english_name': 'Spanish', 'iso_639_1': 'es'...",Released,,0,5.3,3,
2,0,/zjG95oDnBcFKMPgBEmmuNVOMC90.jpg,,12000000,"[{'id': 18, 'name': 'Drama'}]",https://www.netflix.com/title/80085566,299782,tt0069049,en,"Surrounded by fans and skeptics, grizzled dire...",8.22,/kFky1paYEfHxfCYByEc9g7gn6Zk.jpg,"[{'id': 7573, 'logo_path': None, 'name': ""Les ...","[{'iso_3166_1': 'FR', 'name': 'France'}, {'iso...",2018-11-02,0.0,122,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,40 years in the making,0,6.694,157,R
3,0,,,350000,"[{'id': 35, 'name': 'Comedy'}, {'id': 27, 'nam...",,29163,tt0088751,en,Using soundtracks and extensive footage from m...,1.672,/aYbeNeNID1wLBp9l214w8CU00xd.jpg,[],"[{'iso_3166_1': 'US', 'name': 'United States o...",2005-04-22,0.0,100,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,,0,3.4,5,
4,0,,,187,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,431608,tt0094859,en,An outrageous social comedy about a New York r...,0.848,,[],"[{'iso_3166_1': 'US', 'name': 'United States o...",2016-10-28,0.0,74,[],Released,An outrageous social comedy about a New York r...,0,0.0,0,


# Final Tables Check

In [60]:
## Test your connection by checking for any tables that exist (there should be none at this point)
q = """SHOW TABLES;"""
pd.read_sql(q, engine)

Unnamed: 0,Tables_in_movies
0,genres
1,title_basics
2,title_genres
3,title_ratings
4,tmdb_data_aab
5,tmdb_data_mvp
