# A. Project Name:  IMDb Successful Movie.
- **Student Name:** Eduardo Galindez.
- **Coding Dojo Bootcamp:** Data Science.
  - **Stack:** Data Enrichment.
- **Date:** September 29th, 2022.

# B. Project Objective
- For Part C of the project we will be practicing applying an ETL process on our previously saved movie data. Specifically, we will create a new MySQL database after preparing the data for a relational database. I will export the database to a .sql file in our repository using MySQL Workbench.

# C. Project Statement


### Specifications:

Our stakeholder wants we to take the data we have been cleaning and collecting in Parts A & B of the project, and wants me to create a MySQL database for them.

We should normalize the tables as best I can before adding them to our new database.

- Note: an important exception to their request is that they would like we to keep all of the data from the TMDB API in 1 table together (even though it will not be perfectly normalized).
- We only need to keep the imdb_id, revenue, budget, and certification columns

### Required Transformation steps:
Normalize Genre:

- Convert the single string of genres from title basics into 2 new tables.
1. title_genres: with the columns:
    - tconst
    - genre_id
2. genres:
    - genre_id
    - genre_name

Discard unnecessary information:

-For the title basics table, drop the following columns:
1. "original_title" (we will use the primary title column instead)
2. "isAdult" ("Adult" will show up in the genres so this is redundant information).
3. "titleType" (every row will be a movie).
4. "genres" and other variants of genre (genre is now represented in the 2 new tables described above.
- Do not include the title_akas table in your SQL database.
    - You have already filtered out the desired movies using this table and the remaining data is mostly nulls and not of-interest to the stakeholder.

### MySQL Database Requirements
Use sqlalchemy with pandas to execute your SQL queries inside your notebook.

Create a new database on your MySQL server and call it "movies".

Make sure to have the following tables in your "movies" database:

- title_basics
- title_ratings
- title_genres
- genres
- tmdb_data

Make sure to set a Primary Key for each table that isn't a joiner table (e.g. title_genres is a joiner table).

After creating each table, show the first 5 rows of that table using a SQL query.

Make sure to run the "SHOW TABLES" SQL query at the end of your notebook to show that all required tables have been created.

### Deliverable:

Submit a link to our github respository containing the Jupyter Notebook file.

# D. Project Development

## 1.- Libraries

In [1]:
# Libraries.
import numpy as np
import pandas as pd

from sqlalchemy.types import *
from sqlalchemy import create_engine
from sqlalchemy_utils import create_database, database_exists

## 2.-  Loading Data

In [2]:
# Load Title Basics dataframe from Part A:
basics_df = pd.read_csv('./Data/title_basics_combined.csv.gz')
basics_df.head(5)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0108549,movie,West from North Goes South,West from North Goes South,0,2004.0,,96,"Comedy,Mystery"
1,tt0113026,movie,The Fantasticks,The Fantasticks,0,2000.0,,86,"Musical,Romance"
2,tt0113092,movie,For the Cause,For the Cause,0,2000.0,,100,"Action,Adventure,Drama"
3,tt0114447,movie,The Silent Force,The Silent Force,0,2001.0,,90,Action
4,tt0115937,movie,Consequence,Consequence,0,2000.0,,91,Drama


In [3]:
# Load Title Ratings dataframe from Part A:
title_ratings = pd.read_csv('./Data/title_ratings_combined.csv.gz')
title_ratings.head(5)

Unnamed: 0,tconst,averageRating,numVotes
0,tt0139422,5.7,122
1,tt0139426,4.8,1258
2,tt0139428,6.3,9
3,tt0139436,7.7,19
4,tt0139438,5.2,10


In [4]:
# Load TMDB API dataframe from Part B:
tmbd_data = pd.read_csv('./Data/tmdb_results_combined.csv.gz')
tmbd_data.head(5)

Unnamed: 0,imdb_id,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,original_language,original_title,...,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,certifcation
0,tt0113026,0.0,/vMFs7nw6P0bIV1jDsQpxAieAVnH.jpg,,10000000.0,"[{'id': 35, 'name': 'Comedy'}, {'id': 10402, '...",,62127.0,en,The Fantasticks,...,0.0,86.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Try to remember the first time magic happened,The Fantasticks,0.0,5.5,22.0,
1,tt0113092,0.0,,,0.0,"[{'id': 878, 'name': 'Science Fiction'}]",,110977.0,en,For the Cause,...,0.0,100.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,The ultimate showdown on a forbidden planet.,For the Cause,0.0,5.1,8.0,
2,tt0116391,0.0,,,0.0,"[{'id': 18, 'name': 'Drama'}, {'id': 28, 'name...",,442869.0,hi,Gang,...,0.0,152.0,"[{'english_name': 'Hindi', 'iso_639_1': 'hi', ...",Released,,Gang,0.0,0.0,0.0,
3,tt0118694,0.0,/n4GJFGzsc7NinI1VeGDXIcQjtU2.jpg,,150000.0,"[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...",http://www.wkw-inthemoodforlove.com/,843.0,cn,花樣年華,...,12854953.0,99.0,"[{'english_name': 'Cantonese', 'iso_639_1': 'c...",Released,"Feel the heat, keep the feeling burning, let t...",In the Mood for Love,0.0,8.11,1984.0,PG
4,tt0118852,0.0,,,0.0,"[{'id': 18, 'name': 'Drama'}]",,49511.0,en,Chinese Coffee,...,0.0,99.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,There's a fine line between friendship and bet...,Chinese Coffee,0.0,6.851,47.0,R


## 3.-  Data Transformation

In [5]:
# Check genres in Basics.
basics_df['genres']

0                   Comedy,Mystery
1                  Musical,Romance
2           Action,Adventure,Drama
3                           Action
4                            Drama
                   ...            
79853                        Drama
79854    Action,Adventure,Thriller
79855                     Thriller
79856                Drama,History
79857                        Drama
Name: genres, Length: 79858, dtype: object

In [6]:
# Convert these strings into lists of strings.
basics_df['genres_split'] = basics_df['genres'].str.split(',')
basics_df

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,genres_split
0,tt0108549,movie,West from North Goes South,West from North Goes South,0,2004.0,,96,"Comedy,Mystery","[Comedy, Mystery]"
1,tt0113026,movie,The Fantasticks,The Fantasticks,0,2000.0,,86,"Musical,Romance","[Musical, Romance]"
2,tt0113092,movie,For the Cause,For the Cause,0,2000.0,,100,"Action,Adventure,Drama","[Action, Adventure, Drama]"
3,tt0114447,movie,The Silent Force,The Silent Force,0,2001.0,,90,Action,[Action]
4,tt0115937,movie,Consequence,Consequence,0,2000.0,,91,Drama,[Drama]
...,...,...,...,...,...,...,...,...,...,...
79853,tt9916170,movie,The Rehearsal,O Ensaio,0,2019.0,,51,Drama,[Drama]
79854,tt9916190,movie,Safeguard,Safeguard,0,2020.0,,95,"Action,Adventure,Thriller","[Action, Adventure, Thriller]"
79855,tt9916270,movie,Il talento del calabrone,Il talento del calabrone,0,2020.0,,84,Thriller,[Thriller]
79856,tt9916362,movie,Coven,Akelarre,0,2020.0,,92,"Drama,History","[Drama, History]"


In [7]:
# Let's use explode to separate the list of genres into new rows
exploded_genres = basics_df.explode('genres_split')
exploded_genres

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,genres_split
0,tt0108549,movie,West from North Goes South,West from North Goes South,0,2004.0,,96,"Comedy,Mystery",Comedy
0,tt0108549,movie,West from North Goes South,West from North Goes South,0,2004.0,,96,"Comedy,Mystery",Mystery
1,tt0113026,movie,The Fantasticks,The Fantasticks,0,2000.0,,86,"Musical,Romance",Musical
1,tt0113026,movie,The Fantasticks,The Fantasticks,0,2000.0,,86,"Musical,Romance",Romance
2,tt0113092,movie,For the Cause,For the Cause,0,2000.0,,100,"Action,Adventure,Drama",Action
...,...,...,...,...,...,...,...,...,...,...
79854,tt9916190,movie,Safeguard,Safeguard,0,2020.0,,95,"Action,Adventure,Thriller",Thriller
79855,tt9916270,movie,Il talento del calabrone,Il talento del calabrone,0,2020.0,,84,Thriller,Thriller
79856,tt9916362,movie,Coven,Akelarre,0,2020.0,,92,"Drama,History",Drama
79856,tt9916362,movie,Coven,Akelarre,0,2020.0,,92,"Drama,History",History


In [8]:
# Let's use .unique() to get the unique genres.
unique_genres = sorted(exploded_genres['genres_split'].unique())
unique_genres

['Action',
 'Adult',
 'Adventure',
 'Animation',
 'Biography',
 'Comedy',
 'Crime',
 'Drama',
 'Family',
 'Fantasy',
 'Game-Show',
 'History',
 'Horror',
 'Music',
 'Musical',
 'Mystery',
 'News',
 'Reality-TV',
 'Romance',
 'Sci-Fi',
 'Short',
 'Sport',
 'Talk-Show',
 'Thriller',
 'War',
 'Western']

In [9]:
# Create a new title_genres table.
title_genres = exploded_genres[['tconst', 'genres_split']].copy()
title_genres

Unnamed: 0,tconst,genres_split
0,tt0108549,Comedy
0,tt0108549,Mystery
1,tt0113026,Musical
1,tt0113026,Romance
2,tt0113092,Action
...,...,...
79854,tt9916190,Thriller
79855,tt9916270,Thriller
79856,tt9916362,Drama
79856,tt9916362,History


In [10]:
# Let's replace tring genres with integers.
genres_int = range(len(unique_genres))
genre_map = dict(zip(unique_genres, genres_int))
genre_map

{'Action': 0,
 'Adult': 1,
 'Adventure': 2,
 'Animation': 3,
 'Biography': 4,
 'Comedy': 5,
 'Crime': 6,
 'Drama': 7,
 'Family': 8,
 'Fantasy': 9,
 'Game-Show': 10,
 'History': 11,
 'Horror': 12,
 'Music': 13,
 'Musical': 14,
 'Mystery': 15,
 'News': 16,
 'Reality-TV': 17,
 'Romance': 18,
 'Sci-Fi': 19,
 'Short': 20,
 'Sport': 21,
 'Talk-Show': 22,
 'Thriller': 23,
 'War': 24,
 'Western': 25}

In [11]:
# Replace the string genres with the new integer ids.
basics_df['genre_name'] = basics_df['genres_split'].replace(genre_map)
basics_df = basics_df.drop(columns='genres_split')
basics_df.head(5)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,genre_name
0,tt0108549,movie,West from North Goes South,West from North Goes South,0,2004.0,,96,"Comedy,Mystery","[Comedy, Mystery]"
1,tt0113026,movie,The Fantasticks,The Fantasticks,0,2000.0,,86,"Musical,Romance","[Musical, Romance]"
2,tt0113092,movie,For the Cause,For the Cause,0,2000.0,,100,"Action,Adventure,Drama","[Action, Adventure, Drama]"
3,tt0114447,movie,The Silent Force,The Silent Force,0,2001.0,,90,Action,[Action]
4,tt0115937,movie,Consequence,Consequence,0,2000.0,,91,Drama,[Drama]


In [12]:
# Convert the genre map dictionary into a dataframe
genres = pd.DataFrame({'genre_name':genre_map.keys(),
                            'genre_ID':genre_map.values()})
genres.head()

Unnamed: 0,genre_name,genre_ID
0,Action,0
1,Adult,1
2,Adventure,2
3,Animation,3
4,Biography,4


In [13]:
# Discard unnecessary information in Title Basics.
basics_df = basics_df.drop(columns=['originalTitle', 'isAdult', 'endYear',
                                   'titleType', 'genres'])
basics_df

Unnamed: 0,tconst,primaryTitle,startYear,runtimeMinutes,genre_name
0,tt0108549,West from North Goes South,2004.0,96,"[Comedy, Mystery]"
1,tt0113026,The Fantasticks,2000.0,86,"[Musical, Romance]"
2,tt0113092,For the Cause,2000.0,100,"[Action, Adventure, Drama]"
3,tt0114447,The Silent Force,2001.0,90,[Action]
4,tt0115937,Consequence,2000.0,91,[Drama]
...,...,...,...,...,...
79853,tt9916170,The Rehearsal,2019.0,51,[Drama]
79854,tt9916190,Safeguard,2020.0,95,"[Action, Adventure, Thriller]"
79855,tt9916270,Il talento del calabrone,2020.0,84,[Thriller]
79856,tt9916362,Coven,2020.0,92,"[Drama, History]"


In [14]:
# Let's convert 'stratYear' to integer.
basics_df['startYear'] = basics_df['startYear'].apply(np.int64)
title_basics = basics_df.copy()
title_basics.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 79858 entries, 0 to 79857
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   tconst          79858 non-null  object
 1   primaryTitle    79858 non-null  object
 2   startYear       79858 non-null  int64 
 3   runtimeMinutes  79858 non-null  int64 
 4   genre_name      79858 non-null  object
dtypes: int64(2), object(3)
memory usage: 3.0+ MB


## 4.-  MySQL Database

In [15]:
# Connect to MySQL.
username = 'root' 
password = 'root' 

movies = f'mysql+pymysql://{username}:{password}@localhost/PartC_Project3'
engine = create_engine(movies, pool_size=10, max_overflow=20, pool_pre_ping=True)
engine

Engine(mysql+pymysql://root:***@localhost/PartC_Project3)

In [16]:
# Test the connection.
movies

'mysql+pymysql://root:root@localhost/PartC_Project3'

In [17]:
# Create the database.
if database_exists(movies) == False: create_database(movies)
else: print('The database already exists.')

The database already exists.


In [18]:
# Verify that the database was created.
database_exists(movies)

True

### 4.1.- Tables creation

In [19]:
# Check the dtypes of our dataframes.
print('\033[1m Title Basic Data:\033[0;0m\n', title_basics.dtypes)
print('\n\033[1m Title Ratings Data:\033[0;0m\n', title_ratings.dtypes)
print('\n\033[1m Title Genres Data:\033[0;0m\n', title_genres.dtypes)
print('\n\033[1m Genres Data:\033[0;0m\n', genres.dtypes)
print('\n\033[1m TMBD Data:\033[0;0m\n', tmbd_data.dtypes)

[1m Title Basic Data:[0;0m
 tconst            object
primaryTitle      object
startYear          int64
runtimeMinutes     int64
genre_name        object
dtype: object

[1m Title Ratings Data:[0;0m
 tconst            object
averageRating    float64
numVotes           int64
dtype: object

[1m Title Genres Data:[0;0m
 tconst          object
genres_split    object
dtype: object

[1m Genres Data:[0;0m
 genre_name    object
genre_ID       int64
dtype: object

[1m TMBD Data:[0;0m
 imdb_id                   object
adult                    float64
backdrop_path             object
belongs_to_collection     object
budget                   float64
genres                    object
homepage                  object
id                       float64
original_language         object
original_title            object
overview                  object
popularity               float64
poster_path               object
production_companies      object
production_countries      object
release_date    

#### 4.1.1.- Title Basic Data

In [20]:
# Convert to a sql table.
title_basics.to_sql('title_basics', engine, if_exists = 'replace')

OperationalError: (pymysql.err.OperationalError) (1241, 'Operand should contain 1 column(s)')
[SQL: INSERT INTO title_basics (`index`, tconst, `primaryTitle`, `startYear`, `runtimeMinutes`, genre_name) VALUES (%(index)s, %(tconst)s, %(primaryTitle)s, %(startYear)s, %(runtimeMinutes)s, %(genre_name)s)]
[parameters: ({'index': 0, 'tconst': 'tt0108549', 'primaryTitle': 'West from North Goes South', 'startYear': 2004, 'runtimeMinutes': 96, 'genre_name': ['Comedy', 'Mystery']}, {'index': 1, 'tconst': 'tt0113026', 'primaryTitle': 'The Fantasticks', 'startYear': 2000, 'runtimeMinutes': 86, 'genre_name': ['Musical', 'Romance']}, {'index': 2, 'tconst': 'tt0113092', 'primaryTitle': 'For the Cause', 'startYear': 2000, 'runtimeMinutes': 100, 'genre_name': ['Action', 'Adventure', 'Drama']}, {'index': 3, 'tconst': 'tt0114447', 'primaryTitle': 'The Silent Force', 'startYear': 2001, 'runtimeMinutes': 90, 'genre_name': ['Action']}, {'index': 4, 'tconst': 'tt0115937', 'primaryTitle': 'Consequence', 'startYear': 2000, 'runtimeMinutes': 91, 'genre_name': ['Drama']}, {'index': 5, 'tconst': 'tt0116391', 'primaryTitle': 'Gang', 'startYear': 2000, 'runtimeMinutes': 167, 'genre_name': ['Action', 'Crime', 'Drama']}, {'index': 6, 'tconst': 'tt0116628', 'primaryTitle': 'The Incorporated', 'startYear': 2000, 'runtimeMinutes': 86, 'genre_name': ['Action', 'Thriller']}, {'index': 7, 'tconst': 'tt0116991', 'primaryTitle': 'Mariette in Ecstasy', 'startYear': 2009, 'runtimeMinutes': 101, 'genre_name': ['Drama']}  ... displaying 10 of 79858 total bound parameter sets ...  {'index': 79856, 'tconst': 'tt9916362', 'primaryTitle': 'Coven', 'startYear': 2020, 'runtimeMinutes': 92, 'genre_name': ['Drama', 'History']}, {'index': 79857, 'tconst': 'tt9916538', 'primaryTitle': 'Kuambil Lagi Hatiku', 'startYear': 2019, 'runtimeMinutes': 123, 'genre_name': ['Drama']})]
(Background on this error at: https://sqlalche.me/e/14/e3q8)

#### 4.1.2.- Title Ratings Data

In [21]:
# Convert to a sql table.
title_ratings.to_sql('title_ratings', engine, if_exists = 'replace')

408868

#### 4.1.3.- Title Genres Data

In [22]:
# Calculate max string lengths for object columns
key_len = title_genres['tconst'].fillna('').map(len).max()
title_len = title_genres['genres_split'].fillna('').map(len).max()

## Create a schema dictonary using Sqlalchemy datatype objects
title_genres_schema = {
    'tconst': String(key_len+1), 
    'genres_split': Text(title_len+1)}

In [23]:
# Save to sql with dtype and index=False
title_genres.to_sql('title_genres',engine, dtype=title_genres_schema,if_exists='replace',index=False)

149518

In [24]:
# Set 'tconst' column as the primary key.
# engine.execute('ALTER TABLE title_genres ADD PRIMARY KEY ('tconst');')

#### 4.1.4.- Genres Data

In [25]:
# Calculate max string lengths for object columns
genres_title_len = genres['genre_name'].fillna('').map(len).max()

## Create a schema dictonary using Sqlalchemy datatype objects
genres_schema = {
    'genre_name': Text(genres_title_len+1), 
    'genre_ID':Integer()}

In [26]:
# Save to sql with dtype and index=False
genres.to_sql('genres',engine, dtype=genres_schema,if_exists='replace',index=False)

26

In [27]:
# Set 'tconst' column as the primary key.
# engine.execute('ALTER TABLE genres ADD PRIMARY KEY ('genre_ID');')

#### 4.1.1.- TMDB Data

In [28]:
# Convert to a sql table.
tmbd_data.to_sql('tmbd_data', engine, if_exists = 'replace')

2495

### 4.2.- Showing tables

In [29]:
# Check the communication.
q = '''
SHOW TABLES;
'''
pd.read_sql(q, engine)

Unnamed: 0,Tables_in_partc_project3
0,genres
1,title_basics
2,title_genres
3,title_ratings
4,tmbd_data


# E. Conclusions

- Xxxxx