# ETL Project

## Extract

From Kaggle.com, we found two sources of data in .csv form that had similar subjects; famous movie catchphrases and movies available on netflix. The common column in each file is 'title' (the title of the movie).

We read these .csvs with the Python Pandas library and assigned them to their own respective dataframes.

In [1]:
import pandas as pd
from sqlalchemy import create_engine
import datetime
from config import (user, password, host, port, database)

In [2]:
# read catchphrase.csv file
csv_file = "catchphrase.csv"

# store .csv into dataframe
catchphrases_df = pd.read_csv(csv_file)
catchphrases_df.head()

Unnamed: 0,Catchphrase,Movie Name,Context
0,"Beetlejuice, Beetlejuice, Beetlejuice!",BEETLEJUICE,"Lydia, summoning Beetlejuice"
1,It's showtime!,BEETLEJUICE,"Beetlejuice, being summoned."
2,They're heeeere!,POLTERGEIST,"Carol Anne Freeling, notifying her parents of..."
3,Hey you guys!,THE GOONIES,"Sloth, calling the attention of the children ..."
4,"Good morning, Vietnam!","GOOD MORNING, VIETNAM",Adrian Cronauer's greeting on his radio show


In [3]:
# read catchphrase.csv file
csv_file = "netflix_titles.csv"

# store .csv into dataframe
netflix_df = pd.read_csv(csv_file)
netflix_df

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,TV Show,3%,,"João Miguel, Bianca Comparato, Michel Gomes, R...",Brazil,"August 14, 2020",2020,TV-MA,4 Seasons,"International TV Shows, TV Dramas, TV Sci-Fi &...",In a future where the elite inhabit an island ...
1,s2,Movie,7:19,Jorge Michel Grau,"Demián Bichir, Héctor Bonilla, Oscar Serrano, ...",Mexico,"December 23, 2016",2016,TV-MA,93 min,"Dramas, International Movies",After a devastating earthquake hits Mexico Cit...
2,s3,Movie,23:59,Gilbert Chan,"Tedd Chan, Stella Chung, Henley Hii, Lawrence ...",Singapore,"December 20, 2018",2011,R,78 min,"Horror Movies, International Movies","When an army recruit is found dead, his fellow..."
3,s4,Movie,9,Shane Acker,"Elijah Wood, John C. Reilly, Jennifer Connelly...",United States,"November 16, 2017",2009,PG-13,80 min,"Action & Adventure, Independent Movies, Sci-Fi...","In a postapocalyptic world, rag-doll robots hi..."
4,s5,Movie,21,Robert Luketic,"Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...",United States,"January 1, 2020",2008,PG-13,123 min,Dramas,A brilliant group of students become card-coun...
...,...,...,...,...,...,...,...,...,...,...,...,...
7782,s7783,Movie,Zozo,Josef Fares,"Imad Creidi, Antoinette Turk, Elias Gergi, Car...","Sweden, Czech Republic, United Kingdom, Denmar...","October 19, 2020",2005,TV-MA,99 min,"Dramas, International Movies",When Lebanon's Civil War deprives Zozo of his ...
7783,s7784,Movie,Zubaan,Mozez Singh,"Vicky Kaushal, Sarah-Jane Dias, Raaghav Chanan...",India,"March 2, 2019",2015,TV-14,111 min,"Dramas, International Movies, Music & Musicals",A scrappy but poor boy worms his way into a ty...
7784,s7785,Movie,Zulu Man in Japan,,Nasty C,,"September 25, 2020",2019,TV-MA,44 min,"Documentaries, International Movies, Music & M...","In this documentary, South African rapper Nast..."
7785,s7786,TV Show,Zumbo's Just Desserts,,"Adriano Zumbo, Rachel Khoo",Australia,"October 31, 2020",2019,TV-PG,1 Season,"International TV Shows, Reality TV",Dessert wizard Adriano Zumbo looks for the nex...


## Transform

To transform the data, we made sure the column names across both dataframes has consistent lowercase snake-case formatting.

Because the data in the 'title' column of the original .csv movie catchrases file were in ALL CAPS, we decided to make the 'title' column in the netflix dataframe to be same. To do so, we used the str.upper() function.

For good measure, we dropped duplicates in the netflix dataframe because there were so many records. We dropped a number of columns in each dataframe we decided were not relevant for particularly useful for this query.

Finally, we set the 'title' column in each dataframe to the index. This way, we could rely on serial index columns in the schema we created on the SQL end.

In [4]:
# rename columns of catchphrase_df
catchphrases_df = catchphrases_df.rename(columns={'Catchphrase': 'catchphrase',
                                'Movie Name': 'title',
                                'Context': 'context'})
catchphrases_df

Unnamed: 0,catchphrase,title,context
0,"Beetlejuice, Beetlejuice, Beetlejuice!",BEETLEJUICE,"Lydia, summoning Beetlejuice"
1,It's showtime!,BEETLEJUICE,"Beetlejuice, being summoned."
2,They're heeeere!,POLTERGEIST,"Carol Anne Freeling, notifying her parents of..."
3,Hey you guys!,THE GOONIES,"Sloth, calling the attention of the children ..."
4,"Good morning, Vietnam!","GOOD MORNING, VIETNAM",Adrian Cronauer's greeting on his radio show
...,...,...,...
145,Hasta la vista...baby.,TERMINATOR 2: JUDGMENT DAY,"The Terminator, to T-1000, before shattering ..."
146,"That'll do, Pig.",BABE,"Farmer Hogget, to his sheep-pig, Babe, after ..."
147,"It's over, Johnny.",FIRST BLOOD,"Col. Trautman, to John Rambo, telling him his..."
148,"After all, tomorrow is another day.",GONE WITH THE WIND,"Scarlett O'Hara to herself, convinced she can..."


In [5]:
# drop duplicates of netflix_df, jut in case.
netflix_df = netflix_df.drop_duplicates()
netflix_df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,TV Show,3%,,"João Miguel, Bianca Comparato, Michel Gomes, R...",Brazil,"August 14, 2020",2020,TV-MA,4 Seasons,"International TV Shows, TV Dramas, TV Sci-Fi &...",In a future where the elite inhabit an island ...
1,s2,Movie,7:19,Jorge Michel Grau,"Demián Bichir, Héctor Bonilla, Oscar Serrano, ...",Mexico,"December 23, 2016",2016,TV-MA,93 min,"Dramas, International Movies",After a devastating earthquake hits Mexico Cit...
2,s3,Movie,23:59,Gilbert Chan,"Tedd Chan, Stella Chung, Henley Hii, Lawrence ...",Singapore,"December 20, 2018",2011,R,78 min,"Horror Movies, International Movies","When an army recruit is found dead, his fellow..."
3,s4,Movie,9,Shane Acker,"Elijah Wood, John C. Reilly, Jennifer Connelly...",United States,"November 16, 2017",2009,PG-13,80 min,"Action & Adventure, Independent Movies, Sci-Fi...","In a postapocalyptic world, rag-doll robots hi..."
4,s5,Movie,21,Robert Luketic,"Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...",United States,"January 1, 2020",2008,PG-13,123 min,Dramas,A brilliant group of students become card-coun...


In [6]:
# change netflix titles to ALL CAPS to be consistent with catchphrase_df
netflix_df['title'] = netflix_df['title'].str.upper()
netflix_df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,TV Show,3%,,"João Miguel, Bianca Comparato, Michel Gomes, R...",Brazil,"August 14, 2020",2020,TV-MA,4 Seasons,"International TV Shows, TV Dramas, TV Sci-Fi &...",In a future where the elite inhabit an island ...
1,s2,Movie,7:19,Jorge Michel Grau,"Demián Bichir, Héctor Bonilla, Oscar Serrano, ...",Mexico,"December 23, 2016",2016,TV-MA,93 min,"Dramas, International Movies",After a devastating earthquake hits Mexico Cit...
2,s3,Movie,23:59,Gilbert Chan,"Tedd Chan, Stella Chung, Henley Hii, Lawrence ...",Singapore,"December 20, 2018",2011,R,78 min,"Horror Movies, International Movies","When an army recruit is found dead, his fellow..."
3,s4,Movie,9,Shane Acker,"Elijah Wood, John C. Reilly, Jennifer Connelly...",United States,"November 16, 2017",2009,PG-13,80 min,"Action & Adventure, Independent Movies, Sci-Fi...","In a postapocalyptic world, rag-doll robots hi..."
4,s5,Movie,21,Robert Luketic,"Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...",United States,"January 1, 2020",2008,PG-13,123 min,Dramas,A brilliant group of students become card-coun...


In [7]:
# drop unwanted columns in nextflix_df
netflix_df = netflix_df[['title', 'date_added', 'release_year', 'rating', 'duration']]

# set 'title' as index
netflix_df = netflix_df.set_index('title')
netflix_df.head()

Unnamed: 0_level_0,date_added,release_year,rating,duration
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
3%,"August 14, 2020",2020,TV-MA,4 Seasons
7:19,"December 23, 2016",2016,TV-MA,93 min
23:59,"December 20, 2018",2011,R,78 min
9,"November 16, 2017",2009,PG-13,80 min
21,"January 1, 2020",2008,PG-13,123 min


In [8]:
# drop unwanted columns from catchphrase_df
catchphrases_df = catchphrases_df[['title', 'catchphrase']]

# set 'title' as index
catchphrases_df = catchphrases_df.set_index('title')
catchphrases_df.head()

Unnamed: 0_level_0,catchphrase
title,Unnamed: 1_level_1
BEETLEJUICE,"Beetlejuice, Beetlejuice, Beetlejuice!"
BEETLEJUICE,It's showtime!
POLTERGEIST,They're heeeere!
THE GOONIES,Hey you guys!
"GOOD MORNING, VIETNAM","Good morning, Vietnam!"


## Load
We took the dataframes we created and loaded them into the respective tables we created in PostgreSQL.

In [9]:
# create engine to postgreSQL
engine = create_engine(f'postgresql://{user}:{password}@{host}:{port}/{database}')
conn = engine.connect()

In [10]:
# look at tables in database
engine.table_names()

['catchphrases', 'netflix']

In [11]:
# load catchphrases_df into catchphrases_db
catchphrases_df.to_sql(name='catchphrases', con=engine, if_exists='append', index=True)

In [12]:
# load netflix_df into netflix_db
netflix_df.to_sql(name='netflix', con=engine, if_exists='append', index=True)