# Milestone 2

Author: Berkeley Willis  
Class: Data Preparation 540-T303  
Date: 2020-04-13

#### Goals:  
* Load the data and make sure that it fill suit any needs I have of the data
* Use 5 or more data transformation and/or cleansing steps to the flat file data

In [1]:
# First things first obviously I need to go ahead and load up any necessary libraries
# May not have immediate use of some of these but may come handy as the project evolves
import pandas as pd
import numpy as np
import re
import datetime

# And I need to just read the flat file that is local as of now
netflix_data = pd.read_csv("./data/netflix_titles.csv")

# Show a quick sample of records
display(netflix_data.head())

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,81145628,Movie,Norm of the North: King Sized Adventure,"Richard Finn, Tim Maltby","Alan Marriott, Andrew Toth, Brian Dobson, Cole...","United States, India, South Korea, China","September 9, 2019",2019,TV-PG,90 min,"Children & Family Movies, Comedies",Before planning an awesome wedding for his gra...
1,80117401,Movie,Jandino: Whatever it Takes,,Jandino Asporaat,United Kingdom,"September 9, 2016",2016,TV-MA,94 min,Stand-Up Comedy,Jandino Asporaat riffs on the challenges of ra...
2,70234439,TV Show,Transformers Prime,,"Peter Cullen, Sumalee Montano, Frank Welker, J...",United States,"September 8, 2018",2013,TV-Y7-FV,1 Season,Kids' TV,"With the help of three human allies, the Autob..."
3,80058654,TV Show,Transformers: Robots in Disguise,,"Will Friedle, Darren Criss, Constance Zimmer, ...",United States,"September 8, 2018",2016,TV-Y7,1 Season,Kids' TV,When a prison ship crash unleashes hundreds of...
4,80125979,Movie,#realityhigh,Fernando Lebrija,"Nesta Cooper, Kate Walsh, John Michael Higgins...",United States,"September 8, 2017",2017,TV-14,99 min,Comedies,When nerdy high schooler Dani finally attracts...


In [2]:
# So first things first I am going to do a quick search for duplicates for show_id or title
if any(netflix_data['show_id'].duplicated()):
    print("Duplicates have been detected in show_id column")

if any(netflix_data['title'].duplicated()):
    print("Duplicates have been detected in title column")

Duplicates have been detected in title column


In [3]:
# Kinda want to dig in a bit more to confirm if they are actuall duplicates or if there is something different
display(netflix_data[ netflix_data.title.isin(netflix_data[ netflix_data['title'].duplicated() ].title) ].sort_values(by="title").head(n=10))

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
5905,80026224,TV Show,Aquarius,,"David Duchovny, Gethin Anthony, Grey Damon, Em...",United States,"June 16, 2017",2016,TV-MA,2 Seasons,"Crime TV Shows, TV Dramas","Amid the turmoil of 1960s LA, two cops and a p..."
3441,80113667,Movie,Aquarius,Kleber Mendonça Filho,"Sonia Braga, Maeve Jinkings, Irandhir Santos, ...","Brazil, France","January 13, 2017",2016,UR,146 min,"Dramas, International Movies",The final holdout in her historic beachside bu...
2159,80204923,Movie,Benji,Brandon Camp,"Gabriel Bateman, Darby Camp, Kiele Sanchez, Je...","United Arab Emirates, United States","March 16, 2018",2018,TV-PG,87 min,"Children & Family Movies, Dramas",A determined dog comes to the rescue and helps...
1969,296682,Movie,Benji,Joe Camp,"Benji, Deborah Walley, Peter Breck, Edgar Buch...",United States,"March 6, 2018",1974,G,86 min,"Children & Family Movies, Classic Movies",After lovable abandoned mutt Benji is adopted ...
5724,70204957,TV Show,Bleach,,"Masakazu Morita, Fumiko Orikasa, Yuki Matsuoka...",Japan,"November 3, 2018",2006,TV-14,3 Seasons,"Anime Series, International TV Shows",After teenager Ichigo Kurosaki acquires superp...
257,80217733,Movie,Bleach,Shinsuke Sato,"Sota Fukushi, Hana Sugisaki, Ryo Yoshizawa, Er...",Japan,"September 14, 2018",2018,TV-14,109 min,"Action & Adventure, International Movies, Sci-...",When high schooler Ichigo is suddenly given re...
3914,80209153,Movie,Blood Money,Lucky McKee,"John Cusack, Ellar Coltrane, Willa Fitzgerald,...",United States,"February 17, 2018",2017,R,85 min,Action & Adventure,Three friends on a wilderness trip find bags f...
2064,80175623,Movie,Blood Money,Vishal Mahadkar,"Manish Chaudhary, Kunal Khemu, Amrita Puri, Sh...",India,"March 27, 2017",2012,TV-14,110 min,"Dramas, International Movies, Romantic Movies",Ambitious Kunal takes a job in the diamond bus...
2699,352989,Movie,Carrie,Brian De Palma,"Sissy Spacek, Piper Laurie, Amy Irving, Willia...",United States,"June 1, 2019",1976,R,98 min,"Classic Movies, Cult Movies, Horror Movies",An outcast teen with telekinetic ability lashe...
4185,60029461,Movie,Carrie,David Carson,"Angela Bettis, Patricia Clarkson, Rena Sofer, ...","United States, Canada","December 31, 2019",2002,TV-14,132 min,Horror Movies,This remake of the classic 1976 horror flick f...


In [4]:
# First going to rename some of the headers to identify some of it as specifically netflix data or for clarity
netflix_data = netflix_data.rename(columns={
        "show_id":"netflix_id", "type":"media_type","cast":"netflix_cast","date_added":"netflix_date_str",
        "listed_in":"netflix_categories", "duration":"netflix_duration", "description":"netflix_desc"
    })

# Going to look at it real quick to check changes
display(netflix_data.head())

Unnamed: 0,netflix_id,media_type,title,director,netflix_cast,country,netflix_date_str,release_year,rating,netflix_duration,netflix_categories,netflix_desc
0,81145628,Movie,Norm of the North: King Sized Adventure,"Richard Finn, Tim Maltby","Alan Marriott, Andrew Toth, Brian Dobson, Cole...","United States, India, South Korea, China","September 9, 2019",2019,TV-PG,90 min,"Children & Family Movies, Comedies",Before planning an awesome wedding for his gra...
1,80117401,Movie,Jandino: Whatever it Takes,,Jandino Asporaat,United Kingdom,"September 9, 2016",2016,TV-MA,94 min,Stand-Up Comedy,Jandino Asporaat riffs on the challenges of ra...
2,70234439,TV Show,Transformers Prime,,"Peter Cullen, Sumalee Montano, Frank Welker, J...",United States,"September 8, 2018",2013,TV-Y7-FV,1 Season,Kids' TV,"With the help of three human allies, the Autob..."
3,80058654,TV Show,Transformers: Robots in Disguise,,"Will Friedle, Darren Criss, Constance Zimmer, ...",United States,"September 8, 2018",2016,TV-Y7,1 Season,Kids' TV,When a prison ship crash unleashes hundreds of...
4,80125979,Movie,#realityhigh,Fernando Lebrija,"Nesta Cooper, Kate Walsh, John Michael Higgins...",United States,"September 8, 2017",2017,TV-14,99 min,Comedies,When nerdy high schooler Dani finally attracts...


In [5]:
# Seems that they are different primarily based off the type of media, directors, publish data, and country
# so that concerns me less. Especially with their differing show_ids

# Something I want to do here then is do some small sanitization based on their titles
# trim all white space from the tiels and their punctuation and throw them all to lower case

# To simplify creating a quick function to sanitize any given string as much as possible for join operations
def sanitize_str(dirty_string):
    # Taking the titles, removing special characters, trimming whitespace, and subbing out for '_' chars
    return(re.sub(' ', '_', re.sub(' +', ' ', re.sub('[^A-Za-z0-9\s]+', '', dirty_string).strip().lower())))

# Going to run it for the titles and for media types
netflix_data['clean_title'] = netflix_data.title.apply(sanitize_str)
# Going to also run it for media_type to make sure that we can handle a movie and show having the same name
netflix_data['clean_media_type'] = netflix_data.media_type.apply(sanitize_str)

# Check out the changes
display(netflix_data.head())

Unnamed: 0,netflix_id,media_type,title,director,netflix_cast,country,netflix_date_str,release_year,rating,netflix_duration,netflix_categories,netflix_desc,clean_title,clean_media_type
0,81145628,Movie,Norm of the North: King Sized Adventure,"Richard Finn, Tim Maltby","Alan Marriott, Andrew Toth, Brian Dobson, Cole...","United States, India, South Korea, China","September 9, 2019",2019,TV-PG,90 min,"Children & Family Movies, Comedies",Before planning an awesome wedding for his gra...,norm_of_the_north_king_sized_adventure,movie
1,80117401,Movie,Jandino: Whatever it Takes,,Jandino Asporaat,United Kingdom,"September 9, 2016",2016,TV-MA,94 min,Stand-Up Comedy,Jandino Asporaat riffs on the challenges of ra...,jandino_whatever_it_takes,movie
2,70234439,TV Show,Transformers Prime,,"Peter Cullen, Sumalee Montano, Frank Welker, J...",United States,"September 8, 2018",2013,TV-Y7-FV,1 Season,Kids' TV,"With the help of three human allies, the Autob...",transformers_prime,tv_show
3,80058654,TV Show,Transformers: Robots in Disguise,,"Will Friedle, Darren Criss, Constance Zimmer, ...",United States,"September 8, 2018",2016,TV-Y7,1 Season,Kids' TV,When a prison ship crash unleashes hundreds of...,transformers_robots_in_disguise,tv_show
4,80125979,Movie,#realityhigh,Fernando Lebrija,"Nesta Cooper, Kate Walsh, John Michael Higgins...",United States,"September 8, 2017",2017,TV-14,99 min,Comedies,When nerdy high schooler Dani finally attracts...,realityhigh,movie


In [6]:
# Now going to confirm that we get the dates properly represented because as they stand now they aren't date objects
netflix_data['netflix_date_added'] = pd.to_datetime(netflix_data.netflix_date_str.str.strip(), format="%B %d, %Y")

# Show a bit of the data to show the change
display(netflix_data.head())

Unnamed: 0,netflix_id,media_type,title,director,netflix_cast,country,netflix_date_str,release_year,rating,netflix_duration,netflix_categories,netflix_desc,clean_title,clean_media_type,netflix_date_added
0,81145628,Movie,Norm of the North: King Sized Adventure,"Richard Finn, Tim Maltby","Alan Marriott, Andrew Toth, Brian Dobson, Cole...","United States, India, South Korea, China","September 9, 2019",2019,TV-PG,90 min,"Children & Family Movies, Comedies",Before planning an awesome wedding for his gra...,norm_of_the_north_king_sized_adventure,movie,2019-09-09
1,80117401,Movie,Jandino: Whatever it Takes,,Jandino Asporaat,United Kingdom,"September 9, 2016",2016,TV-MA,94 min,Stand-Up Comedy,Jandino Asporaat riffs on the challenges of ra...,jandino_whatever_it_takes,movie,2016-09-09
2,70234439,TV Show,Transformers Prime,,"Peter Cullen, Sumalee Montano, Frank Welker, J...",United States,"September 8, 2018",2013,TV-Y7-FV,1 Season,Kids' TV,"With the help of three human allies, the Autob...",transformers_prime,tv_show,2018-09-08
3,80058654,TV Show,Transformers: Robots in Disguise,,"Will Friedle, Darren Criss, Constance Zimmer, ...",United States,"September 8, 2018",2016,TV-Y7,1 Season,Kids' TV,When a prison ship crash unleashes hundreds of...,transformers_robots_in_disguise,tv_show,2018-09-08
4,80125979,Movie,#realityhigh,Fernando Lebrija,"Nesta Cooper, Kate Walsh, John Michael Higgins...",United States,"September 8, 2017",2017,TV-14,99 min,Comedies,When nerdy high schooler Dani finally attracts...,realityhigh,movie,2017-09-08


In [7]:
# For now let's isolate those netflix records that are only movies, since that's what we are most intersted in
netflix_movie_data = netflix_data[ netflix_data.clean_media_type=="movie" ]

# Reindex for other operations
netflix_movie_data = netflix_movie_data.reset_index(drop=True)

print("We started with {} records, and only have {} movies".format(netflix_data.shape[0],netflix_movie_data.shape[0]))

display(netflix_movie_data.head())

We started with 6234 records, and only have 4265 movies


Unnamed: 0,netflix_id,media_type,title,director,netflix_cast,country,netflix_date_str,release_year,rating,netflix_duration,netflix_categories,netflix_desc,clean_title,clean_media_type,netflix_date_added
0,81145628,Movie,Norm of the North: King Sized Adventure,"Richard Finn, Tim Maltby","Alan Marriott, Andrew Toth, Brian Dobson, Cole...","United States, India, South Korea, China","September 9, 2019",2019,TV-PG,90 min,"Children & Family Movies, Comedies",Before planning an awesome wedding for his gra...,norm_of_the_north_king_sized_adventure,movie,2019-09-09
1,80117401,Movie,Jandino: Whatever it Takes,,Jandino Asporaat,United Kingdom,"September 9, 2016",2016,TV-MA,94 min,Stand-Up Comedy,Jandino Asporaat riffs on the challenges of ra...,jandino_whatever_it_takes,movie,2016-09-09
2,80125979,Movie,#realityhigh,Fernando Lebrija,"Nesta Cooper, Kate Walsh, John Michael Higgins...",United States,"September 8, 2017",2017,TV-14,99 min,Comedies,When nerdy high schooler Dani finally attracts...,realityhigh,movie,2017-09-08
3,70304989,Movie,Automata,Gabe Ibáñez,"Antonio Banderas, Dylan McDermott, Melanie Gri...","Bulgaria, United States, Spain, Canada","September 8, 2017",2014,R,110 min,"International Movies, Sci-Fi & Fantasy, Thrillers","In a dystopian future, an insurance adjuster f...",automata,movie,2017-09-08
4,80164077,Movie,Fabrizio Copano: Solo pienso en mi,"Rodrigo Toro, Francisco Schultz",Fabrizio Copano,Chile,"September 8, 2017",2017,TV-MA,60 min,Stand-Up Comedy,Fabrizio Copano takes audience participation t...,fabrizio_copano_solo_pienso_en_mi,movie,2017-09-08


In [8]:
# Lastly I will change the netflix_duration to a timedelta so that we can possibly use this value later
netflix_movie_data['netflix_timedelta'] = netflix_movie_data.netflix_duration.apply(pd.to_timedelta)

# And show the dataa table again
display(netflix_movie_data.head())

Unnamed: 0,netflix_id,media_type,title,director,netflix_cast,country,netflix_date_str,release_year,rating,netflix_duration,netflix_categories,netflix_desc,clean_title,clean_media_type,netflix_date_added,netflix_timedelta
0,81145628,Movie,Norm of the North: King Sized Adventure,"Richard Finn, Tim Maltby","Alan Marriott, Andrew Toth, Brian Dobson, Cole...","United States, India, South Korea, China","September 9, 2019",2019,TV-PG,90 min,"Children & Family Movies, Comedies",Before planning an awesome wedding for his gra...,norm_of_the_north_king_sized_adventure,movie,2019-09-09,01:30:00
1,80117401,Movie,Jandino: Whatever it Takes,,Jandino Asporaat,United Kingdom,"September 9, 2016",2016,TV-MA,94 min,Stand-Up Comedy,Jandino Asporaat riffs on the challenges of ra...,jandino_whatever_it_takes,movie,2016-09-09,01:34:00
2,80125979,Movie,#realityhigh,Fernando Lebrija,"Nesta Cooper, Kate Walsh, John Michael Higgins...",United States,"September 8, 2017",2017,TV-14,99 min,Comedies,When nerdy high schooler Dani finally attracts...,realityhigh,movie,2017-09-08,01:39:00
3,70304989,Movie,Automata,Gabe Ibáñez,"Antonio Banderas, Dylan McDermott, Melanie Gri...","Bulgaria, United States, Spain, Canada","September 8, 2017",2014,R,110 min,"International Movies, Sci-Fi & Fantasy, Thrillers","In a dystopian future, an insurance adjuster f...",automata,movie,2017-09-08,01:50:00
4,80164077,Movie,Fabrizio Copano: Solo pienso en mi,"Rodrigo Toro, Francisco Schultz",Fabrizio Copano,Chile,"September 8, 2017",2017,TV-MA,60 min,Stand-Up Comedy,Fabrizio Copano takes audience participation t...,fabrizio_copano_solo_pienso_en_mi,movie,2017-09-08,01:00:00
