<h1>Netflix Data</h1>
<p>In the age of pandemic, Netflix becomes an essential part of life, Kaggle user <b>Shivam Bansal</b> prepared a dataset including the list of movies and TV shows included in Netflix. Let's have a look of what they have for users</p>

<p>First, let's import the essential libraries and the dataset:</p>

In [1]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv('dataset/netflix_titles.csv')

<h2>Exploratory</h2>
<p>Let's have a look at the data, understand how it is structured, and what data does it include:</p>

In [3]:
df.head(5)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,TV Show,3%,,"João Miguel, Bianca Comparato, Michel Gomes, R...",Brazil,"August 14, 2020",2020,TV-MA,4 Seasons,"International TV Shows, TV Dramas, TV Sci-Fi &...",In a future where the elite inhabit an island ...
1,s2,Movie,7:19,Jorge Michel Grau,"Demián Bichir, Héctor Bonilla, Oscar Serrano, ...",Mexico,"December 23, 2016",2016,TV-MA,93 min,"Dramas, International Movies",After a devastating earthquake hits Mexico Cit...
2,s3,Movie,23:59,Gilbert Chan,"Tedd Chan, Stella Chung, Henley Hii, Lawrence ...",Singapore,"December 20, 2018",2011,R,78 min,"Horror Movies, International Movies","When an army recruit is found dead, his fellow..."
3,s4,Movie,9,Shane Acker,"Elijah Wood, John C. Reilly, Jennifer Connelly...",United States,"November 16, 2017",2009,PG-13,80 min,"Action & Adventure, Independent Movies, Sci-Fi...","In a postapocalyptic world, rag-doll robots hi..."
4,s5,Movie,21,Robert Luketic,"Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...",United States,"January 1, 2020",2008,PG-13,123 min,Dramas,A brilliant group of students become card-coun...


In [4]:
df.columns

Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description'],
      dtype='object')

In [5]:
df['type'].value_counts()

Movie      5377
TV Show    2410
Name: type, dtype: int64

<h3>Movie Data</h3>
<p>Here we will be only foucsing on movie data only, so I would filter out the dataset.</p>

In [6]:
movie_df = df[df['type'] == 'Movie'].copy()

<p>As we have are focusing on movie data only, we could drop the 'type' column.</p>

In [7]:
movie_df.drop(columns = 'type', axis = 1, inplace = True)

In [8]:
movie_df

Unnamed: 0,show_id,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
1,s2,7:19,Jorge Michel Grau,"Demián Bichir, Héctor Bonilla, Oscar Serrano, ...",Mexico,"December 23, 2016",2016,TV-MA,93 min,"Dramas, International Movies",After a devastating earthquake hits Mexico Cit...
2,s3,23:59,Gilbert Chan,"Tedd Chan, Stella Chung, Henley Hii, Lawrence ...",Singapore,"December 20, 2018",2011,R,78 min,"Horror Movies, International Movies","When an army recruit is found dead, his fellow..."
3,s4,9,Shane Acker,"Elijah Wood, John C. Reilly, Jennifer Connelly...",United States,"November 16, 2017",2009,PG-13,80 min,"Action & Adventure, Independent Movies, Sci-Fi...","In a postapocalyptic world, rag-doll robots hi..."
4,s5,21,Robert Luketic,"Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...",United States,"January 1, 2020",2008,PG-13,123 min,Dramas,A brilliant group of students become card-coun...
6,s7,122,Yasir Al Yasiri,"Amina Khalil, Ahmed Dawood, Tarek Lotfy, Ahmed...",Egypt,"June 1, 2020",2019,TV-MA,95 min,"Horror Movies, International Movies","After an awful accident, a couple admitted to ..."
...,...,...,...,...,...,...,...,...,...,...,...
7781,s7782,Zoom,Peter Hewitt,"Tim Allen, Courteney Cox, Chevy Chase, Kate Ma...",United States,"January 11, 2020",2006,PG,88 min,"Children & Family Movies, Comedies","Dragged from civilian life, a former superhero..."
7782,s7783,Zozo,Josef Fares,"Imad Creidi, Antoinette Turk, Elias Gergi, Car...","Sweden, Czech Republic, United Kingdom, Denmar...","October 19, 2020",2005,TV-MA,99 min,"Dramas, International Movies",When Lebanon's Civil War deprives Zozo of his ...
7783,s7784,Zubaan,Mozez Singh,"Vicky Kaushal, Sarah-Jane Dias, Raaghav Chanan...",India,"March 2, 2019",2015,TV-14,111 min,"Dramas, International Movies, Music & Musicals",A scrappy but poor boy worms his way into a ty...
7784,s7785,Zulu Man in Japan,,Nasty C,,"September 25, 2020",2019,TV-MA,44 min,"Documentaries, International Movies, Music & M...","In this documentary, South African rapper Nast..."


<h3>Missing Data</h3>
<p>The next step is to check if we have any null data, and as we could see there are some missing data for 'director', 'cast', 'country' and 'rating'.</p>

<p>Let's begin with rating, as it only has 5 missing data.</p>

In [9]:
print(movie_df.isnull().sum())
before = movie_df.isnull().sum()

show_id           0
title             0
director        163
cast            426
country         230
date_added        0
release_year      0
rating            5
duration          0
listed_in         0
description       0
dtype: int64


In [10]:
movie_df[movie_df['rating'].isnull()]

Unnamed: 0,show_id,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
67,s68,13TH: A Conversation with Oprah Winfrey & Ava ...,,"Oprah Winfrey, Ava DuVernay",,"January 26, 2017",2017,,37 min,Movies,Oprah Winfrey sits down with director Ava DuVe...
3736,s3737,Louis C.K. 2017,Louis C.K.,Louis C.K.,United States,"April 4, 2017",2017,,74 min,Movies,"Louis C.K. muses on religion, eternal love, gi..."
3737,s3738,Louis C.K.: Hilarious,Louis C.K.,Louis C.K.,United States,"September 16, 2016",2010,,84 min,Movies,Emmy-winning comedy writer Louis C.K. brings h...
3738,s3739,Louis C.K.: Live at the Comedy Store,Louis C.K.,Louis C.K.,United States,"August 15, 2016",2015,,66 min,Movies,The comic puts his trademark hilarious/thought...
4323,s4324,My Honor Was Loyalty,Alessandro Pepe,"Leone Frisa, Paolo Vaccarino, Francesco Miglio...",Italy,"March 1, 2017",2015,,115 min,Dramas,"Amid the chaos and horror of World War II, a c..."


<p>With only five missing data, we coul deasily deal with them individual by doing some manual search. Another interesting finding is that the dataset has included "comedy show" as movie, which is quite debatable. Anyway, we could leave them for now.</p>

<p>Speaking of which, we could also drop the <b>"show_id"</b> column as it does not really tell us much.</p>

In [11]:
movie_df.drop(columns = 'show_id', axis = 1, inplace = True)

In [12]:
movie_df.head(4)

Unnamed: 0,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
1,7:19,Jorge Michel Grau,"Demián Bichir, Héctor Bonilla, Oscar Serrano, ...",Mexico,"December 23, 2016",2016,TV-MA,93 min,"Dramas, International Movies",After a devastating earthquake hits Mexico Cit...
2,23:59,Gilbert Chan,"Tedd Chan, Stella Chung, Henley Hii, Lawrence ...",Singapore,"December 20, 2018",2011,R,78 min,"Horror Movies, International Movies","When an army recruit is found dead, his fellow..."
3,9,Shane Acker,"Elijah Wood, John C. Reilly, Jennifer Connelly...",United States,"November 16, 2017",2009,PG-13,80 min,"Action & Adventure, Independent Movies, Sci-Fi...","In a postapocalyptic world, rag-doll robots hi..."
4,21,Robert Luketic,"Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...",United States,"January 1, 2020",2008,PG-13,123 min,Dramas,A brilliant group of students become card-coun...


In [13]:
movie_df.reset_index(inplace = True)

<p>Missing data of movies should be relatively easy to resolve, here, I would make use of the <a href="https://www.themoviedb.org/">TMDB (The Movie Database) API</a> to fill in the missing data.</p>
<p><em>Oh, and we also need to make sure we have installed and imported the 'requests' package.</em></p>
<p>So I have create 4 simple functions to deal with this problem (please use your own API keys, it is free.).</p>

<h4>Getting Movie ID</h4>
<p>To fill in the missing data, we would need to first create a new column, and get the movie id for each movie.</p>

In [14]:
import requests

In [15]:
TMDB_API = '476e30bfe7878c3cf343c4446566b6ec'

# Find Movie ID in TMDB
def get_movie_id(movie_name):
    response = requests.get(
    'https://api.themoviedb.org/3/search/movie',
    params={'api_key': TMDB_API,
            'query' : movie_name},
    )
    
    try:
        movie_id = response.json()['results'][0]['id']     
#       print('Movie ID found: ', movie_id)
    except:
        movie_id = np.nan
      
    return movie_id

In [16]:
movie_df['movie_id'] = np.nan

In [17]:
movie_df[:10]

Unnamed: 0,index,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,movie_id
0,1,7:19,Jorge Michel Grau,"Demián Bichir, Héctor Bonilla, Oscar Serrano, ...",Mexico,"December 23, 2016",2016,TV-MA,93 min,"Dramas, International Movies",After a devastating earthquake hits Mexico Cit...,
1,2,23:59,Gilbert Chan,"Tedd Chan, Stella Chung, Henley Hii, Lawrence ...",Singapore,"December 20, 2018",2011,R,78 min,"Horror Movies, International Movies","When an army recruit is found dead, his fellow...",
2,3,9,Shane Acker,"Elijah Wood, John C. Reilly, Jennifer Connelly...",United States,"November 16, 2017",2009,PG-13,80 min,"Action & Adventure, Independent Movies, Sci-Fi...","In a postapocalyptic world, rag-doll robots hi...",
3,4,21,Robert Luketic,"Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...",United States,"January 1, 2020",2008,PG-13,123 min,Dramas,A brilliant group of students become card-coun...,
4,6,122,Yasir Al Yasiri,"Amina Khalil, Ahmed Dawood, Tarek Lotfy, Ahmed...",Egypt,"June 1, 2020",2019,TV-MA,95 min,"Horror Movies, International Movies","After an awful accident, a couple admitted to ...",
5,7,187,Kevin Reynolds,"Samuel L. Jackson, John Heard, Kelly Rowan, Cl...",United States,"November 1, 2019",1997,R,119 min,Dramas,After one of his high school students attacks ...,
6,8,706,Shravan Kumar,"Divya Dutta, Atul Kulkarni, Mohan Agashe, Anup...",India,"April 1, 2019",2019,TV-14,118 min,"Horror Movies, International Movies","When a doctor goes missing, his psychiatrist w...",
7,9,1920,Vikram Bhatt,"Rajneesh Duggal, Adah Sharma, Indraneil Sengup...",India,"December 15, 2017",2008,TV-MA,143 min,"Horror Movies, International Movies, Thrillers",An architect and his wife move into a castle t...,
8,10,1922,Zak Hilditch,"Thomas Jane, Molly Parker, Dylan Schmid, Kaitl...",United States,"October 20, 2017",2017,TV-MA,103 min,"Dramas, Thrillers",A farmer pens a confession admitting to his wi...,
9,13,2215,Nottapon Boonprakob,Artiwara Kongmalai,Thailand,"March 1, 2019",2018,TV-MA,89 min,"Documentaries, International Movies, Sports Mo...",This intimate documentary follows rock star Ar...,


In [18]:
movie_df['movie_id'] = movie_df.apply(lambda x: get_movie_id(x['title']), axis=1)

In [19]:
movie_df

Unnamed: 0,index,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,movie_id
0,1,7:19,Jorge Michel Grau,"Demián Bichir, Héctor Bonilla, Oscar Serrano, ...",Mexico,"December 23, 2016",2016,TV-MA,93 min,"Dramas, International Movies",After a devastating earthquake hits Mexico Cit...,387805.0
1,2,23:59,Gilbert Chan,"Tedd Chan, Stella Chung, Henley Hii, Lawrence ...",Singapore,"December 20, 2018",2011,R,78 min,"Horror Movies, International Movies","When an army recruit is found dead, his fellow...",107227.0
2,3,9,Shane Acker,"Elijah Wood, John C. Reilly, Jennifer Connelly...",United States,"November 16, 2017",2009,PG-13,80 min,"Action & Adventure, Independent Movies, Sci-Fi...","In a postapocalyptic world, rag-doll robots hi...",12244.0
3,4,21,Robert Luketic,"Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...",United States,"January 1, 2020",2008,PG-13,123 min,Dramas,A brilliant group of students become card-coun...,535292.0
4,6,122,Yasir Al Yasiri,"Amina Khalil, Ahmed Dawood, Tarek Lotfy, Ahmed...",Egypt,"June 1, 2020",2019,TV-MA,95 min,"Horror Movies, International Movies","After an awful accident, a couple admitted to ...",512959.0
...,...,...,...,...,...,...,...,...,...,...,...,...
5372,7781,Zoom,Peter Hewitt,"Tim Allen, Courteney Cox, Chevy Chase, Kate Ma...",United States,"January 11, 2020",2006,PG,88 min,"Children & Family Movies, Comedies","Dragged from civilian life, a former superhero...",14113.0
5373,7782,Zozo,Josef Fares,"Imad Creidi, Antoinette Turk, Elias Gergi, Car...","Sweden, Czech Republic, United Kingdom, Denmar...","October 19, 2020",2005,TV-MA,99 min,"Dramas, International Movies",When Lebanon's Civil War deprives Zozo of his ...,99478.0
5374,7783,Zubaan,Mozez Singh,"Vicky Kaushal, Sarah-Jane Dias, Raaghav Chanan...",India,"March 2, 2019",2015,TV-14,111 min,"Dramas, International Movies, Music & Musicals",A scrappy but poor boy worms his way into a ty...,455272.0
5375,7784,Zulu Man in Japan,,Nasty C,,"September 25, 2020",2019,TV-MA,44 min,"Documentaries, International Movies, Music & M...","In this documentary, South African rapper Nast...",748172.0


<p>Now we have filled up most <b>movie_id</b>, we should be able to easily retrieve the other data.</p>
<p>Before that, we should also save the file as a new checkpoint</p>

In [21]:
# movie_df.to_csv('movie_df.csv', index = False)

In [365]:
movie_df = pd.read_csv('movie_df.csv')

In [331]:
movie_df.isnull().sum()

index             0
title             0
director        163
cast            426
country         230
date_added        0
release_year      0
rating            5
duration          0
listed_in         0
description       0
movie_id        207
dtype: int64

In [332]:
def find_director(row):
    movie_id = int(row['movie_id'])   
    response = requests.get(
    'https://api.themoviedb.org/3/movie/{}/credits'.format(movie_id),
    params={'api_key': TMDB_API,},
    )

    director = ''
    for crew in response.json()['crew']:
        if crew['job'] == 'Director':
            director += crew['original_name'] + ', '
#         print(director)
    return str(director[:-1])


def find_cast(row):
    movie_id = int(row['movie_id'])
    response = requests.get(
    'https://api.themoviedb.org/3/movie/{}/credits'.format(movie_id),
    params={'api_key': TMDB_API,},
    )

    cast_list = ''
    for cast in response.json()['cast']:
        cast_list += cast['name'] + ', '
#     print(cast_list)
    return str(cast_list[:-1])


def find_country(row):
    movie_id = int(row['movie_id'])


#             print('finding country')
    response = requests.get(
    'https://api.themoviedb.org/3/movie/{}'.format(movie_id),
    params={'api_key': TMDB_API,},
    )

#     print(response.status_code)
    try:
        country_list = response.json()['production_countries'][0]['name']
#         print('country list: ', country_list)
        return str(country_list[:-1])
    except:
        return(row['country'])





In [333]:
missing_data = movie_df[movie_df[['country','director','cast']].isnull().any(axis=1)]


In [334]:
# drop missing data with no movie_id

missing_data = missing_data.dropna(subset=['movie_id']).copy()

In [335]:
missing_data['country'] = missing_data.apply(lambda x: find_country(x), axis=1)

In [336]:
missing_data['director'] = missing_data.apply(lambda x: find_director(x), axis=1)

In [337]:
missing_data['cast'] = missing_data.apply(lambda x: find_cast(x), axis=1)

In [338]:
missing_data.isnull().sum()

index            0
title            0
director         0
cast             0
country         60
date_added       0
release_year     0
rating           1
duration         0
listed_in        0
description      0
movie_id         0
dtype: int64

In [341]:
missing_data.drop(columns ='index', inplace = True)

In [342]:
missing_data

Unnamed: 0,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,movie_id
14,'89,"Dave Stewart,","Lee Dixon, Ian Wright, Paul Merson, Steve Boul...",United Kingdo,"May 16, 2018",2017,TV-PG,87 min,Sports Movies,"Mixing old footage with interviews, this is th...",486882.0
19,(T)ERROR,"Lyric R. Cabral, David Felix Sutcliffe,",,United States of Americ,"June 30, 2016",2015,NR,84 min,Documentaries,This real-life look at FBI counterterrorism op...,318972.0
22,#cats_the_mewvie,"Michael Margolis,","Mike Bridavsky, Jason Eppink, Amanda Brennan, ...",Canad,"February 5, 2020",2020,TV-14,90 min,"Documentaries, International Movies",This pawesome documentary explores how our fel...,668291.0
27,#Rucker50,"Robert McCullough Jr.,",,United States,"December 1, 2016",2016,TV-PG,56 min,"Documentaries, Sports Movies",This documentary celebrates the 50th anniversa...,433479.0
41,100 Days Of Solitude,"Gerardo Olivares, José Díaz Martínez,","José Díaz,",Spai,"July 6, 2018",2018,TV-MA,93 min,"Documentaries, International Movies",Spanish photographer José Díaz spends 100 days...,534110.0
...,...,...,...,...,...,...,...,...,...,...,...
5332,Yoo Byung Jae: Too Much Information,"Yoo Byung-jae,","Yoo Byung-jae, Yang Hyun-suk,",South Kore,"March 16, 2018",2018,TV-MA,63 min,Stand-Up Comedy,"""Saturday Night Live Korea"" writer-turned-come...",512153.0
5345,Your Excellency,"Funke Akindele,","Akin Lewis, Funke Akindele, Kemi Lala Akindoju...",Nigeri,"July 10, 2020",2019,TV-G,120 min,"Comedies, International Movies","Bumbling through politics, a billionaire busin...",655507.0
5361,Zero Hour,"Hall Bartlett,","Dana Andrews, Linda Darnell, Sterling Hayden, ...",United States of Americ,"December 13, 2019",2018,TV-MA,89 min,"International Movies, Thrillers","After his father passes, the heir to a retail ...",54541.0
5375,Zulu Man in Japan,,"Nasty C,",South Afric,"September 25, 2020",2019,TV-MA,44 min,"Documentaries, International Movies, Music & M...","In this documentary, South African rapper Nast...",748172.0


<p>Now we should drop the data from original <b>movie_df</b> and merge the processed (missing_data) dataframe back into the original movie_df</p>

In [366]:
movie_df = movie_df.drop(missing_data.index[:])

In [367]:
movie_df = (movie_df.append(missing_data))

In [368]:
movie_df.reset_index(inplace=True)

In [369]:
movie_df.drop(columns=['level_0', 'index'], inplace = True)

<h4>Dropping some rows with missing data</h4>
<p>We have attempted to fill in most of the data, and let's see if we have reduced the occurence of missing data.</p>

In [370]:
print(movie_df.isnull().sum())
after = movie_df.isnull().sum()

title             0
director         36
cast             25
country         118
date_added        0
release_year      0
rating            5
duration          0
listed_in         0
description       0
movie_id        207
dtype: int64


<p>Let's see how well did we do.</p>

In [371]:
print(before)

show_id           0
title             0
director        163
cast            426
country         230
date_added        0
release_year      0
rating            5
duration          0
listed_in         0
description       0
dtype: int64


<p>Sadly, there are still some missing data. So I guess we would just drop them and movie_id column for now for the sake of analysis. After that, let's save the dataframe into another checkpoint just in case.</p>

In [372]:
movie_df.drop(columns=['movie_id'], axis = 1, inplace=True)
movie_df.dropna(axis=0, how='any', inplace=True)

<p>Ok, let's check if everything looks good before storing the dataframe into a new checkpoint file.</p>

In [373]:
movie_df

Unnamed: 0,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,7:19,Jorge Michel Grau,"Demián Bichir, Héctor Bonilla, Oscar Serrano, ...",Mexico,"December 23, 2016",2016,TV-MA,93 min,"Dramas, International Movies",After a devastating earthquake hits Mexico Cit...
1,23:59,Gilbert Chan,"Tedd Chan, Stella Chung, Henley Hii, Lawrence ...",Singapore,"December 20, 2018",2011,R,78 min,"Horror Movies, International Movies","When an army recruit is found dead, his fellow..."
2,9,Shane Acker,"Elijah Wood, John C. Reilly, Jennifer Connelly...",United States,"November 16, 2017",2009,PG-13,80 min,"Action & Adventure, Independent Movies, Sci-Fi...","In a postapocalyptic world, rag-doll robots hi..."
3,21,Robert Luketic,"Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...",United States,"January 1, 2020",2008,PG-13,123 min,Dramas,A brilliant group of students become card-coun...
4,122,Yasir Al Yasiri,"Amina Khalil, Ahmed Dawood, Tarek Lotfy, Ahmed...",Egypt,"June 1, 2020",2019,TV-MA,95 min,"Horror Movies, International Movies","After an awful accident, a couple admitted to ..."
...,...,...,...,...,...,...,...,...,...,...
5372,Yoo Byung Jae: Too Much Information,"Yoo Byung-jae,","Yoo Byung-jae, Yang Hyun-suk,",South Kore,"March 16, 2018",2018,TV-MA,63 min,Stand-Up Comedy,"""Saturday Night Live Korea"" writer-turned-come..."
5373,Your Excellency,"Funke Akindele,","Akin Lewis, Funke Akindele, Kemi Lala Akindoju...",Nigeri,"July 10, 2020",2019,TV-G,120 min,"Comedies, International Movies","Bumbling through politics, a billionaire busin..."
5374,Zero Hour,"Hall Bartlett,","Dana Andrews, Linda Darnell, Sterling Hayden, ...",United States of Americ,"December 13, 2019",2018,TV-MA,89 min,"International Movies, Thrillers","After his father passes, the heir to a retail ..."
5375,Zulu Man in Japan,,"Nasty C,",South Afric,"September 25, 2020",2019,TV-MA,44 min,"Documentaries, International Movies, Music & M...","In this documentary, South African rapper Nast..."


In [374]:
movie_df.isnull().sum()

title           0
director        0
cast            0
country         0
date_added      0
release_year    0
rating          0
duration        0
listed_in       0
description     0
dtype: int64

In [375]:
movie_df.to_csv('movie_df_cp2.csv', index = False)

<h4>Exploratory Data Analysis</h4>
<p>The next step is to perform exploratory data analysis on the dataset. We can refer to <a href="https://www.datacourses.com/tips-for-performing-eda-with-python-671/">this page</a>.</p>

In [376]:
movie_df = pd.read_csv('movie_df_cp2.csv')

In [377]:
movie_df.dropna(axis=0, how='any', inplace=True)