## Importing libraries

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud

sns.set_style('darkgrid')

In [3]:
!git clone "https://github.com/Yashtiii/Netflix-Content-Strategy"

Cloning into 'Netflix-Content-Strategy'...
remote: Enumerating objects: 13, done.[K
remote: Counting objects: 100% (13/13), done.[K
remote: Compressing objects: 100% (11/11), done.[K
remote: Total 13 (delta 4), reused 8 (delta 2), pack-reused 0 (from 0)[K
Receiving objects: 100% (13/13), 1.17 MiB | 4.31 MiB/s, done.
Resolving deltas: 100% (4/4), done.


## Data loading and initial inspection

In [4]:
netflix_df = pd.read_csv("/content/Netflix-Content-Strategy/netflix_titles.csv")

In [5]:
netflix_df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,TV Show,3%,,"João Miguel, Bianca Comparato, Michel Gomes, R...",Brazil,"August 14, 2020",2020,TV-MA,4 Seasons,"International TV Shows, TV Dramas, TV Sci-Fi &...",In a future where the elite inhabit an island ...
1,s2,Movie,7:19,Jorge Michel Grau,"Demián Bichir, Héctor Bonilla, Oscar Serrano, ...",Mexico,"December 23, 2016",2016,TV-MA,93 min,"Dramas, International Movies",After a devastating earthquake hits Mexico Cit...
2,s3,Movie,23:59,Gilbert Chan,"Tedd Chan, Stella Chung, Henley Hii, Lawrence ...",Singapore,"December 20, 2018",2011,R,78 min,"Horror Movies, International Movies","When an army recruit is found dead, his fellow..."
3,s4,Movie,9,Shane Acker,"Elijah Wood, John C. Reilly, Jennifer Connelly...",United States,"November 16, 2017",2009,PG-13,80 min,"Action & Adventure, Independent Movies, Sci-Fi...","In a postapocalyptic world, rag-doll robots hi..."
4,s5,Movie,21,Robert Luketic,"Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...",United States,"January 1, 2020",2008,PG-13,123 min,Dramas,A brilliant group of students become card-coun...


In [6]:
netflix_df.tail()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
7782,s7783,Movie,Zozo,Josef Fares,"Imad Creidi, Antoinette Turk, Elias Gergi, Car...","Sweden, Czech Republic, United Kingdom, Denmar...","October 19, 2020",2005,TV-MA,99 min,"Dramas, International Movies",When Lebanon's Civil War deprives Zozo of his ...
7783,s7784,Movie,Zubaan,Mozez Singh,"Vicky Kaushal, Sarah-Jane Dias, Raaghav Chanan...",India,"March 2, 2019",2015,TV-14,111 min,"Dramas, International Movies, Music & Musicals",A scrappy but poor boy worms his way into a ty...
7784,s7785,Movie,Zulu Man in Japan,,Nasty C,,"September 25, 2020",2019,TV-MA,44 min,"Documentaries, International Movies, Music & M...","In this documentary, South African rapper Nast..."
7785,s7786,TV Show,Zumbo's Just Desserts,,"Adriano Zumbo, Rachel Khoo",Australia,"October 31, 2020",2019,TV-PG,1 Season,"International TV Shows, Reality TV",Dessert wizard Adriano Zumbo looks for the nex...
7786,s7787,Movie,ZZ TOP: THAT LITTLE OL' BAND FROM TEXAS,Sam Dunn,,"United Kingdom, Canada, United States","March 1, 2020",2019,TV-MA,90 min,"Documentaries, Music & Musicals",This documentary delves into the mystique behi...


### Getting an idea of how many values we are missing
As we can see we have the highest number of values missing from director column and least number of data missing from rating column

In [7]:
netflix_df.isna().sum()

Unnamed: 0,0
show_id,0
type,0
title,0
director,2389
cast,718
country,507
date_added,10
release_year,0
rating,7
duration,0


### Checking out the data types
- Now that I know how many columns have missing values now I am going through their datatypes to understand my next step on how to handle them. Currently I am missing data in Director, cast, country, date_added, rating. All of these have object data type which means currently I can't use mean or median to fill the values.
- We can also see that we have 7787 number of rows and 12 columns


In [11]:
netflix_df.shape

(7787, 12)

In [8]:
netflix_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7787 entries, 0 to 7786
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       7787 non-null   object
 1   type          7787 non-null   object
 2   title         7787 non-null   object
 3   director      5398 non-null   object
 4   cast          7069 non-null   object
 5   country       7280 non-null   object
 6   date_added    7777 non-null   object
 7   release_year  7787 non-null   int64 
 8   rating        7780 non-null   object
 9   duration      7787 non-null   object
 10  listed_in     7787 non-null   object
 11  description   7787 non-null   object
dtypes: int64(1), object(11)
memory usage: 730.2+ KB


- date_added column is a type object (a string) not a datetime object. This has to be corrected.

In [9]:
netflix_df.describe()

Unnamed: 0,release_year
count,7787.0
mean,2013.93258
std,8.757395
min,1925.0
25%,2013.0
50%,2017.0
75%,2018.0
max,2021.0


From describe() I learned that my range of my data starts from year 1925 and ends in 2021.

## Data cleaning and transformation

For null values we are going to do two things:
1) Dropping the rows if the percentage of data is small.
2) Replace missing values with "unknown" or mode.

In [12]:
netflix_df.isna().sum()

Unnamed: 0,0
show_id,0
type,0
title,0
director,2389
cast,718
country,507
date_added,10
release_year,0
rating,7
duration,0


In [13]:
netflix_df["director"]

Unnamed: 0,director
0,
1,Jorge Michel Grau
2,Gilbert Chan
3,Shane Acker
4,Robert Luketic
...,...
7782,Josef Fares
7783,Mozez Singh
7784,
7785,


In [22]:
netflix_df["director"].value_counts()

Unnamed: 0_level_0,count
director,Unnamed: 1_level_1
Unknown,2389
"Raúl Campos, Jan Suter",18
Marcus Raboy,16
Jay Karas,14
Cathy Garcia-Molina,13
...,...
Jonathan Helpert,1
Greg Kohs,1
Jacob Schwab,1
Serge Ou,1


- We can't fill the director missing values with median because many movies/series have a different director.( there are a lot of categories )
- So we will fill that with unknown.


In [14]:
netflix_df["director"] = netflix_df["director"].fillna("Unknown")

In [17]:
netflix_df["director"].isna().sum()

np.int64(0)

Doing the same for cast.

In [23]:
netflix_df["cast"].value_counts()

Unnamed: 0_level_0,count
cast,Unnamed: 1_level_1
Unknown,718
David Attenborough,18
Samuel West,10
Jeff Dunham,7
"Michela Luci, Jamie Watson, Eric Peterson, Anna Claire Bartlam, Nicolas Aqui, Cory Doran, Julie Lemieux, Derek McGrath",6
...,...
"Amina Khalil, Ahmed Dawood, Tarek Lotfy, Ahmed El Fishawy, Mahmoud Hijazi, Jihane Khalil, Asmaa Galal, Tara Emad",1
"Samuel L. Jackson, John Heard, Kelly Rowan, Clifton Collins Jr., Tony Plana",1
"Divya Dutta, Atul Kulkarni, Mohan Agashe, Anupam Shyam, Raayo S. Bakhirta, Yashvit Sancheti, Greeva Kansara, Archan Trivedi, Rajiv Pathak",1
"Rajneesh Duggal, Adah Sharma, Indraneil Sengupta, Anjori Alagh, Rajendranath Zutshi, Vipin Sharma, Amin Hajee, Shri Vallabh Vyas",1


In [18]:
netflix_df["cast"] = netflix_df["cast"].fillna("Unknown")

In [19]:
netflix_df["cast"].isna().sum()

np.int64(0)

Let's work on countries now!

In [20]:
netflix_df["country"].head()

Unnamed: 0,country
0,Brazil
1,Mexico
2,Singapore
3,United States
4,United States


In [21]:
netflix_df["country"].value_counts()

Unnamed: 0_level_0,count
country,Unnamed: 1_level_1
United States,2555
India,923
United Kingdom,397
Japan,226
South Korea,183
...,...
"Germany, United States, United Kingdom, Canada",1
"Peru, United States, United Kingdom",1
"Saudi Arabia, United Arab Emirates",1
"United Kingdom, France, United States, Belgium",1


There are 681 different categories with United States having the highest number. 503 rows have missing values in countries and we can fill them with the mode (United States). We didn't do that with cast and director because movies because it had more than 4000 categories! and cast had more than 6000 categories.

In [33]:
mode_country = netflix_df["country"].mode()[0]
# mode() returns a series so using mode()[0] ensures that we are assigning it's first value
mode_country
netflix_df["country"] = netflix_df["country"].fillna(mode_country)

In [32]:
netflix_df.isna().sum()

Unnamed: 0,0
show_id,0
type,0
title,0
director,0
cast,0
country,0
date_added,10
release_year,0
rating,7
duration,0


In [36]:
netflix_df[["date_added", "rating"]].head()
# we are only missing 7 values in rating and 10 values in date_added so dropping them wouldn't be a bad option considering that we have more than 7000 rows.

Unnamed: 0,date_added,rating
0,"August 14, 2020",TV-MA
1,"December 23, 2016",TV-MA
2,"December 20, 2018",R
3,"November 16, 2017",PG-13
4,"January 1, 2020",PG-13


In [39]:
netflix_df.dropna(subset=["date_added","rating"], inplace= True)
# inplace ensures that whatever changes we have made right now it will reflect to the originat dataset. By default the value of inplace is false.

In [40]:
netflix_df.isna().sum()

Unnamed: 0,0
show_id,0
type,0
title,0
director,0
cast,0
country,0
date_added,0
release_year,0
rating,0
duration,0


In [42]:
netflix_df.shape

(7770, 12)

Initially I had 7787 rows, here I lost 17 rows but it is a very small percentage in comparison to 7787 rows

Now  let's work on changing the data type of date added so we can actually use it for our analysis.

- **.to_datetime** is essential for handeling date operations in pandas. It converts strings, numbers or any other kind of representations of date into pandas datetime type.
- using **format="mixed"** means pandas will infer all different representations and convert them into one.
- **dayfirst=False:** Tells Pandas that dates are in "month-day-year" or "year-month-day" style, not "day-month-year". For example, "03/05/2021" will be treated as March 5th, not May 3rd. We used this because we are given months first in the original dataset.

In [58]:
netflix_df["date_added"] = pd.to_datetime(netflix_df['date_added'], format = "mixed", dayfirst= False)
netflix_df.pop("data_added")

Unnamed: 0,data_added
0,2020-08-14
1,2016-12-23
2,2018-12-20
3,2017-11-16
4,2020-01-01
...,...
7782,2020-10-19
7783,2019-03-02
7784,2020-09-25
7785,2020-10-31


In [59]:
netflix_df["date_added"].head()
# year-month-day format

Unnamed: 0,date_added
0,2020-08-14
1,2016-12-23
2,2018-12-20
3,2017-11-16
4,2020-01-01


We are going to extract year and month from date_added column to understand the time period where movies are added the most. The day in these dates isn't that important.

In [60]:
netflix_df["year_added"] = netflix_df["date_added"].dt.year
netflix_df["year_added"].head()

Unnamed: 0,year_added
0,2020
1,2016
2,2018
3,2017
4,2020


In [61]:
netflix_df["month_added"] = netflix_df["date_added"].dt.month
netflix_df["month_added"].head()

Unnamed: 0,month_added
0,8
1,12
2,12
3,11
4,1


In [62]:
netflix_df.head(2)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,year_added,month_added
0,s1,TV Show,3%,Unknown,"João Miguel, Bianca Comparato, Michel Gomes, R...",Brazil,2020-08-14,2020,TV-MA,4 Seasons,"International TV Shows, TV Dramas, TV Sci-Fi &...",In a future where the elite inhabit an island ...,2020,8
1,s2,Movie,7:19,Jorge Michel Grau,"Demián Bichir, Héctor Bonilla, Oscar Serrano, ...",Mexico,2016-12-23,2016,TV-MA,93 min,"Dramas, International Movies",After a devastating earthquake hits Mexico Cit...,2016,12


In [67]:
print(netflix_df.dtypes)
print(netflix_df.isna().sum())

show_id                 object
type                    object
title                   object
director                object
cast                    object
country                 object
date_added      datetime64[ns]
release_year             int64
rating                  object
duration                object
listed_in               object
description             object
year_added               int32
month_added              int32
dtype: object
show_id         0
type            0
title           0
director        0
cast            0
country         0
date_added      0
release_year    0
rating          0
duration        0
listed_in       0
description     0
year_added      0
month_added     0
dtype: int64
