# Data Preprocessing + Simple Pandas

The following will be an example of some good practices in data preprocessing on some netflix show/movie data.

Here is the URL to the data: https://www.kaggle.com/shivamb/netflix-shows

## Imports

In [None]:
import pandas as pd
import numpy as np

HERE
5
5
5


In [None]:
df = pd.read_csv("netflix_titles.csv")
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,TV Show,3%,,"João Miguel, Bianca Comparato, Michel Gomes, R...",Brazil,"August 14, 2020",2020,TV-MA,4 Seasons,"International TV Shows, TV Dramas, TV Sci-Fi &...",In a future where the elite inhabit an island ...
1,s2,Movie,7:19,Jorge Michel Grau,"Demián Bichir, Héctor Bonilla, Oscar Serrano, ...",Mexico,"December 23, 2016",2016,TV-MA,93 min,"Dramas, International Movies",After a devastating earthquake hits Mexico Cit...
2,s3,Movie,23:59,Gilbert Chan,"Tedd Chan, Stella Chung, Henley Hii, Lawrence ...",Singapore,"December 20, 2018",2011,R,78 min,"Horror Movies, International Movies","When an army recruit is found dead, his fellow..."
3,s4,Movie,9,Shane Acker,"Elijah Wood, John C. Reilly, Jennifer Connelly...",United States,"November 16, 2017",2009,PG-13,80 min,"Action & Adventure, Independent Movies, Sci-Fi...","In a postapocalyptic world, rag-doll robots hi..."
4,s5,Movie,21,Robert Luketic,"Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...",United States,"January 1, 2020",2008,PG-13,123 min,Dramas,A brilliant group of students become card-coun...


In [None]:
df.shape

(7787, 12)

## Preprocessing

- Find and fill missing values 
- Check for duplicates
- Categorize data
- Type the data appropriately


And have reasons for why we did these things!

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7787 entries, 0 to 7786
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       7787 non-null   object
 1   type          7787 non-null   object
 2   title         7787 non-null   object
 3   director      5398 non-null   object
 4   cast          7069 non-null   object
 5   country       7280 non-null   object
 6   date_added    7777 non-null   object
 7   release_year  7787 non-null   int64 
 8   rating        7780 non-null   object
 9   duration      7787 non-null   object
 10  listed_in     7787 non-null   object
 11  description   7787 non-null   object
dtypes: int64(1), object(11)
memory usage: 730.2+ KB


In [None]:
#Check for NAs
df.isna().sum()

show_id            0
type               0
title              0
director        2389
cast             718
country          507
date_added        10
release_year       0
rating             7
duration           0
listed_in          0
description        0
dtype: int64

In [None]:
df.isna().sum()/df.shape[0]

show_id         0.000000
type            0.000000
title           0.000000
director        0.306793
cast            0.092205
country         0.065109
date_added      0.001284
release_year    0.000000
rating          0.000899
duration        0.000000
listed_in       0.000000
description     0.000000
dtype: float64

In [None]:
df["country"].value_counts().sort_values(ascending=False)

United States                                           2555
India                                                    923
United Kingdom                                           397
Japan                                                    226
South Korea                                              183
                                                        ... 
United States, Australia, Samoa, United Kingdom            1
Australia, United Kingdom, Canada                          1
United States, New Zealand, United Kingdom                 1
Ireland, United Kingdom, Greece, France, Netherlands       1
Germany, United Kingdom, United States                     1
Name: country, Length: 681, dtype: int64

In [None]:
df["rating"].value_counts().sort_values(ascending=False)

TV-MA       2863
TV-14       1931
TV-PG        806
R            665
PG-13        386
TV-Y         280
TV-Y7        271
PG           247
TV-G         194
NR            84
G             39
TV-Y7-FV       6
UR             5
NC-17          3
Name: rating, dtype: int64

In [None]:
df[["cast","director", "country"]] = df[["cast", "director", "country"]].fillna("Unknown")

In [None]:
# I could choose to get rid of rows to get rid of missing data
df.loc[df["date_added"].isna() == False, :]

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,TV Show,3%,Unknown,"João Miguel, Bianca Comparato, Michel Gomes, R...",Brazil,"August 14, 2020",2020,TV-MA,4 Seasons,"International TV Shows, TV Dramas, TV Sci-Fi &...",In a future where the elite inhabit an island ...
1,s2,Movie,7:19,Jorge Michel Grau,"Demián Bichir, Héctor Bonilla, Oscar Serrano, ...",Mexico,"December 23, 2016",2016,TV-MA,93 min,"Dramas, International Movies",After a devastating earthquake hits Mexico Cit...
2,s3,Movie,23:59,Gilbert Chan,"Tedd Chan, Stella Chung, Henley Hii, Lawrence ...",Singapore,"December 20, 2018",2011,R,78 min,"Horror Movies, International Movies","When an army recruit is found dead, his fellow..."
3,s4,Movie,9,Shane Acker,"Elijah Wood, John C. Reilly, Jennifer Connelly...",United States,"November 16, 2017",2009,PG-13,80 min,"Action & Adventure, Independent Movies, Sci-Fi...","In a postapocalyptic world, rag-doll robots hi..."
4,s5,Movie,21,Robert Luketic,"Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...",United States,"January 1, 2020",2008,PG-13,123 min,Dramas,A brilliant group of students become card-coun...
...,...,...,...,...,...,...,...,...,...,...,...,...
7782,s7783,Movie,Zozo,Josef Fares,"Imad Creidi, Antoinette Turk, Elias Gergi, Car...","Sweden, Czech Republic, United Kingdom, Denmar...","October 19, 2020",2005,TV-MA,99 min,"Dramas, International Movies",When Lebanon's Civil War deprives Zozo of his ...
7783,s7784,Movie,Zubaan,Mozez Singh,"Vicky Kaushal, Sarah-Jane Dias, Raaghav Chanan...",India,"March 2, 2019",2015,TV-14,111 min,"Dramas, International Movies, Music & Musicals",A scrappy but poor boy worms his way into a ty...
7784,s7785,Movie,Zulu Man in Japan,Unknown,Nasty C,Unknown,"September 25, 2020",2019,TV-MA,44 min,"Documentaries, International Movies, Music & M...","In this documentary, South African rapper Nast..."
7785,s7786,TV Show,Zumbo's Just Desserts,Unknown,"Adriano Zumbo, Rachel Khoo",Australia,"October 31, 2020",2019,TV-PG,1 Season,"International TV Shows, Reality TV",Dessert wizard Adriano Zumbo looks for the nex...


In [None]:
#Or I could choose to replace missing values with the mode. This saves some information, but puts me at risk of a few inaccurate data pts
df["date_added"] = df["date_added"].fillna(df["date_added"].mode()[0])
df["rating"] = df["rating"].fillna(df["rating"].mode()[0])

In [None]:
df.isna().sum()

show_id         0
type            0
title           0
director        0
cast            0
country         0
date_added      0
release_year    0
rating          0
duration        0
listed_in       0
description     0
dtype: int64

In [None]:
df["date_added"] = pd.to_datetime(df["date_added"])

In [None]:
#Check for duplicatess
duplicate = df[df.duplicated()]
duplicate

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description


In [None]:
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,TV Show,3%,Unknown,"João Miguel, Bianca Comparato, Michel Gomes, R...",Brazil,2020-08-14,2020,TV-MA,4 Seasons,"International TV Shows, TV Dramas, TV Sci-Fi &...",In a future where the elite inhabit an island ...
1,s2,Movie,7:19,Jorge Michel Grau,"Demián Bichir, Héctor Bonilla, Oscar Serrano, ...",Mexico,2016-12-23,2016,TV-MA,93 min,"Dramas, International Movies",After a devastating earthquake hits Mexico Cit...
2,s3,Movie,23:59,Gilbert Chan,"Tedd Chan, Stella Chung, Henley Hii, Lawrence ...",Singapore,2018-12-20,2011,R,78 min,"Horror Movies, International Movies","When an army recruit is found dead, his fellow..."
3,s4,Movie,9,Shane Acker,"Elijah Wood, John C. Reilly, Jennifer Connelly...",United States,2017-11-16,2009,PG-13,80 min,"Action & Adventure, Independent Movies, Sci-Fi...","In a postapocalyptic world, rag-doll robots hi..."
4,s5,Movie,21,Robert Luketic,"Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...",United States,2020-01-01,2008,PG-13,123 min,Dramas,A brilliant group of students become card-coun...


In [None]:
#Create groups 
movies = df.loc[df["type"] == "Movie", :]
tv_shows = df.loc[df["type"] == "TV Show", :]

#Get groups on genre
dramas = df[df["listed_in"].str.contains("Dramas")]

In [None]:
sj_rows = df[df["cast"].str.contains("Samuel L. Jackson")]

In [None]:
dramas

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,TV Show,3%,Unknown,"João Miguel, Bianca Comparato, Michel Gomes, R...",Brazil,2020-08-14,2020,TV-MA,4 Seasons,"International TV Shows, TV Dramas, TV Sci-Fi &...",In a future where the elite inhabit an island ...
1,s2,Movie,7:19,Jorge Michel Grau,"Demián Bichir, Héctor Bonilla, Oscar Serrano, ...",Mexico,2016-12-23,2016,TV-MA,93 min,"Dramas, International Movies",After a devastating earthquake hits Mexico Cit...
4,s5,Movie,21,Robert Luketic,"Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...",United States,2020-01-01,2008,PG-13,123 min,Dramas,A brilliant group of students become card-coun...
5,s6,TV Show,46,Serdar Akar,"Erdal Beşikçioğlu, Yasemin Allen, Melis Birkan...",Turkey,2017-07-01,2016,TV-MA,1 Season,"International TV Shows, TV Dramas, TV Mysteries",A genetics professor experiments with a treatm...
7,s8,Movie,187,Kevin Reynolds,"Samuel L. Jackson, John Heard, Kelly Rowan, Cl...",United States,2019-11-01,1997,R,119 min,Dramas,After one of his high school students attacks ...
...,...,...,...,...,...,...,...,...,...,...,...,...
7774,s7775,Movie,Zodiac,David Fincher,"Mark Ruffalo, Jake Gyllenhaal, Robert Downey J...",United States,2019-11-20,2007,R,158 min,"Cult Movies, Dramas, Thrillers","A political cartoonist, a crime reporter and a..."
7776,s7777,Movie,Zokkomon,Satyajit Bhatkal,"Darsheel Safary, Anupam Kher, Manjari Fadnis, ...",India,2018-11-01,2011,PG,104 min,"Children & Family Movies, Dramas","When his cruel uncle abandons him, a young orp..."
7780,s7781,Movie,Zoo,Shlok Sharma,"Shashank Arora, Shweta Tripathi, Rahul Kumar, ...",India,2018-07-01,2018,TV-MA,94 min,"Dramas, Independent Movies, International Movies",A drug dealer starts having doubts about his t...
7782,s7783,Movie,Zozo,Josef Fares,"Imad Creidi, Antoinette Turk, Elias Gergi, Car...","Sweden, Czech Republic, United Kingdom, Denmar...",2020-10-19,2005,TV-MA,99 min,"Dramas, International Movies",When Lebanon's Civil War deprives Zozo of his ...


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7787 entries, 0 to 7786
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   show_id       7787 non-null   object        
 1   type          7787 non-null   object        
 2   title         7787 non-null   object        
 3   director      7787 non-null   object        
 4   cast          7787 non-null   object        
 5   country       7787 non-null   object        
 6   date_added    7787 non-null   datetime64[ns]
 7   release_year  7787 non-null   int64         
 8   rating        7787 non-null   object        
 9   duration      7787 non-null   object        
 10  listed_in     7787 non-null   object        
 11  description   7787 non-null   object        
dtypes: datetime64[ns](1), int64(1), object(10)
memory usage: 730.2+ KB


## Split Function

In [None]:
example = "Hi, my name is Daniel"

example_date = "01-01-2021"

example_phone_num = "(777)-555-999"

temp_list = example.split(", ")

temp_date_list =example_date.split("-")

area_code = example_phone_num.split("-")[0]

In [None]:
area_code

'(777)'

In [None]:
movies["length"] = movies["duration"].str.split(" ", expand = True)[0].astype("int64")
tv_shows["seasons"] = tv_shows["duration"].str.split(" ", expand = True)[0].astype("int64")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies["length"] = movies["duration"].str.split(" ", expand = True)[0].astype("int64")
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tv_shows["seasons"] = tv_shows["duration"].str.split(" ", expand = True)[0].astype("int64")


In [None]:
movies = movies.drop(["duration", "type"], axis = 1)
tv_shows = tv_shows.drop(["duration", "type"], axis = 1)

In [None]:
movies

Unnamed: 0,show_id,title,director,cast,country,date_added,release_year,rating,listed_in,description,length
1,s2,7:19,Jorge Michel Grau,"Demián Bichir, Héctor Bonilla, Oscar Serrano, ...",Mexico,2016-12-23,2016,TV-MA,"Dramas, International Movies",After a devastating earthquake hits Mexico Cit...,93
2,s3,23:59,Gilbert Chan,"Tedd Chan, Stella Chung, Henley Hii, Lawrence ...",Singapore,2018-12-20,2011,R,"Horror Movies, International Movies","When an army recruit is found dead, his fellow...",78
3,s4,9,Shane Acker,"Elijah Wood, John C. Reilly, Jennifer Connelly...",United States,2017-11-16,2009,PG-13,"Action & Adventure, Independent Movies, Sci-Fi...","In a postapocalyptic world, rag-doll robots hi...",80
4,s5,21,Robert Luketic,"Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...",United States,2020-01-01,2008,PG-13,Dramas,A brilliant group of students become card-coun...,123
6,s7,122,Yasir Al Yasiri,"Amina Khalil, Ahmed Dawood, Tarek Lotfy, Ahmed...",Egypt,2020-06-01,2019,TV-MA,"Horror Movies, International Movies","After an awful accident, a couple admitted to ...",95
...,...,...,...,...,...,...,...,...,...,...,...
7781,s7782,Zoom,Peter Hewitt,"Tim Allen, Courteney Cox, Chevy Chase, Kate Ma...",United States,2020-01-11,2006,PG,"Children & Family Movies, Comedies","Dragged from civilian life, a former superhero...",88
7782,s7783,Zozo,Josef Fares,"Imad Creidi, Antoinette Turk, Elias Gergi, Car...","Sweden, Czech Republic, United Kingdom, Denmar...",2020-10-19,2005,TV-MA,"Dramas, International Movies",When Lebanon's Civil War deprives Zozo of his ...,99
7783,s7784,Zubaan,Mozez Singh,"Vicky Kaushal, Sarah-Jane Dias, Raaghav Chanan...",India,2019-03-02,2015,TV-14,"Dramas, International Movies, Music & Musicals",A scrappy but poor boy worms his way into a ty...,111
7784,s7785,Zulu Man in Japan,Unknown,Nasty C,Unknown,2020-09-25,2019,TV-MA,"Documentaries, International Movies, Music & M...","In this documentary, South African rapper Nast...",44


## Some Data Exploration

### Q: Does netflix have mostly newer or older movies?

In [None]:
movies.groupby("release_year")["show_id"].count().sort_values(ascending = False)

release_year
2017    744
2018    734
2016    642
2019    582
2020    411
       ... 
1963      1
1964      1
1966      1
1947      1
1946      1
Name: show_id, Length: 72, dtype: int64

In [None]:
def is_classic_modern_new(year):
    if year < 1990:
        return "classic"
    elif year >= 2019:
        return "new"
    else:
        return "modern"


In [None]:
movies["classic_modern_new"] = movies["release_year"].apply(is_classic_modern_new)

In [None]:
movies

Unnamed: 0,show_id,title,director,cast,country,date_added,release_year,rating,listed_in,description,length,classic_modern_new
1,s2,7:19,Jorge Michel Grau,"Demián Bichir, Héctor Bonilla, Oscar Serrano, ...",Mexico,2016-12-23,2016,TV-MA,"Dramas, International Movies",After a devastating earthquake hits Mexico Cit...,93,modern
2,s3,23:59,Gilbert Chan,"Tedd Chan, Stella Chung, Henley Hii, Lawrence ...",Singapore,2018-12-20,2011,R,"Horror Movies, International Movies","When an army recruit is found dead, his fellow...",78,modern
3,s4,9,Shane Acker,"Elijah Wood, John C. Reilly, Jennifer Connelly...",United States,2017-11-16,2009,PG-13,"Action & Adventure, Independent Movies, Sci-Fi...","In a postapocalyptic world, rag-doll robots hi...",80,modern
4,s5,21,Robert Luketic,"Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...",United States,2020-01-01,2008,PG-13,Dramas,A brilliant group of students become card-coun...,123,modern
6,s7,122,Yasir Al Yasiri,"Amina Khalil, Ahmed Dawood, Tarek Lotfy, Ahmed...",Egypt,2020-06-01,2019,TV-MA,"Horror Movies, International Movies","After an awful accident, a couple admitted to ...",95,new
...,...,...,...,...,...,...,...,...,...,...,...,...
7781,s7782,Zoom,Peter Hewitt,"Tim Allen, Courteney Cox, Chevy Chase, Kate Ma...",United States,2020-01-11,2006,PG,"Children & Family Movies, Comedies","Dragged from civilian life, a former superhero...",88,modern
7782,s7783,Zozo,Josef Fares,"Imad Creidi, Antoinette Turk, Elias Gergi, Car...","Sweden, Czech Republic, United Kingdom, Denmar...",2020-10-19,2005,TV-MA,"Dramas, International Movies",When Lebanon's Civil War deprives Zozo of his ...,99,modern
7783,s7784,Zubaan,Mozez Singh,"Vicky Kaushal, Sarah-Jane Dias, Raaghav Chanan...",India,2019-03-02,2015,TV-14,"Dramas, International Movies, Music & Musicals",A scrappy but poor boy worms his way into a ty...,111,modern
7784,s7785,Zulu Man in Japan,Unknown,Nasty C,Unknown,2020-09-25,2019,TV-MA,"Documentaries, International Movies, Music & M...","In this documentary, South African rapper Nast...",44,new


In [None]:
movies.groupby("classic_modern_new")["show_id"].count().sort_values(ascending = False)

classic_modern_new
modern     4164
new        1005
classic     208
Name: show_id, dtype: int64

### A: Generally, netflix has newer and more modern movies!

#

### Q: How many new movies on Netflix are rated G?

In [None]:
len(movies.loc[(movies["rating"] == "G") & (movies["classic_modern_new"] == "new") , :])

2

### A: 2 New movies are rated G

#

### Q:What were the dramas added to netflix before 2017?

In [None]:
dramas.loc[dramas["date_added"] < pd.to_datetime('2017-01-01'), ["type", "title", "date_added"]]

Unnamed: 0,type,title,date_added
1,Movie,7:19,2016-12-23
59,Movie,1000 Rupee Note,2016-12-01
128,Movie,6 Years,2015-09-08
133,Movie,7 años,2016-10-27
211,Movie,A Noble Intention,2016-09-01
...,...,...,...
7684,Movie,X: Past Is Present,2016-07-01
7685,Movie,XOXO,2016-08-26
7726,Movie,You Carry Me,2016-07-01
7767,TV Show,Zindagi Gulzar Hai,2016-12-15


### A: Resulting dataframe above!