# EDA on Netflix Data

In [1]:
# libraries 
import pandas as pd
import numpy as np
import altair as alt

# Handle large data sets without embedding them in the notebook
alt.data_transformers.enable('data_server')

DataTransformerRegistry.enable('data_server')

## Tasks
1. Basic Data Wrangling Tasks (including understanding the dataset characteristics)
2. Summary views (both visual and numerical)
3. Generate questions about the data
4. Search for answers by visualizing the data

## 1. Load the Dataset
Understanding the dataset characteristics
 - What is the size of the dataset
 - What are the column names
 - Is the data in an appropriate form for us to encode it with altair, adjust as necessary

In [2]:
url = 'https://raw.githubusercontent.com/kemiolamudzengi/dsci-320-datasets/main/netflix_data_edited.csv'
data = pd.read_csv(url, parse_dates= ['release_year', 'year_added'])
data['release_year'] = pd.DatetimeIndex(data['release_year']).year
data['year_added'] = pd.DatetimeIndex(data['year_added']).year
data.head()

Unnamed: 0,show_id,title,director,cast,country,release_year,rating,duration,listed_in,description,month_added,day_added,year_added
0,s7104,Tinker Bell and the Legend of the NeverBeast,Steve Loter,"Ginnifer Goodwin, Mae Whitman, Rosario Dawson,...",United States,2014,G,78,Children & Family Movies,When suspicious scout fairies scheme to captur...,January,1,2008
1,s1764,Dilan 1991,"Fajar Bustomi, Pidi Baiq","Iqbaal Ramadhan, Vanesha Prescilla, Ira Wibowo...",Indonesia,2019,TV-14,118,"Dramas, International Movies, Romantic Movies",Dilan's involvement in the motorbike gang impe...,February,4,2008
2,s3244,Jumping the Broom,Salim Akil,"Angela Bassett, Paula Patton, Laz Alonso, Lore...",United States,2011,PG-13,113,"Comedies, Romantic Movies","After a whirlwind romance, a couple rushes to ...",May,5,2009
3,s3834,Mac & Devin Go to High School,Dylan C. Brown,"Snoop Dogg, Wiz Khalifa, Mike Epps, Teairra Ma...",United States,2012,R,76,Comedies,Devin Overstreet may be the class valedictoria...,November,1,2010
4,s6836,The Rover,David Michôd,"Guy Pearce, Robert Pattinson, Scoot McNairy, D...","Australia, United States",2014,R,103,"International Movies, Thrillers","Set in a chaotic future, this Outback saga fol...",October,1,2011


In [3]:
data.shape

(4959, 13)

In [4]:
data.columns

Index(['show_id', 'title', 'director', 'cast', 'country', 'release_year',
       'rating', 'duration', 'listed_in', 'description', 'month_added',
       'day_added', 'year_added'],
      dtype='object')

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4959 entries, 0 to 4958
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       4959 non-null   object
 1   title         4959 non-null   object
 2   director      4959 non-null   object
 3   cast          4959 non-null   object
 4   country       4959 non-null   object
 5   release_year  4959 non-null   int64 
 6   rating        4959 non-null   object
 7   duration      4959 non-null   int64 
 8   listed_in     4959 non-null   object
 9   description   4959 non-null   object
 10  month_added   4959 non-null   object
 11  day_added     4959 non-null   int64 
 12  year_added    4959 non-null   int64 
dtypes: int64(4), object(9)
memory usage: 503.8+ KB


### Data Wrangling
Let us split the listed_in column so that we have distinct categories, give the column the names genre_1, genre_2, genre_3

In [6]:
new = data["listed_in"].str.split(',', expand = True)
data = data.assign(genre_1 = new[0],
            genre_2 = new[1],
            genre_3 = new[2])
data

Unnamed: 0,show_id,title,director,cast,country,release_year,rating,duration,listed_in,description,month_added,day_added,year_added,genre_1,genre_2,genre_3
0,s7104,Tinker Bell and the Legend of the NeverBeast,Steve Loter,"Ginnifer Goodwin, Mae Whitman, Rosario Dawson,...",United States,2014,G,78,Children & Family Movies,When suspicious scout fairies scheme to captur...,January,1,2008,Children & Family Movies,,
1,s1764,Dilan 1991,"Fajar Bustomi, Pidi Baiq","Iqbaal Ramadhan, Vanesha Prescilla, Ira Wibowo...",Indonesia,2019,TV-14,118,"Dramas, International Movies, Romantic Movies",Dilan's involvement in the motorbike gang impe...,February,4,2008,Dramas,International Movies,Romantic Movies
2,s3244,Jumping the Broom,Salim Akil,"Angela Bassett, Paula Patton, Laz Alonso, Lore...",United States,2011,PG-13,113,"Comedies, Romantic Movies","After a whirlwind romance, a couple rushes to ...",May,5,2009,Comedies,Romantic Movies,
3,s3834,Mac & Devin Go to High School,Dylan C. Brown,"Snoop Dogg, Wiz Khalifa, Mike Epps, Teairra Ma...",United States,2012,R,76,Comedies,Devin Overstreet may be the class valedictoria...,November,1,2010,Comedies,,
4,s6836,The Rover,David Michôd,"Guy Pearce, Robert Pattinson, Scoot McNairy, D...","Australia, United States",2014,R,103,"International Movies, Thrillers","Set in a chaotic future, this Outback saga fol...",October,1,2011,International Movies,Thrillers,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4954,s5052,Rabun,Yasmin Ahmad,"M. Rajoli, Kartina Aziz, Rozie Rashid, Irwan I...",Malaysia,2004,TV-PG,85,"Dramas, Independent Movies, International Movies",A free-spirited couple leave the city to retir...,January,16,2021,Dramas,Independent Movies,International Movies
4955,s2490,Good Sam,Kate Melville,"Tiya Sircar, Chad Connell, Marco Grazzini, Jes...",United States,2019,TV-PG,90,"Children & Family Movies, Dramas",A curious reporter finds love while pursuing t...,January,1,2021,Children & Family Movies,Dramas,
4956,s6429,The Guernsey Literary and Potato Peel Pie Society,Mike Newell,"Lily James, Michiel Huisman, Penelope Wilton, ...","United Kingdom, France, United States",2018,TV-PG,124,"Dramas, Romantic Movies",A London writer bonds with the colorful reside...,January,1,2021,Dramas,Romantic Movies,
4957,s3233,Judge Singh LLB,Atharv Baluja,"Ravinder Grewal, B.N. Sharma, Sardar Sohi, Cha...",India,2015,TV-PG,137,"Comedies, Dramas, International Movies","Framed by a politician for committing murder, ...",January,1,2021,Comedies,Dramas,International Movies


## 2. Summary views (both visual and numerical)
 - Univariate Numerical Summaries
 - Univariate Visual Idioms
 - Multivariate Numerical Summaries
 - Multivariate Visual Idioms

### Univariate Numerical Summaries

#### Quantitative
- range (i.e., min, max)
- central tendency (i.e, mean, median)
- spread (i.e., standard deviation)

In [7]:
data.agg(
    {
    "release_year": ['min', 'max', 'mean', 'median', 'std'],
    "duration": ['min', 'max', 'median', 'std'],
    "day_added": ['min', 'max', 'median', 'std'],
    "year_added": ['min', 'max', 'median', 'std']
    }
)

Unnamed: 0,release_year,duration,day_added,year_added
min,1942.0,3.0,1.0,2008.0
max,2021.0,253.0,31.0,2021.0
mean,2012.738859,,,
median,2016.0,99.0,12.0,2019.0
std,9.775297,27.023462,9.919747,1.394333


#### Categorical
 - Frequency of each value (i.e., frequency table)
First determine which attributes you are interested in exploring
data.columns
'rating', 'added_month', 'genre_1', 'genre_2', 'genre_3'
'director', 'cast', 'country', is more diverse and less interesting at this point


In [8]:
cat_attr = ['rating', 'month_added', 'genre_1', 'genre_2', 'genre_3'] #'director', 'cast', 'country',  less interesting and more diverse. 

In [9]:
for n in cat_attr:
    print(n)
    print(pd.crosstab(index = data[n], columns = 'count'))
    print("\n")

rating
col_0     count
rating         
G            39
NC-17         2
NR           78
PG          239
PG-13       379
R           643
TV-14      1185
TV-G         94
TV-MA      1697
TV-PG       461
TV-Y         68
TV-Y7        66
TV-Y7-FV      3
UR            5


month_added
col_0        count
month_added       
April          400
August         378
December       563
February       310
January        458
July           386
June           334
March          406
May            336
November       473
October        505
September      410


genre_1
col_0                     count
genre_1                        
Action & Adventure          689
Anime Features               19
Children & Family Movies    426
Classic Movies               75
Comedies                   1016
Cult Movies                  12
Documentaries               669
Dramas                     1323
Horror Movies               238
Independent Movies           20
International Movies         99
LGBTQ Movies                  1

Iterate over the list and print out the frequency for each attribute

#### Data Munging

Hmmmm do we want to combine TV-G with G  and also combine TV-PG with PG, also let's drop the ones that are missing
So what is happening is that as we understand the data, we are refinning the dataset and performing additional transformations
https://movielabs.com/md/ratings/v2.3/html/US_TVPG_Ratings.html 

In [10]:
data["rating"] = data["rating"].replace(["G", "PG"], ["TV-G", "TV-PG"])

In [11]:
pd.crosstab(index = data["rating"], columns = 'count')

col_0,count
rating,Unnamed: 1_level_1
NC-17,2
NR,78
PG-13,379
R,643
TV-14,1185
TV-G,133
TV-MA,1697
TV-PG,700
TV-Y,68
TV-Y7,66


### Categorical Univariate Visual Summaries
 - use bar charts for categorical attributes
 e.g genre_1, month_added, rating, country, cast etc. 

In [12]:
alt.Chart(data).mark_bar().encode(
    y = alt.Y("genre_1", sort = "-x"),
    x = 'count()'
)

### Quantitative Univariate Visual Summaries
histograms and density plots  - duration, year added

In [13]:
alt.Chart(data).mark_bar().encode(
    x = alt.X("duration", bin = True),
    y = 'count()'
)

Adjust the number of bars so it is similar to the density plot above

In [14]:
alt.Chart(data).mark_area().encode(
    x = "duration",
    y = 'density:Q'
).transform_density(
    'duration',
    ['duration','density']
)

In [15]:
alt.Chart(data).mark_bar().encode(
    x = alt.X("duration", bin = alt.Bin(maxbins = 30)),
    y = 'count()'
)

### Multivariate Numerical Summaries

#### Categorical
- rating and genre_1
- rating and month_added
- rating and genre_2
HINT: use crosstab

In [16]:
pd.crosstab(index = data["rating"], columns = data["genre_1"])

genre_1,Action & Adventure,Anime Features,Children & Family Movies,Classic Movies,Comedies,Cult Movies,Documentaries,Dramas,Horror Movies,Independent Movies,International Movies,LGBTQ Movies,Movies,Music & Musicals,Romantic Movies,Sci-Fi & Fantasy,Stand-Up Comedy,Thrillers
rating,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
NC-17,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0
NR,9,0,0,3,15,2,21,19,4,0,0,0,0,0,0,0,5,0
PG-13,115,2,2,6,102,0,25,90,27,0,0,0,0,0,0,6,0,4
R,167,0,0,22,119,7,21,206,55,8,6,1,1,1,0,2,7,20
TV-14,177,2,14,19,337,1,171,354,29,1,36,0,1,2,1,1,23,16
TV-G,1,0,72,3,4,0,36,10,0,0,4,0,1,1,1,0,0,0
TV-MA,168,1,0,9,325,1,248,515,121,10,42,0,6,8,1,2,232,8
TV-PG,50,10,217,13,112,1,147,125,2,1,10,0,4,2,0,0,6,0
TV-Y,0,0,61,0,0,0,0,0,0,0,1,0,6,0,0,0,0,0
TV-Y7,0,4,57,0,0,0,0,1,0,0,0,0,4,0,0,0,0,0


In [17]:
alt.Chart(data).mark_rect().encode(
    x = 'rating',
    y = 'genre_1',
    color = 'count()'
)

#### Quantitative
- Correlation Matrix for quantitative attributes

What if we wanted to explore if there a strong correlation between the quantitative attributes for a specific genre

In [18]:
data.corr()

Unnamed: 0,release_year,duration,day_added,year_added
release_year,1.0,-0.192602,-0.01743,-0.002656
duration,-0.192602,1.0,0.017151,-0.020095
day_added,-0.01743,0.017151,1.0,0.04104
year_added,-0.002656,-0.020095,0.04104,1.0


### Multivariate Visual Summaries

#### Stacked Bar Charts  - month and genre_1

#### Overlapping Density Plots - duration and rating (keep 3 ratings you care about

#### Bivariate Outlier Exploration
- use a scatter-plot to depict the values for a one categorical and one quantititve attribute


## Additional Analysis

Now that we have an overview of the data, we can start exploring additional questions of interest.
First summarize the questions that you have been able to answer with the EDA before formulating additional questions of interest
The questions should be diverse (use Stasko classification of low-level tasks (e.g., Retrive Value, Filter, Find Extremum)
- Retrieve Value - Find the longest movie, what is its name, genre, and length?
- Filter - Present the 20 longest movies realized after 2005 that have a pG-13 rating 
- Compute Derived Value - What percentage of movies added to the Netflix catalogue in 2018 were Documentaries? 
- Compute Derived Value - What is the average length of the movies in a given primary genre
- Find Extremum - Which genre has the longest movie
- Sort - Rank movies by their length
- Determine Range - What is the duration range for movies released in 2000?
- Characterize Distribution - What is the distribution by Genre for movies in a given rating group?
- Find Anomalies - What outliers exist for a given genre and rating in terms of movie length
- Correlate - Is there a relationship between film duration and year of release for a given genre
- Does Netflix typically add movies on a specifc day of the month?

### Retrieve Value - Find the longest movie, what is its name, genre, and length?

### Filter - What are the 20 longest movies realized after 2005 that have a PG-13 rating

### Compute Derived Value

### What percentage of movies added to the Netflix catalogue in 2018 were Documentaries? 

#### Sorted Bar Chart - Attempt 1

#### Compute Derived Value and then use Pie Chart or Stacked Single Bar - Attempt 2

#### - Layered View to Create Proportional Single Bar Chart - Attempt 3

### Compute Derived Value - What is the average length of the movies for each rating

### Find Extremum - Which genre has the longest movie
we have already done this above, but we can do it again here. or even use a squares. 
We differentiate between squares and circles becaus a traditional scatter plot has a specific purpose in statistics

### Sort - Categorize the catalogue by arranging the movies in each primary genre for each rating 

### Determine Range - What is the range from when a movie was released to when it was added to Netflix's catalogue?

### Characterize Distribution - What is the distribution by Genre for movies in a given rating group?

### Find Anomalies - What outliers exist for a given genre and rating in terms of movie length

### Correlate - Is there a relationship between film duration and year of release for a given genre

### Does Netflix typically add movies on a specifc day of the month?