# Netflix Movies Analytics
This is an Exploratory Data Analysis of a Netflix dataset found on kaggle, it uses Plotly library for visualization and implements a basic recommender system based on content similar to a presented movie.

The code that make this notebook possible is contained on a custom module: netflixanalytics found on this repository


In [18]:
import netflixanalytics.eda as neda
from pandas import read_csv

In [64]:
# A peek on the raw data
library = read_csv("../data/netflix_titles.csv")
library.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,TV Show,3%,,"João Miguel, Bianca Comparato, Michel Gomes, R...",Brazil,"August 14, 2020",2020,TV-MA,4 Seasons,"International TV Shows, TV Dramas, TV Sci-Fi &...",In a future where the elite inhabit an island ...
1,s2,Movie,7:19,Jorge Michel Grau,"Demián Bichir, Héctor Bonilla, Oscar Serrano, ...",Mexico,"December 23, 2016",2016,TV-MA,93 min,"Dramas, International Movies",After a devastating earthquake hits Mexico Cit...
2,s3,Movie,23:59,Gilbert Chan,"Tedd Chan, Stella Chung, Henley Hii, Lawrence ...",Singapore,"December 20, 2018",2011,R,78 min,"Horror Movies, International Movies","When an army recruit is found dead, his fellow..."
3,s4,Movie,9,Shane Acker,"Elijah Wood, John C. Reilly, Jennifer Connelly...",United States,"November 16, 2017",2009,PG-13,80 min,"Action & Adventure, Independent Movies, Sci-Fi...","In a postapocalyptic world, rag-doll robots hi..."
4,s5,Movie,21,Robert Luketic,"Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...",United States,"January 1, 2020",2008,PG-13,123 min,Dramas,A brilliant group of students become card-coun...


## PreProcessing of data
On this step we get rid of some unnecesary columns such as the show_id and keep only
items classified as movies, standarize the format of dates, get rid of non-ascii characters that might cause encryption troubles and prepare the columns of duration and description for further analysis. It also creates a new column that classifies shows according to their intended audience.

The function tasked with this is neda.preprocessNetflix, acting on the raw file located at "../data/netflix_titles.csv"

In [20]:
# 
netflix = neda.preprocessNetflix("../data/netflix_titles.csv")
netflix.head()

Unnamed: 0,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,target_audience
0,719,Jorge Michel Grau,"Demián Bichir, Héctor Bonilla, Oscar Serrano, ...",Mexico,2016-12-23,2016,TV-MA,93,"Dramas, International Movies","[After, devastating, earthquake, hits, Mexico,...",Adults
1,2359,Gilbert Chan,"Tedd Chan, Stella Chung, Henley Hii, Lawrence ...",Singapore,2018-12-20,2011,R,78,"Horror Movies, International Movies","[When, army, recruit, found, dead, fellow, sol...",Adults
2,9,Shane Acker,"Elijah Wood, John C. Reilly, Jennifer Connelly...",United States,2017-11-16,2009,PG-13,80,"Action & Adventure, Independent Movies, Sci-Fi...","[In, postapocalyptic, world, ragdoll, robots, ...",Teens
3,21,Robert Luketic,"Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...",United States,2020-01-01,2008,PG-13,123,Dramas,"[A, brilliant, group, students, become, cardco...",Teens
4,122,Yasir Al Yasiri,"Amina Khalil, Ahmed Dawood, Tarek Lotfy, Ahmed...",Egypt,2020-06-01,2019,TV-MA,95,"Horror Movies, International Movies","[After, awful, accident, couple, admitted, gri...",Adults


## Top directors, countries, actors, categories.
Using the country, cast, directors and listed_in columns of the netflix dataframes, we will find the ones with more participation in the platform and plot them as horizontal bars.

This will be performed with the function frequencyAnalysis from the visualization submodule

In [21]:
# Count and sort items in each relevant column
directors = neda.splitAndSummarize(netflix["director"], "director")
countries = neda.splitAndSummarize(netflix["country"], "country")
actors = neda.splitAndSummarize(netflix["cast"], "actor")
categories = neda.splitAndSummarize(netflix["listed_in"], "category")


In [22]:
# Import relevant submodule.
import netflixanalytics.visualization as viz
# Top Directors
viz.frequencyAnalysis(directors, 20)

In [23]:
# Top countries with logarithmic scale.
viz.frequencyAnalysis(countries, logX= True)

In [24]:
# Top actors
viz.frequencyAnalysis(actors, 15)

In [25]:
# Top categories
viz.frequencyAnalysis(categories, 18)

## Most common ratings
Which is the main audience on netflix? The graph below suggests that Netflix content caters mostly to mature audiences and also teenagers.

In [26]:
ratings = neda.splitAndSummarize(netflix["rating"], "rating")
viz.targetAudienceCounts(netflix).show()

## Average length of a movie in minutes.
The following histogram shows that a typical movie length oscilates between 60 and 140 minutes in length and the most common value is about 100 minutes, with the longest movie being "Black Mirror Bandersnacht" clocking at a whopping 312 min and the shortest movie being "Silent" at just 3 minutes.

In [27]:
viz.durationDistplot(netflix)

In [28]:
netflix.sort_values("duration")[["title", "duration"]]

Unnamed: 0,title,duration
3916,silent,3
3972,sol levante,5
1069,cops and robbers,8
878,canvas,9
359,american factory a conversation with the obamas,10
...,...,...
3552,raya and sakina,230
2583,lock your girls in,233
3121,no longer kids,237
4751,the school of mischief,253


## Content added over the years
The next graph shows the number of titles added over time to the platform, since 2014, the added content grows very fast but last year that number took a dip, very likely because of the pandemic. In line with our previous analysis, movies intended to adults and teenagers were the most frequent additions over the years.

In [29]:
viz.dateAddedAnalysis(netflix)

## Content Based Recommender
As a final step in this notebook, we will implement a basic content based Recommender function, it works by creating a bag of words from each entry of the dataframe and building a similarity matrix, which is then used to find the top 5 most similar items to a valid title.

In [30]:
import netflixanalytics.recommender as rec

In [66]:
import importlib
importlib.reload(rec)

<module 'netflixanalytics.recommender' from 'c:\\Users\\Amada\\Documents\\kaggle\\netflixShows\\code\\netflixanalytics\\recommender.py'>

In [67]:
cosineSim = rec.createCosineSimMatrix(netflix)

In [79]:
# With our Similarity Matrix, we are left only to ask recommendations!
# Lets try for "March Comes in Like a Lion"
rec.recommender("March Comes in Like a Lion", library, cosineSim)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
3058,s3059,TV Show,Iron Man: Armored Adventures,,"Adrian Petriw, Daniel Bacon, Anna Cummer, Vinc...","Canada, United States, United Kingdom, France,...","August 1, 2020",2011,TV-Y7,2 Seasons,Kids' TV,Teen phenom Tony Stark takes to the skies with...
4362,s4363,Movie,"My Teacher, My Obsession",Damián Romay,"Lucy Loken, Laura Bilgeri, Rusty Joiner, Alexa...",United States,"September 10, 2018",2018,TV-14,86 min,Thrillers,"When Riley changed schools, she didn't expect ..."
1,s2,Movie,7:19,Jorge Michel Grau,"Demián Bichir, Héctor Bonilla, Oscar Serrano, ...",Mexico,"December 23, 2016",2016,TV-MA,93 min,"Dramas, International Movies",After a devastating earthquake hits Mexico Cit...
4613,s4614,TV Show,On Children,,"Chuan-Chen Yeh, Chung Hsin-ling, Frances Wu, C...",Taiwan,"August 5, 2018",2018,TV-14,1 Season,"International TV Shows, TV Dramas, TV Mysteries",These uncanny tales reveal a world where indiv...
1112,s1113,Movie,Brother,Julien Abraham,"MHD, Darren Muselet, Aïssa Maïga, Jalil Lesper...",France,"November 22, 2019",2019,TV-MA,97 min,"Dramas, Independent Movies, International Movies",Thrust from a violent home into a brutal custo...
