# Ciencia de datos
# Exploracion de datos de un dataset de Netflix en Python
### Federico Jaramillo
### Daniel Mesa
### 30 08 2020

# Introduccion
###El dataset contiene datos de las series y peliculas de netflix en el año 2019, en este se encuentran todo tipo de descripcion sobre las producciones

## Carga de datos

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
#solo se usa en Google Colab
from google.colab import files
files.upload()

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
netflix = pd.read_csv("netflix_titles.csv")

#Primeros 2 datos del dataset


In [110]:
netflix.head(2)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,year_added,month_added,season_count
0,81145628,Movie,Norm of the North: King Sized Adventure,"Richard Finn, Tim Maltby","Alan Marriott, Andrew Toth, Brian Dobson, Cole...","United States, India, South Korea, China",2019-09-09,2019,TV-PG,90,"Children & Family Movies, Comedies",Before planning an awesome wedding for his gra...,2019.0,9.0,
1,80117401,Movie,Jandino: Whatever it Takes,NoExist,Jandino Asporaat,United Kingdom,2016-09-09,2016,TV-MA,94,Stand-Up Comedy,Jandino Asporaat riffs on the challenges of ra...,2016.0,9.0,


##Tipos de variables en el dataset
###Se observa una gran cantidad de variables cualitativas en donde solo el año podria considerarse una variable cuantitativa

In [None]:
netflix.dtypes

show_id          int64
type            object
title           object
director        object
cast            object
country         object
date_added      object
release_year     int64
rating          object
duration        object
listed_in       object
description     object
dtype: object

### Descripcion de las variables

In [None]:
netflix.count()

show_id         6234
type            6234
title           6234
director        4265
cast            5664
country         5758
date_added      6223
release_year    6234
rating          6224
duration        6234
listed_in       6234
description     6234
dtype: int64

Cada variable contiene 6234, verifiquemos los valores nulos y las frecuencias

### Valores nulos y frecuencias

In [None]:
netflix.describe()

Unnamed: 0,show_id,release_year
count,6234.0,6234.0
mean,76703680.0,2013.35932
std,10942960.0,8.81162
min,247747.0,1925.0
25%,80035800.0,2013.0
50%,80163370.0,2016.0
75%,80244890.0,2018.0
max,81235730.0,2020.0


Solo posee dos variables cuantitativas, show_id y release_year, ambas nos serviran para la descripcion de las producciones, no tendremos en cuenta datos atipico debido que la mayoria son variables cualitativas, por lo tanto procederemos a modificar las nulas, las que esten muy agrupadas y eliminar las que no consideremos importantes

netflix.head

In [109]:
netflix.head(2)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,year_added,month_added,season_count
0,81145628,Movie,Norm of the North: King Sized Adventure,"Richard Finn, Tim Maltby","Alan Marriott, Andrew Toth, Brian Dobson, Cole...","United States, India, South Korea, China",2019-09-09,2019,TV-PG,90,"Children & Family Movies, Comedies",Before planning an awesome wedding for his gra...,2019.0,9.0,
1,80117401,Movie,Jandino: Whatever it Takes,NoExist,Jandino Asporaat,United Kingdom,2016-09-09,2016,TV-MA,94,Stand-Up Comedy,Jandino Asporaat riffs on the challenges of ra...,2016.0,9.0,


Verificamos cuales columnas tienen valores nulos

In [None]:
print("title : "+str(netflix.title.isnull().values.any()))
print("type : "+str(netflix.type.isnull().values.any()))
print("director : "+str(netflix.director.isnull().values.any()))
print("description : "+str(netflix.description.isnull().values.any()))
print("cast : "+str(netflix.cast.isnull().values.any()))
print("country : "+str(netflix.country.isnull().values.any()))
print("year : "+str(netflix.release_year.isnull().values.any()))
print("duration : "+str(netflix.duration.isnull().values.any()))
print("listenin : "+str(netflix.listed_in.isnull().values.any()))

title : False
type : False
director : True
description : False
cast : True
country : True
year : False
duration : False
listenin : False


Vemos que 3 de las variables contienen valores en NaN, segun el contenido de las variables se pierde informacion debido a que son variables que contienen el origen de las cintas y quienes estan en ellas

Creamos un nuevo dataframe para conservar los datos originales

In [None]:
df =netflix

Se crean nuevas variables, para separar fecha y tener el año de agregacion de manera individual, luego se paramos los varoles de duracion para las series y las peliculas, las primeras estan en temporadas mientras que la segunda esta en minutos

In [108]:
df["date_added"] = pd.to_datetime(df['date_added'])
df['year_added'] = df['date_added'].dt.year
df['month_added'] = df['date_added'].dt.month

df['season_count'] = df.apply(lambda x : x['duration'].split(" ")[0] if "Season" in x['duration'] else "", axis = 1)
df['duration'] = df.apply(lambda x : x['duration'].split(" ")[0] if "Season" not in x['duration'] else "", axis = 1)
df.head(2)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,year_added,month_added,season_count
0,81145628,Movie,Norm of the North: King Sized Adventure,"Richard Finn, Tim Maltby","Alan Marriott, Andrew Toth, Brian Dobson, Cole...","United States, India, South Korea, China",2019-09-09,2019,TV-PG,90,"Children & Family Movies, Comedies",2019.0,9.0,
1,80117401,Movie,Jandino: Whatever it Takes,NoExist,Jandino Asporaat,United Kingdom,2016-09-09,2016,TV-MA,94,Stand-Up Comedy,2016.0,9.0,


Eliminamos los valores nulos por una variable que sea cuantificable en frecuencia absoluta por variable

In [None]:
df['director'] = df['director'].fillna("NoExist")
df['cast'] = df['cast'].fillna("NoExist")
df['country'] = df['country'].fillna("NoExist")

Vemos que los valores nulos se han eliminado

In [None]:
print("title : "+str(df.title.isnull().values.any()))
print("type : "+str(df.type.isnull().values.any()))
print("director : "+str(df.director.isnull().values.any()))
print("description : "+str(df.description.isnull().values.any()))
print("cast : "+str(df.cast.isnull().values.any()))
print("country : "+str(df.country.isnull().values.any()))
print("year : "+str(df.release_year.isnull().values.any()))
print("duration : "+str(df.duration.isnull().values.any()))
print("listenin : "+str(df.listed_in.isnull().values.any()))

title : False
type : False
director : False
description : False
cast : False
country : False
year : False
duration : False
listenin : False


In [105]:
df.head(2)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,year_added,month_added,season_count
0,81145628,Movie,Norm of the North: King Sized Adventure,"Richard Finn, Tim Maltby","Alan Marriott, Andrew Toth, Brian Dobson, Cole...","United States, India, South Korea, China",2019-09-09,2019,TV-PG,90,"Children & Family Movies, Comedies",2019.0,9.0,
1,80117401,Movie,Jandino: Whatever it Takes,NoExist,Jandino Asporaat,United Kingdom,2016-09-09,2016,TV-MA,94,Stand-Up Comedy,2016.0,9.0,


Consideramos que la variable description no es de utilidad, es muy larga y los elementos que hay en ella a simple vista estan descritos por otras variables

In [None]:
df = df.drop('description', axis = 1)

In [106]:
df.head(2)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,year_added,month_added,season_count
0,81145628,Movie,Norm of the North: King Sized Adventure,"Richard Finn, Tim Maltby","Alan Marriott, Andrew Toth, Brian Dobson, Cole...","United States, India, South Korea, China",2019-09-09,2019,TV-PG,90,"Children & Family Movies, Comedies",2019.0,9.0,
1,80117401,Movie,Jandino: Whatever it Takes,NoExist,Jandino Asporaat,United Kingdom,2016-09-09,2016,TV-MA,94,Stand-Up Comedy,2016.0,9.0,


## Relacion entre algunas variables (sodeo)

Esta libreria nos permite construir graficos mas completos y elaborados

In [None]:
#libreria para graficas interactivas
import plotly.graph_objects as go
from plotly.offline import init_notebook_mode, iplot

Cantidad de pelicuales y series en el dataset (variable "type)

In [104]:
grouped = df["type"].value_counts().reset_index()
grouped = grouped.rename(columns = {"type" : "count", "index" : "type"})

trace = go.Pie(labels=grouped["type"], values=grouped['count'], pull=[0.05, 0], marker=dict(colors=["#6ad49b", "#a678de"]))
layout = go.Layout(title="", width=1000, height=450, legend=dict(x=0.1, y=1.1))
fig = go.Figure(data = [trace], layout = layout)
iplot(fig)

Para la construccion de este grafico contamos el numero de series y peliculas por año y se relaciona con la variable "release_year" que hace las veces de clases para las frecuencias absolutas

In [103]:
col = "release_year"

vc1 = d1[col].value_counts().reset_index()
vc1 = vc1.rename(columns = {col : "count", "index" : col})
vc1['percent'] = vc1['count'].apply(lambda x : 100*x/sum(vc1['count']))
vc1 = vc1.sort_values(col)

vc2 = d2[col].value_counts().reset_index()
vc2 = vc2.rename(columns = {col : "count", "index" : col})
vc2['percent'] = vc2['count'].apply(lambda x : 100*x/sum(vc2['count']))
vc2 = vc2.sort_values(col)

trace1 = go.Bar(x=vc1[col], y=vc1["count"], name="TV Shows", marker=dict(color="#a678de"))
trace2 = go.Bar(x=vc2[col], y=vc2["count"], name="Movies", marker=dict(color="#6ad49b"))
data = [trace1, trace2]
layout = go.Layout(title="Contenido agregado a traves de los años", legend=dict(x=0.1, y=1.1, orientation="h"), width=1000, height=450)
fig = go.Figure(data, layout=layout)
fig.show()

Creamos un nuevo dataframe con los datos por año de agregacion a la plataforma organizados para evitar dañar el actual

In [None]:
peliculasporaño = df.sort_values("year_added", ascending= True)

In [107]:
peliculasporaño.head(2)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,year_added,month_added,season_count
3790,70157452,TV Show,Dinner for Five,NoExist,NoExist,United States,2008-02-04,2007,TV-MA,,Stand-Up Comedy & Talk Shows,2008.0,2.0,
3742,70053412,Movie,To and From New York,Sorin Dan Mihalcescu,"Barbara King, Shaana Diya, John Krisiukenas, Y...",United States,2008-01-01,2006,NR,81.0,"Dramas, Independent Movies, Thrillers",2008.0,1.0,


In [None]:
peliculasporaño.title

3790                                Dinner for Five
3742                           To and From New York
1237                                       Splatter
1590                        Just Another Love Story
1543                    Mad Ron's Prevues from Hell
                           ...                     
6229                                   Red vs. Blue
6230                                          Maron
6231         Little Baby Bum: Nursery Rhyme Friends
6232    A Young Doctor's Notebook and Other Stories
6233                                        Friends
Name: title, Length: 6234, dtype: object

Estas son de las primeras peliculas agregadas a la plataforma

In [None]:
fig = go.Figure(data=[go.Table(
    header=dict(values=['Pelicula', 'año'],
                line_color='darkslategray',
                fill_color='lightskyblue',
                align='left'),
    cells=dict(values=[peliculasporaño.title.head(10),
                       peliculasporaño.year_added.head(10)],
               line_color='darkslategray',
               fill_color='lightcyan',
               align='left'))
])

fig.update_layout(width=1000, height=450)
fig.show()