# Como Desenvolver um Projeto de Data Science

## CRISP-DM
<img src='https://miro.medium.com/v2/resize:fit:640/0*tA5OjppLK627FfFo'>

## Preparação dos dados

### Tarefas de preparação dos dados
- a) Limpeza de dados
    - Existem dados nulos, duplicados e inconsistentes?
    - Como os dados serão processados?
    - Como iremos lidar com dados ausentes?
    - Como iremos lidar com dados duplicados?
    - Como iremos lidar com dados inconsistentes?
- b) Integração de dados
    - Existem dados externos para utilizarmos 
    - Existem dados que precisarão ser integrados?
- c) Redução de dados
    - Todos os dados são importantes para análise?
    - Os dados cabem em memória?
- d) Transformação de dados
    - Os dados precisam ser normalizados ou padronizados?
    - Podemos discretizar os dados?    

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

### 1. Leitura de Dados

In [2]:
df = pd.read_csv("../data/netflix_titles.csv")
df

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,81145628,Movie,Norm of the North: King Sized Adventure,"Richard Finn, Tim Maltby","Alan Marriott, Andrew Toth, Brian Dobson, Cole...","United States, India, South Korea, China","September 9, 2019",2019,TV-PG,90 min,"Children & Family Movies, Comedies",Before planning an awesome wedding for his gra...
1,80117401,Movie,Jandino: Whatever it Takes,,Jandino Asporaat,United Kingdom,"September 9, 2016",2016,TV-MA,94 min,Stand-Up Comedy,Jandino Asporaat riffs on the challenges of ra...
2,70234439,TV Show,Transformers Prime,,"Peter Cullen, Sumalee Montano, Frank Welker, J...",United States,"September 8, 2018",2013,TV-Y7-FV,1 Season,Kids' TV,"With the help of three human allies, the Autob..."
3,80058654,TV Show,Transformers: Robots in Disguise,,"Will Friedle, Darren Criss, Constance Zimmer, ...",United States,"September 8, 2018",2016,TV-Y7,1 Season,Kids' TV,When a prison ship crash unleashes hundreds of...
4,80125979,Movie,#realityhigh,Fernando Lebrija,"Nesta Cooper, Kate Walsh, John Michael Higgins...",United States,"September 8, 2017",2017,TV-14,99 min,Comedies,When nerdy high schooler Dani finally attracts...
...,...,...,...,...,...,...,...,...,...,...,...,...
6229,80000063,TV Show,Red vs. Blue,,"Burnie Burns, Jason Saldaña, Gustavo Sorola, G...",United States,,2015,NR,13 Seasons,"TV Action & Adventure, TV Comedies, TV Sci-Fi ...","This parody of first-person shooter games, mil..."
6230,70286564,TV Show,Maron,,"Marc Maron, Judd Hirsch, Josh Brener, Nora Zeh...",United States,,2016,TV-MA,4 Seasons,TV Comedies,"Marc Maron stars as Marc Maron, who interviews..."
6231,80116008,Movie,Little Baby Bum: Nursery Rhyme Friends,,,,,2016,,60 min,Movies,Nursery rhymes and original music for children...
6232,70281022,TV Show,A Young Doctor's Notebook and Other Stories,,"Daniel Radcliffe, Jon Hamm, Adam Godley, Chris...",United Kingdom,,2013,TV-MA,2 Seasons,"British TV Shows, TV Comedies, TV Dramas","Set during the Russian Revolution, this comic ..."


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6234 entries, 0 to 6233
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       6234 non-null   int64 
 1   type          6234 non-null   object
 2   title         6234 non-null   object
 3   director      4265 non-null   object
 4   cast          5664 non-null   object
 5   country       5758 non-null   object
 6   date_added    6223 non-null   object
 7   release_year  6234 non-null   int64 
 8   rating        6224 non-null   object
 9   duration      6234 non-null   object
 10  listed_in     6234 non-null   object
 11  description   6234 non-null   object
dtypes: int64(2), object(10)
memory usage: 584.6+ KB


In [4]:
df.describe(include='all')

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
count,6234.0,6234,6234,4265,5664,5758,6223,6234.0,6224,6234,6234,6234
unique,,2,6172,3301,5469,554,1524,,14,201,461,6226
top,,Movie,The Silence,"Raúl Campos, Jan Suter",David Attenborough,United States,"January 1, 2020",,TV-MA,1 Season,Documentaries,A surly septuagenarian gets another chance at ...
freq,,4265,3,18,18,2032,122,,2027,1321,299,3
mean,76703680.0,,,,,,,2013.35932,,,,
std,10942960.0,,,,,,,8.81162,,,,
min,247747.0,,,,,,,1925.0,,,,
25%,80035800.0,,,,,,,2013.0,,,,
50%,80163370.0,,,,,,,2016.0,,,,
75%,80244890.0,,,,,,,2018.0,,,,


### 2. Preparação dos Dados

**Métodos que você vai aprender**
- `drop()`: deleção de dados
- `duplicated()`: verificação de dados duplicados
- `drop_duplicated`: deleção de dados duplicados
- `dropna()`: método para deleção de dados nulos

- `value_counts`: conta a quantidade de valores únicos para cada categoria
- `explode()`: transforma as colunas listas em linhas do dataframe
- `split()`: método para transformação de string em lista
- `strip()`: método para remoção de espaços vazios em string
- `cut()`: segmenta os dados continuos ou numérico em categorias (discretização)
##### Seleção de dados

Todas as colunas são importantes para análise?

In [5]:
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,81145628,Movie,Norm of the North: King Sized Adventure,"Richard Finn, Tim Maltby","Alan Marriott, Andrew Toth, Brian Dobson, Cole...","United States, India, South Korea, China","September 9, 2019",2019,TV-PG,90 min,"Children & Family Movies, Comedies",Before planning an awesome wedding for his gra...
1,80117401,Movie,Jandino: Whatever it Takes,,Jandino Asporaat,United Kingdom,"September 9, 2016",2016,TV-MA,94 min,Stand-Up Comedy,Jandino Asporaat riffs on the challenges of ra...
2,70234439,TV Show,Transformers Prime,,"Peter Cullen, Sumalee Montano, Frank Welker, J...",United States,"September 8, 2018",2013,TV-Y7-FV,1 Season,Kids' TV,"With the help of three human allies, the Autob..."
3,80058654,TV Show,Transformers: Robots in Disguise,,"Will Friedle, Darren Criss, Constance Zimmer, ...",United States,"September 8, 2018",2016,TV-Y7,1 Season,Kids' TV,When a prison ship crash unleashes hundreds of...
4,80125979,Movie,#realityhigh,Fernando Lebrija,"Nesta Cooper, Kate Walsh, John Michael Higgins...",United States,"September 8, 2017",2017,TV-14,99 min,Comedies,When nerdy high schooler Dani finally attracts...


In [6]:
## senão for importante, você pode dropar usando o drop
df.drop('rating', axis=1)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,duration,listed_in,description
0,81145628,Movie,Norm of the North: King Sized Adventure,"Richard Finn, Tim Maltby","Alan Marriott, Andrew Toth, Brian Dobson, Cole...","United States, India, South Korea, China","September 9, 2019",2019,90 min,"Children & Family Movies, Comedies",Before planning an awesome wedding for his gra...
1,80117401,Movie,Jandino: Whatever it Takes,,Jandino Asporaat,United Kingdom,"September 9, 2016",2016,94 min,Stand-Up Comedy,Jandino Asporaat riffs on the challenges of ra...
2,70234439,TV Show,Transformers Prime,,"Peter Cullen, Sumalee Montano, Frank Welker, J...",United States,"September 8, 2018",2013,1 Season,Kids' TV,"With the help of three human allies, the Autob..."
3,80058654,TV Show,Transformers: Robots in Disguise,,"Will Friedle, Darren Criss, Constance Zimmer, ...",United States,"September 8, 2018",2016,1 Season,Kids' TV,When a prison ship crash unleashes hundreds of...
4,80125979,Movie,#realityhigh,Fernando Lebrija,"Nesta Cooper, Kate Walsh, John Michael Higgins...",United States,"September 8, 2017",2017,99 min,Comedies,When nerdy high schooler Dani finally attracts...
...,...,...,...,...,...,...,...,...,...,...,...
6229,80000063,TV Show,Red vs. Blue,,"Burnie Burns, Jason Saldaña, Gustavo Sorola, G...",United States,,2015,13 Seasons,"TV Action & Adventure, TV Comedies, TV Sci-Fi ...","This parody of first-person shooter games, mil..."
6230,70286564,TV Show,Maron,,"Marc Maron, Judd Hirsch, Josh Brener, Nora Zeh...",United States,,2016,4 Seasons,TV Comedies,"Marc Maron stars as Marc Maron, who interviews..."
6231,80116008,Movie,Little Baby Bum: Nursery Rhyme Friends,,,,,2016,60 min,Movies,Nursery rhymes and original music for children...
6232,70281022,TV Show,A Young Doctor's Notebook and Other Stories,,"Daniel Radcliffe, Jon Hamm, Adam Godley, Chris...",United Kingdom,,2013,2 Seasons,"British TV Shows, TV Comedies, TV Dramas","Set during the Russian Revolution, this comic ..."


##### Verificando dados nulos

In [11]:
df.isna().sum()/df.shape[0]*100

show_id          0.000000
type             0.000000
title            0.000000
director        31.584857
cast             9.143407
country          7.635547
date_added       0.176452
release_year     0.000000
rating           0.160411
duration         0.000000
listed_in        0.000000
description      0.000000
dtype: float64

In [13]:
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,81145628,Movie,Norm of the North: King Sized Adventure,"Richard Finn, Tim Maltby","Alan Marriott, Andrew Toth, Brian Dobson, Cole...","United States, India, South Korea, China","September 9, 2019",2019,TV-PG,90 min,"Children & Family Movies, Comedies",Before planning an awesome wedding for his gra...
1,80117401,Movie,Jandino: Whatever it Takes,,Jandino Asporaat,United Kingdom,"September 9, 2016",2016,TV-MA,94 min,Stand-Up Comedy,Jandino Asporaat riffs on the challenges of ra...
2,70234439,TV Show,Transformers Prime,,"Peter Cullen, Sumalee Montano, Frank Welker, J...",United States,"September 8, 2018",2013,TV-Y7-FV,1 Season,Kids' TV,"With the help of three human allies, the Autob..."
3,80058654,TV Show,Transformers: Robots in Disguise,,"Will Friedle, Darren Criss, Constance Zimmer, ...",United States,"September 8, 2018",2016,TV-Y7,1 Season,Kids' TV,When a prison ship crash unleashes hundreds of...
4,80125979,Movie,#realityhigh,Fernando Lebrija,"Nesta Cooper, Kate Walsh, John Michael Higgins...",United States,"September 8, 2017",2017,TV-14,99 min,Comedies,When nerdy high schooler Dani finally attracts...


In [14]:
#df[df['country'] == '']

##### Verificando dados duplicados usando `duplicated()`

In [18]:
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,81145628,Movie,Norm of the North: King Sized Adventure,"Richard Finn, Tim Maltby","Alan Marriott, Andrew Toth, Brian Dobson, Cole...","United States, India, South Korea, China","September 9, 2019",2019,TV-PG,90 min,"Children & Family Movies, Comedies",Before planning an awesome wedding for his gra...
1,80117401,Movie,Jandino: Whatever it Takes,,Jandino Asporaat,United Kingdom,"September 9, 2016",2016,TV-MA,94 min,Stand-Up Comedy,Jandino Asporaat riffs on the challenges of ra...
2,70234439,TV Show,Transformers Prime,,"Peter Cullen, Sumalee Montano, Frank Welker, J...",United States,"September 8, 2018",2013,TV-Y7-FV,1 Season,Kids' TV,"With the help of three human allies, the Autob..."
3,80058654,TV Show,Transformers: Robots in Disguise,,"Will Friedle, Darren Criss, Constance Zimmer, ...",United States,"September 8, 2018",2016,TV-Y7,1 Season,Kids' TV,When a prison ship crash unleashes hundreds of...
4,80125979,Movie,#realityhigh,Fernando Lebrija,"Nesta Cooper, Kate Walsh, John Michael Higgins...",United States,"September 8, 2017",2017,TV-14,99 min,Comedies,When nerdy high schooler Dani finally attracts...


In [17]:
df[df.duplicated()]

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description


In [19]:
## vericando se existem dados com o mesmo show_id
df[df.duplicated(subset='show_id')]

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description


In [20]:
df[df.duplicated(subset=['title', 'director'])]

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
378,80133200,TV Show,Maniac,,"Espen Petrus Andersen Lervaag, Håkon Bast Moss...",Norway,"September 1, 2017",2015,TV-MA,1 Season,"International TV Shows, TV Comedies, TV Dramas",A nondescript man in a mental hospital has an ...
843,81167047,TV Show,Tunnel,,"Choi Jin-hyuk, Yoon Hyun-min, Lee Yoo-young, C...",South Korea,"October 1, 2019",2017,TV-MA,1 Season,"Crime TV Shows, International TV Shows, Korean...","While chasing a serial murderer, a detective e..."
2124,81072516,Movie,Sarkar,A.R. Murugadoss,"Vijay, Varalakshmi Sarathkumar, Keerthi Suresh...",India,"March 2, 2019",2018,TV-MA,162 min,"Action & Adventure, Dramas, International Movies",A ruthless businessman’s mission to expose ele...
2388,80238292,Movie,The Silence,Gajendra Ahire,"Raghuvir Yadav, Nagraj Manjule, Anjali Patil, ...",India,"March 1, 2018",2015,TV-MA,91 min,"Dramas, International Movies","After encountering a scene of sexual violence,..."
4063,80063658,TV Show,The Code,,Marcus du Sautoy,United Kingdom,"February 1, 2018",2011,TV-PG,1 Season,"British TV Shows, Docuseries, International TV...","In this three-part documentary series, host Ma..."
4740,80178404,TV Show,Oh My Ghost,,"Park Bo-young, Cho Jung-seok, Lim Ju-hwan, Kim...",South Korea,"August 8, 2019",2015,TV-14,1 Season,"International TV Shows, Korean TV Shows, Roman...","Possessed by the ghost of a lustful virgin, a ..."
5046,80995575,Movie,The Birth Reborn,Eduardo Chauvet,,Brazil,"August 10, 2018",2013,TV-14,91 min,"Documentaries, International Movies",Mothers and medical professionals speak about ...
5327,70225722,TV Show,Life,,David Attenborough,"United Kingdom, United States, Greece, Italy, ...","April 22, 2015",2009,TV-PG,1 Season,"British TV Shows, Docuseries, International TV...",Discover the glorious variety of life on Earth...
5578,80085439,Movie,Frank and Cindy,G.J. Echternkamp,"Rene Russo, Oliver Platt, Johnny Simmons, Jane...",United States,"April 1, 2016",2015,R,102 min,"Comedies, Dramas, Independent Movies",A student filmmaker vengefully turns his camer...
5700,70155629,TV Show,Charmed,,"Holly Marie Combs, Alyssa Milano, Brian Krause...",United States,"October 1, 2017",2005,TV-PG,8 Seasons,"Romantic TV Shows, TV Comedies, TV Dramas",After a trio of sisters discover an ancient te...


##### deletando dados duplicados usando `drop_duplicates()`

In [21]:
shape_before = df.shape[0]
df.drop_duplicates(inplace=True)
print('{} dados foram removidos'.format(shape_before-df.shape[0]))

0 dados foram removidos


### 2.1 Limpeza e Formatação de Dados - Análise univariada 

##### Analisando o `show_id` - o identificador do filme ou seriado

In [22]:
df['show_id'].nunique()

6234

In [23]:
df['show_id'].info()

<class 'pandas.core.series.Series'>
Int64Index: 6234 entries, 0 to 6233
Series name: show_id
Non-Null Count  Dtype
--------------  -----
6234 non-null   int64
dtypes: int64(1)
memory usage: 97.4 KB


- Todos os `show_id` são únicos

##### Analisando o ``type`` - se é filme ou serie

In [24]:
df['type']

0         Movie
1         Movie
2       TV Show
3       TV Show
4         Movie
         ...   
6229    TV Show
6230    TV Show
6231      Movie
6232    TV Show
6233    TV Show
Name: type, Length: 6234, dtype: object

In [25]:
df['type'].info()

<class 'pandas.core.series.Series'>
Int64Index: 6234 entries, 0 to 6233
Series name: type
Non-Null Count  Dtype 
--------------  ----- 
6234 non-null   object
dtypes: object(1)
memory usage: 97.4+ KB


In [26]:
df['type'].unique()

array(['Movie', 'TV Show'], dtype=object)

In [27]:
df['type'].value_counts()

Movie      4265
TV Show    1969
Name: type, dtype: int64

In [28]:
df['type'].value_counts(normalize=True)

Movie      0.684151
TV Show    0.315849
Name: type, dtype: float64

##### Analisando o `title` - título do filme ou seriado

In [29]:
df['title'].info()

<class 'pandas.core.series.Series'>
Int64Index: 6234 entries, 0 to 6233
Series name: title
Non-Null Count  Dtype 
--------------  ----- 
6234 non-null   object
dtypes: object(1)
memory usage: 97.4+ KB


In [30]:
df['title'].unique()

array(['Norm of the North: King Sized Adventure',
       'Jandino: Whatever it Takes', 'Transformers Prime', ..., 'Maron',
       "A Young Doctor's Notebook and Other Stories", 'Friends'],
      dtype=object)

In [31]:
df['title'].nunique()

6172

In [33]:
df[df.duplicated('title')]['title'].unique()

array(['Maniac', 'Tunnel', 'Love', 'Drive', 'Manhunt', 'Limitless',
       'The Outsider', 'Shadow', 'Sarkar', 'Benji', 'The Silence',
       'Tiger', 'Wet Hot American Summer', 'Oh My Ghost', 'Us and Them',
       'Deep', 'Retribution', 'Solo', 'Godzilla', 'The Lovers',
       'People You May Know', 'Blood Money', 'The Code', 'The Oath',
       'The Secret', 'Carrie', 'The Gift', 'One Day', 'Don', 'Zoom',
       'The Innocents', 'The Iron Lady', 'The Birth Reborn', 'The Saint',
       'The In-Laws', 'Troy', 'Life', 'We Are Family', 'Persona',
       'Love O2O', 'Prince', 'Lavender', 'Frank and Cindy', 'Locked Up',
       'Zoo', 'Charmed', 'Bleach', 'Top Boy', 'Supergirl',
       "She's Gotta Have It", 'Rosario Tijeras', 'Aquarius', 'Kakegurui',
       'Hostages', 'Lovesick', 'Skins',
       'Little Baby Bum: Nursery Rhyme Friends'], dtype=object)

In [34]:
df[df['title']=='Maniac']
# não é um dado duplicado 

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
150,80124522,TV Show,Maniac,,"Emma Stone, Jonah Hill, Justin Theroux, Sally ...",United States,"September 21, 2018",2018,TV-MA,1 Season,"TV Comedies, TV Dramas, TV Mysteries",Two struggling strangers connect during a mind...
378,80133200,TV Show,Maniac,,"Espen Petrus Andersen Lervaag, Håkon Bast Moss...",Norway,"September 1, 2017",2015,TV-MA,1 Season,"International TV Shows, TV Comedies, TV Dramas",A nondescript man in a mental hospital has an ...


In [35]:
df[df['title']=='Tunnel']
# não é um dado duplicado, porém existem filmes que possuem o mesmo título (é importante validar a informação com algum focal point)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
477,81142594,TV Show,Tunnel,,"Todsapol Maisuk, Johnny Hao, Morakot Liu, Pemy...",,"October 30, 2019",2019,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Dramas",A detective finds himself 30 years in the futu...
843,81167047,TV Show,Tunnel,,"Choi Jin-hyuk, Yoon Hyun-min, Lee Yoo-young, C...",South Korea,"October 1, 2019",2017,TV-MA,1 Season,"Crime TV Shows, International TV Shows, Korean...","While chasing a serial murderer, a detective e..."
2498,80132626,Movie,Tunnel,Seong-hun Kim,"Jung-woo Ha, Doona Bae, Dal-su Oh, Jongsoo Kim",South Korea,"June 3, 2017",2016,TV-14,126 min,"Dramas, International Movies, Thrillers","A poorly constructed tunnel collapses, leaving..."


##### Analisando `director` - o diretor do filme

In [36]:
df['director']
#perceba que existem vários diretores em uma mesma linha. Isso irá dificultar nossa análise...

0       Richard Finn, Tim Maltby
1                            NaN
2                            NaN
3                            NaN
4               Fernando Lebrija
                  ...           
6229                         NaN
6230                         NaN
6231                         NaN
6232                         NaN
6233                         NaN
Name: director, Length: 6234, dtype: object

In [37]:
df['director'].info()

<class 'pandas.core.series.Series'>
Int64Index: 6234 entries, 0 to 6233
Series name: director
Non-Null Count  Dtype 
--------------  ----- 
4265 non-null   object
dtypes: object(1)
memory usage: 97.4+ KB


##### transformando valoress string de uma coluna para lista usnado `split()`

In [41]:
df['director']

0       Richard Finn, Tim Maltby
1                            NaN
2                            NaN
3                            NaN
4               Fernando Lebrija
                  ...           
6229                         NaN
6230                         NaN
6231                         NaN
6232                         NaN
6233                         NaN
Name: director, Length: 6234, dtype: object

In [45]:
df['director'].str.split(',')

0       [Richard Finn,  Tim Maltby]
1                               NaN
2                               NaN
3                               NaN
4                [Fernando Lebrija]
                   ...             
6229                            NaN
6230                            NaN
6231                            NaN
6232                            NaN
6233                            NaN
Name: director, Length: 6234, dtype: object

In [46]:
df['director'] = df['director'].str.split(',')

In [47]:
df['director']

0       [Richard Finn,  Tim Maltby]
1                               NaN
2                               NaN
3                               NaN
4                [Fernando Lebrija]
                   ...             
6229                            NaN
6230                            NaN
6231                            NaN
6232                            NaN
6233                            NaN
Name: director, Length: 6234, dtype: object

##### transformando valores de lista para linhas de um dataframe usando `explode()`

In [48]:
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,81145628,Movie,Norm of the North: King Sized Adventure,"[Richard Finn, Tim Maltby]","Alan Marriott, Andrew Toth, Brian Dobson, Cole...","United States, India, South Korea, China","September 9, 2019",2019,TV-PG,90 min,"Children & Family Movies, Comedies",Before planning an awesome wedding for his gra...
1,80117401,Movie,Jandino: Whatever it Takes,,Jandino Asporaat,United Kingdom,"September 9, 2016",2016,TV-MA,94 min,Stand-Up Comedy,Jandino Asporaat riffs on the challenges of ra...
2,70234439,TV Show,Transformers Prime,,"Peter Cullen, Sumalee Montano, Frank Welker, J...",United States,"September 8, 2018",2013,TV-Y7-FV,1 Season,Kids' TV,"With the help of three human allies, the Autob..."
3,80058654,TV Show,Transformers: Robots in Disguise,,"Will Friedle, Darren Criss, Constance Zimmer, ...",United States,"September 8, 2018",2016,TV-Y7,1 Season,Kids' TV,When a prison ship crash unleashes hundreds of...
4,80125979,Movie,#realityhigh,[Fernando Lebrija],"Nesta Cooper, Kate Walsh, John Michael Higgins...",United States,"September 8, 2017",2017,TV-14,99 min,Comedies,When nerdy high schooler Dani finally attracts...


In [49]:
df = df.explode('director')

In [50]:
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,81145628,Movie,Norm of the North: King Sized Adventure,Richard Finn,"Alan Marriott, Andrew Toth, Brian Dobson, Cole...","United States, India, South Korea, China","September 9, 2019",2019,TV-PG,90 min,"Children & Family Movies, Comedies",Before planning an awesome wedding for his gra...
0,81145628,Movie,Norm of the North: King Sized Adventure,Tim Maltby,"Alan Marriott, Andrew Toth, Brian Dobson, Cole...","United States, India, South Korea, China","September 9, 2019",2019,TV-PG,90 min,"Children & Family Movies, Comedies",Before planning an awesome wedding for his gra...
1,80117401,Movie,Jandino: Whatever it Takes,,Jandino Asporaat,United Kingdom,"September 9, 2016",2016,TV-MA,94 min,Stand-Up Comedy,Jandino Asporaat riffs on the challenges of ra...
2,70234439,TV Show,Transformers Prime,,"Peter Cullen, Sumalee Montano, Frank Welker, J...",United States,"September 8, 2018",2013,TV-Y7-FV,1 Season,Kids' TV,"With the help of three human allies, the Autob..."
3,80058654,TV Show,Transformers: Robots in Disguise,,"Will Friedle, Darren Criss, Constance Zimmer, ...",United States,"September 8, 2018",2016,TV-Y7,1 Season,Kids' TV,When a prison ship crash unleashes hundreds of...


In [51]:
df['director'].unique()

array(['Richard Finn', ' Tim Maltby', nan, ..., ' Seung-uk Jo',
       'Andy Devonshire', 'Onur Ünlü'], dtype=object)

##### removendo caracteres em branco de string usando `strip()`

In [52]:
df['director'] = df['director'].str.strip()

In [53]:
df['director'].unique()

array(['Richard Finn', 'Tim Maltby', nan, ..., 'Seung-uk Jo',
       'Andy Devonshire', 'Onur Ünlü'], dtype=object)

In [54]:
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,81145628,Movie,Norm of the North: King Sized Adventure,Richard Finn,"Alan Marriott, Andrew Toth, Brian Dobson, Cole...","United States, India, South Korea, China","September 9, 2019",2019,TV-PG,90 min,"Children & Family Movies, Comedies",Before planning an awesome wedding for his gra...
0,81145628,Movie,Norm of the North: King Sized Adventure,Tim Maltby,"Alan Marriott, Andrew Toth, Brian Dobson, Cole...","United States, India, South Korea, China","September 9, 2019",2019,TV-PG,90 min,"Children & Family Movies, Comedies",Before planning an awesome wedding for his gra...
1,80117401,Movie,Jandino: Whatever it Takes,,Jandino Asporaat,United Kingdom,"September 9, 2016",2016,TV-MA,94 min,Stand-Up Comedy,Jandino Asporaat riffs on the challenges of ra...
2,70234439,TV Show,Transformers Prime,,"Peter Cullen, Sumalee Montano, Frank Welker, J...",United States,"September 8, 2018",2013,TV-Y7-FV,1 Season,Kids' TV,"With the help of three human allies, the Autob..."
3,80058654,TV Show,Transformers: Robots in Disguise,,"Will Friedle, Darren Criss, Constance Zimmer, ...",United States,"September 8, 2018",2016,TV-Y7,1 Season,Kids' TV,When a prison ship crash unleashes hundreds of...


#### vamos resetar o index do dataframe para setar os labels ``[0, 1, 2,..., n]``

In [55]:
df.reset_index(inplace=True, drop=True)

In [56]:
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,81145628,Movie,Norm of the North: King Sized Adventure,Richard Finn,"Alan Marriott, Andrew Toth, Brian Dobson, Cole...","United States, India, South Korea, China","September 9, 2019",2019,TV-PG,90 min,"Children & Family Movies, Comedies",Before planning an awesome wedding for his gra...
1,81145628,Movie,Norm of the North: King Sized Adventure,Tim Maltby,"Alan Marriott, Andrew Toth, Brian Dobson, Cole...","United States, India, South Korea, China","September 9, 2019",2019,TV-PG,90 min,"Children & Family Movies, Comedies",Before planning an awesome wedding for his gra...
2,80117401,Movie,Jandino: Whatever it Takes,,Jandino Asporaat,United Kingdom,"September 9, 2016",2016,TV-MA,94 min,Stand-Up Comedy,Jandino Asporaat riffs on the challenges of ra...
3,70234439,TV Show,Transformers Prime,,"Peter Cullen, Sumalee Montano, Frank Welker, J...",United States,"September 8, 2018",2013,TV-Y7-FV,1 Season,Kids' TV,"With the help of three human allies, the Autob..."
4,80058654,TV Show,Transformers: Robots in Disguise,,"Will Friedle, Darren Criss, Constance Zimmer, ...",United States,"September 8, 2018",2016,TV-Y7,1 Season,Kids' TV,When a prison ship crash unleashes hundreds of...


##### Analisando `cast` - (elenco) atores envolvidos no filme

Veja que as colunas ``director``, ``cast``, ``country``, e ``listed_in`` também são colunas compostas. Vamos montar uma função para acelerar a preparação destes dados

In [57]:
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,81145628,Movie,Norm of the North: King Sized Adventure,Richard Finn,"Alan Marriott, Andrew Toth, Brian Dobson, Cole...","United States, India, South Korea, China","September 9, 2019",2019,TV-PG,90 min,"Children & Family Movies, Comedies",Before planning an awesome wedding for his gra...
1,81145628,Movie,Norm of the North: King Sized Adventure,Tim Maltby,"Alan Marriott, Andrew Toth, Brian Dobson, Cole...","United States, India, South Korea, China","September 9, 2019",2019,TV-PG,90 min,"Children & Family Movies, Comedies",Before planning an awesome wedding for his gra...
2,80117401,Movie,Jandino: Whatever it Takes,,Jandino Asporaat,United Kingdom,"September 9, 2016",2016,TV-MA,94 min,Stand-Up Comedy,Jandino Asporaat riffs on the challenges of ra...
3,70234439,TV Show,Transformers Prime,,"Peter Cullen, Sumalee Montano, Frank Welker, J...",United States,"September 8, 2018",2013,TV-Y7-FV,1 Season,Kids' TV,"With the help of three human allies, the Autob..."
4,80058654,TV Show,Transformers: Robots in Disguise,,"Will Friedle, Darren Criss, Constance Zimmer, ...",United States,"September 8, 2018",2016,TV-Y7,1 Season,Kids' TV,When a prison ship crash unleashes hundreds of...


In [58]:
df['cast'] = df['cast'].str.split(',')
df['country'] = df['country'].str.split(',')
df['listed_in'] = df['listed_in'].str.split(',')

In [59]:
df = df.explode('cast', ignore_index=True)
df['cast'] = df['cast'].str.strip()

In [60]:
df['cast'].unique()

array(['Alan Marriott', 'Andrew Toth', 'Brian Dobson', ...,
       'Margaret Clunie', 'Tim Steed', 'Matthew Perry'], dtype=object)

In [61]:
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,81145628,Movie,Norm of the North: King Sized Adventure,Richard Finn,Alan Marriott,"[United States, India, South Korea, China]","September 9, 2019",2019,TV-PG,90 min,"[Children & Family Movies, Comedies]",Before planning an awesome wedding for his gra...
1,81145628,Movie,Norm of the North: King Sized Adventure,Richard Finn,Andrew Toth,"[United States, India, South Korea, China]","September 9, 2019",2019,TV-PG,90 min,"[Children & Family Movies, Comedies]",Before planning an awesome wedding for his gra...
2,81145628,Movie,Norm of the North: King Sized Adventure,Richard Finn,Brian Dobson,"[United States, India, South Korea, China]","September 9, 2019",2019,TV-PG,90 min,"[Children & Family Movies, Comedies]",Before planning an awesome wedding for his gra...
3,81145628,Movie,Norm of the North: King Sized Adventure,Richard Finn,Cole Howard,"[United States, India, South Korea, China]","September 9, 2019",2019,TV-PG,90 min,"[Children & Family Movies, Comedies]",Before planning an awesome wedding for his gra...
4,81145628,Movie,Norm of the North: King Sized Adventure,Richard Finn,Jennifer Cameron,"[United States, India, South Korea, China]","September 9, 2019",2019,TV-PG,90 min,"[Children & Family Movies, Comedies]",Before planning an awesome wedding for his gra...


In [62]:
df = df.explode('country', ignore_index=True)
df['country'] = df['country'].str.strip()

In [63]:
df['country'].unique()

array(['United States', 'India', 'South Korea', 'China', 'United Kingdom',
       'Spain', 'Bulgaria', 'Canada', 'Chile', 'Denmark', 'Sweden', nan,
       'Netherlands', 'Belgium', 'France', 'Uruguay', '', 'Thailand',
       'Pakistan', 'Turkey', 'Brazil', 'Italy', 'Indonesia', 'Ireland',
       'Hong Kong', 'Morocco', 'Mexico', 'Vietnam', 'Argentina',
       'Nigeria', 'Japan', 'Germany', 'Greece', 'Norway', 'Switzerland',
       'United Arab Emirates', 'Lebanon', 'Cambodia', 'Russia', 'Poland',
       'Israel', 'Australia', 'Portugal', 'Egypt', 'Taiwan', 'Austria',
       'Czech Republic', 'Senegal', 'Soviet Union', 'Peru', 'Ukraine',
       'Malaysia', 'South Africa', 'Nepal', 'Philippines', 'Serbia',
       'Colombia', 'Singapore', 'Luxembourg', 'West Germany',
       'Cayman Islands', 'New Zealand', 'Venezuela', 'Jordan',
       'Saudi Arabia', 'Iceland', 'Malta', 'Qatar', 'Iran', 'Finland',
       'Iraq', 'Liechtenstein', 'Albania', 'Slovakia', 'Samoa', 'Ghana',
       'Botswana'

##### note que existem valores em branco na coluna ``country``

Vamos fazer uma verificação?

In [64]:
df[df['country'] == '']

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
346,80203094,Movie,City of Joy,Madeleine Gavin,,,"September 7, 2018",2018,TV-MA,77 min,[Documentaries],Women who've been sexually brutalized in war-t...
7594,81168347,Movie,Hotel Pacific,Janusz Majewski,Marek Kondrat,,"October 1, 2019",1975,TV-MA,96 min,"[Classic Movies, Dramas, International Movies]",A motivated teen strives to rise from dishwash...
7596,81168347,Movie,Hotel Pacific,Janusz Majewski,Roman Wilhelmi,,"October 1, 2019",1975,TV-MA,96 min,"[Classic Movies, Dramas, International Movies]",A motivated teen strives to rise from dishwash...
7598,81168347,Movie,Hotel Pacific,Janusz Majewski,Roman Skamene,,"October 1, 2019",1975,TV-MA,96 min,"[Classic Movies, Dramas, International Movies]",A motivated teen strives to rise from dishwash...
7600,81168347,Movie,Hotel Pacific,Janusz Majewski,Čestmír Řanda,,"October 1, 2019",1975,TV-MA,96 min,"[Classic Movies, Dramas, International Movies]",A motivated teen strives to rise from dishwash...
7602,81168347,Movie,Hotel Pacific,Janusz Majewski,Michał Pawlicki,,"October 1, 2019",1975,TV-MA,96 min,"[Classic Movies, Dramas, International Movies]",A motivated teen strives to rise from dishwash...
7604,81168347,Movie,Hotel Pacific,Janusz Majewski,Martin Hron,,"October 1, 2019",1975,TV-MA,96 min,"[Classic Movies, Dramas, International Movies]",A motivated teen strives to rise from dishwash...
7606,81168347,Movie,Hotel Pacific,Janusz Majewski,Stanisława Celińska,,"October 1, 2019",1975,TV-MA,96 min,"[Classic Movies, Dramas, International Movies]",A motivated teen strives to rise from dishwash...
7608,81168347,Movie,Hotel Pacific,Janusz Majewski,Joanna Kasperska,,"October 1, 2019",1975,TV-MA,96 min,"[Classic Movies, Dramas, International Movies]",A motivated teen strives to rise from dishwash...
7610,81168347,Movie,Hotel Pacific,Janusz Majewski,Jaroslava Schallerová,,"October 1, 2019",1975,TV-MA,96 min,"[Classic Movies, Dramas, International Movies]",A motivated teen strives to rise from dishwash...


In [65]:
# opção 02
df[df['country'].str.len()==0]

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
346,80203094,Movie,City of Joy,Madeleine Gavin,,,"September 7, 2018",2018,TV-MA,77 min,[Documentaries],Women who've been sexually brutalized in war-t...
7594,81168347,Movie,Hotel Pacific,Janusz Majewski,Marek Kondrat,,"October 1, 2019",1975,TV-MA,96 min,"[Classic Movies, Dramas, International Movies]",A motivated teen strives to rise from dishwash...
7596,81168347,Movie,Hotel Pacific,Janusz Majewski,Roman Wilhelmi,,"October 1, 2019",1975,TV-MA,96 min,"[Classic Movies, Dramas, International Movies]",A motivated teen strives to rise from dishwash...
7598,81168347,Movie,Hotel Pacific,Janusz Majewski,Roman Skamene,,"October 1, 2019",1975,TV-MA,96 min,"[Classic Movies, Dramas, International Movies]",A motivated teen strives to rise from dishwash...
7600,81168347,Movie,Hotel Pacific,Janusz Majewski,Čestmír Řanda,,"October 1, 2019",1975,TV-MA,96 min,"[Classic Movies, Dramas, International Movies]",A motivated teen strives to rise from dishwash...
7602,81168347,Movie,Hotel Pacific,Janusz Majewski,Michał Pawlicki,,"October 1, 2019",1975,TV-MA,96 min,"[Classic Movies, Dramas, International Movies]",A motivated teen strives to rise from dishwash...
7604,81168347,Movie,Hotel Pacific,Janusz Majewski,Martin Hron,,"October 1, 2019",1975,TV-MA,96 min,"[Classic Movies, Dramas, International Movies]",A motivated teen strives to rise from dishwash...
7606,81168347,Movie,Hotel Pacific,Janusz Majewski,Stanisława Celińska,,"October 1, 2019",1975,TV-MA,96 min,"[Classic Movies, Dramas, International Movies]",A motivated teen strives to rise from dishwash...
7608,81168347,Movie,Hotel Pacific,Janusz Majewski,Joanna Kasperska,,"October 1, 2019",1975,TV-MA,96 min,"[Classic Movies, Dramas, International Movies]",A motivated teen strives to rise from dishwash...
7610,81168347,Movie,Hotel Pacific,Janusz Majewski,Jaroslava Schallerová,,"October 1, 2019",1975,TV-MA,96 min,"[Classic Movies, Dramas, International Movies]",A motivated teen strives to rise from dishwash...


In [66]:
shape_before = df.shape[0]
df.drop(df[df['country'] == ''].index, inplace=True)
print('{} dados foram removidos'.format(shape_before-df.shape[0]))

11 dados foram removidos


In [67]:
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,81145628,Movie,Norm of the North: King Sized Adventure,Richard Finn,Alan Marriott,United States,"September 9, 2019",2019,TV-PG,90 min,"[Children & Family Movies, Comedies]",Before planning an awesome wedding for his gra...
1,81145628,Movie,Norm of the North: King Sized Adventure,Richard Finn,Alan Marriott,India,"September 9, 2019",2019,TV-PG,90 min,"[Children & Family Movies, Comedies]",Before planning an awesome wedding for his gra...
2,81145628,Movie,Norm of the North: King Sized Adventure,Richard Finn,Alan Marriott,South Korea,"September 9, 2019",2019,TV-PG,90 min,"[Children & Family Movies, Comedies]",Before planning an awesome wedding for his gra...
3,81145628,Movie,Norm of the North: King Sized Adventure,Richard Finn,Alan Marriott,China,"September 9, 2019",2019,TV-PG,90 min,"[Children & Family Movies, Comedies]",Before planning an awesome wedding for his gra...
4,81145628,Movie,Norm of the North: King Sized Adventure,Richard Finn,Andrew Toth,United States,"September 9, 2019",2019,TV-PG,90 min,"[Children & Family Movies, Comedies]",Before planning an awesome wedding for his gra...


In [68]:
df = df.explode('listed_in', ignore_index=True)
df['listed_in'] = df['listed_in'].str.strip()

In [69]:
df['director'].unique()

array(['Richard Finn', 'Tim Maltby', nan, ..., 'Seung-uk Jo',
       'Andy Devonshire', 'Onur Ünlü'], dtype=object)

In [71]:
# opção 02
df[df['director'].str.len()==0]

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description


In [70]:
df['listed_in'].unique()

array(['Children & Family Movies', 'Comedies', 'Stand-Up Comedy',
       "Kids' TV", 'Crime TV Shows', 'International TV Shows',
       'Spanish-Language TV Shows', 'International Movies',
       'Sci-Fi & Fantasy', 'Thrillers', 'Docuseries',
       'Science & Nature TV', 'Action & Adventure', 'Dramas',
       'Cult Movies', 'Independent Movies', 'Romantic Movies',
       'Documentaries', 'Horror Movies', 'Romantic TV Shows',
       'TV Comedies', 'TV Dramas', 'TV Thrillers', 'TV Mysteries',
       'British TV Shows', 'Music & Musicals', 'Reality TV',
       'TV Action & Adventure', 'Anime Features', 'Teen TV Shows',
       'Faith & Spirituality', 'Korean TV Shows', 'Anime Series',
       'LGBTQ Movies', 'TV Horror', 'Movies',
       'Stand-Up Comedy & Talk Shows', 'TV Sci-Fi & Fantasy',
       'Classic Movies', 'Sports Movies', 'TV Shows', 'Classic & Cult TV'],
      dtype=object)

In [None]:
df[df['listed_in'].str.len()==0]

#### Como fizemos transformação nos dados, é importante verificar novamente se existem dados nulos

In [72]:
df[df.duplicated()]

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
16535,81168338,Movie,300 Miles to Heaven,Maciej Dejczer,Adrianna Biedrzyńska,Denmark,"October 1, 2019",1989,TV-14,93 min,Dramas,"Hoping to help their dissident parents, two br..."
16536,81168338,Movie,300 Miles to Heaven,Maciej Dejczer,Adrianna Biedrzyńska,Denmark,"October 1, 2019",1989,TV-14,93 min,International Movies,"Hoping to help their dissident parents, two br..."
16537,81168338,Movie,300 Miles to Heaven,Maciej Dejczer,Adrianna Biedrzyńska,France,"October 1, 2019",1989,TV-14,93 min,Dramas,"Hoping to help their dissident parents, two br..."
16538,81168338,Movie,300 Miles to Heaven,Maciej Dejczer,Adrianna Biedrzyńska,France,"October 1, 2019",1989,TV-14,93 min,International Movies,"Hoping to help their dissident parents, two br..."
16539,81168338,Movie,300 Miles to Heaven,Maciej Dejczer,Adrianna Biedrzyńska,Poland,"October 1, 2019",1989,TV-14,93 min,Dramas,"Hoping to help their dissident parents, two br..."
16540,81168338,Movie,300 Miles to Heaven,Maciej Dejczer,Adrianna Biedrzyńska,Poland,"October 1, 2019",1989,TV-14,93 min,International Movies,"Hoping to help their dissident parents, two br..."
56269,81018067,Movie,Blood Will Tell,Miguel Cohan,Oscar Martínez,Argentina,"June 21, 2019",2019,TV-MA,113 min,Dramas,Family patriarch Elías begins to unravel after...
56270,81018067,Movie,Blood Will Tell,Miguel Cohan,Oscar Martínez,Argentina,"June 21, 2019",2019,TV-MA,113 min,Independent Movies,Family patriarch Elías begins to unravel after...
56271,81018067,Movie,Blood Will Tell,Miguel Cohan,Oscar Martínez,Argentina,"June 21, 2019",2019,TV-MA,113 min,International Movies,Family patriarch Elías begins to unravel after...
56272,81018067,Movie,Blood Will Tell,Miguel Cohan,Oscar Martínez,United States,"June 21, 2019",2019,TV-MA,113 min,Dramas,Family patriarch Elías begins to unravel after...


In [73]:
shape_before = df.shape[0]
df.drop_duplicates(inplace=True)
print('{} dados foram removidos'.format(shape_before-df.shape[0]))

54 dados foram removidos


In [74]:
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,81145628,Movie,Norm of the North: King Sized Adventure,Richard Finn,Alan Marriott,United States,"September 9, 2019",2019,TV-PG,90 min,Children & Family Movies,Before planning an awesome wedding for his gra...
1,81145628,Movie,Norm of the North: King Sized Adventure,Richard Finn,Alan Marriott,United States,"September 9, 2019",2019,TV-PG,90 min,Comedies,Before planning an awesome wedding for his gra...
2,81145628,Movie,Norm of the North: King Sized Adventure,Richard Finn,Alan Marriott,India,"September 9, 2019",2019,TV-PG,90 min,Children & Family Movies,Before planning an awesome wedding for his gra...
3,81145628,Movie,Norm of the North: King Sized Adventure,Richard Finn,Alan Marriott,India,"September 9, 2019",2019,TV-PG,90 min,Comedies,Before planning an awesome wedding for his gra...
4,81145628,Movie,Norm of the North: King Sized Adventure,Richard Finn,Alan Marriott,South Korea,"September 9, 2019",2019,TV-PG,90 min,Children & Family Movies,Before planning an awesome wedding for his gra...


##### Analisando o `date_added` - data de lançamento da obra

In [75]:
df['date_added'].info()

<class 'pandas.core.series.Series'>
Int64Index: 139900 entries, 0 to 139953
Series name: date_added
Non-Null Count   Dtype 
--------------   ----- 
139741 non-null  object
dtypes: object(1)
memory usage: 2.1+ MB


In [76]:
index = df.sample(10).index

In [77]:
df.loc[index, 'date_added']

118843       April 20, 2018
107015      August 31, 2017
102765    December 11, 2015
135350      January 1, 2016
35420          May 31, 2019
126388     October 24, 2017
113856       August 1, 2018
34936           May 5, 2017
111037      August 16, 2018
25127     November 24, 2019
Name: date_added, dtype: object

In [80]:
df['date_added'] = pd.to_datetime(df['date_added'])

In [81]:
df.loc[index, 'date_added']

118843   2018-04-20
107015   2017-08-31
102765   2015-12-11
135350   2016-01-01
35420    2019-05-31
126388   2017-10-24
113856   2018-08-01
34936    2017-05-05
111037   2018-08-16
25127    2019-11-24
Name: date_added, dtype: datetime64[ns]

In [83]:
df['date_added'].info()

<class 'pandas.core.series.Series'>
Int64Index: 139900 entries, 0 to 139953
Series name: date_added
Non-Null Count   Dtype         
--------------   -----         
139741 non-null  datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 6.2 MB


##### Analisando o release_year - ano de lançamento da obra

In [84]:
df['release_year'].info()

<class 'pandas.core.series.Series'>
Int64Index: 139900 entries, 0 to 139953
Series name: release_year
Non-Null Count   Dtype
--------------   -----
139900 non-null  int64
dtypes: int64(1)
memory usage: 6.2 MB


In [85]:
df['release_year'].unique()

array([2019, 2016, 2013, 2017, 2014, 2015, 2009, 2012, 2010, 2018, 2011,
       2004, 2000, 1983, 1982, 2006, 2005, 2002, 1997, 2008, 2007, 2003,
       1981, 1991, 1994, 1988, 1976, 1973, 1974, 1989, 1986, 1984, 1978,
       1998, 1972, 1979, 1960, 1959, 2001, 1995, 1992, 1990, 1975, 1985,
       1980, 1970, 1996, 1967, 1999, 1987, 1968, 1993, 2020, 1958, 1965,
       1956, 1962, 1955, 1977, 1945, 1946, 1942, 1944, 1947, 1943, 1969,
       1954, 1966, 1971, 1964, 1925, 1963])

##### Analisando o ``duration``: duração do filme em minutos ou total de temporada das series

In [86]:
df['duration'].info()

<class 'pandas.core.series.Series'>
Int64Index: 139900 entries, 0 to 139953
Series name: duration
Non-Null Count   Dtype 
--------------   ----- 
139900 non-null  object
dtypes: object(1)
memory usage: 6.2+ MB


In [87]:
df['duration'].unique()

array(['90 min', '94 min', '1 Season', '99 min', '110 min', '60 min',
       '78 min', '95 min', '58 min', '62 min', '65 min', '61 min',
       '135 min', '98 min', '96 min', '79 min', '113 min', '80 min',
       '77 min', '112 min', '106 min', '102 min', '114 min', '125 min',
       '142 min', '133 min', '100 min', '86 min', '146 min', '10 min',
       '122 min', '126 min', '116 min', '130 min', '70 min', '128 min',
       '88 min', '108 min', '93 min', '64 min', '107 min', '92 min',
       '111 min', '103 min', '40 min', '85 min', '72 min', '45 min',
       '89 min', '24 min', '127 min', '163 min', '117 min', '42 min',
       '104 min', '75 min', '82 min', '38 min', '97 min', '81 min',
       '91 min', '152 min', '87 min', '121 min', '101 min', '119 min',
       '83 min', '182 min', '124 min', '63 min', '139 min', '171 min',
       '84 min', '69 min', '76 min', '41 min', '153 min', '141 min',
       '52 min', '137 min', '132 min', '157 min', '109 min', '105 min',
       '25 min', '71

In [88]:
df[df['type']=='Movie']['duration'].unique()

array(['90 min', '94 min', '99 min', '110 min', '60 min', '78 min',
       '95 min', '58 min', '62 min', '65 min', '61 min', '135 min',
       '98 min', '96 min', '79 min', '113 min', '80 min', '77 min',
       '112 min', '106 min', '102 min', '114 min', '125 min', '142 min',
       '133 min', '100 min', '86 min', '146 min', '10 min', '122 min',
       '126 min', '116 min', '130 min', '70 min', '128 min', '88 min',
       '108 min', '93 min', '64 min', '107 min', '92 min', '111 min',
       '103 min', '40 min', '85 min', '72 min', '45 min', '89 min',
       '24 min', '127 min', '163 min', '117 min', '42 min', '104 min',
       '75 min', '82 min', '38 min', '97 min', '81 min', '91 min',
       '152 min', '87 min', '121 min', '101 min', '119 min', '83 min',
       '182 min', '124 min', '63 min', '139 min', '171 min', '84 min',
       '69 min', '76 min', '41 min', '153 min', '141 min', '52 min',
       '137 min', '132 min', '157 min', '109 min', '105 min', '25 min',
       '71 min', '123 

In [89]:
df[df['type']=='TV Show']['duration'].unique()

array(['1 Season', '2 Seasons', '5 Seasons', '3 Seasons', '7 Seasons',
       '4 Seasons', '8 Seasons', '6 Seasons', '9 Seasons', '14 Seasons',
       '10 Seasons', '12 Seasons', '15 Seasons', '11 Seasons',
       '13 Seasons'], dtype=object)

In [90]:
filter_filmes = df['type']=='Movie'

##### removendo carateres de string usando `replace()`

In [91]:
df[filter_filmes]['duration'].str.replace('min','').str.strip().astype(int)

0          90
1          90
2          90
3          90
4          90
         ... 
123739    102
123740     88
123741     88
123742    109
139914     60
Name: duration, Length: 98515, dtype: int64

##### Criando uma nova coluna ```duration_movie`` para duração do filme 

In [92]:
df['duration_movie'] = np.NaN
df

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,duration_movie
0,81145628,Movie,Norm of the North: King Sized Adventure,Richard Finn,Alan Marriott,United States,2019-09-09,2019,TV-PG,90 min,Children & Family Movies,Before planning an awesome wedding for his gra...,
1,81145628,Movie,Norm of the North: King Sized Adventure,Richard Finn,Alan Marriott,United States,2019-09-09,2019,TV-PG,90 min,Comedies,Before planning an awesome wedding for his gra...,
2,81145628,Movie,Norm of the North: King Sized Adventure,Richard Finn,Alan Marriott,India,2019-09-09,2019,TV-PG,90 min,Children & Family Movies,Before planning an awesome wedding for his gra...,
3,81145628,Movie,Norm of the North: King Sized Adventure,Richard Finn,Alan Marriott,India,2019-09-09,2019,TV-PG,90 min,Comedies,Before planning an awesome wedding for his gra...,
4,81145628,Movie,Norm of the North: King Sized Adventure,Richard Finn,Alan Marriott,South Korea,2019-09-09,2019,TV-PG,90 min,Children & Family Movies,Before planning an awesome wedding for his gra...,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
139949,70153404,TV Show,Friends,,Matt LeBlanc,United States,NaT,2003,TV-14,10 Seasons,TV Comedies,This hit sitcom follows the merry misadventure...,
139950,70153404,TV Show,Friends,,Matthew Perry,United States,NaT,2003,TV-14,10 Seasons,Classic & Cult TV,This hit sitcom follows the merry misadventure...,
139951,70153404,TV Show,Friends,,Matthew Perry,United States,NaT,2003,TV-14,10 Seasons,TV Comedies,This hit sitcom follows the merry misadventure...,
139952,70153404,TV Show,Friends,,David Schwimmer,United States,NaT,2003,TV-14,10 Seasons,Classic & Cult TV,This hit sitcom follows the merry misadventure...,


In [93]:
filmes_index = df[filter_filmes].index

In [94]:
df.loc[filmes_index, 'duration_movie'] = df.loc[filmes_index]['duration'].str.replace('min','').str.strip()

In [97]:
### checando dados nulos
df[df['duration_movie'].isna()]['type'].value_counts()

TV Show    41385
Name: type, dtype: int64

In [98]:
df['duration_movie'].info()

<class 'pandas.core.series.Series'>
Int64Index: 139900 entries, 0 to 139953
Series name: duration_movie
Non-Null Count  Dtype 
--------------  ----- 
98515 non-null  object
dtypes: object(1)
memory usage: 6.2+ MB


In [99]:
df['duration_movie'] = df['duration_movie'].astype('Int64')

In [100]:
df['duration_movie'].info()

<class 'pandas.core.series.Series'>
Int64Index: 139900 entries, 0 to 139953
Series name: duration_movie
Non-Null Count  Dtype
--------------  -----
98515 non-null  Int64
dtypes: Int64(1)
memory usage: 6.3 MB


#### Excelente! Agora vamos criar uma coluna para analisar o número de temporadas de uma série

In [101]:
filter_series = df['type']=='TV Show'

In [102]:
df[filter_series]['duration'].unique()

array(['1 Season', '2 Seasons', '5 Seasons', '3 Seasons', '7 Seasons',
       '4 Seasons', '8 Seasons', '6 Seasons', '9 Seasons', '14 Seasons',
       '10 Seasons', '12 Seasons', '15 Seasons', '11 Seasons',
       '13 Seasons'], dtype=object)

In [103]:
## replace usando dicionários
dic = {'Seasons':'', 'Season':''}
df[filter_series]['duration'].replace(dic, regex=True)

161        1 
162        1 
163        1 
164        1 
165        1 
         ... 
139949    10 
139950    10 
139951    10 
139952    10 
139953    10 
Name: duration, Length: 41385, dtype: object

In [104]:
## replace usando lista do mesmo tamanho
df[filter_series]['duration'].replace(to_replace=['Seasons', 'Season'], 
                                          value=['',''],
                                         regex=True)

161        1 
162        1 
163        1 
164        1 
165        1 
         ... 
139949    10 
139950    10 
139951    10 
139952    10 
139953    10 
Name: duration, Length: 41385, dtype: object

In [105]:
## replace usando lista do mesmo tamanho
df['qtd_temporada'] = df[filter_series]['duration'].replace(to_replace=['Seasons', 'Season'], 
                                          value=['',''],
                                         regex=True)

In [106]:
df[df['qtd_temporada'].isna()==False]['type'].value_counts()

TV Show    41385
Name: type, dtype: int64

In [107]:
df['qtd_temporada'].info()

<class 'pandas.core.series.Series'>
Int64Index: 139900 entries, 0 to 139953
Series name: qtd_temporada
Non-Null Count  Dtype 
--------------  ----- 
41385 non-null  object
dtypes: object(1)
memory usage: 6.2+ MB


In [109]:
df['qtd_temporada'] = df['qtd_temporada'].astype('Int64')

In [111]:
df.sample(10)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,duration_movie,qtd_temporada
129333,70252898,TV Show,Total Drama,,Megan Fahlenbock,United States,2018-05-03,2015,TV-PG,5 Seasons,TV Comedies,Heartwarming moments are interspersed with bac...,,5.0
69592,81062853,Movie,Next Enti?,Kunal Kohli,Poonam Kaur,India,2019-01-09,2018,TV-14,117 min,Comedies,Clashing views on intimacy lead a couple to go...,117.0,
78493,70125232,Movie,Legend of the Guardians: The Owls of Ga'Hoole,Zack Snyder,Abbie Cornish,Australia,2020-01-01,2010,PG,97 min,Children & Family Movies,"Soren, a barn owl kidnapped from his peaceful ...",97.0,
73908,70275751,Movie,Dallas Buyers Club,Jean-Marc Vallée,Steve Zahn,United States,2018-01-16,2013,R,117 min,Dramas,Refusing to accept a death sentence from his d...,117.0,
88362,81067759,Movie,Fatso,Rajat Kapoor,Neil Bhoopalam,India,2019-02-15,2012,TV-14,97 min,International Movies,Killed too soon due to a clerical error in hea...,97.0,
44288,80148210,Movie,The Titan,Lennart Ruff,Diego Boneta,United Kingdom,2018-03-30,2018,TV-MA,97 min,Dramas,"On a bleak future Earth, a soldier endures a r...",97.0,
79983,20159719,Movie,The Peacemaker,Mimi Leder,Alexander Strobele,United States,2020-01-01,1997,R,124 min,Action & Adventure,After terrorists trigger a nuclear blast in Ru...,124.0,
95644,80044536,Movie,Diary of a Chambermaid,Benoît Jacquot,Mélodie Valemberg,France,2016-12-03,2015,NR,96 min,International Movies,An attractive young woman goes to work as a ch...,96.0,
67955,80996343,Movie,Our Shining Days,,Zhaohua Lu,China,2018-07-01,2017,TV-PG,103 min,Comedies,"In a battle of the bands, a percussions studen...",103.0,
137934,80156006,TV Show,Ultimate Beastmaster México,,Kyung-seok Seo,United States,2018-08-31,2018,TV-14,3 Seasons,Reality TV,Competitors representing countries from around...,,3.0


Nosso dataset contém 6234 elementos, com 12 atributos cada, além de possuir missing data.
Dentre os atributos existentes pra cada elemento do catálogo, vamos focar apenas nos seguintes (uma vez que esse é apenas um exemplo):

* `type` - o tipo da obra (série ou filme)
* `release_year` - ano de lançamento da obra
* `date_added`- data de lançamento da obra
* `director` - diretor da obra

##### Deletando coluna usando ``del``

- ``drop()`` deleta linhas e colunas x ``del`` deleta apenas colunas
- ``drop()`` deleta vários itens de uma vez x ``del`` deleta apenas um item por vez 
- ``drop()`` deleta com inplace opcional x ``del`` sempre usa a operação inplace

In [112]:
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,duration_movie,qtd_temporada
0,81145628,Movie,Norm of the North: King Sized Adventure,Richard Finn,Alan Marriott,United States,2019-09-09,2019,TV-PG,90 min,Children & Family Movies,Before planning an awesome wedding for his gra...,90,
1,81145628,Movie,Norm of the North: King Sized Adventure,Richard Finn,Alan Marriott,United States,2019-09-09,2019,TV-PG,90 min,Comedies,Before planning an awesome wedding for his gra...,90,
2,81145628,Movie,Norm of the North: King Sized Adventure,Richard Finn,Alan Marriott,India,2019-09-09,2019,TV-PG,90 min,Children & Family Movies,Before planning an awesome wedding for his gra...,90,
3,81145628,Movie,Norm of the North: King Sized Adventure,Richard Finn,Alan Marriott,India,2019-09-09,2019,TV-PG,90 min,Comedies,Before planning an awesome wedding for his gra...,90,
4,81145628,Movie,Norm of the North: King Sized Adventure,Richard Finn,Alan Marriott,South Korea,2019-09-09,2019,TV-PG,90 min,Children & Family Movies,Before planning an awesome wedding for his gra...,90,


In [113]:
del df['duration']

In [114]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 139900 entries, 0 to 139953
Data columns (total 13 columns):
 #   Column          Non-Null Count   Dtype         
---  ------          --------------   -----         
 0   show_id         139900 non-null  int64         
 1   type            139900 non-null  object        
 2   title           139900 non-null  object        
 3   director        100989 non-null  object        
 4   cast            138412 non-null  object        
 5   country         134110 non-null  object        
 6   date_added      139741 non-null  datetime64[ns]
 7   release_year    139900 non-null  int64         
 8   rating          139827 non-null  object        
 9   listed_in       139900 non-null  object        
 10  description     139900 non-null  object        
 11  duration_movie  98515 non-null   Int64         
 12  qtd_temporada   41385 non-null   Int64         
dtypes: Int64(2), datetime64[ns](1), int64(2), object(8)
memory usage: 19.2+ MB


### 3. Criar dois arquivos: 
- a) uma para os filmes 
- b) outro para as séries

### a) Trabalhando apenas com os filmes

In [115]:
df_filmes = pd.DataFrame(df[filter_filmes])

In [116]:
df_filmes.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,listed_in,description,duration_movie,qtd_temporada
0,81145628,Movie,Norm of the North: King Sized Adventure,Richard Finn,Alan Marriott,United States,2019-09-09,2019,TV-PG,Children & Family Movies,Before planning an awesome wedding for his gra...,90,
1,81145628,Movie,Norm of the North: King Sized Adventure,Richard Finn,Alan Marriott,United States,2019-09-09,2019,TV-PG,Comedies,Before planning an awesome wedding for his gra...,90,
2,81145628,Movie,Norm of the North: King Sized Adventure,Richard Finn,Alan Marriott,India,2019-09-09,2019,TV-PG,Children & Family Movies,Before planning an awesome wedding for his gra...,90,
3,81145628,Movie,Norm of the North: King Sized Adventure,Richard Finn,Alan Marriott,India,2019-09-09,2019,TV-PG,Comedies,Before planning an awesome wedding for his gra...,90,
4,81145628,Movie,Norm of the North: King Sized Adventure,Richard Finn,Alan Marriott,South Korea,2019-09-09,2019,TV-PG,Children & Family Movies,Before planning an awesome wedding for his gra...,90,


#### Selecionando apenas dados de filmes

In [117]:
df_filmes.drop(['qtd_temporada', 'type'], axis=1)

Unnamed: 0,show_id,title,director,cast,country,date_added,release_year,rating,listed_in,description,duration_movie
0,81145628,Norm of the North: King Sized Adventure,Richard Finn,Alan Marriott,United States,2019-09-09,2019,TV-PG,Children & Family Movies,Before planning an awesome wedding for his gra...,90
1,81145628,Norm of the North: King Sized Adventure,Richard Finn,Alan Marriott,United States,2019-09-09,2019,TV-PG,Comedies,Before planning an awesome wedding for his gra...,90
2,81145628,Norm of the North: King Sized Adventure,Richard Finn,Alan Marriott,India,2019-09-09,2019,TV-PG,Children & Family Movies,Before planning an awesome wedding for his gra...,90
3,81145628,Norm of the North: King Sized Adventure,Richard Finn,Alan Marriott,India,2019-09-09,2019,TV-PG,Comedies,Before planning an awesome wedding for his gra...,90
4,81145628,Norm of the North: King Sized Adventure,Richard Finn,Alan Marriott,South Korea,2019-09-09,2019,TV-PG,Children & Family Movies,Before planning an awesome wedding for his gra...,90
...,...,...,...,...,...,...,...,...,...,...,...
123739,80085439,Frank and Cindy,G.J. Echternkamp,Fabianne Therese,United States,2016-04-01,2015,R,Independent Movies,A student filmmaker vengefully turns his camer...,102
123740,80011846,Iverson,Zatella Beatty,Allen Iverson,United States,2016-04-01,2014,NR,Documentaries,This unfiltered documentary follows the rocky ...,88
123741,80011846,Iverson,Zatella Beatty,Allen Iverson,United States,2016-04-01,2014,NR,Sports Movies,This unfiltered documentary follows the rocky ...,88
123742,80064521,Jeremy Scott: The People's Designer,Vlad Yudin,Jeremy Scott,United States,2016-04-01,2015,PG-13,Documentaries,The journey of fashion designer Jeremy Scott f...,109


In [118]:
df_filmes.drop(['qtd_temporada', 'type'], axis=1, inplace=True)

In [119]:
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,listed_in,description,duration_movie,qtd_temporada
0,81145628,Movie,Norm of the North: King Sized Adventure,Richard Finn,Alan Marriott,United States,2019-09-09,2019,TV-PG,Children & Family Movies,Before planning an awesome wedding for his gra...,90,
1,81145628,Movie,Norm of the North: King Sized Adventure,Richard Finn,Alan Marriott,United States,2019-09-09,2019,TV-PG,Comedies,Before planning an awesome wedding for his gra...,90,
2,81145628,Movie,Norm of the North: King Sized Adventure,Richard Finn,Alan Marriott,India,2019-09-09,2019,TV-PG,Children & Family Movies,Before planning an awesome wedding for his gra...,90,
3,81145628,Movie,Norm of the North: King Sized Adventure,Richard Finn,Alan Marriott,India,2019-09-09,2019,TV-PG,Comedies,Before planning an awesome wedding for his gra...,90,
4,81145628,Movie,Norm of the North: King Sized Adventure,Richard Finn,Alan Marriott,South Korea,2019-09-09,2019,TV-PG,Children & Family Movies,Before planning an awesome wedding for his gra...,90,


In [120]:
df_filmes.reset_index(inplace=True, drop=True)

In [121]:
df_filmes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 98515 entries, 0 to 98514
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   show_id         98515 non-null  int64         
 1   title           98515 non-null  object        
 2   director        97624 non-null  object        
 3   cast            97494 non-null  object        
 4   country         96594 non-null  object        
 5   date_added      98514 non-null  datetime64[ns]
 6   release_year    98515 non-null  int64         
 7   rating          98500 non-null  object        
 8   listed_in       98515 non-null  object        
 9   description     98515 non-null  object        
 10  duration_movie  98515 non-null  Int64         
dtypes: Int64(1), datetime64[ns](1), int64(2), object(7)
memory usage: 8.4+ MB


#### verificando dados nulos de filmes

In [122]:
df_filmes.isna().sum()/df_filmes.shape[0]*100

show_id           0.000000
title             0.000000
director          0.904431
cast              1.036390
country           1.949957
date_added        0.001015
release_year      0.000000
rating            0.015226
listed_in         0.000000
description       0.000000
duration_movie    0.000000
dtype: float64

##### Poderíamos descartar todos os dados nulos considerando que a quantidade é pequena. Porém, vamos tentar entender cada dado nulo

In [123]:
df_filmes[df_filmes['director'].isna()]

Unnamed: 0,show_id,title,director,cast,country,date_added,release_year,rating,listed_in,description,duration_movie
160,80117401,Jandino: Whatever it Takes,,Jandino Asporaat,United Kingdom,2016-09-09,2016,TV-MA,Stand-Up Comedy,Jandino Asporaat riffs on the challenges of ra...,94
468,80182480,Krish Trish and Baltiboy,,Damandeep Singh Baggan,,2017-09-08,2009,TV-Y7,Children & Family Movies,"A team of minstrels, including a monkey, cat a...",58
469,80182480,Krish Trish and Baltiboy,,Smita Malhotra,,2017-09-08,2009,TV-Y7,Children & Family Movies,"A team of minstrels, including a monkey, cat a...",58
470,80182480,Krish Trish and Baltiboy,,Baba Sehgal,,2017-09-08,2009,TV-Y7,Children & Family Movies,"A team of minstrels, including a monkey, cat a...",58
491,80182481,Krish Trish and Baltiboy: Part II,,Damandeep Singh Baggan,,2017-09-08,2010,TV-Y7,Children & Family Movies,Animal minstrels narrate stories about a monke...,58
...,...,...,...,...,...,...,...,...,...,...,...
97691,80244992,Ethirmarai,,Ramanathan,India,2018-04-01,2018,TV-14,Thrillers,"While investigating a brutal string of deaths,...",91
97692,80244992,Ethirmarai,,Rakesh,India,2018-04-01,2018,TV-14,International Movies,"While investigating a brutal string of deaths,...",91
97693,80244992,Ethirmarai,,Rakesh,India,2018-04-01,2018,TV-14,Music & Musicals,"While investigating a brutal string of deaths,...",91
97694,80244992,Ethirmarai,,Rakesh,India,2018-04-01,2018,TV-14,Thrillers,"While investigating a brutal string of deaths,...",91


In [124]:
shape_before = df_filmes.shape[0]
df_filmes.drop(df_filmes[df_filmes['director'].isna()].index, inplace=True)
print('{} dados foram removidos'.format(shape_before-df_filmes.shape[0]))

891 dados foram removidos


**Excelente!** deletamos todos os filmes onde os diretores são nulos

In [125]:
df_filmes.isna().sum()/df_filmes.shape[0]*100

show_id           0.000000
title             0.000000
director          0.000000
cast              0.979267
country           1.826395
date_added        0.000000
release_year      0.000000
rating            0.012292
listed_in         0.000000
description       0.000000
duration_movie    0.000000
dtype: float64

In [126]:
df_filmes[df_filmes['cast'].isna()]

Unnamed: 0,show_id,title,director,cast,country,date_added,release_year,rating,listed_in,description,duration_movie
626,80046727,Rolling Papers,Mitch Dickman,,United States,2017-09-08,2015,TV-MA,Documentaries,"As the newspaper industry takes a hit, The Den...",79
627,80046727,Rolling Papers,Mitch Dickman,,Uruguay,2017-09-08,2015,TV-MA,Documentaries,"As the newspaper industry takes a hit, The Den...",79
689,80203094,City of Joy,Madeleine Gavin,,United States,2018-09-07,2018,TV-MA,Documentaries,Women who've been sexually brutalized in war-t...,77
1690,80081155,Amanda Knox,Rod Blackhurst,,Denmark,2016-09-30,2016,TV-MA,Documentaries,She was twice convicted and acquitted of murde...,92
1691,80081155,Amanda Knox,Rod Blackhurst,,United States,2016-09-30,2016,TV-MA,Documentaries,She was twice convicted and acquitted of murde...,92
...,...,...,...,...,...,...,...,...,...,...,...
98447,80097468,The Bad Kids,Louis Pepe,,United States,2017-04-01,2016,TV-MA,Documentaries,"In this documentary, teachers at a Mojave Dese...",101
98448,80171439,The Human Factor: The Untold Story of the Bomb...,Rudradeep Bhattacharjee,,India,2017-04-01,2013,TV-PG,Documentaries,A family of Parsi musicians collectively works...,53
98449,80171439,The Human Factor: The Untold Story of the Bomb...,Rudradeep Bhattacharjee,,India,2017-04-01,2013,TV-PG,International Movies,A family of Parsi musicians collectively works...,53
98450,80171439,The Human Factor: The Untold Story of the Bomb...,Rudradeep Bhattacharjee,,India,2017-04-01,2013,TV-PG,Music & Musicals,A family of Parsi musicians collectively works...,53


In [127]:
df_filmes.drop(df_filmes[df_filmes['cast'].isna()].index, inplace=True)

In [128]:
df_filmes.isna().sum()/df_filmes.shape[0]*100

show_id           0.000000
title             0.000000
director          0.000000
cast              0.000000
country           1.764803
date_added        0.000000
release_year      0.000000
rating            0.010345
listed_in         0.000000
description       0.000000
duration_movie    0.000000
dtype: float64

In [129]:
df_filmes[df_filmes['country'].isna()]

Unnamed: 0,show_id,title,director,cast,country,date_added,release_year,rating,listed_in,description,duration_movie
371,80169755,Joaquín Reyes: Una y no más,José Miguel Contreras,Joaquín Reyes,,2017-09-08,2017,TV-MA,Stand-Up Comedy,Comedian and celebrity impersonator Joaquín Re...,78
471,80182483,Krish Trish and Baltiboy: Battle of Wits,Munjal Shroff,Damandeep Singh Baggan,,2017-09-08,2013,TV-Y7,Children & Family Movies,"An artisan is cheated of his payment, a lion o...",62
472,80182483,Krish Trish and Baltiboy: Battle of Wits,Munjal Shroff,Smita Malhotra,,2017-09-08,2013,TV-Y7,Children & Family Movies,"An artisan is cheated of his payment, a lion o...",62
473,80182483,Krish Trish and Baltiboy: Battle of Wits,Munjal Shroff,Baba Sehgal,,2017-09-08,2013,TV-Y7,Children & Family Movies,"An artisan is cheated of his payment, a lion o...",62
474,80182483,Krish Trish and Baltiboy: Battle of Wits,Munjal Shroff,Deepak Chachra,,2017-09-08,2013,TV-Y7,Children & Family Movies,"An artisan is cheated of his payment, a lion o...",62
...,...,...,...,...,...,...,...,...,...,...,...
95183,81035850,My Wife and My Wifey,Moataz El Tony,Raja Al-Jeddawi,,2019-04-18,2014,TV-14,International Movies,A man finds his marriage to a dedicated women'...,99
95184,81035850,My Wife and My Wifey,Moataz El Tony,Edward,,2019-04-18,2014,TV-14,Comedies,A man finds his marriage to a dedicated women'...,99
95185,81035850,My Wife and My Wifey,Moataz El Tony,Edward,,2019-04-18,2014,TV-14,International Movies,A man finds his marriage to a dedicated women'...,99
95350,81013626,HOMECOMING: A film by Beyoncé,Beyoncé Knowles-Carter,Beyoncé Knowles-Carter,,2019-04-17,2019,TV-MA,Documentaries,"This intimate, in-depth look at Beyoncé's cele...",138


In [130]:
df_filmes.drop(df_filmes[df_filmes['country'].isna()].index, inplace=True)

In [131]:
df_filmes.isna().sum()/df_filmes.shape[0]*100

show_id           0.000000
title             0.000000
director          0.000000
cast              0.000000
country           0.000000
date_added        0.000000
release_year      0.000000
rating            0.010531
listed_in         0.000000
description       0.000000
duration_movie    0.000000
dtype: float64

In [132]:
df_filmes.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 94962 entries, 0 to 98513
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   show_id         94962 non-null  int64         
 1   title           94962 non-null  object        
 2   director        94962 non-null  object        
 3   cast            94962 non-null  object        
 4   country         94962 non-null  object        
 5   date_added      94962 non-null  datetime64[ns]
 6   release_year    94962 non-null  int64         
 7   rating          94952 non-null  object        
 8   listed_in       94962 non-null  object        
 9   description     94962 non-null  object        
 10  duration_movie  94962 non-null  Int64         
dtypes: Int64(1), datetime64[ns](1), int64(2), object(7)
memory usage: 8.8+ MB


In [133]:
df_filmes.drop(df_filmes[df_filmes['rating'].isna()].index, inplace=True)

In [134]:
df_filmes.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 94952 entries, 0 to 98513
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   show_id         94952 non-null  int64         
 1   title           94952 non-null  object        
 2   director        94952 non-null  object        
 3   cast            94952 non-null  object        
 4   country         94952 non-null  object        
 5   date_added      94952 non-null  datetime64[ns]
 6   release_year    94952 non-null  int64         
 7   rating          94952 non-null  object        
 8   listed_in       94952 non-null  object        
 9   description     94952 non-null  object        
 10  duration_movie  94952 non-null  Int64         
dtypes: Int64(1), datetime64[ns](1), int64(2), object(7)
memory usage: 8.8+ MB


In [135]:
df_filmes['country'].unique()

array(['United States', 'India', 'South Korea', 'China', 'Bulgaria',
       'Spain', 'Canada', 'Chile', 'United Kingdom', 'Denmark', 'Sweden',
       'Netherlands', 'Belgium', 'France', 'Thailand', 'Pakistan',
       'Turkey', 'Brazil', 'Italy', 'Indonesia', 'Hong Kong', 'Morocco',
       'Vietnam', 'Argentina', 'Nigeria', 'Greece', 'Norway', 'Ireland',
       'Switzerland', 'United Arab Emirates', 'Japan', 'Cambodia',
       'Russia', 'Mexico', 'Israel', 'Germany', 'Australia', 'Portugal',
       'Egypt', 'Austria', 'Soviet Union', 'Peru', 'Uruguay', 'Malaysia',
       'South Africa', 'Nepal', 'Philippines', 'Poland', 'Serbia',
       'Singapore', 'Luxembourg', 'Cayman Islands', 'New Zealand',
       'Venezuela', 'Jordan', 'Saudi Arabia', 'Iceland', 'West Germany',
       'Malta', 'Lebanon', 'Qatar', 'Senegal', 'Iran', 'Colombia', 'Iraq',
       'Liechtenstein', 'Taiwan', 'Albania', 'Czech Republic', 'Slovakia',
       'Ghana', 'Finland', 'Hungary', 'Guatemala', 'Malawi', 'Paraguay',


In [136]:
df_filmes[df_filmes['country'].str.contains('')]

Unnamed: 0,show_id,title,director,cast,country,date_added,release_year,rating,listed_in,description,duration_movie
0,81145628,Norm of the North: King Sized Adventure,Richard Finn,Alan Marriott,United States,2019-09-09,2019,TV-PG,Children & Family Movies,Before planning an awesome wedding for his gra...,90
1,81145628,Norm of the North: King Sized Adventure,Richard Finn,Alan Marriott,United States,2019-09-09,2019,TV-PG,Comedies,Before planning an awesome wedding for his gra...,90
2,81145628,Norm of the North: King Sized Adventure,Richard Finn,Alan Marriott,India,2019-09-09,2019,TV-PG,Children & Family Movies,Before planning an awesome wedding for his gra...,90
3,81145628,Norm of the North: King Sized Adventure,Richard Finn,Alan Marriott,India,2019-09-09,2019,TV-PG,Comedies,Before planning an awesome wedding for his gra...,90
4,81145628,Norm of the North: King Sized Adventure,Richard Finn,Alan Marriott,South Korea,2019-09-09,2019,TV-PG,Children & Family Movies,Before planning an awesome wedding for his gra...,90
...,...,...,...,...,...,...,...,...,...,...,...
98509,80085439,Frank and Cindy,G.J. Echternkamp,Fabianne Therese,United States,2016-04-01,2015,R,Dramas,A student filmmaker vengefully turns his camer...,102
98510,80085439,Frank and Cindy,G.J. Echternkamp,Fabianne Therese,United States,2016-04-01,2015,R,Independent Movies,A student filmmaker vengefully turns his camer...,102
98511,80011846,Iverson,Zatella Beatty,Allen Iverson,United States,2016-04-01,2014,NR,Documentaries,This unfiltered documentary follows the rocky ...,88
98512,80011846,Iverson,Zatella Beatty,Allen Iverson,United States,2016-04-01,2014,NR,Sports Movies,This unfiltered documentary follows the rocky ...,88


In [137]:
df_filmes.reset_index(inplace=True, drop=True)

In [138]:
df_filmes.isna().sum()

show_id           0
title             0
director          0
cast              0
country           0
date_added        0
release_year      0
rating            0
listed_in         0
description       0
duration_movie    0
dtype: int64

In [139]:
df_filmes[df_filmes['country'] == '']

Unnamed: 0,show_id,title,director,cast,country,date_added,release_year,rating,listed_in,description,duration_movie


**Transformando dados numericos para categoricos usando `pd.cut()`**

In [141]:
df_filmes.head()

Unnamed: 0,show_id,title,director,cast,country,date_added,release_year,rating,listed_in,description,duration_movie
0,81145628,Norm of the North: King Sized Adventure,Richard Finn,Alan Marriott,United States,2019-09-09,2019,TV-PG,Children & Family Movies,Before planning an awesome wedding for his gra...,90
1,81145628,Norm of the North: King Sized Adventure,Richard Finn,Alan Marriott,United States,2019-09-09,2019,TV-PG,Comedies,Before planning an awesome wedding for his gra...,90
2,81145628,Norm of the North: King Sized Adventure,Richard Finn,Alan Marriott,India,2019-09-09,2019,TV-PG,Children & Family Movies,Before planning an awesome wedding for his gra...,90
3,81145628,Norm of the North: King Sized Adventure,Richard Finn,Alan Marriott,India,2019-09-09,2019,TV-PG,Comedies,Before planning an awesome wedding for his gra...,90
4,81145628,Norm of the North: King Sized Adventure,Richard Finn,Alan Marriott,South Korea,2019-09-09,2019,TV-PG,Children & Family Movies,Before planning an awesome wedding for his gra...,90


In [142]:
df_filmes['duration_movie'].describe()

count       94952.0
mean     107.828924
std       23.865633
min            12.0
25%            93.0
50%           104.0
75%           121.0
max           228.0
Name: duration_movie, dtype: Float64

In [144]:
df_filmes['duration_movie']

0         90
1         90
2         90
3         90
4         90
        ... 
94947    102
94948    102
94949     88
94950     88
94951    109
Name: duration_movie, Length: 94952, dtype: Int64

In [143]:
pd.cut(df_filmes['duration_movie'], 
                               bins=3, 
                               labels=['baixa', 'media', 'alta'])

0        media
1        media
2        media
3        media
4        media
         ...  
94947    media
94948    media
94949    media
94950    media
94951    media
Name: duration_movie, Length: 94952, dtype: category
Categories (3, object): ['baixa' < 'media' < 'alta']

In [145]:
df_filmes['duration'] = pd.cut(df_filmes['duration_movie'], 
                               bins=3, 
                               labels=['baixa', 'media', 'alta'])

In [147]:
df_filmes.sample(20)

Unnamed: 0,show_id,title,director,cast,country,date_added,release_year,rating,listed_in,description,duration_movie,duration
51712,70128697,Kill the Irishman,Jonathan Hensleigh,Tony Darrow,United States,2019-07-01,2011,R,Dramas,This true crime tale charts the rise and fall ...,106,media
61536,60034568,New York Minute,Dennie Gordon,Jack Osbourne,United States,2020-01-01,2004,PG,Comedies,When bickering teenage twins accidentally land...,91,media
82841,80174897,Like Father,Lauren Miller Rogen,Kelsey Grammer,United States,2018-08-03,2018,TV-MA,Comedies,"After she's left at the altar, a workaholic ad...",104,media
25470,80097390,Morris from America,Chad Hartigan,Carla Juri,Germany,2018-11-01,2016,R,Dramas,When his father moves from the U.S. to Heidelb...,91,media
12724,60003242,Charlie's Angels,McG,Drew Barrymore,United States,2019-10-01,2000,PG-13,Action & Adventure,"Three beauties use their looks, charm and mart...",98,media
30080,80216541,In Family I Trust,Patricia Font,Alexandra Jiménez,Spain,2019-05-10,2018,TV-MA,Comedies,"After her partner cheats on her, an architect ...",98,media
3628,80092885,ARQ,Tony Elliott,Shaun Benson,United States,2016-09-16,2016,TV-MA,International Movies,"Trapped in a lab and stuck in a time loop, a d...",89,media
5200,81141689,Spookley the Square Pumpkin,Bernie Denk,Michel Perron,United States,2019-09-01,2004,G,Children & Family Movies,"When a storm terrorizes his patch, a pumpkin c...",47,baixa
45407,352989,Carrie,Brian De Palma,P.J. Soles,United States,2019-06-01,1976,R,Cult Movies,An outcast teen with telekinetic ability lashe...,98,media
38780,80200047,The Boy Who Harnessed the Wind,Chiwetel Ejiofor,Chiwetel Ejiofor,Malawi,2019-03-01,2019,TV-14,Independent Movies,"Inspired by a science book, 13-year-old Willia...",114,media


In [148]:
df_filmes.sort_values(['date_added','show_id','director','cast', 'country', 'listed_in'], inplace=True)

In [149]:
df_filmes.to_csv('../data/netflix_filmes_clear.csv', index=False)

### b) Trabalhando apenas com as series

In [150]:
df_series = pd.DataFrame(df[filter_series])

In [151]:
df_series.isna().sum()/df_series.shape[0]*100

show_id             0.000000
type                0.000000
title               0.000000
director           91.869035
cast                1.128428
country             9.348798
date_added          0.381781
release_year        0.000000
rating              0.140147
listed_in           0.000000
description         0.000000
duration_movie    100.000000
qtd_temporada       0.000000
dtype: float64

**Veja que não existe ``diretores`` na maiorias das series.** Nesse caso, dado o número alto de dados nulos, é melhor não considerar a coluna ``director`` para as séries. Vamos também deletar a coluna ``duration_movie`` que só faz sentido para filmes

In [152]:
df_series.drop(['director','duration_movie'], axis=1, inplace=True)

In [153]:
df_series.isna().sum()/df_series.shape[0]*100

show_id          0.000000
type             0.000000
title            0.000000
cast             1.128428
country          9.348798
date_added       0.381781
release_year     0.000000
rating           0.140147
listed_in        0.000000
description      0.000000
qtd_temporada    0.000000
dtype: float64

In [154]:
df_series[df_series['country'].isna()]

Unnamed: 0,show_id,type,title,cast,country,date_added,release_year,rating,listed_in,description,qtd_temporada
726,80244601,TV Show,Castle of Stars,Chaiyapol Pupart,,2018-09-07,2015,TV-14,International TV Shows,As four couples with different lifestyles go t...,1
727,80244601,TV Show,Castle of Stars,Chaiyapol Pupart,,2018-09-07,2015,TV-14,Romantic TV Shows,As four couples with different lifestyles go t...,1
728,80244601,TV Show,Castle of Stars,Chaiyapol Pupart,,2018-09-07,2015,TV-14,TV Comedies,As four couples with different lifestyles go t...,1
729,80244601,TV Show,Castle of Stars,Jintanutda Lummakanon,,2018-09-07,2015,TV-14,International TV Shows,As four couples with different lifestyles go t...,1
730,80244601,TV Show,Castle of Stars,Jintanutda Lummakanon,,2018-09-07,2015,TV-14,Romantic TV Shows,As four couples with different lifestyles go t...,1
...,...,...,...,...,...,...,...,...,...,...,...
139446,80244228,TV Show,Beauty and the Bitches,Mila Thanapa,,2019-04-23,2016,TV-14,TV Comedies,Beauty pageant contestants from all over Thail...,2
139447,80244228,TV Show,Beauty and the Bitches,Mild Anis Suwit,,2019-04-23,2016,TV-14,International TV Shows,Beauty pageant contestants from all over Thail...,2
139448,80244228,TV Show,Beauty and the Bitches,Mild Anis Suwit,,2019-04-23,2016,TV-14,Romantic TV Shows,Beauty pageant contestants from all over Thail...,2
139449,80244228,TV Show,Beauty and the Bitches,Mild Anis Suwit,,2019-04-23,2016,TV-14,TV Comedies,Beauty pageant contestants from all over Thail...,2


In [155]:
#quantas séries não tem o pais?
df_series[df_series['country'].isna()]['show_id'].nunique()

281

In [156]:
df_series

Unnamed: 0,show_id,type,title,cast,country,date_added,release_year,rating,listed_in,description,qtd_temporada
161,70234439,TV Show,Transformers Prime,Peter Cullen,United States,2018-09-08,2013,TV-Y7-FV,Kids' TV,"With the help of three human allies, the Autob...",1
162,70234439,TV Show,Transformers Prime,Sumalee Montano,United States,2018-09-08,2013,TV-Y7-FV,Kids' TV,"With the help of three human allies, the Autob...",1
163,70234439,TV Show,Transformers Prime,Frank Welker,United States,2018-09-08,2013,TV-Y7-FV,Kids' TV,"With the help of three human allies, the Autob...",1
164,70234439,TV Show,Transformers Prime,Jeffrey Combs,United States,2018-09-08,2013,TV-Y7-FV,Kids' TV,"With the help of three human allies, the Autob...",1
165,70234439,TV Show,Transformers Prime,Kevin Michael Richardson,United States,2018-09-08,2013,TV-Y7-FV,Kids' TV,"With the help of three human allies, the Autob...",1
...,...,...,...,...,...,...,...,...,...,...,...
139949,70153404,TV Show,Friends,Matt LeBlanc,United States,NaT,2003,TV-14,TV Comedies,This hit sitcom follows the merry misadventure...,10
139950,70153404,TV Show,Friends,Matthew Perry,United States,NaT,2003,TV-14,Classic & Cult TV,This hit sitcom follows the merry misadventure...,10
139951,70153404,TV Show,Friends,Matthew Perry,United States,NaT,2003,TV-14,TV Comedies,This hit sitcom follows the merry misadventure...,10
139952,70153404,TV Show,Friends,David Schwimmer,United States,NaT,2003,TV-14,Classic & Cult TV,This hit sitcom follows the merry misadventure...,10


In [157]:
df_series.dropna(subset=['cast','date_added','rating'], inplace=True)

In [158]:
df_series.isna().sum()/df_series.shape[0]*100

show_id          0.000000
type             0.000000
title            0.000000
cast             0.000000
country          9.304211
date_added       0.000000
release_year     0.000000
rating           0.000000
listed_in        0.000000
description      0.000000
qtd_temporada    0.000000
dtype: float64

In [159]:
df_series.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 40702 entries, 161 to 139794
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   show_id        40702 non-null  int64         
 1   type           40702 non-null  object        
 2   title          40702 non-null  object        
 3   cast           40702 non-null  object        
 4   country        36915 non-null  object        
 5   date_added     40702 non-null  datetime64[ns]
 6   release_year   40702 non-null  int64         
 7   rating         40702 non-null  object        
 8   listed_in      40702 non-null  object        
 9   description    40702 non-null  object        
 10  qtd_temporada  40702 non-null  Int64         
dtypes: Int64(1), datetime64[ns](1), int64(2), object(7)
memory usage: 3.8+ MB


In [160]:
df_series.isna().sum()/df_series.shape[0]*100

show_id          0.000000
type             0.000000
title            0.000000
cast             0.000000
country          9.304211
date_added       0.000000
release_year     0.000000
rating           0.000000
listed_in        0.000000
description      0.000000
qtd_temporada    0.000000
dtype: float64

In [161]:
df_series.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 40702 entries, 161 to 139794
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   show_id        40702 non-null  int64         
 1   type           40702 non-null  object        
 2   title          40702 non-null  object        
 3   cast           40702 non-null  object        
 4   country        36915 non-null  object        
 5   date_added     40702 non-null  datetime64[ns]
 6   release_year   40702 non-null  int64         
 7   rating         40702 non-null  object        
 8   listed_in      40702 non-null  object        
 9   description    40702 non-null  object        
 10  qtd_temporada  40702 non-null  Int64         
dtypes: Int64(1), datetime64[ns](1), int64(2), object(7)
memory usage: 3.8+ MB


In [162]:
df_series.sort_values(['date_added','show_id','cast', 'country', 'listed_in'], inplace=True)

In [163]:
df_series.to_csv('../data/netflix_series_clear.csv', index=False)

### Excelente. Agora podemos iniciar a análise exploratória dos dados da netflix