## Library

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## File

In [3]:
url = 'https://raw.githubusercontent.com/albvieiraa/EDA-Streamings/refs/heads/main/datasets/netflix_titles.csv'

In [4]:
df_netflix = pd.read_csv(url)
df_netflix.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


In [5]:
df_netflix.tail(3)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
8804,s8805,Movie,Zombieland,Ruben Fleischer,"Jesse Eisenberg, Woody Harrelson, Emma Stone, ...",United States,"November 1, 2019",2009,R,88 min,"Comedies, Horror Movies",Looking to survive in a world taken over by zo...
8805,s8806,Movie,Zoom,Peter Hewitt,"Tim Allen, Courteney Cox, Chevy Chase, Kate Ma...",United States,"January 11, 2020",2006,PG,88 min,"Children & Family Movies, Comedies","Dragged from civilian life, a former superhero..."
8806,s8807,Movie,Zubaan,Mozez Singh,"Vicky Kaushal, Sarah-Jane Dias, Raaghav Chanan...",India,"March 2, 2019",2015,TV-14,111 min,"Dramas, International Movies, Music & Musicals",A scrappy but poor boy worms his way into a ty...


## Conhecendo nossos dados

In [None]:
df_netflix.shape

(8807, 12)

In [None]:
df_netflix.describe()

Unnamed: 0,release_year
count,8807.0
mean,2014.180198
std,8.819312
min,1925.0
25%,2013.0
50%,2017.0
75%,2019.0
max,2021.0


In [None]:
df_netflix.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB


In [None]:
df_netflix.isnull().sum()

Unnamed: 0,0
show_id,0
type,0
title,0
director,2634
cast,825
country,831
date_added,10
release_year,0
rating,4
duration,3


In [None]:
df_netflix.duplicated().sum()

np.int64(0)

In [None]:
df_netflix.nunique()

Unnamed: 0,0
show_id,8807
type,2
title,8807
director,4528
cast,7692
country,748
date_added,1767
release_year,74
rating,17
duration,220


In [None]:
# categorias do type
df_netflix['type'].unique()

array(['Movie', 'TV Show'], dtype=object)

**Observações:**
- Tratar as colunas nulas
- Verificar se é relevante mudar o tipo de dados da coluna *date_added* e *duration*
- Renomear coluna *listed_in* para **gender** e verificar como separar esses dados
- Tratar dados ausentes de *director, cast, country, date_added, rating, duration*
- Filtrar os tipos de mídia

## Tratamento dos dados

In [None]:
df_netflix_tratando = df_netflix.copy()

In [None]:
df_netflix_tratando = df_netflix_tratando.rename(columns={'listed_in': 'gender'})

Novo dataframe só com filmes e outro com TV Show

#### Movies

In [None]:
# filtrar os tipos de mídia
df_netflix_movies = df_netflix_tratando[df_netflix_tratando['type'] == 'Movie']

In [None]:
df_netflix_movies.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,gender,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
6,s7,Movie,My Little Pony: A New Generation,"Robert Cullen, José Luis Ucha","Vanessa Hudgens, Kimiko Glenn, James Marsden, ...",,"September 24, 2021",2021,PG,91 min,Children & Family Movies,Equestria's divided. But a bright-eyed hero be...
7,s8,Movie,Sankofa,Haile Gerima,"Kofi Ghanaba, Oyafunmike Ogunlano, Alexandra D...","United States, Ghana, Burkina Faso, United Kin...","September 24, 2021",1993,TV-MA,125 min,"Dramas, Independent Movies, International Movies","On a photo shoot in Ghana, an American model s..."
9,s10,Movie,The Starling,Theodore Melfi,"Melissa McCarthy, Chris O'Dowd, Kevin Kline, T...",United States,"September 24, 2021",2021,PG-13,104 min,"Comedies, Dramas",A woman adjusting to life after a loss contend...
12,s13,Movie,Je Suis Karl,Christian Schwochow,"Luna Wedler, Jannis Niewöhner, Milan Peschel, ...","Germany, Czech Republic","September 23, 2021",2021,TV-MA,127 min,"Dramas, International Movies",After most of her family is murdered in a terr...


Na coluna *duration* do df 'df_netflix_movies' remover caracteres, deixando apenas números e depois transformar

In [None]:
df_netflix_movies['duration'] = df_netflix_movies['duration'].str.replace(' min', '')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_netflix_movies['duration'] = df_netflix_movies['duration'].str.replace(' min', '')


In [None]:
# Observando os valores nulos
df_netflix_movies[df_netflix_movies['duration'].isnull()]

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,gender,description
5541,s5542,Movie,Louis C.K. 2017,Louis C.K.,Louis C.K.,United States,"April 4, 2017",2017,74 min,,Movies,"Louis C.K. muses on religion, eternal love, gi..."
5794,s5795,Movie,Louis C.K.: Hilarious,Louis C.K.,Louis C.K.,United States,"September 16, 2016",2010,84 min,,Movies,Emmy-winning comedy writer Louis C.K. brings h...
5813,s5814,Movie,Louis C.K.: Live at the Comedy Store,Louis C.K.,Louis C.K.,United States,"August 15, 2016",2015,66 min,,Movies,The comic puts his trademark hilarious/thought...


In [None]:
# Transformar str em float
df_netflix_movies['duration']= df_netflix_movies['duration'].astype(float)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_netflix_movies['duration']= df_netflix_movies['duration'].astype(float)


In [None]:
df_netflix_movies.describe()

Unnamed: 0,release_year,duration
count,6131.0,6128.0
mean,2013.121514,99.577187
std,9.678169,28.290593
min,1942.0,3.0
25%,2012.0,87.0
50%,2016.0,98.0
75%,2018.0,114.0
max,2021.0,312.0


In [None]:
df_netflix_movies['date_added'] = pd.to_datetime(df_netflix_movies['date_added'], errors='coerce')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_netflix_movies['date_added'] = pd.to_datetime(df_netflix_movies['date_added'], errors='coerce')


In [None]:
df_netflix_movies['date_added'] = df_netflix_movies['date_added'].dt.date

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_netflix_movies['date_added'] = df_netflix_movies['date_added'].dt.date


In [None]:
df_netflix_movies.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6131 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   show_id       6131 non-null   object 
 1   type          6131 non-null   object 
 2   title         6131 non-null   object 
 3   director      5943 non-null   object 
 4   cast          5656 non-null   object 
 5   country       5691 non-null   object 
 6   date_added    6131 non-null   object 
 7   release_year  6131 non-null   int64  
 8   rating        6129 non-null   object 
 9   duration      6128 non-null   float64
 10  gender        6131 non-null   object 
 11  description   6131 non-null   object 
dtypes: float64(1), int64(1), object(10)
memory usage: 751.7+ KB


#### Investigando os valores nulos

In [None]:
df_netflix_movies[df_netflix_movies['country'].isnull()]
# substituir 'country' pelo nome das outras variáveis vazias

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,gender,description
6,s7,Movie,My Little Pony: A New Generation,"Robert Cullen, José Luis Ucha","Vanessa Hudgens, Kimiko Glenn, James Marsden, ...",,2021-09-24,2021,PG,91.0,Children & Family Movies,Equestria's divided. But a bright-eyed hero be...
13,s14,Movie,Confessions of an Invisible Girl,Bruno Garotti,"Klara Castanho, Lucca Picon, Júlia Gomes, Marc...",,2021-09-22,2021,TV-PG,91.0,"Children & Family Movies, Comedies",When the clever but socially-awkward Tetê join...
16,s17,Movie,Europe's Most Dangerous Man: Otto Skorzeny in ...,"Pedro de Echave García, Pablo Azorín Williams",,,2021-09-22,2020,TV-MA,67.0,"Documentaries, International Movies",Declassified documents reveal the post-WWII li...
18,s19,Movie,Intrusion,Adam Salky,"Freida Pinto, Logan Marshall-Green, Robert Joh...",,2021-09-22,2021,TV-14,94.0,Thrillers,After a deadly home invasion at a couple’s new...
22,s23,Movie,Avvai Shanmughi,K.S. Ravikumar,"Kamal Hassan, Meena, Gemini Ganesan, Heera Raj...",,2021-09-21,1996,TV-PG,161.0,"Comedies, International Movies",Newly divorced and denied visitation rights wi...
...,...,...,...,...,...,...,...,...,...,...,...,...
8585,s8586,Movie,Three-Quarters Decent,Mohamed Hamdy,"Mohamed Ragab, Lamitta Frangieh, Mohsen Mansou...",,2019-06-20,2010,TV-14,96.0,"Comedies, Dramas, International Movies","Determined to fight corruption in his country,..."
8602,s8603,Movie,Tom and Jerry: The Magic Ring,Phil Roman,"Richard Kind, Dana Hill, Anndi McAfee, Tony Ja...",,2019-12-15,2001,TV-Y7,60.0,"Children & Family Movies, Comedies",When a young wizard leaves Tom to guard his pr...
8622,s8623,Movie,Tremors 2: Aftershocks,S.S. Wilson,"Fred Ward, Chris Gartin, Helen Shaver, Michael...",,2020-01-01,1995,PG-13,100.0,"Comedies, Horror Movies, Sci-Fi & Fantasy",A rag-tag team of survivalists and scientists ...
8718,s8719,Movie,Westside vs. the World,Michael Fahey,"Ron Perlman, Louie Simmons",,2019-08-09,2019,TV-MA,96.0,"Documentaries, Sports Movies",A look into the journey of influential strengt...


#### TV Show

In [None]:
df_netflix_series = df_netflix_tratando[df_netflix_tratando['type'] == 'TV Show']

In [None]:
df_netflix_series.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2676 entries, 1 to 8803
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       2676 non-null   object
 1   type          2676 non-null   object
 2   title         2676 non-null   object
 3   director      230 non-null    object
 4   cast          2326 non-null   object
 5   country       2285 non-null   object
 6   date_added    2666 non-null   object
 7   release_year  2676 non-null   int64 
 8   rating        2674 non-null   object
 9   duration      2676 non-null   object
 10  gender        2676 non-null   object
 11  description   2676 non-null   object
dtypes: int64(1), object(11)
memory usage: 271.8+ KB


In [None]:
df_netflix_series['date_added'] = pd.to_datetime(df_netflix_series['date_added'], errors='coerce') #first

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_netflix_series['date_added'] = pd.to_datetime(df_netflix_series['date_added'], errors='coerce') #first


In [None]:
df_netflix_series['date_added'] = df_netflix_series['date_added'].dt.date # second

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_netflix_series['date_added'] = df_netflix_series['date_added'].dt.date # second


In [None]:
df_netflix_series['duration'] = df_netflix_series['duration'].str.replace(' Seasons', '')
df_netflix_series['duration'] = df_netflix_series['duration'].str.replace(' Season', '')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_netflix_series['duration'] = df_netflix_series['duration'].str.replace(' Seasons', '')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_netflix_series['duration'] = df_netflix_series['duration'].str.replace(' Season', '')


In [None]:
# transformar coluna duration em float
df_netflix_series['duration'] = df_netflix_series['duration'].astype(float)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_netflix_series['duration'] = df_netflix_series['duration'].astype(float)


In [None]:
df_netflix_series.describe()

Unnamed: 0,release_year,duration
count,2676.0,2676.0
mean,2016.605755,1.764948
std,5.740138,1.582752
min,1925.0,1.0
25%,2016.0,1.0
50%,2018.0,1.0
75%,2020.0,2.0
max,2021.0,17.0


## Análise Filmes - Movies from Netflix

In [None]:
df_netflix_movies.head(2)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,gender,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,2021-09-25,2020,PG-13,90.0,Documentaries,"As her father nears the end of his life, filmm..."
6,s7,Movie,My Little Pony: A New Generation,"Robert Cullen, José Luis Ucha","Vanessa Hudgens, Kimiko Glenn, James Marsden, ...",,2021-09-24,2021,PG,91.0,Children & Family Movies,Equestria's divided. But a bright-eyed hero be...


In [None]:
df_director = df_netflix_movies.groupby('director')['title'].count().sort_values(ascending=False)
df_director

Unnamed: 0_level_0,title
director,Unnamed: 1_level_1
Rajiv Chilaka,19
"Raúl Campos, Jan Suter",18
Suhas Kadav,16
Marcus Raboy,15
Jay Karas,14
...,...
Jude Okwudiafor Johnson,1
Jude Weng,1
Julia Hart,1
Julia Knowles,1


In [None]:
# Verificando os diretores que tem mais de um filme
df_director_filter = df_director[df_director > 1]
df_director_filter

Unnamed: 0_level_0,title
director,Unnamed: 1_level_1
Rajiv Chilaka,19
"Raúl Campos, Jan Suter",18
Suhas Kadav,16
Marcus Raboy,15
Jay Karas,14
...,...
Alexis Morante,2
Angelina Jolie,2
A. L. Vijay,2
Chris Sivertson,2


In [None]:
df_release_year = df_netflix_movies.groupby('release_year')['title'].count().sort_values(ascending=False)
df_release_year

Unnamed: 0_level_0,title
release_year,Unnamed: 1_level_1
2018,767
2017,767
2016,658
2019,633
2020,517
...,...
1946,1
1963,1
1961,1
1959,1


In [None]:
# Filtrando
df_release_year_filter = df_release_year[df_release_year > 11]
df_release_year_filter

Unnamed: 0_level_0,title
release_year,Unnamed: 1_level_1
2018,767
2017,767
2016,658
2019,633
2020,517
2015,398
2021,277
2014,264
2013,225
2012,173


In [None]:
# Agrupar quantidade de filmes por país
df_movies_country = df_netflix_movies.groupby('country')['title'].count().sort_values(ascending=False)
df_movies_country

Unnamed: 0_level_0,title
country,Unnamed: 1_level_1
United States,2058
India,893
United Kingdom,206
Canada,122
Spain,97
...,...
"Ireland, France, Iceland, United States, Mexico, Belgium, United Kingdom, Hong Kong",1
"Ireland, Canada, United Kingdom, United States",1
"Ireland, Canada, Luxembourg, United States, United Kingdom, Philippines, India",1
"Ireland, Canada",1


In [None]:
df_movies_country_filter = df_movies_country[df_movies_country > 10]
df_movies_country_filter

Unnamed: 0_level_0,title
country,Unnamed: 1_level_1
United States,2058
India,893
United Kingdom,206
Canada,122
Spain,97
Egypt,92
Nigeria,86
Indonesia,77
Japan,76
Turkey,76


In [None]:
df_movies_country.head(10)

Unnamed: 0_level_0,title
country,Unnamed: 1_level_1
United States,2058
India,893
United Kingdom,206
Canada,122
Spain,97
Egypt,92
Nigeria,86
Indonesia,77
Japan,76
Turkey,76


In [None]:
# salvando o dataset de filmes
df_netflix_movies.to_csv('netflix_movies.csv', index=False)

In [None]:
# importar biblioteca para baixar automaticamente quando rodar o próximo código
from google.colab import files
files.download('netflix_movies.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>