## Library

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## File

In [2]:
url = 'https://raw.githubusercontent.com/albvieiraa/EDA-Streamings/refs/heads/main/datasets/hulu_titles.csv'

In [3]:
df_hulu = pd.read_csv(url)
df_hulu.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Ricky Velez: Here's Everything,,,,"October 24, 2021",2021,TV-MA,,"Comedy, Stand Up",​Comedian Ricky Velez bares it all with his ho...
1,s2,Movie,Silent Night,,,,"October 23, 2021",2020,,94 min,"Crime, Drama, Thriller","Mark, a low end South London hitman recently r..."
2,s3,Movie,The Marksman,,,,"October 23, 2021",2021,PG-13,108 min,"Action, Thriller",A hardened Arizona rancher tries to protect an...
3,s4,Movie,Gaia,,,,"October 22, 2021",2021,R,97 min,Horror,A forest ranger and two survivalists with a cu...
4,s5,Movie,Settlers,,,,"October 22, 2021",2021,,104 min,"Science Fiction, Thriller",Mankind's earliest settlers on the Martian fro...


In [4]:
df_hulu.tail(3)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
3070,s3071,TV Show,The Fades,,,United Kingdom,,2011,TV-14,1 Season,"Horror, International, Science Fiction",Seventeen-year-old Paul is haunted by apocalyp...
3071,s3072,TV Show,The Twilight Zone,,,United States,,1959,TV-PG,5 Seasons,"Classics, Science Fiction, Thriller",Rod Serling's seminal anthology series focused...
3072,s3073,TV Show,Tokyo Magnitude 8.0,,,Japan,,2009,TV-14,1 Season,"Anime, Drama, International",The devastation is unleashed in the span of se...


## Conhecendo nossos dados

In [5]:
df_hulu.shape

(3073, 12)

In [6]:
df_hulu.describe()

Unnamed: 0,cast,release_year
count,0.0,3073.0
mean,,2012.567524
std,,10.844069
min,,1923.0
25%,,2010.0
50%,,2016.0
75%,,2019.0
max,,2021.0


In [7]:
df_hulu.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3073 entries, 0 to 3072
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   show_id       3073 non-null   object 
 1   type          3073 non-null   object 
 2   title         3073 non-null   object 
 3   director      3 non-null      object 
 4   cast          0 non-null      float64
 5   country       1620 non-null   object 
 6   date_added    3045 non-null   object 
 7   release_year  3073 non-null   int64  
 8   rating        2553 non-null   object 
 9   duration      2594 non-null   object 
 10  listed_in     3073 non-null   object 
 11  description   3069 non-null   object 
dtypes: float64(1), int64(1), object(10)
memory usage: 288.2+ KB


In [8]:
df_hulu.isnull().sum()

Unnamed: 0,0
show_id,0
type,0
title,0
director,3070
cast,3073
country,1453
date_added,28
release_year,0
rating,520
duration,479


In [9]:
df_hulu.duplicated().sum()

np.int64(0)

In [10]:
df_hulu.nunique()

Unnamed: 0,0
show_id,3073
type,2
title,3073
director,3
cast,0
country,150
date_added,1115
release_year,72
rating,88
duration,135


In [11]:
# categorias do type
df_hulu['type'].unique()

array(['Movie', 'TV Show'], dtype=object)

**Observações:**
- Tratar as colunas nulas
- Verificar se é relevante mudar o tipo de dados da coluna *date_added* e *duration*
- Renomear coluna *listed_in* para **gender** e verificar como separar esses dados
- Tratar dados ausentes
- Filtrar os tipos de mídia
- Alterar tipo **cast**

## Tratamento dos dados

In [12]:
df_hulu_tratando = df_hulu.copy()

In [13]:
df_hulu_tratando = df_hulu_tratando.rename(columns={'listed_in': 'gender'})

Novo dataframe só com filmes e outro com TV Show

#### Movies

In [14]:
# filtrar os tipos de mídia
df_hulu_movies = df_hulu_tratando[df_hulu_tratando['type'] == 'Movie']

In [15]:
df_hulu_movies.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,gender,description
0,s1,Movie,Ricky Velez: Here's Everything,,,,"October 24, 2021",2021,TV-MA,,"Comedy, Stand Up",​Comedian Ricky Velez bares it all with his ho...
1,s2,Movie,Silent Night,,,,"October 23, 2021",2020,,94 min,"Crime, Drama, Thriller","Mark, a low end South London hitman recently r..."
2,s3,Movie,The Marksman,,,,"October 23, 2021",2021,PG-13,108 min,"Action, Thriller",A hardened Arizona rancher tries to protect an...
3,s4,Movie,Gaia,,,,"October 22, 2021",2021,R,97 min,Horror,A forest ranger and two survivalists with a cu...
4,s5,Movie,Settlers,,,,"October 22, 2021",2021,,104 min,"Science Fiction, Thriller",Mankind's earliest settlers on the Martian fro...


In [16]:
df_hulu_movies.shape

(1484, 12)

Na coluna *duration* do df 'df_hulu_movies' remover caracteres, deixando apenas números e depois transformar

In [17]:
df_hulu_movies['duration'] = df_hulu_movies['duration'].str.replace(' min', '')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_hulu_movies['duration'] = df_hulu_movies['duration'].str.replace(' min', '')


In [18]:
# Observando os valores nulos
df_hulu_movies[df_hulu_movies['duration'].isnull()]

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,gender,description
0,s1,Movie,Ricky Velez: Here's Everything,,,,"October 24, 2021",2021,TV-MA,,"Comedy, Stand Up",​Comedian Ricky Velez bares it all with his ho...
12,s13,Movie,Out of the Shadows: The Man Behind the Steele ...,,,,"October 18, 2021",2021,,,News,George Stephanopoulos sits down with former MI...
15,s16,Movie,Showtime Championship Boxing: Lopez vs. Salido...,,,,"October 16, 2021",2011,,,Sports,Juan Manuel Lopez vs. Orlando Salido for the W...
37,s38,Movie,BELLATOR MMA: Kongo vs. Johnson 2,,,,"October 13, 2021",2020,,,Sports,Cheick Kongo vs. Tim Johnson battle in a histo...
40,s41,Movie,In the Crease,,,,"October 13, 2021",2020,,,Sports,Post-game analysis and highlight show airing e...
...,...,...,...,...,...,...,...,...,...,...,...,...
2951,s2952,Movie,All Saints (1998),,,Australia,"February 19, 2014",1998,12 Seasons,,"Drama, International",All Saints is an Australian medical drama focu...
2955,s2956,Movie,Getting On,,,United Kingdom,"January 7, 2014",2009,3 Seasons,,"Comedy, International, Sitcom",Care for the elderly is the least glamorous ar...
2958,s2959,Movie,(Sub) Ninja Scroll,,,Japan,"December 15, 2013",1993,93 min,,"Action, Adult Animation, Adventure",When Jubei saves a young ninja woman from the ...
2959,s2960,Movie,Packed to the Rafters,,,Australia,"December 3, 2013",2008,6 Seasons,,"Comedy, Drama, International",Packed to the Rafters is an Australian family-...


In [19]:
# Quantidade
df_hulu_movies[df_hulu_movies['duration'].isnull()].shape

(479, 12)

In [20]:
# Transformar str em float
df_hulu_movies['duration']= df_hulu_movies['duration'].astype(float)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_hulu_movies['duration']= df_hulu_movies['duration'].astype(float)


In [21]:
df_hulu_movies.describe()

Unnamed: 0,cast,release_year,duration
count,0.0,1484.0,1005.0
mean,,2012.607143,98.266667
std,,12.142055,21.215423
min,,1923.0,1.0
25%,,2012.0,89.0
50%,,2017.0,97.0
75%,,2019.0,109.0
max,,2021.0,192.0


In [22]:
df_hulu_movies['date_added'] = pd.to_datetime(df_hulu_movies['date_added'], errors='coerce')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_hulu_movies['date_added'] = pd.to_datetime(df_hulu_movies['date_added'], errors='coerce')


In [23]:
df_hulu_movies['date_added'] = df_hulu_movies['date_added'].dt.date

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_hulu_movies['date_added'] = df_hulu_movies['date_added'].dt.date


In [25]:
# Transformar coluna cast em string
df_hulu_movies['cast'] = df_hulu_movies['cast'].astype(str)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_hulu_movies['cast'] = df_hulu_movies['cast'].astype(str)


In [26]:
df_hulu_movies.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1484 entries, 0 to 2993
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   show_id       1484 non-null   object 
 1   type          1484 non-null   object 
 2   title         1484 non-null   object 
 3   director      3 non-null      object 
 4   cast          1484 non-null   object 
 5   country       661 non-null    object 
 6   date_added    1484 non-null   object 
 7   release_year  1484 non-null   int64  
 8   rating        1194 non-null   object 
 9   duration      1005 non-null   float64
 10  gender        1484 non-null   object 
 11  description   1481 non-null   object 
dtypes: float64(1), int64(1), object(10)
memory usage: 183.0+ KB


#### Investigando os valores nulos

In [35]:
df_hulu_movies[df_hulu_movies['description'].isnull()]
# substituir 'country' pelo nome das outras variáveis vazias

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,gender,description
908,s909,Movie,Bart & Hahn,,,,2021-03-08,2021,,,Sports,
1158,s1159,Movie,UFC 252,,,,2020-11-25,2020,TV-14,,Sports,
1741,s1742,Movie,MLS,,,,2020-04-06,2009,,,Sports,


#### TV Show

In [36]:
df_hulu_series = df_hulu_tratando[df_hulu_tratando['type'] == 'TV Show']

In [37]:
df_hulu_series.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1589 entries, 5 to 3072
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   show_id       1589 non-null   object 
 1   type          1589 non-null   object 
 2   title         1589 non-null   object 
 3   director      0 non-null      object 
 4   cast          0 non-null      float64
 5   country       959 non-null    object 
 6   date_added    1561 non-null   object 
 7   release_year  1589 non-null   int64  
 8   rating        1359 non-null   object 
 9   duration      1589 non-null   object 
 10  gender        1589 non-null   object 
 11  description   1588 non-null   object 
dtypes: float64(1), int64(1), object(10)
memory usage: 161.4+ KB


In [38]:
df_hulu_series['date_added'] = pd.to_datetime(df_hulu_series['date_added'], errors='coerce') #first

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_hulu_series['date_added'] = pd.to_datetime(df_hulu_series['date_added'], errors='coerce') #first


In [39]:
df_hulu_series['date_added'] = df_hulu_series['date_added'].dt.date # second

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_hulu_series['date_added'] = df_hulu_series['date_added'].dt.date # second


In [40]:
df_hulu_series['duration'] = df_hulu_series['duration'].str.replace(' Seasons', '')
df_hulu_series['duration'] = df_hulu_series['duration'].str.replace(' Season', '')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_hulu_series['duration'] = df_hulu_series['duration'].str.replace(' Seasons', '')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_hulu_series['duration'] = df_hulu_series['duration'].str.replace(' Season', '')


In [41]:
# transformar coluna duration em float
df_hulu_series['duration'] = df_hulu_series['duration'].astype(float)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_hulu_series['duration'] = df_hulu_series['duration'].astype(float)


In [43]:
# Transformando coluna cast em string
df_hulu_series['cast'] = df_hulu_series['cast'].astype(str)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_hulu_series['cast'] = df_hulu_series['cast'].astype(str)


In [44]:
df_hulu_series.describe()

Unnamed: 0,release_year,duration
count,1589.0,1589.0
mean,2012.530522,2.720579
std,9.476398,3.191016
min,1933.0,1.0
25%,2010.0,1.0
50%,2015.0,1.0
75%,2019.0,3.0
max,2021.0,34.0
