<a href="https://colab.research.google.com/github/anneaiad/LH_CD_ANNE_PIMENTEL/blob/main/desafio_indicium_cd.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Desafio Cientista de Dados - Indicium | Programa Lighthouse

Desafio

Você foi alocado em um time da Indicium contratado por um estúdio de Hollywood chamado PProductions, e agora deve fazer uma análise em cima de um banco de dados cinematográfico para orientar qual tipo de filme deve ser o próximo a ser desenvolvido. Lembre-se que há muito dinheiro envolvido, então a análise deve ser muito detalhada e levar em consideração o máximo de fatores possíveis (a introdução de dados externos é permitida - e encorajada).


# Preparação dos Dados

Extração dos Dados

In [None]:
# Importando a biblioteca Pandas
import pandas as pd

In [None]:
# Importando o arquivo csv
dados = pd.read_csv('/content/desafio_indicium_imdb.csv')
dados.head()

Unnamed: 0.1,Unnamed: 0,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,1,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411
1,2,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444
2,3,The Godfather: Part II,1974,A,202 min,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000
3,4,12 Angry Men,1957,U,96 min,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845,4360000
4,5,The Lord of the Rings: The Return of the King,2003,U,201 min,"Action, Adventure, Drama",8.9,Gandalf and Aragorn lead the World of Men agai...,94.0,Peter Jackson,Elijah Wood,Viggo Mortensen,Ian McKellen,Orlando Bloom,1642758,377845905


Tratamento e Limpeza dos Dados

In [None]:
# Verificando o tipo de cada coluna e se há algum número nulo
dados.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 999 entries, 0 to 998
Data columns (total 16 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Unnamed: 0     999 non-null    int64  
 1   Series_Title   999 non-null    object 
 2   Released_Year  999 non-null    object 
 3   Certificate    898 non-null    object 
 4   Runtime        999 non-null    object 
 5   Genre          999 non-null    object 
 6   IMDB_Rating    999 non-null    float64
 7   Overview       999 non-null    object 
 8   Meta_score     842 non-null    float64
 9   Director       999 non-null    object 
 10  Star1          999 non-null    object 
 11  Star2          999 non-null    object 
 12  Star3          999 non-null    object 
 13  Star4          999 non-null    object 
 14  No_of_Votes    999 non-null    int64  
 15  Gross          830 non-null    object 
dtypes: float64(2), int64(2), object(12)
memory usage: 125.0+ KB


In [None]:
# Verificando a quantidade de valores nulos por coluna
dados.isnull().sum()

Unnamed: 0,0
Unnamed: 0,0
Series_Title,0
Released_Year,0
Certificate,101
Runtime,0
Genre,0
IMDB_Rating,0
Overview,0
Meta_score,157
Director,0


In [None]:
# Tratando os valores nulos
# Transformando os valores nulos da coluna 'Certificate' em 'Not Rated' para indicar que o filme não possui a nota ou não foi informada
dados['Certificate'].fillna('Not Rated', inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  dados['Certificate'].fillna('Not Rated', inplace=True)


In [None]:
# Transformando os valores nulos da coluna 'Gross' em 'Unknown" para indicar números de arrecadação não conhecidos
dados['Gross'].fillna('Unknown', inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  dados['Gross'].fillna('Unknown', inplace=True)


In [None]:
# Transformando os dados nulos da coluna 'Meta_score' com a mediana para conseguir trabalhar com os dados futuramente sem prejudicar a análise com números muito fora da realidade
dados['Meta_score'].fillna(dados['Meta_score'].median(), inplace=True)

In [None]:
# Verificando se ainda há valores nulos
dados.isnull().sum()

Unnamed: 0,0
Series_Title,0
Released_Year,0
Certificate,0
Runtime,0
Genre,0
IMDB_Rating,0
Overview,0
Meta_score,0
Director,0
Star1,0


In [None]:
# Removendo os dados irrelevantes
dados = dados.drop(columns=['Unnamed: 0'])
dados.head()

Unnamed: 0,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411
1,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444
2,The Godfather: Part II,1974,A,202 min,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000
3,12 Angry Men,1957,U,96 min,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845,4360000
4,The Lord of the Rings: The Return of the King,2003,U,201 min,"Action, Adventure, Drama",8.9,Gandalf and Aragorn lead the World of Men agai...,94.0,Peter Jackson,Elijah Wood,Viggo Mortensen,Ian McKellen,Orlando Bloom,1642758,377845905


In [None]:
# Ajustando o Dtype da coluna 'Released_Year'
# Verificando todos os valores únicos para identificar irregularidades
dados['Released_Year'].unique()

array(['1972', '2008', '1974', '1957', '2003', '1994', '1993', '2010',
       '1999', '2001', '1966', '2002', '1990', '1980', '1975', '2020',
       '2019', '2014', '1998', '1997', '1995', '1991', '1977', '1962',
       '1954', '1946', '2011', '2006', '2000', '1988', '1985', '1968',
       '1960', '1942', '1936', '1931', '2018', '2017', '2016', '2012',
       '2009', '2007', '1984', '1981', '1979', '1971', '1963', '1964',
       '1950', '1940', '2013', '2005', '2004', '1992', '1987', '1986',
       '1983', '1976', '1973', '1965', '1959', '1958', '1952', '1948',
       '1944', '1941', '1927', '1921', '2015', '1996', '1989', '1978',
       '1961', '1955', '1953', '1925', '1924', '1982', '1967', '1951',
       '1949', '1939', '1937', '1934', '1928', '1926', '1920', '1970',
       '1969', '1956', '1947', '1945', '1930', '1938', '1935', '1933',
       '1932', '1922', '1943', 'PG'], dtype=object)

In [None]:
# Contando quantos dados existem com o valor 'PG' na coluna Released_Year
pg_cont = (dados['Released_Year'] == 'PG').sum()
print(pg_cont)

1


In [None]:
# Verificando em qual linha está o valor 'PG'
pg_linhas = dados[dados['Released_Year'] == 'PG']
print(pg_linhas)

    Series_Title Released_Year Certificate  Runtime  \
965    Apollo 13            PG           U  140 min   

                         Genre  IMDB_Rating  \
965  Adventure, Drama, History          7.6   

                                              Overview  Meta_score  \
965  NASA must devise a strategy to return Apollo 1...        77.0   

       Director      Star1        Star2        Star3        Star4  \
965  Ron Howard  Tom Hanks  Bill Paxton  Kevin Bacon  Gary Sinise   

     No_of_Votes        Gross  
965       269197  173,837,933  


In [None]:
# Modificando o dado e inserindo o ano correto do filme. Fonte utilizada <https://www.imdb.com/pt/title/tt0112384/>
dados.loc[dados['Released_Year'] == 'PG', 'Released_Year'] = 1995

In [None]:
# Verificando se o dado foi modificado corretamente
dados['Released_Year'].unique()

array(['1972', '2008', '1974', '1957', '2003', '1994', '1993', '2010',
       '1999', '2001', '1966', '2002', '1990', '1980', '1975', '2020',
       '2019', '2014', '1998', '1997', '1995', '1991', '1977', '1962',
       '1954', '1946', '2011', '2006', '2000', '1988', '1985', '1968',
       '1960', '1942', '1936', '1931', '2018', '2017', '2016', '2012',
       '2009', '2007', '1984', '1981', '1979', '1971', '1963', '1964',
       '1950', '1940', '2013', '2005', '2004', '1992', '1987', '1986',
       '1983', '1976', '1973', '1965', '1959', '1958', '1952', '1948',
       '1944', '1941', '1927', '1921', '2015', '1996', '1989', '1978',
       '1961', '1955', '1953', '1925', '1924', '1982', '1967', '1951',
       '1949', '1939', '1937', '1934', '1928', '1926', '1920', '1970',
       '1969', '1956', '1947', '1945', '1930', '1938', '1935', '1933',
       '1932', '1922', '1943', 1995], dtype=object)

In [22]:
# Transformando o Dtype em int
dados['Released_Year'] = dados['Released_Year'].astype(int)
dados.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 999 entries, 0 to 998
Data columns (total 15 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Series_Title   999 non-null    object 
 1   Released_Year  999 non-null    int64  
 2   Certificate    999 non-null    object 
 3   Runtime        999 non-null    object 
 4   Genre          999 non-null    object 
 5   IMDB_Rating    999 non-null    float64
 6   Overview       999 non-null    object 
 7   Meta_score     999 non-null    float64
 8   Director       999 non-null    object 
 9   Star1          999 non-null    object 
 10  Star2          999 non-null    object 
 11  Star3          999 non-null    object 
 12  Star4          999 non-null    object 
 13  No_of_Votes    999 non-null    int64  
 14  Gross          999 non-null    object 
dtypes: float64(2), int64(2), object(11)
memory usage: 117.2+ KB


In [30]:
# Criando uma nova coluna 'Runtime_min' removendo a parte textual 'min' e os espaços deixando apenas os números
dados['Runtime_min'] = dados['Runtime'].str.replace('min', '', regex=False).str.strip()

In [37]:
# Verificando se a coluna 'Runtime_min' foi criada corretamente
dados.head(3)

Unnamed: 0,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross,Runtime_min
0,The Godfather,1972,A,175 min,"crime, drama",9.2,an organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411,175
1,The Dark Knight,2008,UA,152 min,"action, crime, drama",9.0,when the menace known as the joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444,152
2,The Godfather: Part II,1974,A,202 min,"crime, drama",9.0,the early life and career of vito corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000,202


In [34]:
# Modificando o tipo da coluna 'Runtime_min' para int
dados['Runtime_min'] = dados['Runtime_min'].astype(int)

In [35]:
# Verificando se o tipo da coluna 'Runtime_min' foi modificado corretamente
dados.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 999 entries, 0 to 998
Data columns (total 16 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Series_Title   999 non-null    object 
 1   Released_Year  999 non-null    int64  
 2   Certificate    999 non-null    object 
 3   Runtime        999 non-null    object 
 4   Genre          999 non-null    object 
 5   IMDB_Rating    999 non-null    float64
 6   Overview       999 non-null    object 
 7   Meta_score     999 non-null    float64
 8   Director       999 non-null    object 
 9   Star1          999 non-null    object 
 10  Star2          999 non-null    object 
 11  Star3          999 non-null    object 
 12  Star4          999 non-null    object 
 13  No_of_Votes    999 non-null    int64  
 14  Gross          999 non-null    object 
 15  Runtime_min    999 non-null    int64  
dtypes: float64(2), int64(3), object(11)
memory usage: 125.0+ KB


In [24]:
# Removendo os possíveis espaços extras nas colunas descritivas do tipo 'object'
for col in dados.select_dtypes(include=['object']).columns:
  dados[col] = dados[col].str.strip()

In [25]:
# Padronizando os textos das colunas textuais 'Genre' e 'Overview' deixando em lowercase para melhor visualização
dados_text_lower = ['Genre', 'Overview']
for col in dados_text_lower:
  dados[col] = dados[col].str.lower()

In [26]:
# Verificando se a padronização do texto foi aplicada corretamente
dados.head()

Unnamed: 0,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,The Godfather,1972,A,175 min,"crime, drama",9.2,an organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411
1,The Dark Knight,2008,UA,152 min,"action, crime, drama",9.0,when the menace known as the joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444
2,The Godfather: Part II,1974,A,202 min,"crime, drama",9.0,the early life and career of vito corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000
3,12 Angry Men,1957,U,96 min,"crime, drama",9.0,a jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845,4360000
4,The Lord of the Rings: The Return of the King,2003,U,201 min,"action, adventure, drama",8.9,gandalf and aragorn lead the world of men agai...,94.0,Peter Jackson,Elijah Wood,Viggo Mortensen,Ian McKellen,Orlando Bloom,1642758,377845905


In [48]:
# Salvando os dados tratados em uma cópia do dataframe original
dados_tratados = dados.copy()

In [49]:
# Removendo a coluna 'Runtime' para utilizar apenas a nova coluna criada 'Runtime_min'
dados_tratados.drop(columns=['Runtime'], inplace=True)

In [51]:
# Verificando se foi aplicado corretamente
dados_tratados.head(3)

Unnamed: 0,Series_Title,Released_Year,Certificate,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross,Runtime_min
0,The Godfather,1972,A,"crime, drama",9.2,an organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411,175
1,The Dark Knight,2008,UA,"action, crime, drama",9.0,when the menace known as the joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444,152
2,The Godfather: Part II,1974,A,"crime, drama",9.0,the early life and career of vito corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000,202


In [52]:
# Salvando os dados tratados em um novo arquivo csv para futura consultas
dados_tratados.to_csv('dados_tratados_desafio_indicium.csv', index=False)