<img src="https://raw.githubusercontent.com/andre-marcos-perez/ebac-course-utils/main/media/logo/newebac_logo_black_half.png" alt="ebac-logo">

---

# **Módulo** | Análise de Dados: Fundamentos de Aprendizado de Máquina
Caderno de **Exercícios**<br>
Professor [André Perez](https://www.linkedin.com/in/andremarcosperez/)

---

# **Tópicos**

<ol type="1">
  <li>Teoria;</li>
  <li>Atributos categóricos;</li>
  <li>Atributos numéricos;</li>
  <li>Dados faltantes.</li>
</ol>

---

# **Exercícios**

## 1\. Pinguins

Neste exercício, vamos utilizar uma base de dados com informações sobre penguins. A idéia é preparar a base de dados para prever a espécie do penguin (variável resposta) baseado em suas características físicas e geográficas (variáveis preditivas).

In [43]:
import numpy as np
import pandas as pd
import seaborn as sns

In [44]:
data = sns.load_dataset('penguins')

In [45]:
data.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female


### **1.1. Valores nulos**

A base de dados possui valores faltantes, utilize os conceitos da aula para trata-los.

In [46]:
# resposta da questão 1.1

#removendo valores nulos

data = data.dropna()

In [47]:
#verificando se ainda existem valores nulos

print(data.isna().sum())

species              0
island               0
bill_length_mm       0
bill_depth_mm        0
flipper_length_mm    0
body_mass_g          0
sex                  0
dtype: int64


### **1.2. Variáveis numéricas**

Identifique as variáveis numéricas e crie uma nova coluna **padronizando** seus valores. A nova coluna deve ter o mesmo nome da coluna original acrescidade de "*_std*".

> **Nota**: Você não deve tratar a variável resposta.

In [48]:
#verificando os tipos de dados
data.dtypes

species               object
island                object
bill_length_mm       float64
bill_depth_mm        float64
flipper_length_mm    float64
body_mass_g          float64
sex                   object
dtype: object

In [49]:
#analisando as coluans numericas

data[['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']].head()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
0,39.1,18.7,181.0,3750.0
1,39.5,17.4,186.0,3800.0
2,40.3,18.0,195.0,3250.0
4,36.7,19.3,193.0,3450.0
5,39.3,20.6,190.0,3650.0


In [50]:
#para padronizar as colunas de valores é preciso verificar a média de cada coluna e o desvio padrão.

# calculando as médias e desvios padrão de todas as colunas
medias = data.mean(numeric_only=True)
desvios_padrao = data.std(numeric_only=True)

# Colunas a serem padronizadas
colunas_padronizar = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']

# Padronização das colunas e adição ao DataFrame
for coluna in colunas_padronizar:
    nome_coluna_padronizada = coluna + '_std'
    data[nome_coluna_padronizada] = (data[coluna] - medias[coluna]) / desvios_padrao[coluna]

data.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,bill_length_mm_std,bill_depth_mm_std,flipper_length_mm_std,body_mass_g_std
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male,-0.894695,0.779559,-1.424608,-0.567621
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female,-0.821552,0.119404,-1.067867,-0.505525
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female,-0.675264,0.424091,-0.425733,-1.188572
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female,-1.333559,1.084246,-0.568429,-0.940192
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,Male,-0.858123,1.7444,-0.782474,-0.691811


### **1.3. Variáveis categóricas**

Identifique as variáveis categóricas nominais e ordinais, crie uma nova coluna aplicando a técnica correta de conversão a seus valores. A nova coluna deve ter o mesmo nome da coluna original acrescidade de "*_nom*" ou "*_ord*".

> **Nota**: Você não deve tratar a variável resposta.

In [51]:
#utilizando a tecnica "one hot encoding" para codificar a coluna nominal "sex"

data['sex_male_ord'] = data['sex'].apply(lambda sex: 1 if sex == 'Male' else 0)
data['sex_female_ord'] = data['sex'].apply(lambda sex: 1 if sex == 'Female' else 0)



In [52]:
#verificando os valores únicos da coluna "island"

data['island'].drop_duplicates()


0     Torgersen
20       Biscoe
30        Dream
Name: island, dtype: object

In [53]:
#relizando o processo de codificação da categoria nominal "island"

island_dummies = pd.get_dummies(data['island'], prefix='island_nom')
data = pd.concat([data, island_dummies], axis=1)
display(data)

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,bill_length_mm_std,bill_depth_mm_std,flipper_length_mm_std,body_mass_g_std,sex_male_ord,sex_female_ord,island_nom_Biscoe,island_nom_Dream,island_nom_Torgersen
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male,-0.894695,0.779559,-1.424608,-0.567621,1,0,0,0,1
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female,-0.821552,0.119404,-1.067867,-0.505525,0,1,0,0,1
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female,-0.675264,0.424091,-0.425733,-1.188572,0,1,0,0,1
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female,-1.333559,1.084246,-0.568429,-0.940192,0,1,0,0,1
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,Male,-0.858123,1.744400,-0.782474,-0.691811,1,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
338,Gentoo,Biscoe,47.2,13.7,214.0,4925.0,Female,0.586470,-1.759497,0.929884,0.891616,0,1,1,0,0
340,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,Female,0.513326,-1.454811,1.001232,0.798473,0,1,1,0,0
341,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,Male,1.171621,-0.743875,1.500670,1.916186,1,0,1,0,0
342,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,Female,0.220750,-1.200905,0.787187,1.233139,0,1,1,0,0


### **1.4. Limpeza**

Descarte as colunas originais e mantenha apenas a variável resposta e as variáveis preditivas com o sufixo *_std*", *_nom*" e "*_ord*".

In [55]:
#criando o data frame final para a predição da espécie somente com a colunas que serão utilizadas


df_predicao = data[['species', 'bill_length_mm_std',	'bill_depth_mm_std', 'flipper_length_mm_std', 'body_mass_g_std', 'sex_male_ord',	'sex_female_ord', 'island_nom_Biscoe', 'island_nom_Dream', 'island_nom_Torgersen']]

df_predicao.head()

Unnamed: 0,species,bill_length_mm_std,bill_depth_mm_std,flipper_length_mm_std,body_mass_g_std,sex_male_ord,sex_female_ord,island_nom_Biscoe,island_nom_Dream,island_nom_Torgersen
0,Adelie,-0.894695,0.779559,-1.424608,-0.567621,1,0,0,0,1
1,Adelie,-0.821552,0.119404,-1.067867,-0.505525,0,1,0,0,1
2,Adelie,-0.675264,0.424091,-0.425733,-1.188572,0,1,0,0,1
4,Adelie,-1.333559,1.084246,-0.568429,-0.940192,0,1,0,0,1
5,Adelie,-0.858123,1.7444,-0.782474,-0.691811,1,0,0,0,1


---