<a href="https://colab.research.google.com/github/carneiro-fernando/EBAC/blob/main/Exercicios/Modulo_22.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="https://raw.githubusercontent.com/andre-marcos-perez/ebac-course-utils/main/media/logo/newebac_logo_black_half.png" alt="ebac-logo">

---

# **Módulo** | Análise de Dados: Fundamentos de Aprendizado de Máquina
Caderno de **Exercícios**<br>
Professor [André Perez](https://www.linkedin.com/in/andremarcosperez/)

---

# **Tópicos**

<ol type="1">
  <li>Teoria;</li>
  <li>Atributos categóricos;</li>
  <li>Atributos numéricos;</li>
  <li>Dados faltantes.</li>
</ol>

---

# **Exercícios**

## 1\. Pinguins

Neste exercício, vamos utilizar uma base de dados com informações sobre penguins. A idéia é preparar a base de dados para prever a espécie do penguin (variável resposta) baseado em suas características físicas e geográficas (variáveis preditivas).

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import re

In [2]:
data = sns.load_dataset('penguins')

In [3]:
data.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female


### **1.1. Valores nulos**

A base de dados possui valores faltantes, utilize os conceitos da aula para trata-los.

#### Valores nulos categóricos

In [4]:
# resposta da questão 1.1

# Verificação dos atributos categóricos
data[['species', 'island', 'sex']].isnull().any()

species    False
island     False
sex         True
dtype: bool

* Somente a coluna 'sex' tem valores nulos. Vamos tratá-los por descartar esses valores.

In [5]:
# Descartando atributos categóricos nulos
data_clean = data.dropna(subset='sex')
data_clean[['species', 'island', 'sex']].isnull().any()

species    False
island     False
sex        False
dtype: bool

#### Valores nulos numéricos

In [6]:
# Verificando valores nulos
data_clean[['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']].isnull().any()

bill_length_mm       False
bill_depth_mm        False
flipper_length_mm    False
body_mass_g          False
dtype: bool

### **1.2. Variáveis numéricas**

Identifique as variáveis numéricas e crie uma nova coluna **padronizando** seus valores. A nova coluna deve ter o mesmo nome da coluna original acrescidade de "*_std*".

> **Nota**: Você não deve tratar a variável resposta.

In [7]:
# resposta da questão 1.2

# Criando DataFrame com os dados numéricos
std_df = data_clean[['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']]

# Aplicando a padronização (sigma = (x - média) / desvio padrão)
std_df = std_df.apply(lambda x: (x - np.mean(x)) / np.std(x))

# Conferindo a padronização. Os valores devem ser (aproximadamente): mean = 0 & std = 1.
std_df.describe()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
count,333.0,333.0,333.0,333.0
mean,3.840772e-16,6.401286e-16,2.133762e-16,-1.70701e-16
std,1.001505,1.001505,1.001505,1.001505
min,-2.177987,-2.067291,-2.069852,-1.874435
25%,-0.8227879,-0.7958519,-0.7836512,-0.8172292
50%,0.09288742,0.06872642,-0.283462,-0.1953432
75%,0.8437412,0.7807321,0.8598276,0.7063915
max,2.858227,2.204743,2.146028,2.603144


In [8]:
# Renomeando as colunas
for col in std_df:
  std_df.rename(columns={col: col + '_std'}, inplace=True)

In [9]:
# Mesclando com o DataFrame original
data_clean = pd.merge(left = data_clean, right = std_df, how = 'inner', right_index= True, left_index= True)

### **1.3. Variáveis categóricas**

Identifique as variáveis categóricas nominais e ordinais, crie uma nova coluna aplicando a técnica correta de conversão a seus valores. A nova coluna deve ter o mesmo nome da coluna original acrescidade de "*_nom*" ou "*_ord*".

> **Nota**: Você não deve tratar a variável resposta.

In [10]:
# resposta da questão 1.3

# Criando DataFrame com variáveis transformadas de categóricas para binárias
nom_df = pd.get_dummies(data= data_clean[['island', 'sex']])

# Renomeando as colunas
for col in nom_df:
  nom_df.rename(columns={col: col + '_nom'}, inplace=True)

In [11]:
# Mesclando com o DataFrame original
data_clean = pd.merge(left = data_clean, right = nom_df, how = 'inner', right_index= True, left_index= True)

data_clean

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,bill_length_mm_std,bill_depth_mm_std,flipper_length_mm_std,body_mass_g_std,island_Biscoe_nom,island_Dream_nom,island_Torgersen_nom,sex_Female_nom,sex_Male_nom
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male,-0.896042,0.780732,-1.426752,-0.568475,0,0,1,0,1
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female,-0.822788,0.119584,-1.069474,-0.506286,0,0,1,1,0
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female,-0.676280,0.424729,-0.426373,-1.190361,0,0,1,1,0
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female,-1.335566,1.085877,-0.569284,-0.941606,0,0,1,1,0
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,Male,-0.859415,1.747026,-0.783651,-0.692852,0,0,1,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
338,Gentoo,Biscoe,47.2,13.7,214.0,4925.0,Female,0.587352,-1.762145,0.931283,0.892957,1,0,0,1,0
340,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,Female,0.514098,-1.457000,1.002739,0.799674,1,0,0,1,0
341,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,Male,1.173384,-0.744994,1.502928,1.919069,1,0,0,0,1
342,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,Female,0.221082,-1.202712,0.788372,1.234995,1,0,0,1,0



### **1.4. Limpeza**

Descarte as colunas originais e mantenha apenas a variável resposta e as variáveis preditivas com o sufixo *_std*", *_nom*" e "*_ord*".

In [12]:
# resposta da questão 1.4

columns_list = ['species'] # Lista com o nome da coluna que contém a variável resposta

# Busca em cada nome de coluna por _std ou _nom e adiciona à lista de nomes de coluna
for col in data_clean:
  if re.findall(".*_std.*|.*_nom.*", col):
    columns_list.append(col)

# Cria um DataFrame baseado no original com as colunas armazenadas na lista
penguins_df = data_clean[columns_list]

In [13]:
penguins_df

Unnamed: 0,species,bill_length_mm_std,bill_depth_mm_std,flipper_length_mm_std,body_mass_g_std,island_Biscoe_nom,island_Dream_nom,island_Torgersen_nom,sex_Female_nom,sex_Male_nom
0,Adelie,-0.896042,0.780732,-1.426752,-0.568475,0,0,1,0,1
1,Adelie,-0.822788,0.119584,-1.069474,-0.506286,0,0,1,1,0
2,Adelie,-0.676280,0.424729,-0.426373,-1.190361,0,0,1,1,0
4,Adelie,-1.335566,1.085877,-0.569284,-0.941606,0,0,1,1,0
5,Adelie,-0.859415,1.747026,-0.783651,-0.692852,0,0,1,0,1
...,...,...,...,...,...,...,...,...,...,...
338,Gentoo,0.587352,-1.762145,0.931283,0.892957,1,0,0,1,0
340,Gentoo,0.514098,-1.457000,1.002739,0.799674,1,0,0,1,0
341,Gentoo,1.173384,-0.744994,1.502928,1.919069,1,0,0,0,1
342,Gentoo,0.221082,-1.202712,0.788372,1.234995,1,0,0,1,0


---