<a href="https://colab.research.google.com/github/VitorFRodrigues/Data-Science-Bootcamp/blob/main/Proj05_final/Notebook/Saiba_mais/Metodologias_de_Categorizacao.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Projeto Módulo 5 do Bootcamp Data Science 3 - Metodologia de Categorização
---
Autor: Vitor Rodrigues

e-mail: vitorfbaiano@gmail.com

---

Para treinar modelos de *Machine Learning* é necessário que os dados a serem entregues sejam do tipo numérico ou categorizado. Utilizaremos coluna **AGE_PERCENTIL** como exemplo de categorização estudado no projeto do [link](https://raw.githubusercontent.com/VitorFRodrigues/Data-Science-Bootcamp/main/Proj05_final/Notebook/Bootcamp_Proj_05.ipynb).



In [14]:
import pandas as pd

AGE_PERC_URL = 'https://github.com/VitorFRodrigues/Data-Science-Bootcamp/blob/main/Proj05_final/dados/Kaggle_Sirio_Libanes_AGE_PERCENTIL.xlsx?raw=true'

In [15]:
dados = pd.read_excel(AGE_PERC_URL)
dados

Unnamed: 0,PATIENT_VISIT_IDENTIFIER,AGE_PERCENTIL
0,0,60th
1,0,60th
2,0,60th
3,0,60th
4,0,60th
...,...,...
1920,384,50th
1921,384,50th
1922,384,50th
1923,384,50th


In [16]:
dados_categorizar = dados['AGE_PERCENTIL']

Abaixo será decorrida as seguintes metodologias de categorização
1. Find and Replace;
2. Label Encoding;
3. One Hot Encoding;
4. OriginalEncoder (do Scikit-Learn);
5. OneHotEncoder (do Scikit-Learn).

## 1. Metodologia Find and Replace

Primeiro encontramos os valores apresentados no conjunto de dados. Desse modo aplicamos a função
```
unique()
```
para remoção de valores repetidos. 

In [17]:
dados_categorizar.unique()

array(['60th', '90th', '10th', '40th', '70th', '20th', '50th', '80th',
       '30th', 'Above 90th'], dtype=object)

In [18]:
dados_categorizar.value_counts()

20th          215
10th          205
30th          205
40th          200
70th          195
50th          190
80th          190
60th          185
Above 90th    185
90th          155
Name: AGE_PERCENTIL, dtype: int64

Agora criamos um dicionário categorizando cada valor. Em seguida substituimos as *strings* por seu correspondente categorizado:

In [19]:
substituir = {"60th": 60, 
              "10th": 10, 
              "40th": 40, 
              "70th": 70, 
              "20th": 20, 
              "50th": 50, 
              "80th": 80, 
              "30th": 30,
              "90th": 90, 
              "Above 90th": 100}
dados_categorizado_find_replace = dados_categorizar.replace(substituir)
dados_categorizado_find_replace

0       60
1       60
2       60
3       60
4       60
        ..
1920    50
1921    50
1922    50
1923    50
1924    50
Name: AGE_PERCENTIL, Length: 1925, dtype: int64

## 2. Metodologia Label Encoding

O Label Encoding utiliza um método rápido de categorização que converte cada valor em número.

In [20]:
dados_categorizado_label_enc = dados_categorizar.astype('category').cat.codes
dados_categorizado_label_enc

0       5
1       5
2       5
3       5
4       5
       ..
1920    4
1921    4
1922    4
1923    4
1924    4
Length: 1925, dtype: int8

## 3. One Hot Encoding

Já o One Hot Encoding expande cada valor único em colunas, indicando no *Dataset* qual valor é verdadeiro (1) ou falso (0).

In [21]:
pd.get_dummies(dados_categorizar).head()

Unnamed: 0,10th,20th,30th,40th,50th,60th,70th,80th,90th,Above 90th
0,0,0,0,0,0,1,0,0,0,0
1,0,0,0,0,0,1,0,0,0,0
2,0,0,0,0,0,1,0,0,0,0
3,0,0,0,0,0,1,0,0,0,0
4,0,0,0,0,0,1,0,0,0,0


## 4. Metodologia Scikit-Learn - OriginalEncoder

Tal qual o Label Encoding, o Original Encoder categoriza a partir da metodologia da biblioteca Scikit-Learn.

In [22]:
from sklearn.preprocessing import OrdinalEncoder

dados_categorizar_code = pd.DataFrame([dados_categorizar]).T

ord_enc = OrdinalEncoder()
dados_categorizar_code["AGE_PERCENTIL_code"] = ord_enc.fit_transform(dados_categorizar_code[["AGE_PERCENTIL"]])
dados_categorizar_code

Unnamed: 0,AGE_PERCENTIL,AGE_PERCENTIL_code
0,60th,5.0
1,60th,5.0
2,60th,5.0
3,60th,5.0
4,60th,5.0
...,...,...
1920,50th,4.0
1921,50th,4.0
1922,50th,4.0
1923,50th,4.0


## 5. Metodologia Scikit-Learn - OneHotEncoder

Tal qual o One Hot Encoding, o One Hot Encoder expande cada valor único em colunas. Ele se apresenta dentro da biblioteca Scikit-Learn.

In [23]:
from sklearn.preprocessing import OneHotEncoder

dados_categorizar_code = pd.DataFrame([dados_categorizar]).T

oe_style = OneHotEncoder()
oe_results = oe_style.fit_transform(dados_categorizar_code)
pd.DataFrame(oe_results.toarray(), columns=oe_style.categories_).head()

Unnamed: 0,10th,20th,30th,40th,50th,60th,70th,80th,90th,Above 90th
0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


## Referêcias
[Categorical Encoding](https://pbpython.com/categorical-encoding.html)