# **DESAFIO 01 - APRENDIZAGEM SUPERVISIONADA: CLASSIFICAÇÃO DO DATASET CÂNCER DE MAMA DO REPOSITORIO KAGGLE**

Este projeto tem como objetivo desenvolver um modelo de Machine Learning para prever a probabilidade de um tumor mamário ser benigno ou maligno. A análise será conduzida com base em características extraídas de imagens digitalizadas de biópsias por aspiração com agulha fina (FNA) de massas mamárias. Utilizando o conjunto de dados do Breast Cancer Wisconsin (Diagnostic), empregaremos diversas técnicas de classificação para examinar variáveis clínicas e laboratoriais — incluindo raio, textura, perímetro, área, suavidade, compactação, concavidade e pontos concavos das células nucleares — com o intuito de construir um modelo preditivo robusto.

Para determinar a técnica mais eficaz, realizaremos uma comparação sistemática dos seguintes métodos, utilizando suas configurações padrão, para identificar aquele que oferece a melhor performance:

    Naive Bayes
    Máquinas de Vetores de Suporte (SVM)
    Regressão Logística
    Aprendizagem Baseada em Instâncias (KNN)
    Árvore de Decisão
    Random Forest
    XGBoost
    LightGBM
    CatBoost

Cada método será avaliado com base na sua precisão, recall, F1-score, e outras métricas relevantes, permitindo-nos selecionar o algoritmo mais adequado para nosso modelo preditivo.

Os dados foram extraídos do site do Kaggle:

https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data

# **EXPLORAÇÃO, ANÁLISE E TRATAMENTO DOS DADOS: PROJETO PREVISÃO DE CÂNCER DE MAMA**

## **Exploração dos Dados**

In [1]:
import numpy as np
import pandas as pd

In [2]:
dados = pd.read_csv('../database/data_cancer2.csv',
                    sep=',', encoding='utf-8')

In [3]:
dados.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


In [4]:
dados.tail()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
564,926424,M,21.56,22.39,142.0,1479.0,0.111,0.1159,0.2439,0.1389,...,26.4,166.1,2027.0,0.141,0.2113,0.4107,0.2216,0.206,0.07115,
565,926682,M,20.13,28.25,131.2,1261.0,0.0978,0.1034,0.144,0.09791,...,38.25,155.0,1731.0,0.1166,0.1922,0.3215,0.1628,0.2572,0.06637,
566,926954,M,16.6,28.08,108.3,858.1,0.08455,0.1023,0.09251,0.05302,...,34.12,126.7,1124.0,0.1139,0.3094,0.3403,0.1418,0.2218,0.0782,
567,927241,M,20.6,29.33,140.1,1265.0,0.1178,0.277,0.3514,0.152,...,39.42,184.6,1821.0,0.165,0.8681,0.9387,0.265,0.4087,0.124,
568,92751,B,7.76,24.54,47.92,181.0,0.05263,0.04362,0.0,0.0,...,30.37,59.16,268.6,0.08996,0.06444,0.0,0.0,0.2871,0.07039,


In [5]:
dados.shape

(569, 33)

## **Análise das Variáveis (Atributos)**

## **Análise dos tipos de atributos.**

In [6]:
dados.dtypes

id                           int64
diagnosis                   object
radius_mean                float64
texture_mean               float64
perimeter_mean             float64
area_mean                  float64
smoothness_mean            float64
compactness_mean           float64
concavity_mean             float64
concave points_mean        float64
symmetry_mean              float64
fractal_dimension_mean     float64
radius_se                  float64
texture_se                 float64
perimeter_se               float64
area_se                    float64
smoothness_se              float64
compactness_se             float64
concavity_se               float64
concave points_se          float64
symmetry_se                float64
fractal_dimension_se       float64
radius_worst               float64
texture_worst              float64
perimeter_worst            float64
area_worst                 float64
smoothness_worst           float64
compactness_worst          float64
concavity_worst     

Após a análise inicial do DataFrame, identificamos duas colunas dispensáveis: a coluna ‘id’, que é redundante para nossa análise, e a coluna ‘Unnamed: 32’, que parece ser um artefato sem relevância gerado durante a importação dos dados. Ambas serão removidas para otimizar nosso conjunto de dados.

In [7]:
dados_relevante = pd.DataFrame.copy(dados)

In [8]:
dados_relevante.drop(dados.columns[0], axis=1, inplace=True)

In [9]:
dados_relevante.drop(dados.columns[-1], axis=1, inplace=True)

## **Valores Missing (NAN)**

In [10]:
dados_relevante.isnull().sum()

diagnosis                  0
radius_mean                0
texture_mean               0
perimeter_mean             0
area_mean                  0
smoothness_mean            0
compactness_mean           0
concavity_mean             0
concave points_mean        0
symmetry_mean              0
fractal_dimension_mean     0
radius_se                  0
texture_se                 0
perimeter_se               0
area_se                    0
smoothness_se              0
compactness_se             0
concavity_se               0
concave points_se          0
symmetry_se                0
fractal_dimension_se       0
radius_worst               0
texture_worst              0
perimeter_worst            0
area_worst                 0
smoothness_worst           0
compactness_worst          0
concavity_worst            0
concave points_worst       0
symmetry_worst             0
fractal_dimension_worst    0
dtype: int64

## **Análises Estatísticas Descritivas**

In [13]:
dados_relevante.describe()

Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,0.062798,...,16.26919,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946
std,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,0.00706,...,4.833242,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061
min,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,0.04996,...,7.93,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504
25%,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,0.0577,...,13.01,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146
50%,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,0.06154,...,14.97,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004
75%,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,0.06612,...,18.79,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208
max,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,0.09744,...,36.04,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075


In [14]:
dados_relevante.mode()

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,B,12.34,14.93,82.61,512.2,0.1007,0.1147,0.0,0.0,0.1601,...,12.36,17.7,101.7,284.4,0.1216,0.1486,0.0,0.0,0.2226,0.07427
1,,,15.7,87.76,,,0.1206,,,0.1714,...,,27.26,105.9,402.8,0.1223,0.3416,,,0.2369,
2,,,16.84,134.7,,,,,,0.1717,...,,,117.7,439.6,0.1234,,,,0.2383,
3,,,16.85,,,,,,,0.1769,...,,,,458.0,0.1256,,,,0.2972,
4,,,17.46,,,,,,,0.1893,...,,,,472.4,0.1275,,,,0.3109,
5,,,18.22,,,,,,,,...,,,,489.5,0.1312,,,,0.3196,
6,,,18.9,,,,,,,,...,,,,546.7,0.1347,,,,,
7,,,19.83,,,,,,,,...,,,,547.4,0.1401,,,,,
8,,,20.52,,,,,,,,...,,,,624.1,0.1415,,,,,
9,,,,,,,,,,,...,,,,698.8,,,,,,


## **Salvando (Exportando) o Dataframe Tratado**

In [15]:
dados_relevante.to_csv('../database/data_cancer2_tratado.csv', sep=',', encoding='utf-8', index = False)

# **PRÉ-PROCESSAMENTO**

In [21]:
import numpy as np
import pandas as pd

In [22]:
df_original = pd.read_csv('../database/data_cancer2_tratado.csv',
                    sep=',', encoding='utf-8')

In [23]:
df_original.head()

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [24]:
df_original.shape

(569, 31)

In [25]:
df_original.dtypes

diagnosis                   object
radius_mean                float64
texture_mean               float64
perimeter_mean             float64
area_mean                  float64
smoothness_mean            float64
compactness_mean           float64
concavity_mean             float64
concave points_mean        float64
symmetry_mean              float64
fractal_dimension_mean     float64
radius_se                  float64
texture_se                 float64
perimeter_se               float64
area_se                    float64
smoothness_se              float64
compactness_se             float64
concavity_se               float64
concave points_se          float64
symmetry_se                float64
fractal_dimension_se       float64
radius_worst               float64
texture_worst              float64
perimeter_worst            float64
area_worst                 float64
smoothness_worst           float64
compactness_worst          float64
concavity_worst            float64
concave points_worst

## **Transformando as variáveis categóricas nominais em variáveis categóricas ordinais**

In [26]:
df_ordinal = pd.DataFrame.copy(df_original)

In [27]:
df_ordinal['diagnosis'].replace({'M':1, 'B': 0}, inplace=True)

In [28]:
df_ordinal.head()

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,1,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,1,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,1,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,1,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [29]:
df_ordinal.dtypes

diagnosis                    int64
radius_mean                float64
texture_mean               float64
perimeter_mean             float64
area_mean                  float64
smoothness_mean            float64
compactness_mean           float64
concavity_mean             float64
concave points_mean        float64
symmetry_mean              float64
fractal_dimension_mean     float64
radius_se                  float64
texture_se                 float64
perimeter_se               float64
area_se                    float64
smoothness_se              float64
compactness_se             float64
concavity_se               float64
concave points_se          float64
symmetry_se                float64
fractal_dimension_se       float64
radius_worst               float64
texture_worst              float64
perimeter_worst            float64
area_worst                 float64
smoothness_worst           float64
compactness_worst          float64
concavity_worst            float64
concave points_worst

In [30]:
df_ordinal.shape

(569, 31)

## **LEGENDA**

Atributos:

- **id**: Um número inteiro que serve como identificador único para cada amostra.
- **diagnosis**: Uma string que representa o diagnóstico, onde 1 (‘M’) indica maligno e 0 (‘B’) indica benigno.
- **radius_mean**: Um número real que representa a média dos raios dos núcleos celulares.
- **texture_mean**: Um número real que representa a média do desvio padrão dos valores de escala de cinza.
- **perimeter_mean**: Um número real que representa a média dos perímetros dos núcleos celulares.
- **area_mean**: Um número real que representa a média das áreas dos núcleos celulares.
- **smoothness_mean**: Um número real que representa a média da variação local nos comprimentos dos raios dos núcleos celulares.
- **compactness_mean**: Um número real que representa a média da compacidade dos núcleos celulares, calculada como $$\text{perímetro}^2/\text{área} - 1.0$$.
- **concavity_mean**: Um número real que representa a média da gravidade das porções côncavas do contorno dos núcleos celulares.
- **concave points_mean**: Um número real que representa a média do número de porções côncavas do contorno dos núcleos celulares.
- **symmetry_mean**: Um número real que representa a média da simetria dos núcleos celulares.
- **fractal_dimension_mean**: Um número real que representa a média da “aproximação da linha costeira - 1” dos núcleos celulares.
- **radius_se**: Um número real que representa o erro padrão dos raios dos núcleos celulares.
- **texture_se**: Um número real que representa o erro padrão do desvio padrão dos valores de escala de cinza.
- **perimeter_se**: Um número real que representa o erro padrão dos perímetros dos núcleos celulares.
- **area_se**: Um número real que representa o erro padrão das áreas dos núcleos celulares.
- **smoothness_se**: Um número real que representa o erro padrão da variação local nos comprimentos dos raios dos núcleos celulares.
- **compactness_se**: Um número real que representa o erro padrão da compacidade dos núcleos celulares.
- **concavity_se**: Um número real que representa o erro padrão da gravidade das porções côncavas do contorno dos núcleos celulares.
- **concave points_se**: Um número real que representa o erro padrão do número de porções côncavas do contorno dos núcleos celulares.
- **symmetry_se**: Um número real que representa o erro padrão da simetria dos núcleos celulares.
- **fractal_dimension_se**: Um número real que representa o erro padrão da “aproximação da linha costeira - 1” dos núcleos celulares.
- **radius_worst**: Um número real que representa o maior valor médio dos raios dos núcleos celulares.
- **texture_worst**: Um número real que representa o maior valor médio do desvio padrão dos valores de escala de cinza.
- **perimeter_worst**: Um número real que representa o maior valor médio dos perímetros dos núcleos celulares.
- **area_worst**: Um número real que representa o maior valor médio das áreas dos núcleos celulares.
- **smoothness_worst**: Um número real que representa o maior valor médio da variação local nos comprimentos dos raios dos núcleos celulares.
- **compactness_worst**: Um número real que representa o maior valor médio da compacidade dos núcleos celulares.
- **concavity_worst**: Um número real que representa o maior valor médio da gravidade das porções côncavas do contorno dos núcleos celulares.
- **concave points_worst**: Um número real que representa o maior valor médio do número de porções côncavas do contorno dos núcleos celulares.
- **symmetry_worst**: Um número real que representa o maior valor médio da simetria dos núcleos celulares.
- **fractal_dimension_worst**: Um número real que representa o maior valor médio da “aproximação da linha costeira - 1” dos núcleos celulares.

## **ATRIBUTOS PREVISORES E ALVO**

In [31]:
df_ordinal.head()

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,1,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,1,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,1,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,1,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [32]:
previsores = df_ordinal.iloc[:, 1:32].values


In [33]:
previsores

array([[1.799e+01, 1.038e+01, 1.228e+02, ..., 2.654e-01, 4.601e-01,
        1.189e-01],
       [2.057e+01, 1.777e+01, 1.329e+02, ..., 1.860e-01, 2.750e-01,
        8.902e-02],
       [1.969e+01, 2.125e+01, 1.300e+02, ..., 2.430e-01, 3.613e-01,
        8.758e-02],
       ...,
       [1.660e+01, 2.808e+01, 1.083e+02, ..., 1.418e-01, 2.218e-01,
        7.820e-02],
       [2.060e+01, 2.933e+01, 1.401e+02, ..., 2.650e-01, 4.087e-01,
        1.240e-01],
       [7.760e+00, 2.454e+01, 4.792e+01, ..., 0.000e+00, 2.871e-01,
        7.039e-02]])

In [34]:
previsores.shape

(569, 30)

In [35]:
alvo = df_ordinal.iloc[:, 0].values

In [36]:
alvo

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1,
       0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1,
       1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0,
       0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1,
       1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,

In [37]:
alvo.shape

(569,)

## **Análise das escalas dos atributos (Escalonamento)**

In [38]:
df_ordinal.describe()

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,0.372583,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,...,16.26919,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946
std,0.483918,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,...,4.833242,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061
min,0.0,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,...,7.93,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504
25%,0.0,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,...,13.01,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146
50%,0.0,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,...,14.97,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004
75%,1.0,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,...,18.79,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208
max,1.0,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,...,36.04,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075


Padronização (utiliza a média e o desvio padrão como referência).

Normalização (utiliza os valores máximo e mínimo como referência).

In [39]:
from sklearn.preprocessing import StandardScaler

In [40]:
previsores_esc = StandardScaler().fit_transform(previsores)

In [41]:
previsores_esc

array([[ 1.09706398, -2.07333501,  1.26993369, ...,  2.29607613,
         2.75062224,  1.93701461],
       [ 1.82982061, -0.35363241,  1.68595471, ...,  1.0870843 ,
        -0.24388967,  0.28118999],
       [ 1.57988811,  0.45618695,  1.56650313, ...,  1.95500035,
         1.152255  ,  0.20139121],
       ...,
       [ 0.70228425,  2.0455738 ,  0.67267578, ...,  0.41406869,
        -1.10454895, -0.31840916],
       [ 1.83834103,  2.33645719,  1.98252415, ...,  2.28998549,
         1.91908301,  2.21963528],
       [-1.80840125,  1.22179204, -1.81438851, ..., -1.74506282,
        -0.04813821, -0.75120669]])

In [42]:
previsores_esc_df = pd.DataFrame(previsores_esc)
previsores_esc_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20,21,22,23,24,25,26,27,28,29
0,1.097064,-2.073335,1.269934,0.984375,1.568466,3.283515,2.652874,2.532475,2.217515,2.255747,...,1.886690,-1.359293,2.303601,2.001237,1.307686,2.616665,2.109526,2.296076,2.750622,1.937015
1,1.829821,-0.353632,1.685955,1.908708,-0.826962,-0.487072,-0.023846,0.548144,0.001392,-0.868652,...,1.805927,-0.369203,1.535126,1.890489,-0.375612,-0.430444,-0.146749,1.087084,-0.243890,0.281190
2,1.579888,0.456187,1.566503,1.558884,0.942210,1.052926,1.363478,2.037231,0.939685,-0.398008,...,1.511870,-0.023974,1.347475,1.456285,0.527407,1.082932,0.854974,1.955000,1.152255,0.201391
3,-0.768909,0.253732,-0.592687,-0.764464,3.283553,3.402909,1.915897,1.451707,2.867383,4.910919,...,-0.281464,0.133984,-0.249939,-0.550021,3.394275,3.893397,1.989588,2.175786,6.046041,4.935010
4,1.750297,-1.151816,1.776573,1.826229,0.280372,0.539340,1.371011,1.428493,-0.009560,-0.562450,...,1.298575,-1.466770,1.338539,1.220724,0.220556,-0.313395,0.613179,0.729259,-0.868353,-0.397100
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,2.110995,0.721473,2.060786,2.343856,1.041842,0.219060,1.947285,2.320965,-0.312589,-0.931027,...,1.901185,0.117700,1.752563,2.015301,0.378365,-0.273318,0.664512,1.629151,-1.360158,-0.709091
565,1.704854,2.085134,1.615931,1.723842,0.102458,-0.017833,0.693043,1.263669,-0.217664,-1.058611,...,1.536720,2.047399,1.421940,1.494959,-0.691230,-0.394820,0.236573,0.733827,-0.531855,-0.973978
566,0.702284,2.045574,0.672676,0.577953,-0.840484,-0.038680,0.046588,0.105777,-0.809117,-0.895587,...,0.561361,1.374854,0.579001,0.427906,-0.809587,0.350735,0.326767,0.414069,-1.104549,-0.318409
567,1.838341,2.336457,1.982524,1.735218,1.525767,3.272144,3.296944,2.658866,2.137194,1.043695,...,1.961239,2.237926,2.303601,1.653171,1.430427,3.904848,3.197605,2.289985,1.919083,2.219635


In [43]:
previsores_esc_df.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20,21,22,23,24,25,26,27,28,29
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,-1.373633e-16,6.868164e-17,-1.248757e-16,-2.185325e-16,-8.366672e-16,1.873136e-16,4.995028e-17,-4.995028e-17,1.74826e-16,4.745277e-16,...,-8.241796e-16,1.248757e-17,-3.746271e-16,0.0,-2.372638e-16,-3.371644e-16,7.492542e-17,2.247763e-16,2.62239e-16,-5.744282e-16
std,1.00088,1.00088,1.00088,1.00088,1.00088,1.00088,1.00088,1.00088,1.00088,1.00088,...,1.00088,1.00088,1.00088,1.00088,1.00088,1.00088,1.00088,1.00088,1.00088,1.00088
min,-2.029648,-2.229249,-1.984504,-1.454443,-3.112085,-1.610136,-1.114873,-1.26182,-2.744117,-1.819865,...,-1.726901,-2.223994,-1.693361,-1.222423,-2.682695,-1.443878,-1.305831,-1.745063,-2.16096,-1.601839
25%,-0.6893853,-0.7259631,-0.6919555,-0.6671955,-0.7109628,-0.747086,-0.7437479,-0.7379438,-0.7032397,-0.7226392,...,-0.6749213,-0.7486293,-0.6895783,-0.642136,-0.6912304,-0.6810833,-0.7565142,-0.7563999,-0.6418637,-0.6919118
50%,-0.2150816,-0.1046362,-0.23598,-0.2951869,-0.03489108,-0.2219405,-0.3422399,-0.3977212,-0.0716265,-0.1782793,...,-0.2690395,-0.04351564,-0.2859802,-0.341181,-0.04684277,-0.2695009,-0.2182321,-0.2234689,-0.1274095,-0.2164441
75%,0.4693926,0.5841756,0.4996769,0.3635073,0.636199,0.4938569,0.5260619,0.6469351,0.5307792,0.4709834,...,0.5220158,0.6583411,0.540279,0.357589,0.5975448,0.5396688,0.5311411,0.71251,0.4501382,0.4507624
max,3.971288,4.651889,3.97613,5.250529,4.770911,4.568425,4.243589,3.92793,4.484751,4.910919,...,4.094189,3.885905,4.287337,5.930172,3.955374,5.112877,4.700669,2.685877,6.046041,6.846856


## **Codificação de variáveis categóricas**

### **LabelEncoder: transformação de variáveis categóricas em numéricas**


In [44]:
from sklearn.preprocessing import LabelEncoder

In [45]:
df_original.head()

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [46]:
previsores_labelencoder = df_original.iloc[:, 1:32].values
previsores_labelencoder

array([[1.799e+01, 1.038e+01, 1.228e+02, ..., 2.654e-01, 4.601e-01,
        1.189e-01],
       [2.057e+01, 1.777e+01, 1.329e+02, ..., 1.860e-01, 2.750e-01,
        8.902e-02],
       [1.969e+01, 2.125e+01, 1.300e+02, ..., 2.430e-01, 3.613e-01,
        8.758e-02],
       ...,
       [1.660e+01, 2.808e+01, 1.083e+02, ..., 1.418e-01, 2.218e-01,
        7.820e-02],
       [2.060e+01, 2.933e+01, 1.401e+02, ..., 2.650e-01, 4.087e-01,
        1.240e-01],
       [7.760e+00, 2.454e+01, 4.792e+01, ..., 0.000e+00, 2.871e-01,
        7.039e-02]])

In [47]:
previsores_labelencoder.shape

(569, 30)

In [48]:
alvo_labelencoder = df_original.iloc[:, 0].values
alvo_labelencoder

array(['M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M',
       'M', 'M', 'M', 'M', 'M', 'M', 'B', 'B', 'B', 'M', 'M', 'M', 'M',
       'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'B', 'M',
       'M', 'M', 'M', 'M', 'M', 'M', 'M', 'B', 'M', 'B', 'B', 'B', 'B',
       'B', 'M', 'M', 'B', 'M', 'M', 'B', 'B', 'B', 'B', 'M', 'B', 'M',
       'M', 'B', 'B', 'B', 'B', 'M', 'B', 'M', 'M', 'B', 'M', 'B', 'M',
       'M', 'B', 'B', 'B', 'M', 'M', 'B', 'M', 'M', 'M', 'B', 'B', 'B',
       'M', 'B', 'B', 'M', 'M', 'B', 'B', 'B', 'M', 'M', 'B', 'B', 'B',
       'B', 'M', 'B', 'B', 'M', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B',
       'M', 'M', 'M', 'B', 'M', 'M', 'B', 'B', 'B', 'M', 'M', 'B', 'M',
       'B', 'M', 'M', 'B', 'M', 'M', 'B', 'B', 'M', 'B', 'B', 'M', 'B',
       'B', 'B', 'B', 'M', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B',
       'M', 'B', 'B', 'B', 'B', 'M', 'M', 'B', 'M', 'B', 'B', 'M', 'M',
       'B', 'B', 'M', 'M', 'B', 'B', 'B', 'B', 'M', 'B', 'B', 'M

In [49]:
alvo_labelencoder_ordinal = LabelEncoder().fit_transform(alvo_labelencoder)
alvo_labelencoder_ordinal

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1,
       0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1,
       1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0,
       0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1,
       1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,

In [50]:
alvo_labelencoder_ordinal.shape

(569,)

## **Escalonamento**

In [51]:
from sklearn.preprocessing import StandardScaler

In [52]:
previsores_labelencoder_esc = StandardScaler().fit_transform(previsores_labelencoder)

In [53]:
previsores_labelencoder_esc

array([[ 1.09706398, -2.07333501,  1.26993369, ...,  2.29607613,
         2.75062224,  1.93701461],
       [ 1.82982061, -0.35363241,  1.68595471, ...,  1.0870843 ,
        -0.24388967,  0.28118999],
       [ 1.57988811,  0.45618695,  1.56650313, ...,  1.95500035,
         1.152255  ,  0.20139121],
       ...,
       [ 0.70228425,  2.0455738 ,  0.67267578, ...,  0.41406869,
        -1.10454895, -0.31840916],
       [ 1.83834103,  2.33645719,  1.98252415, ...,  2.28998549,
         1.91908301,  2.21963528],
       [-1.80840125,  1.22179204, -1.81438851, ..., -1.74506282,
        -0.04813821, -0.75120669]])

In [54]:
previsores_labelencoder_esc_df = pd.DataFrame(previsores_labelencoder_esc)
previsores_labelencoder_esc_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20,21,22,23,24,25,26,27,28,29
0,1.097064,-2.073335,1.269934,0.984375,1.568466,3.283515,2.652874,2.532475,2.217515,2.255747,...,1.886690,-1.359293,2.303601,2.001237,1.307686,2.616665,2.109526,2.296076,2.750622,1.937015
1,1.829821,-0.353632,1.685955,1.908708,-0.826962,-0.487072,-0.023846,0.548144,0.001392,-0.868652,...,1.805927,-0.369203,1.535126,1.890489,-0.375612,-0.430444,-0.146749,1.087084,-0.243890,0.281190
2,1.579888,0.456187,1.566503,1.558884,0.942210,1.052926,1.363478,2.037231,0.939685,-0.398008,...,1.511870,-0.023974,1.347475,1.456285,0.527407,1.082932,0.854974,1.955000,1.152255,0.201391
3,-0.768909,0.253732,-0.592687,-0.764464,3.283553,3.402909,1.915897,1.451707,2.867383,4.910919,...,-0.281464,0.133984,-0.249939,-0.550021,3.394275,3.893397,1.989588,2.175786,6.046041,4.935010
4,1.750297,-1.151816,1.776573,1.826229,0.280372,0.539340,1.371011,1.428493,-0.009560,-0.562450,...,1.298575,-1.466770,1.338539,1.220724,0.220556,-0.313395,0.613179,0.729259,-0.868353,-0.397100
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,2.110995,0.721473,2.060786,2.343856,1.041842,0.219060,1.947285,2.320965,-0.312589,-0.931027,...,1.901185,0.117700,1.752563,2.015301,0.378365,-0.273318,0.664512,1.629151,-1.360158,-0.709091
565,1.704854,2.085134,1.615931,1.723842,0.102458,-0.017833,0.693043,1.263669,-0.217664,-1.058611,...,1.536720,2.047399,1.421940,1.494959,-0.691230,-0.394820,0.236573,0.733827,-0.531855,-0.973978
566,0.702284,2.045574,0.672676,0.577953,-0.840484,-0.038680,0.046588,0.105777,-0.809117,-0.895587,...,0.561361,1.374854,0.579001,0.427906,-0.809587,0.350735,0.326767,0.414069,-1.104549,-0.318409
567,1.838341,2.336457,1.982524,1.735218,1.525767,3.272144,3.296944,2.658866,2.137194,1.043695,...,1.961239,2.237926,2.303601,1.653171,1.430427,3.904848,3.197605,2.289985,1.919083,2.219635


In [55]:
previsores_labelencoder_esc_df.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20,21,22,23,24,25,26,27,28,29
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,-1.373633e-16,6.868164e-17,-1.248757e-16,-2.185325e-16,-8.366672e-16,1.873136e-16,4.995028e-17,-4.995028e-17,1.74826e-16,4.745277e-16,...,-8.241796e-16,1.248757e-17,-3.746271e-16,0.0,-2.372638e-16,-3.371644e-16,7.492542e-17,2.247763e-16,2.62239e-16,-5.744282e-16
std,1.00088,1.00088,1.00088,1.00088,1.00088,1.00088,1.00088,1.00088,1.00088,1.00088,...,1.00088,1.00088,1.00088,1.00088,1.00088,1.00088,1.00088,1.00088,1.00088,1.00088
min,-2.029648,-2.229249,-1.984504,-1.454443,-3.112085,-1.610136,-1.114873,-1.26182,-2.744117,-1.819865,...,-1.726901,-2.223994,-1.693361,-1.222423,-2.682695,-1.443878,-1.305831,-1.745063,-2.16096,-1.601839
25%,-0.6893853,-0.7259631,-0.6919555,-0.6671955,-0.7109628,-0.747086,-0.7437479,-0.7379438,-0.7032397,-0.7226392,...,-0.6749213,-0.7486293,-0.6895783,-0.642136,-0.6912304,-0.6810833,-0.7565142,-0.7563999,-0.6418637,-0.6919118
50%,-0.2150816,-0.1046362,-0.23598,-0.2951869,-0.03489108,-0.2219405,-0.3422399,-0.3977212,-0.0716265,-0.1782793,...,-0.2690395,-0.04351564,-0.2859802,-0.341181,-0.04684277,-0.2695009,-0.2182321,-0.2234689,-0.1274095,-0.2164441
75%,0.4693926,0.5841756,0.4996769,0.3635073,0.636199,0.4938569,0.5260619,0.6469351,0.5307792,0.4709834,...,0.5220158,0.6583411,0.540279,0.357589,0.5975448,0.5396688,0.5311411,0.71251,0.4501382,0.4507624
max,3.971288,4.651889,3.97613,5.250529,4.770911,4.568425,4.243589,3.92793,4.484751,4.910919,...,4.094189,3.885905,4.287337,5.930172,3.955374,5.112877,4.700669,2.685877,6.046041,6.846856


## **RESUMO PRÉ-PROCESSAMENTO**

- **Variáveis do Conjunto de Dados:**

    - **Alvo (Target):** Variável dependente que indica a presença ou ausência de doença cardíaca, com 'M' para maligno e 'B' para benigno.
    - **Alvos com LabelEncoder:** Utiliza LabelEncoder para converter variáveis categóricas em numéricas sem alterar a escala.
    - **Alvos com OneHotEncoder:** Aplica LabelEncoder e OneHotEncoder para criar representações binárias das categorias sem introduzir hierarquia.
    - **Previsores:** Variáveis preditoras transformadas de categóricas para numéricas manualmente, sem normalização ou escalonamento.
    - **Previsores com LabelEncoder:** Converte variáveis categóricas em numéricas usando LabelEncoder sem alterar a escala.
    - **Previsores Escalonados:** Variáveis preditoras categóricas são convertidas para numéricas e escalonadas para um intervalo uniforme.
    - **Previsores com LabelEncoder (Scaled Predictors LabelEncoder):** Usa LabelEncoder para converter variáveis categóricas em numéricas e escalonar os dados.

- **Características dos Núcleos Celulares:**
    - **Raio:** Média das distâncias do centro aos pontos do perímetro.
    - **Textura:** Desvio padrão dos valores da escala de cinza.
    - **Perímetro, Área, Suavidade:** Medem a forma e a textura da célula.
    - **Compacidade:** Calculada como ...


## **BASE DE TREINO E TESTE**

In [57]:
from sklearn.model_selection import train_test_split

Parâmetros train_test_split:   
- arrays: nomes dos atributos previsores e alvo.   
- test_size: tamanho em porcentagem dos dados de teste. default é none.   
- train_size: tamanho em porcentagem dos dados de treinamento.default é none.  
- random_state: nomeação de um estado aleatório.   
- shuffle: embaralhamento dos dados aleatórios. Associado com o random_state ocorre o mesmo embaralhamento sempre. Default é True.  
- stratify: Possibilidade de dividir os dados de forma estratificada. Default é None (nesse caso é mantido a proporção, isto é, se tem 30% de zeros e 70% de 1 no dataframe, na separação em treinamento e teste se manterá essa proporção).

In [913]:
x_treino, x_teste, y_treino, y_teste = train_test_split(previsores, alvo, test_size = 0.3, random_state = 0)

In [914]:
x_treino.shape

(398, 30)

In [915]:
x_teste.shape

(171, 30)

In [916]:
y_treino.shape

(398,)

In [911]:
y_teste.shape

(171,)

- **Alvos:**
- **1-:** alvo
- **2-:** alvo_labelencoder_ordinal
- 
- **Testes:**
- **1-** previsores
- **2-** previsores_esc
- **3-** previsores_labelencoder
- **4-** previsores_labelencoder_esc

# **NAIVE BAYES**

https://scikit-learn.org/stable/modules/naive_bayes.html

Treinamento do algoritmo

In [242]:
from sklearn.naive_bayes import GaussianNB

In [243]:
naive = GaussianNB()
naive.fit(x_treino, y_treino)

Avaliação do algoritmo

In [244]:
previsoes_naive = naive.predict(x_teste)
previsoes_naive

array([1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1,
       0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0,
       1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0,
       1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0,
       1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1,
       0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0,
       0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0])

In [245]:
y_teste

array([1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1,
       0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0,
       0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0,
       1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0,
       1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1,
       0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0])

In [246]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [247]:
accuracy_score(y_teste, previsoes_naive)

0.9239766081871345

In [248]:
print("Acurácia: %.2f%%" % (accuracy_score(y_teste, previsoes_naive) * 100.0))

Acurácia: 92.40%


In [249]:
confusion_matrix(y_teste, previsoes_naive)

array([[101,   7],
       [  6,  57]])

In [71]:
print(classification_report(y_teste, previsoes_naive))

              precision    recall  f1-score   support

           0       0.94      0.94      0.94       108
           1       0.89      0.90      0.90        63

    accuracy                           0.92       171
   macro avg       0.92      0.92      0.92       171
weighted avg       0.92      0.92      0.92       171



**Análise dados de treino**

In [72]:
previsoes_treino = naive.predict(x_treino)
previsoes_treino

array([0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1,
       0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
       0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0,
       1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1,
       0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0,
       1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1,
       0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1,
       0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1,
       0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0,

In [73]:
accuracy_score(y_treino, previsoes_treino)

0.9422110552763819

In [74]:
confusion_matrix(y_treino, previsoes_treino)

array([[242,   7],
       [ 16, 133]])

### **Validação Cruzada**

In [75]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

In [76]:
# Separando os dados em folds
kfold = KFold(n_splits = 30, shuffle=True, random_state = 5)

In [77]:
# Criando o modelo
modelo = GaussianNB()
resultado = cross_val_score(modelo, previsores, alvo, cv = kfold)
resultado

array([0.89473684, 0.89473684, 1.        , 0.94736842, 1.        ,
       0.94736842, 0.94736842, 1.        , 0.94736842, 1.        ,
       1.        , 0.94736842, 0.94736842, 0.94736842, 0.94736842,
       1.        , 0.84210526, 0.89473684, 1.        , 0.94736842,
       0.89473684, 0.78947368, 0.89473684, 1.        , 0.94736842,
       1.        , 0.94736842, 0.94736842, 0.89473684, 0.77777778])

In [78]:
# Usamos a média e o desvio padrão
print("Acurácia Média: %.2f%%" % (resultado.mean() * 100.0))

Acurácia Média: 93.82%


Naive Bayes = 92.40% (treino e teste) - e 93.82% (validação cruzada) - previsores, alvo
Naive Bayes = 91.23% (treino e teste) - e 93.47% (validação cruzada) - previsores_esc, alvo
Naive Bayes = 92.40% (treino e teste) - e 93.82% (validação cruzada) - previsores_labelencoder, alvo
Naive Bayes = 91.23% (treino e teste) - e 93.47% (validação cruzada) - previsores_labelencoder_esc, alvo

Naive Bayes = 92.40% (treino e teste) - e 93.82% (validação cruzada) - previsores, alvo_labelencoder_ordinal
Naive Bayes = 91.23% (treino e teste) - e 93.47% (validação cruzada) - previsores_esc, alvo_labelencoder_ordinal
Naive Bayes = 92.40% (treino e teste) - e 93.82% (validação cruzada) - previsores_labelencoder, alvo_labelencoder_ordinal
Naive Bayes = 91.23% (treino e teste) - e 93.47% (validação cruzada) - previsores_labelencoder_esc, alvo_labelencoder_ordinal

**Melhores Naive Bayes: 
Naive Bayes = 92.40% (treino e teste) - e 93.82% (validação cruzada) - previsores, alvo
Naive Bayes = 92.40% (treino e teste) - e 93.82% (validação cruzada) - previsores_labelencoder, alvo
Naive Bayes = 92.40% (treino e teste) - e 93.82% (validação cruzada) - previsores, alvo_labelencoder_ordinal
Naive Bayes = 92.40% (treino e teste) - e 93.82% (validação cruzada) - previsores_labelencoder, alvo_labelencoder_ordinal**

# **MÁQUINAS DE VETORES DE SUPORTE (SVM)**

https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

In [79]:
from sklearn.svm import SVC

In [80]:
svm = SVC(kernel='rbf', random_state=1, C = 2)
svm.fit(x_treino, y_treino)

In [81]:
previsoes_svm = svm.predict(x_teste)
previsoes_svm

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1,
       0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0,
       0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0,
       1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0,
       1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1,
       0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0])

In [82]:
y_teste

array([1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1,
       0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0,
       0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0,
       1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0,
       1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1,
       0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0])

In [83]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [84]:
print("Acurácia: %.2f%%" % (accuracy_score(y_teste, previsoes_svm) * 100.0))

Acurácia: 94.74%


In [85]:
confusion_matrix(y_teste, previsoes_svm)

array([[107,   1],
       [  8,  55]])

In [86]:
print(classification_report(y_teste, previsoes_svm))

              precision    recall  f1-score   support

           0       0.93      0.99      0.96       108
           1       0.98      0.87      0.92        63

    accuracy                           0.95       171
   macro avg       0.96      0.93      0.94       171
weighted avg       0.95      0.95      0.95       171



**Análise dados de treino**

In [87]:
previsoes_treino = svm.predict(x_treino)
previsoes_treino

array([0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0,
       0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
       0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0,
       0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1,
       1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0,

In [88]:
accuracy_score(y_treino, previsoes_treino)

0.9095477386934674

In [89]:
confusion_matrix(y_treino, previsoes_treino)

array([[247,   2],
       [ 34, 115]])

### **Validação Cruzada**

In [90]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

In [91]:
# Separando os dados em folds
kfold = KFold(n_splits = 30, shuffle=True, random_state = 5)

In [92]:
# Criando o modelo
modelo = SVC(kernel='rbf', random_state=1, C = 2)
resultado = cross_val_score(modelo, previsores_labelencoder_esc, alvo_labelencoder_ordinal, cv = kfold)

# Usamos a média e o desvio padrão
print("Acurácia Média: %.2f%%" % (resultado.mean() * 100.0))

Acurácia Média: 97.88%


SVM = 94.74% (treino e teste) - e 91.72% (validação cruzada)- SVC(kernel='rbf', random_state=1, C = 2) - previsores, alvo
SVM = 97.66% (treino e teste) - e 97.88% (validação cruzada)- SVC(kernel='rbf', random_state=1, C = 2) - previsores_esc, alvo
SVM = 94.74% (treino e teste) - e 91.72% (validação cruzada)- SVC(kernel='rbf', random_state=1, C = 2) - previsores_labelencoder, alvo
SVM = 97.66% (treino e teste) - e 97.88% (validação cruzada)- SVC(kernel='rbf', random_state=1, C = 2) - previsores_labelencoder_esc, alvo

SVM = 94.74% (treino e teste) - e 91.72% (validação cruzada)- SVC(kernel='rbf', random_state=1, C = 2) - previsores, alvo_labelencoder_ordinal
SVM = 97.66% (treino e teste) - e 97.88% (validação cruzada)- SVC(kernel='rbf', random_state=1, C = 2) - previsores_esc, alvo_labelencoder_ordinal
SVM = 94.74% (treino e teste) - e 91.72% (validação cruzada)- SVC(kernel='rbf', random_state=1, C = 2) - previsores_labelencoder, alvo_labelencoder_ordinal
SVM = 97.66% (treino e teste) - e 97.88% (validação cruzada)- SVC(kernel='rbf', random_state=1, C = 2) - previsores_labelencoder_esc, alvo_labelencoder_ordinal

**Melhores SVM:
SVM = 97.66% (treino e teste) - e 97.88% (validação cruzada)- SVC(kernel='rbf', random_state=1, C = 2) - previsores_esc, alvo
SVM = 97.66% (treino e teste) - e 97.88% (validação cruzada)- SVC(kernel='rbf', random_state=1, C = 2) - previsores_labelencoder_esc, alvo
SVM = 97.66% (treino e teste) - e 97.88% (validação cruzada)- SVC(kernel='rbf', random_state=1, C = 2) - previsores_esc, alvo_labelencoder_ordinal
SVM = 97.66% (treino e teste) - e 97.88% (validação cruzada)- SVC(kernel='rbf', random_state=1, C = 2) - previsores_labelencoder_esc, alvo_labelencoder_ordinal**

# **REGRESSÃO LOGÍSTICA**

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

In [270]:
from sklearn.linear_model import LogisticRegression

In [271]:
logistica = LogisticRegression(random_state=1, max_iter=2000, penalty="l2",
                               tol=0.0001, C=1,solver="lbfgs")
logistica.fit(x_treino, y_treino)

In [272]:
logistica.intercept_

array([-0.04258607])

In [273]:
logistica.coef_

array([[ 0.25965337,  0.58891309,  0.27527119,  0.35070364,  0.13501664,
        -0.41458176,  0.67094946,  0.74096029,  0.37987661, -0.03289321,
         1.35087219, -0.14092099,  0.90852156,  0.98001244, -0.25957915,
        -0.92374666,  0.13448413,  0.34937963, -0.16812679, -0.91359164,
         0.8473647 ,  0.91043792,  0.73517119,  0.84764528,  0.56888501,
        -0.17670599,  0.82594672,  1.08228373,  0.48409262,  0.60758107]])

In [274]:
previsoes_logistica = logistica.predict(x_teste)
previsoes_logistica

array([1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1,
       0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0,
       0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0,
       1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0,
       1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
       0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0])

In [275]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [276]:
print("Acurácia: %.2f%%" % (accuracy_score(y_teste, previsoes_logistica) * 100.0))

Acurácia: 97.66%


In [277]:
confusion_matrix(y_teste, previsoes_logistica)

array([[107,   1],
       [  3,  60]])

In [278]:
print(classification_report(y_teste, previsoes_logistica))

              precision    recall  f1-score   support

           0       0.97      0.99      0.98       108
           1       0.98      0.95      0.97        63

    accuracy                           0.98       171
   macro avg       0.98      0.97      0.97       171
weighted avg       0.98      0.98      0.98       171



**Análise dados de treino**

In [279]:
previsoes_treino = logistica.predict(x_treino)
previsoes_treino

array([0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1,
       0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
       0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0,
       1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0,
       1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0,
       0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1,
       0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1,
       0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1,
       1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0,

In [280]:
accuracy_score(y_treino, previsoes_treino)

0.9899497487437185

In [281]:
confusion_matrix(y_treino, previsoes_treino)

array([[249,   0],
       [  4, 145]])

### **Validação Cruzada**

In [282]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

In [283]:
# Separando os dados em folds
kfold = KFold(n_splits = 30, shuffle=True, random_state = 5)

In [284]:
# Criando o modelo
modelo = LogisticRegression(random_state=1, max_iter=2000, penalty="l2",
                               tol=0.0001, C=1,solver="lbfgs")
resultado = cross_val_score(modelo, previsores_labelencoder_esc, alvo_labelencoder_ordinal, cv = kfold)

# Usamos a média e o desvio padrão
print("Acurácia Média: %.2f%%" % (resultado.mean() * 100.0))

Acurácia Média: 98.06%


Regressão logística = 95.91% (treino e teste) - e 98.06% (validação cruzada com previsores_esc) - LogisticRegression(random_state=1, max_iter=2000, penalty="l2", tol=0.0001, C=1,solver="lbfgs") - previsores, alvo
Regressão logística = 97.66% (treino e teste) - e 98.06%% (validação cruzada com previsores_esc) - LogisticRegression(random_state=1, max_iter=2000, penalty="l2", tol=0.0001, C=1,solver="lbfgs") - previsores_esc, alvo
SRegressão logística = 95.91% (treino e teste) - e 98.06% (validação cruzada com previsores_esc) - LogisticRegression(random_state=1, max_iter=2000, penalty="l2", tol=0.0001, C=1,solver="lbfgs") - previsores_labelencoder, alvo
Regressão logística = 97.66% (treino e teste) - e 98.06% (validação cruzada com previsores_esc) - LogisticRegression(random_state=1, max_iter=2000, penalty="l2", tol=0.0001, C=1,solver="lbfgs") - previsores_labelencoder_esc, alvo

Regressão logística = 95.91% (treino e teste) - e 98.06% (validação cruzada com previsores_labelencoder_esc) - LogisticRegression(random_state=1, max_iter=2000, penalty="l2", tol=0.0001, C=1,solver="lbfgs") - previsores, alvo_labelencoder_ordinal
Regressão logística = 97.66% (treino e teste) - e 98.06% (validação cruzada com previsores_labelencoder_esc) - LogisticRegression(random_state=1, max_iter=2000, penalty="l2", tol=0.0001, C=1,solver="lbfgs") - previsores_esc, alvo_labelencoder_ordinal
Regressão logística = 97.66% (treino e teste) - e 98.06% (validação cruzada cruzada com previsores_labelencoder_esc) - LogisticRegression(random_state=1, max_iter=2000, penalty="l2", tol=0.0001, C=1,solver="lbfgs") - previsores_labelencoder, alvo_labelencoder_ordinal
Regressão logística = 97.66% (treino e teste) - e 98.06% (validação cruzada cruzada com previsores_labelencoder_esc) - LogisticRegression(random_state=1, max_iter=600, penalty="l2", tol=0.0001, C=1,solver="lbfgs") - previsores_labelencoder_esc, alvo_labelencoder_ordinal

**Melhores Regressão Logística:
Regressão logística = 97.66% (treino e teste) - e 98.06%% (validação cruzada com previsores_esc) - LogisticRegression(random_state=1, max_iter=2000, penalty="l2", tol=0.0001, C=1,solver="lbfgs") - previsores_esc, alvo
Regressão logística = 97.66% (treino e teste) - e 98.06% (validação cruzada com previsores_esc) - LogisticRegression(random_state=1, max_iter=2000, penalty="l2", tol=0.0001, C=1,solver="lbfgs") - previsores_labelencoder_esc, alvo
Regressão logística = 97.66% (treino e teste) - e 98.06% (validação cruzada com previsores_labelencoder_esc) - LogisticRegression(random_state=1, max_iter=2000, penalty="l2", tol=0.0001, C=1,solver="lbfgs") - previsores_esc, alvo_labelencoder_ordinal
Regressão logística = 97.66% (treino e teste) - e 98.06% (validação cruzada cruzada com previsores_labelencoder_esc) - LogisticRegression(random_state=1, max_iter=600, penalty="l2", tol=0.0001, C=1,solver="lbfgs") - previsores_labelencoder_esc, alvo_labelencoder_ordinal**

# **APRENDIZAGEM BASEADA EM INSTÂNCIAS (KNN)**

https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

In [408]:
from sklearn.neighbors import KNeighborsClassifier

In [409]:
knn = KNeighborsClassifier(n_neighbors=7, metric='minkowski', p=1)
knn.fit(x_treino, y_treino)

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.DistanceMetric.html

In [410]:
previsoes_knn = knn.predict(x_teste)
previsoes_knn

array([1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1,
       0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0,
       1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0,
       1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0,
       1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1,
       0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0])

In [411]:
y_teste

array([1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1,
       0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0,
       0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0,
       1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0,
       1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1,
       0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0])

In [412]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [413]:
print("Acurácia: %.2f%%" % (accuracy_score(y_teste, previsoes_knn) * 100.0))

Acurácia: 96.49%


In [414]:
confusion_matrix(y_teste, previsoes_knn)

array([[106,   2],
       [  4,  59]])

In [415]:
print(classification_report(y_teste, previsoes_knn))

              precision    recall  f1-score   support

           0       0.96      0.98      0.97       108
           1       0.97      0.94      0.95        63

    accuracy                           0.96       171
   macro avg       0.97      0.96      0.96       171
weighted avg       0.96      0.96      0.96       171



**Análise dados de treino**

In [416]:
previsoes_treino = knn.predict(x_treino)
previsoes_treino

array([0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1,
       0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
       0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,
       0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1,
       1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0,
       0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1,
       0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1,
       0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1,
       1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0,

In [417]:
accuracy_score(y_treino, previsoes_treino)

0.949748743718593

In [418]:
confusion_matrix(y_treino, previsoes_treino)

array([[242,   7],
       [ 13, 136]])

### **Validação Cruzada**

In [419]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

In [420]:
# Separando os dados em folds
kfold = KFold(n_splits = 30, shuffle=True, random_state = 5)

In [421]:
# Criando o modelo
modelo = KNeighborsClassifier(n_neighbors=7, metric='minkowski', p = 1)
resultado = cross_val_score(modelo, previsores_labelencoder, alvo_labelencoder_ordinal, cv = kfold)

# Usamos a média e o desvio padrão
print("Acurácia Média: %.2f%%" % (resultado.mean() * 100.0))

Acurácia Média: 93.13%


KNN = 96.49% (treino e teste) - e 93.13% (validação cruzada)- KNeighborsClassifier(n_neighbors=7, metric='minkowski', p = 1) - previsores, alvo
KNN = 95.91% (treino e teste) - e 96.65% (validação cruzada)- KNeighborsClassifier(n_neighbors=7, metric='minkowski', p = 1) - previsores_esc, alvo
KNN = 96.49% (treino e teste) - e 93.13% (validação cruzada)- KNeighborsClassifier(n_neighbors=7, metric='minkowski', p = 1) - previsores_labelencoder, alvo
KNN = 95.91% (treino e teste) - e 96.65% (validação cruzada)- KNeighborsClassifier(n_neighbors=7, metric='minkowski', p = 1) -previsores_labelencoder_esc, alvo

KNN = 96.49% (treino e teste) - e 93.13% (validação cruzada)- KNeighborsClassifier(n_neighbors=7, metric='minkowski', p = 1) - previsores, alvo_labelencoder_ordinal
KNN = 95.91% (treino e teste) - e 96.65% (validação cruzada)- KNeighborsClassifier(n_neighbors=7, metric='minkowski', p = 1) - previsores_esc, alvo_labelencoder_ordinal
KNN = 96.49% (treino e teste) - e 93.13% (validação cruzada)- KNeighborsClassifier(n_neighbors=7, metric='minkowski', p = 1) - previsores_labelencoder, alvo_labelencoder_ordinal
KNN = 95.91% (treino e teste) - e 96.65% (validação cruzada)- KNeighborsClassifier(n_neighbors=7, metric='minkowski', p = 1) - previsores_labelencoder_esc, alvo_labelencoder_ordinal

**Melhores KNN:
KNN = 96.49% (treino e teste) - e 93.13% (validação cruzada)- KNeighborsClassifier(n_neighbors=7, metric='minkowski', p = 1) - previsores, alvo
KNN = 96.49% (treino e teste) - e 93.13% (validação cruzada)- KNeighborsClassifier(n_neighbors=7, metric='minkowski', p = 1) - previsores_labelencoder, alvo
KNN = 96.49% (treino e teste) - e 93.13% (validação cruzada)- KNeighborsClassifier(n_neighbors=7, metric='minkowski', p = 1) - previsores, alvo_labelencoder_ordinal
KNN = 96.49% (treino e teste) - e 93.13% (validação cruzada)- KNeighborsClassifier(n_neighbors=7, metric='minkowski', p = 1) - previsores_labelencoder, alvo_labelencoder_ordinal**

# **ÁRVORE DE DECISÃO**

https://scikit-learn.org/stable/modules/tree.html

In [567]:
from sklearn.tree import DecisionTreeClassifier

In [568]:
arvore = DecisionTreeClassifier(criterion='entropy', random_state = 0, max_depth=3)
arvore.fit(x_treino, y_treino)

In [569]:
previsoes_arvore = arvore.predict(x_teste)
previsoes_arvore

array([1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1,
       0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0,
       1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0,
       1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0,
       1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1,
       0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1,
       0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0])

In [570]:
y_teste

array([1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1,
       0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0,
       0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0,
       1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0,
       1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1,
       0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0])

In [571]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [572]:
print("Acurácia: %.2f%%" % (accuracy_score(y_teste, previsoes_arvore) * 100.0))

Acurácia: 95.32%


In [573]:
confusion_matrix(y_teste, previsoes_arvore)

array([[102,   6],
       [  2,  61]])

In [574]:
print(classification_report(y_teste, previsoes_arvore))

              precision    recall  f1-score   support

           0       0.98      0.94      0.96       108
           1       0.91      0.97      0.94        63

    accuracy                           0.95       171
   macro avg       0.95      0.96      0.95       171
weighted avg       0.95      0.95      0.95       171



**Análise dados de treino**

In [575]:
previsoes_treino = arvore.predict(x_treino)
previsoes_treino

array([0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1,
       0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
       0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0,
       1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0,
       1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0,
       1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1,
       0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1,
       0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1,
       1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0,

In [576]:
accuracy_score(y_treino, previsoes_treino)

0.964824120603015

In [558]:
confusion_matrix(y_treino, previsoes_treino)

array([[245,   4],
       [ 10, 139]])

### **Validação Cruzada**

In [559]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

In [560]:
# Separando os dados em folds
kfold = KFold(n_splits = 30, shuffle=True, random_state = 5)

In [561]:
# Criando o modelo
modelo = DecisionTreeClassifier(criterion='entropy', random_state = 0, max_depth=7)
resultado = cross_val_score(modelo, previsores_esc, alvo, cv = kfold)

# Usamos a média e o desvio padrão
print("Acurácia Média: %.2f%%" % (resultado.mean() * 100.0))

Acurácia Média: 92.44%


Árvore de decisão = 95.32% (treino e teste) - e 92.44% (validação cruzada) - DecisionTreeClassifier(criterion='entropy', random_state = 0, max_depth=3) - previsores, alvo
Árvore de decisão = 95.32% (treino e teste) - e 92.44% (validação cruzada) - DecisionTreeClassifier(criterion='entropy', random_state = 0, max_depth=3) - previsores_esc, alvo
Árvore de decisão = 95.32% (treino e teste) - e 92.44% (validação cruzada) - DecisionTreeClassifier(criterion='entropy', random_state = 0, max_depth=3) - previsores_labelencoder, alvo
Árvore de decisão = 95.32% (treino e teste) - e 92.44% (validação cruzada) - DecisionTreeClassifier(criterion='entropy', random_state = 0, max_depth=3) -previsores_labelencoder_esc, alvo

Árvore de decisão = 95.32% (treino e teste) - e 92.44% (validação cruzada) - DecisionTreeClassifier(criterion='entropy', random_state = 0, max_depth=3) - previsores, alvo_labelencoder_ordinal
Árvore de decisão = 95.32% (treino e teste) - e 92.44% (validação cruzada) - DecisionTreeClassifier(criterion='entropy', random_state = 0, max_depth=3) - previsores_esc, alvo_labelencoder_ordinal
Árvore de decisão = 95.32% (treino e teste) - e 92.44% (validação cruzada) - DecisionTreeClassifier(criterion='entropy', random_state = 0, max_depth=3) - previsores_labelencoder, alvo_labelencoder_ordinal
Árvore de decisão = 95.32% (treino e teste) - e 92.44% (validação cruzada) - DecisionTreeClassifier(criterion='entropy', random_state = 0, max_depth=3) - previsores_labelencoder_esc, alvo_labelencoder_ordinal

**Melhores Árvore de decisão:
Árvore de decisão = 95.32% (treino e teste) - e 92.44% (validação cruzada) - DecisionTreeClassifier(criterion='entropy', random_state = 0, max_depth=3) - previsores, alvo
Árvore de decisão = 95.32% (treino e teste) - e 92.44% (validação cruzada) - DecisionTreeClassifier(criterion='entropy', random_state = 0, max_depth=3) - previsores_esc, alvo
Árvore de decisão = 95.32% (treino e teste) - e 92.44% (validação cruzada) - DecisionTreeClassifier(criterion='entropy', random_state = 0, max_depth=3) - previsores_labelencoder, alvo
Árvore de decisão = 95.32% (treino e teste) - e 92.44% (validação cruzada) - DecisionTreeClassifier(criterion='entropy', random_state = 0, max_depth=3) -previsores_labelencoder_esc, alvo
Árvore de decisão = 95.32% (treino e teste) - e 92.44% (validação cruzada) - DecisionTreeClassifier(criterion='entropy', random_state = 0, max_depth=3) - previsores, alvo_labelencoder_ordinal
Árvore de decisão = 95.32% (treino e teste) - e 92.44% (validação cruzada) - DecisionTreeClassifier(criterion='entropy', random_state = 0, max_depth=3) - previsores_esc, alvo_labelencoder_ordinal
Árvore de decisão = 95.32% (treino e teste) - e 92.44% (validação cruzada) - DecisionTreeClassifier(criterion='entropy', random_state = 0, max_depth=3) - previsores_labelencoder, alvo_labelencoder_ordinal
Árvore de decisão = 95.32% (treino e teste) - e 92.44% (validação cruzada) - DecisionTreeClassifier(criterion='entropy', random_state = 0, max_depth=3) - previsores_labelencoder_esc, alvo_labelencoder_ordinal**

# **RANDOM FOREST**

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

In [663]:
from sklearn.ensemble import RandomForestClassifier

In [664]:
random = RandomForestClassifier(n_estimators=150, criterion='entropy', random_state = 0, max_depth=4)
random.fit(x_treino, y_treino)

In [665]:
previsoes_random = random.predict(x_teste)
previsoes_random

array([1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1,
       0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0,
       0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0,
       1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0,
       1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1,
       0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0,
       0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0])

In [666]:
y_teste

array([1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1,
       0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0,
       0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0,
       1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0,
       1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1,
       0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0])

In [667]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [668]:
print("Acurácia: %.2f%%" % (accuracy_score(y_teste, previsoes_random) * 100.0))

Acurácia: 96.49%


In [669]:
confusion_matrix(y_teste, previsoes_random)

array([[106,   2],
       [  4,  59]])

In [670]:
print(classification_report(y_teste, previsoes_random))

              precision    recall  f1-score   support

           0       0.96      0.98      0.97       108
           1       0.97      0.94      0.95        63

    accuracy                           0.96       171
   macro avg       0.97      0.96      0.96       171
weighted avg       0.96      0.96      0.96       171



**Análise dados de treino**

In [671]:
previsoes_treino = random.predict(x_treino)
previsoes_treino

array([0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1,
       0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
       0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0,
       1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0,
       1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0,
       0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1,
       0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1,
       0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1,
       1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0,

In [672]:
accuracy_score(y_treino, previsoes_treino)

0.9899497487437185

In [673]:
confusion_matrix(y_treino, previsoes_treino)

array([[249,   0],
       [  4, 145]])

### **Validação Cruzada**

In [674]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

In [675]:
# Separando os dados em folds
kfold = KFold(n_splits = 30, shuffle=True, random_state = 5)

In [676]:
# Criando o modelo
modelo = RandomForestClassifier(n_estimators=150, criterion='entropy', random_state = 0, max_depth=4)
resultado = cross_val_score(modelo, previsores_labelencoder_esc, alvo, cv = kfold)

# Usamos a média e o desvio padrão
print("Acurácia Média: %.2f%%" % (resultado.mean() * 100.0))

Acurácia Média: 95.76%


Random Forest = 96.49% (treino e teste)- e 95.76% (validação cruzada) - RandomForestClassifier(n_estimators=150, criterion='entropy', random_state = 0, max_depth=4) - previsores, alvo
Random Forest = 96.49% (treino e teste) - e 95.76% (validação cruzada) - RandomForestClassifier(n_estimators=150, criterion='entropy', random_state = 0, max_depth=4) - previsores_esc, alvo
Random Forest = 96.49% (treino e teste) - e 95.76% (validação cruzada) - RandomForestClassifier(n_estimators=150, criterion='entropy', random_state = 0, max_depth=4) - previsores_labelencoder, alvo
Random Forest = 96.49% (treino e teste) - e 95.76% (validação cruzada) - RandomForestClassifier(n_estimators=150, criterion='entropy', random_state = 0, max_depth=4) -previsores_labelencoder_esc, alvo

Random Forest = 96.49% - e 95.76% (validação cruzada) - RandomForestClassifier(n_estimators=150, criterion='entropy', random_state = 0, max_depth=4) (treino e teste) - previsores, alvo_labelencoder_ordinal
Random Forest = 96.49% - e 95.76% (validação cruzada) - RandomForestClassifier(n_estimators=150, criterion='entropy', random_state = 0, max_depth=4) (treino e teste) - previsores_esc, alvo_labelencoder_ordinal
Random Forest = 96.49% - e 95.76% (validação cruzada) - RandomForestClassifier(n_estimators=150, criterion='entropy', random_state = 0, max_depth=4) (treino e teste) - previsores_labelencoder, alvo_labelencoder_ordinal
Random Forest = 96.49% - e 95.76% (validação cruzada) - RandomForestClassifier(n_estimators=150, criterion='entropy', random_state = 0, max_depth=4) (treino e teste) - previsores_labelencoder_esc, alvo_labelencoder_ordinal

**Melhores Random Forest:
Random Forest = 96.49% (treino e teste)- e 95.76% (validação cruzada) - RandomForestClassifier(n_estimators=150, criterion='entropy', random_state = 0, max_depth=4) - previsores, alvo
Random Forest = 96.49% (treino e teste) - e 95.76% (validação cruzada) - RandomForestClassifier(n_estimators=150, criterion='entropy', random_state = 0, max_depth=4) - previsores_esc, alvo
Random Forest = 96.49% (treino e teste) - e 95.76% (validação cruzada) - RandomForestClassifier(n_estimators=150, criterion='entropy', random_state = 0, max_depth=4) - previsores_labelencoder, alvo
Random Forest = 96.49% (treino e teste) - e 95.76% (validação cruzada) - RandomForestClassifier(n_estimators=150, criterion='entropy', random_state = 0, max_depth=4) -previsores_labelencoder_esc, alvo
Random Forest = 96.49% - e 95.76% (validação cruzada) - RandomForestClassifier(n_estimators=150, criterion='entropy', random_state = 0, max_depth=4) (treino e teste) - previsores, alvo_labelencoder_ordinal
Random Forest = 96.49% - e 95.76% (validação cruzada) - RandomForestClassifier(n_estimators=150, criterion='entropy', random_state = 0, max_depth=4) (treino e teste) - previsores_esc, alvo_labelencoder_ordinal
Random Forest = 96.49% - e 95.76% (validação cruzada) - RandomForestClassifier(n_estimators=150, criterion='entropy', random_state = 0, max_depth=4) (treino e teste) - previsores_labelencoder, alvo_labelencoder_ordinal
Random Forest = 96.49% - e 95.76% (validação cruzada) - RandomForestClassifier(n_estimators=150, criterion='entropy', random_state = 0, max_depth=4) (treino e teste) - previsores_labelencoder_esc, alvo_labelencoder_ordinal**

# **XGBOOST**

https://xgboost.readthedocs.io/en/stable/

In [801]:
from xgboost import XGBClassifier

In [802]:
xg = XGBClassifier(max_depth=2, learning_rate=0.05, n_estimators=250, objective='binary:logistic', random_state=3)
xg.fit(x_treino,y_treino)

In [803]:
previsoes_xg = xg.predict(x_teste)
previsoes_xg

array([1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1,
       0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0,
       0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0,
       1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0,
       1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1,
       0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0])

In [804]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [805]:
print("Acurácia: %.2f%%" % (accuracy_score(y_teste, previsoes_xg) * 100.0))

Acurácia: 96.49%


In [806]:
confusion_matrix(y_teste, previsoes_xg)

array([[106,   2],
       [  4,  59]])

In [807]:
print(classification_report(y_teste, previsoes_xg))

              precision    recall  f1-score   support

           0       0.96      0.98      0.97       108
           1       0.97      0.94      0.95        63

    accuracy                           0.96       171
   macro avg       0.97      0.96      0.96       171
weighted avg       0.96      0.96      0.96       171



**Análise dados de treino**

In [808]:
previsoes_treino = xg.predict(x_treino)
previsoes_treino

array([0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1,
       0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
       0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0,
       1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0,
       1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0,
       0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1,
       0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1,
       0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1,
       1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0,

In [809]:
accuracy_score(y_treino, previsoes_treino)

1.0

In [810]:
confusion_matrix(y_treino, previsoes_treino)

array([[249,   0],
       [  0, 149]])

### **Validação Cruzada**

In [811]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

In [812]:
# Separando os dados em folds
kfold = KFold(n_splits = 30, shuffle=True, random_state = 5)

In [813]:
# Criando o modelo
modelo = XGBClassifier(max_depth=2, learning_rate=0.05, n_estimators=250, objective='binary:logistic', random_state=3)
resultado = cross_val_score(modelo, previsores_esc, alvo, cv = kfold)

# Usamos a média e o desvio padrão
print("Acurácia Média: %.2f%%" % (resultado.mean() * 100.0))

Acurácia Média: 96.29%


XGboost = 96.49% - e 96.29% (validação cruzada) - XGBClassifier(max_depth=2, learning_rate=0.05, n_estimators=250, objective='binary:logistic', random_state=3) - previsores, alvo
XGboost = 96.49% - e 96.29% (validação cruzada) - XGBClassifier(max_depth=2, learning_rate=0.05, n_estimators=250, objective='binary:logistic', random_state=3) - previsores_esc, alvo
XGboost = 96.49% - e 96.29% (validação cruzada) - XGBClassifier(max_depth=2, learning_rate=0.05, n_estimators=250, objective='binary:logistic', random_state=3) - previsores_labelencoder, alvo
XGboost = 96.49% - e 96.29% (validação cruzada) - XGBClassifier(max_depth=2, learning_rate=0.05, n_estimators=250, objective='binary:logistic', random_state=3) - previsores_labelencoder_esc, alvo

XGboost = 96.49% - e 96.29% (validação cruzada) - XGBClassifier(max_depth=2, learning_rate=0.05, n_estimators=250, objective='binary:logistic', random_state=3) - previsores, alvo_labelencoder_ordinal
XGboost = 96.49% - e 96.29% (validação cruzada) - XGBClassifier(max_depth=2, learning_rate=0.05, n_estimators=250, objective='binary:logistic', random_state=3) - previsores_esc, alvo_labelencoder_ordinal
XGboost = 96.49% - e 96.29% (validação cruzada) - XGBClassifier(max_depth=2, learning_rate=0.05, n_estimators=250, objective='binary:logistic', random_state=3) - previsores_labelencoder, alvo_labelencoder_ordinal
XGboost = 96.49% - e 96.29% (validação cruzada) - XGBClassifier(max_depth=2, learning_rate=0.05, n_estimators=250, objective='binary:logistic', random_state=3) - previsores_labelencoder_esc, alvo_labelencoder_ordinal

**Melhores XGboost:
XGboost = 96.49% - e 96.29% (validação cruzada) - XGBClassifier(max_depth=2, learning_rate=0.05, n_estimators=250, objective='binary:logistic', random_state=3) - previsores, alvo
XGboost = 96.49% - e 96.29% (validação cruzada) - XGBClassifier(max_depth=2, learning_rate=0.05, n_estimators=250, objective='binary:logistic', random_state=3) - previsores_esc, alvo
XGboost = 96.49% - e 96.29% (validação cruzada) - XGBClassifier(max_depth=2, learning_rate=0.05, n_estimators=250, objective='binary:logistic', random_state=3) - previsores_labelencoder, alvo
XGboost = 96.49% - e 96.29% (validação cruzada) - XGBClassifier(max_depth=2, learning_rate=0.05, n_estimators=250, objective='binary:logistic', random_state=3) - previsores_labelencoder_esc, alvo
XGboost = 96.49% - e 96.29% (validação cruzada) - XGBClassifier(max_depth=2, learning_rate=0.05, n_estimators=250, objective='binary:logistic', random_state=3) - previsores, alvo_labelencoder_ordinal
XGboost = 96.49% - e 96.29% (validação cruzada) - XGBClassifier(max_depth=2, learning_rate=0.05, n_estimators=250, objective='binary:logistic', random_state=3) - previsores_esc, alvo_labelencoder_ordinal
XGboost = 96.49% - e 96.29% (validação cruzada) - XGBClassifier(max_depth=2, learning_rate=0.05, n_estimators=250, objective='binary:logistic', random_state=3) - previsores_labelencoder, alvo_labelencoder_ordinal
XGboost = 96.49% - e 96.29% (validação cruzada) - XGBClassifier(max_depth=2, learning_rate=0.05, n_estimators=250, objective='binary:logistic', random_state=3) - previsores_labelencoder_esc, alvo_labelencoder_ordinal
Porem todos acertaram 100% no conjunto de teste, podendo ter a ocrrencia de overfitting**

https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.train.html

In [884]:
# Instalação do Algoritmo
!pip install lightgbm



In [885]:
import lightgbm as lgb

In [886]:
# Dataset para treino
dataset = lgb.Dataset(x_treino,label=y_treino)

# **LIGHTGBM**

In [None]:
https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.train.html

In [None]:
# Instalação do Algoritmo
!pip install lightgbm

In [None]:
import lightgbm as lgb

In [None]:
# Dataset para treino
dataset = lgb.Dataset(x_treino,label=y_treino)

**Hiperparâmetros**

**Controle de ajuste**

num_leaves : define o número de folhas a serem formadas em uma árvore. Não tem uma relação direta entre num_leaves e max_depth e, portanto, os dois não devem estar vinculados um ao outro.

max_depth : especifica a profundidade máxima ou nível até o qual a árvore pode crescer.

**Controle de velocidade**

learning_rate: taxa de aprendizagem, determina o impacto de cada árvore no resultado final.

max_bin : O valor menor de max_bin reduz muito tempo de procesamento, pois agrupa os valores do recurso em caixas discretas, o que é computacionalmente mais barato.

**Controle de precisão**

num_leaves : valor alto produz árvores mais profundas com maior precisão, mas leva ao overfitting.

max_bin : valores altos tem efeito semelhante ao causado pelo aumento do valor de num_leaves e também torna mais lento o procedimento de treinamento.

In [887]:
# Parâmetros
parametros = {'num_leaves':250, # número de folhas
              'objective':'binary', # classificação Binária
              'max_depth':2,
              'learning_rate':.05,
              'max_bin':100}

In [888]:
lgbm=lgb.train(parametros,dataset,num_boost_round=200)

[LightGBM] [Info] Number of positive: 149, number of negative: 249
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.004266 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3000
[LightGBM] [Info] Number of data points in the train set: 398, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.374372 -> initscore=-0.513507
[LightGBM] [Info] Start training from score -0.513507


In [912]:
# Marcação do tempo de execução
from datetime import datetime
inicio=datetime.now()
lgbm=lgb.train(parametros,dataset)
fim=datetime.now()

tempo = fim - inicio
tempo

[LightGBM] [Info] Number of positive: 149, number of negative: 249
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.005734 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3000
[LightGBM] [Info] Number of data points in the train set: 398, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.374372 -> initscore=-0.513507
[LightGBM] [Info] Start training from score -0.513507


datetime.timedelta(microseconds=106295)

In [890]:
previsoes_lgbm = lgbm.predict(x_teste)
previsoes_lgbm

array([0.97614691, 0.03818063, 0.01016664, 0.02657164, 0.00751271,
       0.00975954, 0.02974303, 0.00874901, 0.01762794, 0.00596233,
       0.20603235, 0.0563609 , 0.01349088, 0.46001366, 0.30390385,
       0.92909284, 0.12182897, 0.99134263, 0.99275262, 0.99218284,
       0.9322574 , 0.9800872 , 0.03181537, 0.00780198, 0.98256166,
       0.00887689, 0.00604247, 0.94910771, 0.0102017 , 0.99372689,
       0.00887689, 0.99061897, 0.08322696, 0.98542273, 0.007809  ,
       0.95929397, 0.01680769, 0.9603403 , 0.00689979, 0.98914629,
       0.47091367, 0.00747961, 0.46703509, 0.00970488, 0.21810155,
       0.99253303, 0.007809  , 0.01477855, 0.01129781, 0.96094977,
       0.99223539, 0.96496678, 0.98693564, 0.00668155, 0.0066986 ,
       0.01027927, 0.02012097, 0.04465835, 0.0130796 , 0.99356544,
       0.94261209, 0.980124  , 0.00461161, 0.01007769, 0.99356544,
       0.41906305, 0.99224767, 0.99385407, 0.99153486, 0.01136134,
       0.23675513, 0.99311673, 0.00927197, 0.5660571 , 0.98572

In [891]:
previsoes_lgbm.shape

(171,)

In [892]:
# Quando for menor que 5 considera 0 e quando for maior ou igual a 5 considera 1
for i in range(0, 171):
    if previsoes_lgbm[i] >= .5:
       previsoes_lgbm[i] = 1
    else:
       previsoes_lgbm[i] = 0

In [893]:
previsoes_lgbm

array([1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.,
       1., 1., 1., 1., 1., 0., 0., 1., 0., 0., 1., 0., 1., 0., 1., 0., 1.,
       0., 1., 0., 1., 0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 1., 1.,
       1., 1., 0., 0., 0., 0., 0., 0., 1., 1., 1., 0., 0., 1., 0., 1., 1.,
       1., 0., 0., 1., 0., 1., 1., 0., 0., 0., 0., 0., 1., 1., 1., 0., 1.,
       0., 0., 0., 1., 1., 0., 0., 1., 1., 0., 0., 1., 0., 0., 0., 0., 0.,
       0., 0., 1., 0., 1., 0., 0., 1., 0., 1., 1., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 1., 0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.,
       0., 1., 1., 0., 0., 0., 1., 0., 0., 1., 0., 0., 0., 0., 0., 1., 1.,
       0., 0., 1., 0., 1., 0., 1., 1., 0., 0., 1., 0., 1., 1., 1., 0., 0.,
       0.])

In [894]:
y_teste

array([1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1,
       0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0,
       0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0,
       1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0,
       1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1,
       0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0])

In [895]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [896]:
print("Acurácia: %.2f%%" % (accuracy_score(y_teste, previsoes_lgbm) * 100.0))

Acurácia: 95.32%


In [897]:
confusion_matrix(y_teste, previsoes_lgbm)

array([[105,   3],
       [  5,  58]])

**Análise dados de treino**

In [898]:
previsoes_treino = lgbm.predict(x_treino)
previsoes_treino

array([0.01217818, 0.0148552 , 0.00780198, 0.96370672, 0.01027927,
       0.00863329, 0.00696301, 0.007809  , 0.95131085, 0.99385407,
       0.02626927, 0.14177776, 0.82009864, 0.00696301, 0.01739686,
       0.06136193, 0.0113902 , 0.99311673, 0.00461161, 0.01248515,
       0.99273863, 0.97815053, 0.00936556, 0.00746002, 0.99163853,
       0.99385407, 0.01864839, 0.00789434, 0.59725581, 0.01007769,
       0.00596233, 0.81205   , 0.99385407, 0.99254565, 0.007809  ,
       0.01652675, 0.02866246, 0.99385407, 0.01051764, 0.02124003,
       0.00604247, 0.01027927, 0.00751271, 0.99124676, 0.01861639,
       0.99385407, 0.01528155, 0.98780617, 0.01162788, 0.97157499,
       0.00970488, 0.91154483, 0.01102915, 0.05711927, 0.007809  ,
       0.01136134, 0.99123938, 0.02623712, 0.99364765, 0.04011997,
       0.02686039, 0.01334377, 0.99195852, 0.01349088, 0.00751271,
       0.00865529, 0.98282381, 0.00596233, 0.05593339, 0.97962408,
       0.93151309, 0.00780198, 0.99385407, 0.01136134, 0.86546

In [899]:
previsoes_treino.shape

(398,)

In [900]:
# Quando for menor que 5 considera 0 e quando for maior ou igual a 5 considera 1
for i in range(0, 398):
    if previsoes_treino[i] >= .5:
       previsoes_treino[i] = 1
    else:
       previsoes_treino[i] = 0

In [901]:
previsoes_treino

array([0., 0., 0., 1., 0., 0., 0., 0., 1., 1., 0., 0., 1., 0., 0., 0., 0.,
       1., 0., 0., 1., 1., 0., 0., 1., 1., 0., 0., 1., 0., 0., 1., 1., 1.,
       0., 0., 0., 1., 0., 0., 0., 0., 0., 1., 0., 1., 0., 1., 0., 1., 0.,
       1., 0., 0., 0., 0., 1., 0., 1., 0., 0., 0., 1., 0., 0., 0., 1., 0.,
       0., 1., 1., 0., 1., 0., 1., 0., 1., 1., 1., 1., 0., 0., 0., 1., 0.,
       1., 0., 1., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0.,
       0., 1., 0., 1., 1., 0., 1., 1., 0., 0., 1., 0., 0., 0., 1., 0., 0.,
       0., 0., 0., 1., 1., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 1., 0.,
       0., 1., 0., 0., 1., 0., 1., 0., 0., 0., 0., 1., 1., 0., 0., 0., 1.,
       0., 1., 1., 0., 0., 0., 0., 0., 1., 0., 0., 1., 0., 0., 1., 1., 0.,
       0., 0., 1., 1., 0., 0., 1., 0., 0., 0., 1., 0., 1., 0., 1., 1., 1.,
       1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 1., 0.,
       0., 1., 1., 1., 1.

In [902]:
accuracy_score(y_treino, previsoes_treino)

0.9899497487437185

In [903]:
confusion_matrix(y_treino, previsoes_treino)

array([[249,   0],
       [  4, 145]])

### **Validação Cruzada**

In [904]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

In [905]:
# Separando os dados em folds
kfold = KFold(n_splits = 30, shuffle=True, random_state = 5)

In [906]:
# Criando o modelo
modelo = lgb.LGBMClassifier(num_leaves = 250, objective = 'binary',
                            max_depth = 2, learning_rate = .05, max_bin =100)
resultado = cross_val_score(modelo, previsores_labelencoder, alvo, cv = kfold)

# Usamos a média e o desvio padrão
print("Acurácia Média: %.2f%%" % (resultado.mean() * 100.0))

[LightGBM] [Info] Number of positive: 209, number of negative: 341
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000268 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3000
[LightGBM] [Info] Number of data points in the train set: 550, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.380000 -> initscore=-0.489548
[LightGBM] [Info] Start training from score -0.489548
[LightGBM] [Info] Number of positive: 203, number of negative: 347
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000199 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3000
[LightGBM] [Info] Number of data points in the train set: 550, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.369091 -> initscore=-0.536119
[LightGBM] [Info] Start training from score -0.536119
[LightGBM] [Info] Number

LightGBM = 95.32% (treino e teste)- e 96.11% (validação cruzada) - lgb.LGBMClassifier(num_leaves = 250, objective = 'binary',  max_depth = 2, learning_rate = .05, max_bin =100) - previsores, alvo
LightGBM = 97.08% (treino e teste)- e 96.12% (validação cruzada) - lgb.LGBMClassifier(num_leaves = 250, objective = 'binary',  max_depth = 2, learning_rate = .05, max_bin =100) - previsores_esc, alvo
LightGBM = 95.32% (treino e teste)- e 96.11% (validação cruzada) - lgb.LGBMClassifier(num_leaves = 250, objective = 'binary',  max_depth = 2, learning_rate = .05, max_bin =100) - previsores_labelencoder, alvo
LightGBM = 97.08% (treino e teste)- e 85,93% (validação cruzada) - lgb.LGBMClassifier(num_leaves = 250, objective = 'binary',  max_depth = 2, learning_rate = .05, max_bin =100) - previsores_labelencoder_esc, alvo

LightGBM = 95.32% (treino e teste)- e 96.11% (validação cruzada) - lgb.LGBMClassifier(num_leaves = 250, objective = 'binary',  max_depth = 2, learning_rate = .05, max_bin =100) - previsores, alvo_labelencoder_ordinal
LightGBM = 97.08% (treino e teste)- e 96.12% (validação cruzada) - lgb.LGBMClassifier(num_leaves = 250, objective = 'binary',  max_depth = 2, learning_rate = .05, max_bin =100) - previsores_esc, alvo_labelencoder_ordinal
LightGBM = 95.32% (treino e teste)- e 96.11% (validação cruzada) - lgb.LGBMClassifier(num_leaves = 250, objective = 'binary',  max_depth = 2, learning_rate = .05, max_bin =100) - previsores_labelencoder, alvo_labelencoder_ordinal
LightGBM = 97.08% (treino e teste)- e 96.12% (validação cruzada) - lgb.LGBMClassifier(num_leaves = 250, objective = 'binary',  max_depth = 2, learning_rate = .05, max_bin =100) - previsores_labelencoder_esc, alvo_labelencoder_ordinal

**Melhores LightGBM:
LightGBM = 97.08% (treino e teste)- e 96.12% (validação cruzada) - lgb.LGBMClassifier(num_leaves = 250, objective = 'binary',  max_depth = 2, learning_rate = .05, max_bin =100) - previsores_esc, alvo
LightGBM = 97.08% (treino e teste)- e 85,93% (validação cruzada) - lgb.LGBMClassifier(num_leaves = 250, objective = 'binary',  max_depth = 2, learning_rate = .05, max_bin =100) - previsores_labelencoder_esc, alvo
LightGBM = 97.08% (treino e teste)- e 96.12% (validação cruzada) - lgb.LGBMClassifier(num_leaves = 250, objective = 'binary',  max_depth = 2, learning_rate = .05, max_bin =100) - previsores_esc, alvo_labelencoder_ordinal
LightGBM = 97.08% (treino e teste)- e 96.12% (validação cruzada) - lgb.LGBMClassifier(num_leaves = 250, objective = 'binary',  max_depth = 2, learning_rate = .05, max_bin =100) - previsores_labelencoder_esc, alvo_labelencoder_ordinal**

# **CATBOOST**

https://catboost.ai/en/docs/

In [917]:
#Instalação
!pip install catboost



In [918]:
from catboost import CatBoostClassifier

In [921]:
df_original

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,M,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,...,25.380,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890
1,M,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,...,24.990,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902
2,M,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,...,23.570,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758
3,M,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,...,14.910,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300
4,M,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,...,22.540,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,M,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,...,25.450,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115
565,M,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,...,23.690,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637
566,M,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,...,18.980,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820
567,M,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,...,25.740,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400


In [922]:
previsores_catboost = df_original.iloc[:, 1:32]

In [923]:
previsores_catboost.head()

Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [934]:
alvo_catboost = df_original.iloc[:, 0]
alvo_catboost

0      M
1      M
2      M
3      M
4      M
      ..
564    M
565    M
566    M
567    M
568    B
Name: diagnosis, Length: 569, dtype: object

In [935]:
from sklearn.model_selection import train_test_split

In [951]:
x_treino, x_teste, y_treino, y_teste = train_test_split(previsores_catboost, alvo_catboost, test_size = 0.3, random_state = 0)

In [952]:
catboost = CatBoostClassifier(task_type='CPU', iterations=100, learning_rate=0.1, depth = 8, random_state = 5,
                              eval_metric="Accuracy")

In [953]:
catboost.fit( x_treino, y_treino, plot=True, eval_set=(x_teste, y_teste))

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

0:	learn: 0.9597990	test: 0.9239766	best: 0.9239766 (0)	total: 13.7ms	remaining: 1.36s
1:	learn: 0.9648241	test: 0.9239766	best: 0.9239766 (0)	total: 22.9ms	remaining: 1.12s
2:	learn: 0.9748744	test: 0.9181287	best: 0.9239766 (0)	total: 31.3ms	remaining: 1.01s
3:	learn: 0.9673367	test: 0.9239766	best: 0.9239766 (0)	total: 39.6ms	remaining: 950ms
4:	learn: 0.9773869	test: 0.9298246	best: 0.9298246 (4)	total: 47.6ms	remaining: 905ms
5:	learn: 0.9773869	test: 0.9473684	best: 0.9473684 (5)	total: 58.1ms	remaining: 910ms
6:	learn: 0.9824121	test: 0.9473684	best: 0.9473684 (5)	total: 66.4ms	remaining: 883ms
7:	learn: 0.9849246	test: 0.9473684	best: 0.9473684 (5)	total: 74.4ms	remaining: 855ms
8:	learn: 0.9874372	test: 0.9532164	best: 0.9532164 (8)	total: 82.6ms	remaining: 835ms
9:	learn: 0.9899497	test: 0.9590643	best: 0.9590643 (9)	total: 91ms	remaining: 819ms
10:	learn: 0.9874372	test: 0.9590643	best: 0.9590643 (9)	total: 99.9ms	remaining: 808ms
11:	learn: 0.9899497	test: 0.9590643	best: 0

<catboost.core.CatBoostClassifier at 0x7dd801a86f50>

In [954]:
previsoes_cat = catboost.predict(x_teste)
previsoes_cat

array(['M', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B',
       'M', 'B', 'M', 'B', 'M', 'M', 'M', 'M', 'M', 'B', 'B', 'M', 'B',
       'B', 'M', 'B', 'M', 'B', 'M', 'B', 'M', 'B', 'M', 'B', 'M', 'B',
       'M', 'B', 'B', 'M', 'B', 'B', 'M', 'B', 'B', 'B', 'M', 'M', 'M',
       'M', 'B', 'B', 'B', 'B', 'B', 'B', 'M', 'M', 'M', 'B', 'B', 'M',
       'B', 'M', 'M', 'M', 'B', 'B', 'M', 'B', 'M', 'M', 'B', 'B', 'B',
       'B', 'B', 'M', 'M', 'M', 'B', 'M', 'B', 'B', 'B', 'M', 'M', 'B',
       'B', 'B', 'M', 'B', 'B', 'M', 'B', 'B', 'B', 'B', 'B', 'B', 'B',
       'M', 'B', 'M', 'B', 'M', 'M', 'B', 'M', 'M', 'B', 'B', 'B', 'B',
       'B', 'B', 'B', 'B', 'B', 'M', 'B', 'M', 'B', 'B', 'B', 'B', 'B',
       'M', 'B', 'B', 'B', 'B', 'B', 'B', 'M', 'M', 'B', 'B', 'B', 'M',
       'B', 'B', 'M', 'B', 'B', 'B', 'B', 'B', 'M', 'B', 'B', 'B', 'M',
       'B', 'M', 'B', 'M', 'M', 'B', 'B', 'M', 'B', 'M', 'M', 'M', 'B',
       'B', 'B'], dtype=object)

In [955]:
y_teste

512    M
457    B
439    B
298    B
37     B
      ..
7      M
408    M
523    B
361    B
553    B
Name: diagnosis, Length: 171, dtype: object

In [956]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [958]:
print("Acurácia: %.2f%%" % (accuracy_score(y_teste, previsoes_cat) * 100.0))

Acurácia: 97.08%


In [959]:
confusion_matrix(y_teste, previsoes_cat)

array([[106,   2],
       [  3,  60]])

**Análise dados de treino**

In [960]:
previsoes_treino = catboost.predict(x_treino)
previsoes_treino

array(['B', 'B', 'B', 'M', 'B', 'B', 'B', 'B', 'M', 'M', 'B', 'B', 'M',
       'B', 'B', 'B', 'B', 'M', 'B', 'B', 'M', 'M', 'B', 'B', 'M', 'M',
       'B', 'B', 'M', 'B', 'B', 'M', 'M', 'M', 'B', 'B', 'B', 'M', 'B',
       'B', 'B', 'B', 'B', 'M', 'B', 'M', 'B', 'M', 'B', 'M', 'B', 'M',
       'B', 'B', 'B', 'B', 'M', 'B', 'M', 'B', 'B', 'B', 'M', 'B', 'B',
       'B', 'M', 'B', 'B', 'M', 'M', 'B', 'M', 'B', 'M', 'B', 'M', 'M',
       'M', 'M', 'B', 'B', 'B', 'M', 'B', 'M', 'B', 'M', 'B', 'B', 'M',
       'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B',
       'M', 'B', 'B', 'B', 'M', 'B', 'B', 'B', 'B', 'B', 'B', 'M', 'B',
       'M', 'B', 'B', 'M', 'B', 'M', 'M', 'B', 'M', 'M', 'B', 'B', 'M',
       'B', 'B', 'B', 'M', 'B', 'B', 'B', 'B', 'B', 'M', 'M', 'B', 'B',
       'B', 'B', 'M', 'M', 'B', 'B', 'B', 'B', 'M', 'B', 'B', 'M', 'B',
       'B', 'M', 'B', 'M', 'B', 'B', 'B', 'B', 'M', 'M', 'B', 'M', 'B',
       'M', 'B', 'M', 'M', 'B', 'B', 'B', 'B', 'B', 'M', 'B', 'B

In [961]:
accuracy_score(y_treino, previsoes_treino)

0.992462311557789

In [962]:
confusion_matrix(y_treino, previsoes_treino)

array([[249,   0],
       [  3, 146]])

### **Validação Cruzada**

In [963]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

In [964]:
# Separando os dados em folds
kfold = KFold(n_splits = 30, shuffle=True, random_state = 5)

In [965]:
# Criando o modelo
modelo = CatBoostClassifier(task_type='CPU', iterations=100, learning_rate=0.1, depth = 8, random_state = 5,
                              eval_metric="Accuracy")
resultado = cross_val_score(modelo, previsores, alvo, cv = kfold)

# Usamos a média e o desvio padrão
print("Acurácia Média: %.2f%%" % (resultado.mean() * 100.0))

0:	learn: 0.9581818	total: 10.4ms	remaining: 1.03s
1:	learn: 0.9654545	total: 18.7ms	remaining: 916ms
2:	learn: 0.9727273	total: 26.9ms	remaining: 871ms
3:	learn: 0.9727273	total: 35.5ms	remaining: 851ms
4:	learn: 0.9836364	total: 45ms	remaining: 856ms
5:	learn: 0.9800000	total: 53.5ms	remaining: 839ms
6:	learn: 0.9890909	total: 61.7ms	remaining: 819ms
7:	learn: 0.9890909	total: 69.3ms	remaining: 797ms
8:	learn: 0.9890909	total: 77ms	remaining: 779ms
9:	learn: 0.9909091	total: 84.6ms	remaining: 761ms
10:	learn: 0.9927273	total: 92.3ms	remaining: 747ms
11:	learn: 0.9945455	total: 100ms	remaining: 735ms
12:	learn: 0.9945455	total: 109ms	remaining: 729ms
13:	learn: 0.9945455	total: 117ms	remaining: 720ms
14:	learn: 0.9945455	total: 125ms	remaining: 710ms
15:	learn: 0.9945455	total: 133ms	remaining: 701ms
16:	learn: 0.9945455	total: 143ms	remaining: 696ms
17:	learn: 0.9945455	total: 151ms	remaining: 688ms
18:	learn: 0.9963636	total: 159ms	remaining: 676ms
19:	learn: 0.9945455	total: 166ms	

CatBoost = 97.08% (treino e teste) - e 97.16% (validação cruzada) - CatBoostClassifier(task_type='CPU', iterations=100, learning_rate=0.1, depth = 8, random_state = 5, eval_metric="Accuracy") - previsores_catboost, alvo_catboost

**Melhores CatBoost:
CatBoost = 97.08% (treino e teste) - e 97.16% (validação cruzada) - CatBoostClassifier(task_type='CPU', iterations=100, learning_rate=0.1, depth = 8, random_state = 5, eval_metric="Accuracy") - previsores_catboost, alvo_catboost**

# **Salvando dados para Deploy**

In [None]:
previsores

In [None]:
alvo

In [None]:
np.savetxt('../output/previsores.csv', previsores, delimiter=',')

In [None]:
np.savetxt('../output/alvo.csv', alvo, delimiter=',')

# **CONCLUSÃO DO DESAFIO 1**

DESENVOLVER E SELECIONAR O MELHOR ALGORITMO DE MACHINE LEARNING DE CLASSIFICAÇÃO PARA O DATASET DO LINK A SEGUIR:

https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

**Melhor Metodo:**
**-1º-** Regressão logística = 97.66% (treino e teste) - e 98.06%% (validação cruzada com previsores escalonados) com previsores escalonados
**-2ª-** SVM com 97.66% (treino e teste) - e 97.88% (validação cruzada) com previsores escalonados
**-3º-** LightGBM = 97.08% (treino e teste)- e 96.12% (validação cruzada) - lgb.LGBMClassifier(num_leaves = 250, objective = 'binary',  max_depth = 2, learning_rate = .05, max_bin =100) - com previsores escalonados
**-4ª-** CatBoost = 97.08% (treino e teste) - e 97.16% (validação cruzada) - CatBoostClassifier(task_type='CPU', iterations=100, learning_rate=0.1, depth = 8, random_state = 5, eval_metric="Accuracy") - com previsores gerados pelo catboos

**Melhores Naive Bayes:**
Naive Bayes = 92.40% (treino e teste) - e 93.82% (validação cruzada) - previsores, alvo
Naive Bayes = 92.40% (treino e teste) - e 93.82% (validação cruzada) - previsores_labelencoder, alvo
Naive Bayes = 92.40% (treino e teste) - e 93.82% (validação cruzada) - previsores, alvo_labelencoder_ordinal
Naive Bayes = 92.40% (treino e teste) - e 93.82% (validação cruzada) - previsores_labelencoder, alvo_labelencoder_ordinal

**Melhores SVM:**
SVM = 97.66% (treino e teste) - e 97.88% (validação cruzada)- SVC(kernel='rbf', random_state=1, C = 2) - previsores_esc, alvo
SVM = 97.66% (treino e teste) - e 97.88% (validação cruzada)- SVC(kernel='rbf', random_state=1, C = 2) - previsores_labelencoder_esc, alvo
SVM = 97.66% (treino e teste) - e 97.88% (validação cruzada)- SVC(kernel='rbf', random_state=1, C = 2) - previsores_esc, alvo_labelencoder_ordinal
SVM = 97.66% (treino e teste) - e 97.88% (validação cruzada)- SVC(kernel='rbf', random_state=1, C = 2) - previsores_labelencoder_esc, alvo_labelencoder_ordinal

**Melhores Regressão Logística:**
Regressão logística = 97.66% (treino e teste) - e 98.06%% (validação cruzada com previsores_esc) - LogisticRegression(random_state=1, max_iter=2000, penalty="l2", tol=0.0001, C=1,solver="lbfgs") - previsores_esc, alvo
Regressão logística = 97.66% (treino e teste) - e 98.06% (validação cruzada com previsores_esc) - LogisticRegression(random_state=1, max_iter=2000, penalty="l2", tol=0.0001, C=1,solver="lbfgs") - previsores_labelencoder_esc, alvo
Regressão logística = 97.66% (treino e teste) - e 98.06% (validação cruzada com previsores_labelencoder_esc) - LogisticRegression(random_state=1, max_iter=2000, penalty="l2", tol=0.0001, C=1,solver="lbfgs") - previsores_esc, alvo_labelencoder_ordinal
Regressão logística = 97.66% (treino e teste) - e 98.06% (validação cruzada cruzada com previsores_labelencoder_esc) - LogisticRegression(random_state=1, max_iter=600, penalty="l2", tol=0.0001, C=1,solver="lbfgs") - previsores_labelencoder_esc, alvo_labelencoder_ordinal

**Melhores KNN:**
KNN = 96.49% (treino e teste) - e 93.13% (validação cruzada)- KNeighborsClassifier(n_neighbors=7, metric='minkowski', p = 1) - previsores, alvo
KNN = 96.49% (treino e teste) - e 93.13% (validação cruzada)- KNeighborsClassifier(n_neighbors=7, metric='minkowski', p = 1) - previsores_labelencoder, alvo
KNN = 96.49% (treino e teste) - e 93.13% (validação cruzada)- KNeighborsClassifier(n_neighbors=7, metric='minkowski', p = 1) - previsores, alvo_labelencoder_ordinal
KNN = 96.49% (treino e teste) - e 93.13% (validação cruzada)- KNeighborsClassifier(n_neighbors=7, metric='minkowski', p = 1) - previsores_labelencoder, alvo_labelencoder_ordinal**

**Melhores Árvore de decisão:**
Árvore de decisão = 95.32% (treino e teste) - e 92.44% (validação cruzada) - DecisionTreeClassifier(criterion='entropy', random_state = 0, max_depth=3) - previsores, alvo
Árvore de decisão = 95.32% (treino e teste) - e 92.44% (validação cruzada) - DecisionTreeClassifier(criterion='entropy', random_state = 0, max_depth=3) - previsores_esc, alvo
Árvore de decisão = 95.32% (treino e teste) - e 92.44% (validação cruzada) - DecisionTreeClassifier(criterion='entropy', random_state = 0, max_depth=3) - previsores_labelencoder, alvo
Árvore de decisão = 95.32% (treino e teste) - e 92.44% (validação cruzada) - DecisionTreeClassifier(criterion='entropy', random_state = 0, max_depth=3) -previsores_labelencoder_esc, alvo
Árvore de decisão = 95.32% (treino e teste) - e 92.44% (validação cruzada) - DecisionTreeClassifier(criterion='entropy', random_state = 0, max_depth=3) - previsores, alvo_labelencoder_ordinal
Árvore de decisão = 95.32% (treino e teste) - e 92.44% (validação cruzada) - DecisionTreeClassifier(criterion='entropy', random_state = 0, max_depth=3) - previsores_esc, alvo_labelencoder_ordinal
Árvore de decisão = 95.32% (treino e teste) - e 92.44% (validação cruzada) - DecisionTreeClassifier(criterion='entropy', random_state = 0, max_depth=3) - previsores_labelencoder, alvo_labelencoder_ordinal
Árvore de decisão = 95.32% (treino e teste) - e 92.44% (validação cruzada) - DecisionTreeClassifier(criterion='entropy', random_state = 0, max_depth=3) - previsores_labelencoder_esc, alvo_labelencoder_ordinal

**Melhores Random Forest:**
Random Forest = 96.49% (treino e teste)- e 95.76% (validação cruzada) - RandomForestClassifier(n_estimators=150, criterion='entropy', random_state = 0, max_depth=4) - previsores, alvo
Random Forest = 96.49% (treino e teste) - e 95.76% (validação cruzada) - RandomForestClassifier(n_estimators=150, criterion='entropy', random_state = 0, max_depth=4) - previsores_esc, alvo
Random Forest = 96.49% (treino e teste) - e 95.76% (validação cruzada) - RandomForestClassifier(n_estimators=150, criterion='entropy', random_state = 0, max_depth=4) - previsores_labelencoder, alvo
Random Forest = 96.49% (treino e teste) - e 95.76% (validação cruzada) - RandomForestClassifier(n_estimators=150, criterion='entropy', random_state = 0, max_depth=4) -previsores_labelencoder_esc, alvo
Random Forest = 96.49% - e 95.76% (validação cruzada) - RandomForestClassifier(n_estimators=150, criterion='entropy', random_state = 0, max_depth=4) (treino e teste) - previsores, alvo_labelencoder_ordinal
Random Forest = 96.49% - e 95.76% (validação cruzada) - RandomForestClassifier(n_estimators=150, criterion='entropy', random_state = 0, max_depth=4) (treino e teste) - previsores_esc, alvo_labelencoder_ordinal
Random Forest = 96.49% - e 95.76% (validação cruzada) - RandomForestClassifier(n_estimators=150, criterion='entropy', random_state = 0, max_depth=4) (treino e teste) - previsores_labelencoder, alvo_labelencoder_ordinal
Random Forest = 96.49% - e 95.76% (validação cruzada) - RandomForestClassifier(n_estimators=150, criterion='entropy', random_state = 0, max_depth=4) (treino e teste) - previsores_labelencoder_esc, alvo_labelencoder_ordinal

**Melhores XGboost: Porem todos acertaram 100% no conjunto de teste, podendo ter a ocrrencia de overfitting**
XGboost = 96.49% - e 96.29% (validação cruzada) - XGBClassifier(max_depth=2, learning_rate=0.05, n_estimators=250, objective='binary:logistic', random_state=3) - previsores, alvo
XGboost = 96.49% - e 96.29% (validação cruzada) - XGBClassifier(max_depth=2, learning_rate=0.05, n_estimators=250, objective='binary:logistic', random_state=3) - previsores_esc, alvo
XGboost = 96.49% - e 96.29% (validação cruzada) - XGBClassifier(max_depth=2, learning_rate=0.05, n_estimators=250, objective='binary:logistic', random_state=3) - previsores_labelencoder, alvo
XGboost = 96.49% - e 96.29% (validação cruzada) - XGBClassifier(max_depth=2, learning_rate=0.05, n_estimators=250, objective='binary:logistic', random_state=3) - previsores_labelencoder_esc, alvo
XGboost = 96.49% - e 96.29% (validação cruzada) - XGBClassifier(max_depth=2, learning_rate=0.05, n_estimators=250, objective='binary:logistic', random_state=3) - previsores, alvo_labelencoder_ordinal
XGboost = 96.49% - e 96.29% (validação cruzada) - XGBClassifier(max_depth=2, learning_rate=0.05, n_estimators=250, objective='binary:logistic', random_state=3) - previsores_esc, alvo_labelencoder_ordinal
XGboost = 96.49% - e 96.29% (validação cruzada) - XGBClassifier(max_depth=2, learning_rate=0.05, n_estimators=250, objective='binary:logistic', random_state=3) - previsores_labelencoder, alvo_labelencoder_ordinal
XGboost = 96.49% - e 96.29% (validação cruzada) - XGBClassifier(max_depth=2, learning_rate=0.05, n_estimators=250, objective='binary:logistic', random_state=3) - previsores_labelencoder_esc, alvo_labelencoder_ordinal

**Melhores LightGBM:**
LightGBM = 97.08% (treino e teste)- e 96.12% (validação cruzada) - lgb.LGBMClassifier(num_leaves = 250, objective = 'binary',  max_depth = 2, learning_rate = .05, max_bin =100) - previsores_esc, alvo
LightGBM = 97.08% (treino e teste)- e 85,93% (validação cruzada) - lgb.LGBMClassifier(num_leaves = 250, objective = 'binary',  max_depth = 2, learning_rate = .05, max_bin =100) - previsores_labelencoder_esc, alvo
LightGBM = 97.08% (treino e teste)- e 96.12% (validação cruzada) - lgb.LGBMClassifier(num_leaves = 250, objective = 'binary',  max_depth = 2, learning_rate = .05, max_bin =100) - previsores_esc, alvo_labelencoder_ordinal
LightGBM = 97.08% (treino e teste)- e 96.12% (validação cruzada) - lgb.LGBMClassifier(num_leaves = 250, objective = 'binary',  max_depth = 2, learning_rate = .05, max_bin =100) - previsores_labelencoder_esc, alvo_labelencoder_ordinal

**Melhores CatBoost:**
CatBoost = 97.08% (treino e teste) - e 97.16% (validação cruzada) - CatBoostClassifier(task_type='CPU', iterations=100, learning_rate=0.1, depth = 8, random_state = 5, eval_metric="Accuracy") - previsores_catboost, alvo_catboost