---
# Exercício Breast Cancer - scaler, cross-validation, pipeline  
---
**Machine Learning em Projetos (Erick Muzart e Fernando melo)**   
Tópicos:
- normalização (StandardScaler)
- cross-validation
- pipeline





#### **Descrição do projeto de machine learning**
Antes de escrever qualquer código, precisamos entender o problema que queremos resolver e fazer uma descrição eficiente do projeto, visando a comunicação  simples e rápida do objetivo do projeto para técnicos, gestores e colaboradores.  
  
**1- Descrição do problema ou tarefa:**  
Prever se o câncer de mama é maligno ou benigno com base nas características da imagem digitalizada de um aspirado por agulha fina (FNA) de uma massa mamária.   
**2- Descrição da solução de IA:**  
Treinamento supervisionado de modelo de classificação de câncer de mama em 2 classes (benigno/maligno) com base nas características dos núcleos celulares presentes na imagem.   
**3- Fonte de dados:**  
Os dados foram obtidos a partir de uma imagem digitalizada de um aspirado por agulha fina (FNA) de uma massa mamária. Eles descrevem as características dos núcleos celulares presentes na imagem.   
Fonte dados: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html   
**4- Variáveis independentes (preditoras ou "features"):**  
'mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error',
       'fractal dimension error', 'worst radius', 'worst texture',
       'worst perimeter', 'worst area', 'worst smoothness',
       'worst compactness', 'worst concavity', 'worst concave points',
       'worst symmetry', 'worst fractal dimension'     
**5- Variável dependente (resposta ou "target"):**  
Tipo de câncer benigno ou maligno

## Carregar bibliotecas

In [None]:
# Importar bibliotecas pandas, ConfusionMatrixDisplay, train_test_split, matplotlib, seaborn, metrics
import pandas as pd
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt 
from sklearn import metrics
import seaborn as sns

## Análise exploratória dos dados

In [None]:
# Importar  e instanciar Sklearn dataset load_breast_cancer


In [None]:
# Converter sklearn dataset em pandas dataframe

# Cria nova coluna target

# Mostra as dimensões do dataset e as primeiras linhas


Dimensões do dataset:  (569, 31)


Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,radius error,texture error,perimeter error,area error,smoothness error,compactness error,concavity error,concave points error,symmetry error,fractal dimension error,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,1.095,0.9053,8.589,153.4,0.006399,0.04904,0.05373,0.01587,0.03003,0.006193,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,0.5435,0.7339,3.398,74.08,0.005225,0.01308,0.0186,0.0134,0.01389,0.003532,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,0.7456,0.7869,4.585,94.03,0.00615,0.04006,0.03832,0.02058,0.0225,0.004571,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,0.4956,1.156,3.445,27.23,0.00911,0.07458,0.05661,0.01867,0.05963,0.009208,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,0.7572,0.7813,5.438,94.44,0.01149,0.02461,0.05688,0.01885,0.01756,0.005115,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


In [None]:
# Verificar o total de amostras por classe da variável target (.value_counts)


1    357
0    212
Name: target, dtype: int64

### Normalização (Standard Scaler)

In [None]:
# Atribuir variáveis independentes para X

# Atribuir variável dependente para y

# Dividir dados em treino e teste (80/20) com stratify, pois as classes estão desbalanceadas


In [None]:
# Importa a biblioteca StandardScaler

# Instancia objeto StandardScaler

# Ajusta o scaler aos dados de treino


StandardScaler()

In [None]:
# Cria um dataframe com os dados de treino transformados pelo scaler apenas para visualização da transformação


Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,radius error,texture error,perimeter error,area error,smoothness error,compactness error,concavity error,concave points error,symmetry error,fractal dimension error,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,-1.072001,-0.658425,-1.08808,-0.939274,-0.13594,-1.008718,-0.968359,-1.102032,0.281062,-0.113231,-0.704861,-0.440938,-0.743949,-0.629805,0.000748,-0.991573,-0.69376,-0.983284,-0.591579,-0.428972,-1.034094,-0.623497,-1.070773,-0.876534,-0.169982,-1.038836,-1.078995,-1.350527,-0.352658,-0.54138
1,1.748743,0.066502,1.751157,1.745559,1.274468,0.842288,1.519852,1.994664,-0.293045,-0.32018,0.127567,-0.381383,0.094075,0.317524,0.639656,0.087389,0.708451,1.18215,0.426212,0.074797,1.228342,-0.092833,1.187467,1.104386,1.517001,0.249655,1.178594,1.549916,0.191078,-0.173739
2,-0.974734,-0.931124,-0.997709,-0.867589,-0.613515,-1.138154,-1.092292,-1.243358,0.434395,-0.429247,-0.254445,1.23713,-0.338634,-0.413827,0.52024,-0.833114,-1.006736,-1.857894,1.356046,-1.00656,-0.973231,-1.036772,-1.008044,-0.834168,-1.097823,-1.16726,-1.282241,-1.707442,-0.307734,-1.213033
3,-0.145103,-1.215186,-0.123013,-0.253192,0.664482,0.286762,-0.129729,-0.098605,0.555635,0.029395,-0.531049,-1.262281,-0.411682,-0.4366,-0.39358,-0.129997,-0.219965,-0.527278,-0.26945,-0.316623,-0.251266,-1.369643,-0.166633,-0.330292,0.234006,0.096874,-0.087521,-0.344838,0.242198,-0.118266
4,-0.771617,-0.081211,-0.8037,-0.732927,-0.672282,-1.006099,-0.798502,-0.684484,0.737495,-0.457213,-0.498529,1.322961,-0.440597,-0.521457,-0.174225,-0.628196,-0.581187,-0.278344,1.528534,-0.313022,-0.801135,0.07923,-0.824381,-0.74183,-0.911367,-0.984612,-0.93319,-0.777604,0.555118,-0.761639


Treina modelo com dados normalizados

In [None]:
# instancia modelo de regressão linear

# Instancia objeto StandarScaler

# Ajusta o scaler aos dados de treino

# treina o modelo (aprende os coeficientes)


LogisticRegression()

In [None]:
# Verificar a acurácia do modelo (.score)


0.9824561403508771

In [None]:
# Verificar o f1_score do modelo, quando as classes estão desbalenceadas


0.9824561403508771

## Pipeline com StandardScaler e cross-validation

In [None]:
# Importar make_pipeline, StandardScaler, LogisticRegression, cross_val_score


In [None]:
# Atribuir variáveis independentes para X

# Atribuir variável dependente para y


In [None]:
# Criar pipeline com os passos StandardScaler() e LogisticRegression(), usando make_pipeline.


Pipeline(steps=[('standardscaler', StandardScaler()),
                ('logisticregression', LogisticRegression())])

In [None]:
# Fazer cross validation 10 folds do pipeline e tire a média dos scores


0.9806704260651629

In [None]:
# O pipeline pode ser treinado e usado como qualquer outro modelo
# e evita vazamento de dados (data leakage) de teste para os dados de treino.

# Dividir dados em treino e teste (80/20) com stratify, pois as classes estão desbalanceadas

# treinar o pipeline com .fit

# verificar o score do pipeline com .score


0.9824561403508771

In [None]:
# Faz predição das 2 primeiras linhas do X_test com o modelo do pipeline treinado acima.


array([0, 1])

### pipeline tem algumas vantagens:

1- Seu arquivo de treinamento permanece o mesmo e não vai crescer por causa do one-hot encoding.  
2- Na predição de novos dados, não é necessário fazer pandas dummies no novo arquivo. Também evita eventuais problemas caso os novos dados não tenham todas as categorias que existem nos dados de treinamento. As dimensões do novo dataset será diferente e vai dar erro.  
3- É possível fazer grid search para os parâmetros de pré-processamento e os parâmetros do modelo.  
