# Usando Pipeline da biblioteca imblearn

A [Pipeline do imblearn](https://imbalanced-learn.org/stable/references/generated/imblearn.pipeline.Pipeline.html), diferentemente do scikit-learn, aplica transformações tanto no X quanto no y, podendo ser utilizado para aplicar oversampling nos dados. Aqui temos um breve exemplo de como ela funciona, utilizando a base de dados de [diabetes](https://www.kaggle.com/datasets/mathchi/diabetes-data-set).

In [28]:
import pandas as pd
import numpy as np
from sklearn.metrics import  accuracy_score
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.model_selection import train_test_split, KFold, cross_val_score

df = pd.read_csv('diabetes.csv') #Lendo os dados

df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [29]:
X = df.drop('Outcome', axis=1)

y = df['Outcome']

X_train, X_test, y_train, y_test = train_test_split(X, y) #Split treino e teste

In [30]:
from imblearn.pipeline import Pipeline #Utilizando a função Pipeline do imblearn
from imblearn.over_sampling import SMOTE 

kfold = KFold(n_splits=5)

sm = SMOTE(random_state=42)


In [31]:
#Criando a pipeline com o SMOTE:
pipeline = Pipeline([
    ('oversampling', sm), ('modelo', HistGradientBoostingClassifier())
])

#Realizando a validação cruzada:
print("Acurácia Média: ", np.mean(cross_val_score(pipeline, X_train, y_train, cv=kfold, scoring='accuracy')))

Acurácia Média:  0.7310194902548727
