### SMOTE-ENN

#### ENN (Edited Nearest Neighbors)

The algorithm of ENN can be explained as follows.

Given the dataset with N observations, determine K, as the number of nearest neighbors. If not determined, then K=3.

Find the K-nearest neighbor of the observation among the other observations in the dataset, then return the majority class from the K-nearest neighbor.

If the class of the observation and the majority class from the observation’s K-nearest neighbor is different, then the observation and its K-nearest neighbor are deleted from the dataset.

Repeat step 2 and 3 until the desired proportion of each class is fulfilled.

#### The process of SMOTE-ENN can be explained as follows.

(Start of SMOTE: Synthetic Minority Oversampling Technique) Choose random data from the minority class.

Calculate the distance between the random data and its k nearest neighbors.

Multiply the difference with a random number between 0 and 1, then add the result to the minority class as a synthetic sample.

Repeat step number 2–3 until the desired proportion of minority class is met. (End of SMOTE)


(Start of ENN) Determine K, as the number of nearest neighbors. If not determined, then K=3.

Find the K-nearest neighbor of the observation among the other observations in the dataset, then return the majority class from the K-nearest neighbor.

If the class of the observation and the majority class from the observation’s K-nearest neighbor is different, then the observation and its K-nearest neighbor are deleted from the dataset.

Repeat step 2 and 3 until the desired proportion of each class is fulfilled. (End of ENN)

In [2]:
import pandas as pd
import numpy as np
from imblearn.pipeline import Pipeline
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_validate
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import AdaBoostClassifier # for the sake of example, could use anything else
from imblearn.combine import SMOTEENN
from imblearn.under_sampling import EditedNearestNeighbours

In [3]:
data=pd.read_csv("diabetes.csv")
data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [4]:
Y=data['Outcome'].values
X=data.drop('Outcome',axis=1)

In [5]:
#Define model
model_ori=AdaBoostClassifier()
#Define evaluation procedure (here we use Repeated Stratified K-Fold CV)
cv_ori=RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
#Evaluate model
scoring=['accuracy','precision_macro','recall_macro']
scores_ori = cross_validate(model_ori, X, Y, scoring=scoring, cv=cv_ori, n_jobs=-1)

# summarize performance
print('Mean Accuracy: %.4f' % np.mean(scores_ori['test_accuracy']))
print('Mean Precision: %.4f' % np.mean(scores_ori['test_precision_macro']))
print('Mean Recall: %.4f' % np.mean(scores_ori['test_recall_macro']))

Mean Accuracy: 0.7535
Mean Precision: 0.7346
Mean Recall: 0.7122


In [6]:
##Using SMOTE-ENN to balance the data
#Define model
model=AdaBoostClassifier()
#Define SMOTE-ENN
resample=SMOTEENN(enn=EditedNearestNeighbours(sampling_strategy='all'))
#Define pipeline
pipeline=Pipeline(steps=[('r', resample), ('m', model)])
#Define evaluation procedure (here we use Repeated Stratified K-Fold CV)
cv=RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
#Evaluate model
scoring=['accuracy','precision_macro','recall_macro']
scores = cross_validate(pipeline, X, Y, scoring=scoring, cv=cv, n_jobs=-1)

# summarize performance
print('Mean Accuracy: %.4f' % np.mean(scores['test_accuracy']))
print('Mean Precision: %.4f' % np.mean(scores['test_precision_macro']))
print('Mean Recall: %.4f' % np.mean(scores['test_recall_macro']))

Mean Accuracy: 0.7353
Mean Precision: 0.7279
Mean Recall: 0.7468


The sampling_strategy used in EditedNearestNeighbours is 'all' , since the ENN purpose is to delete some observations from both classes that are identified as having different class between the observation’s class and its K-nearest neighbor majority class.