## Oversampling and Undersampling

Oversampling methods duplicate or create new synthetic examples in the minority class,whereas undersampling methods delete examples in the majority class.  Both types of sampling can be effective when used in isolation, although can be more effective when both types of methods are used together. 

  After completing this tutorial, you will know:
  
- How to define a sequence of oversampling and undersampling methods to be applied to a training dataset or when evaluating a classifier model.

- How  to  manually  combine  oversampling  and  undersampling  methods  for  imbalanced classification.

- How to use pre-defined and well-performing combinations of sampling methods for imbalanced classification.

#### Binary Test Problem and Decision Tree Model (baseline on imbalance dataset)
 
 Tying this together, the complete example of creating an imbalanced classification dataset and plotting the examples is listed below.
 
 Running the example reports the average ROC AUC for the decision tree on the datasetover three repeats of 10-fold cross-validation (e.g.  average over 30 different model evaluations)

In [1]:
from numpy import mean
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
# generate 2 class dataset
X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1)
# define model
model = DecisionTreeClassifier()
# define evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(model, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)
# summarize performance
print('Mean ROC AUC: %.3f' % mean(scores))

Mean ROC AUC: 0.771


Now that we have a test problem, model, and test harness, let’s look at manual combinations of oversampling and undersampling methods.

The imbalanced-learn Python library provides a range of sampling techniques, as well as aPipelineclass that can be used to create a combined sequence of sampling methods to apply toa dataset.  We can use the Pipeline to construct a sequence of oversampling and undersampling techniques to apply to a dataset.

In [2]:
from numpy import mean
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
# generate dataset
X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1)
# define model
model = DecisionTreeClassifier()
# define sampling
over = RandomOverSampler(sampling_strategy=0.1)
under = RandomUnderSampler(sampling_strategy=0.5)
# define pipeline
pipeline = Pipeline(steps=[('o', over), ('u', under), ('m', model)])
# define evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(pipeline, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)
# summarize performance
print('Mean ROC AUC: %.3f' % mean(scores))

Mean ROC AUC: 0.814


#### SMOTE and Random Undersampling

Perhaps the most popular oversampling method is the Synthetic Minority Oversampling Technique, or SMOTE for short.  SMOTE works by selecting examples that are close in the feature space, drawing a line between the examples in the feature space and drawing a new sample as a point along that line.  The authors of the technique recommend using SMOTE on the minority class, followed by an undersampling technique on the majority class. 

In [4]:
# combination of SMOTE and random undersampling for imbalanced classification
from numpy import mean
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
# generate dataset
X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1)
# define model
model = DecisionTreeClassifier()
# define pipeline
over = SMOTE(sampling_strategy=0.1)
under = RandomUnderSampler(sampling_strategy=0.5)
steps = [('o', over), ('u', under), ('m', model)]
pipeline = Pipeline(steps=steps)
# define evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(pipeline, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)
# summarize performance
print('Mean ROC AUC: %.3f' % mean(scores))

Mean ROC AUC: 0.843


### Standard Combined Data Sampling Methods

There are combinations of oversampling and under sampling methods that have proven effective and  together  may  be  considered  sampling  techniques.   Two  examples  are  the  combination of SMOTE with Tomek Links undersampling and SMOTE with Edited Nearest Neighbors undersampling


#### SMOTE and Tomek Links Undersampling

SMOTE is an oversampling method that synthesizes new plausible examples in the minority class.  Tomek Links refers to a method for identifying pairs of nearest neighbors in a dataset that have different classes.  Removing one or both of the examples in these pairs (such as the examples in the majority class) has the effect of making the decision boundary in the training dataset less noisy or ambiguous.

Gustavo Batista, et al.  tested combining these methods in their 2003 paper titled Balancing Training Data for Automated Annotation of Keywords:  A Case Study.  Specifically, first the SMOTE method is applied to oversample the minority class to a balanced distribution, then examples in Tomek Links from the majority classes are identified and removed.

The combination was shown to provide a reduction in false negatives at the cost of an increase in false positives for a binary classification task.

The SMOTE configuration can be set via the smote argument and takes a configured SMOTE instance.   The Tomek  Links  configuration  can  be  set  via  the tomek argument  and  takes  a configured TomekLinks object.

The default is to balance the dataset with SMOTE then remove Tomek links from all classes.  This is the approach used in another paper that explores this combination entitled A Study of the Behavior of Several Methods for Balancing Machine Learning TrainingData.

Alternately, we can configure the combination to only remove links from the majority class as described in the 2003 paper by specifying the tomek argument with an instance of TomekLinks with the sampling strategy argument set to only undersample the ‘majority’ class.

In this case, it seems that this combined sampling strategy does not offer a benefit for thismodel on this dataset.

In [5]:
# combined SMOTE and Tomek Links sampling for imbalanced classification
from numpy import mean
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from imblearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from imblearn.combine import SMOTETomek
from imblearn.under_sampling import TomekLinks
# generate dataset
X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1)
# define model
model = DecisionTreeClassifier()
# define sampling
resample = SMOTETomek(tomek=TomekLinks(sampling_strategy='majority'))
# define pipeline
pipeline = Pipeline(steps=[('r', resample), ('m', model)])
# define evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(pipeline, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)
# summarize performance
print('Mean ROC AUC: %.3f' % mean(scores))

Mean ROC AUC: 0.815


#### SMOTE and Edited Nearest Neighbors Undersampling

SMOTE may be the most popular oversampling technique and can be combined with manydifferent undersampling techniques.  Another very popular undersampling method is the EditedNearest Neighbors, or ENN, rule.  This rule involves usingk= 3 nearest neighbors to locatethose examples in a dataset that are misclassified and that are then removed. 
Gustavo Batista, et al.  explore many combinations of oversampling and undersampling methods compared to the methods used inisolation in their 2004 paper titled A Study of the Behavior of Several Methods for BalancingMachine Learning Training Data.  This includes the combinations:

- Condensed Nearest Neighbors + Tomek Links
- SMOTE + Tomek Links
- SMOTE + Edited Nearest Neighbors

Regarding this final combination, the authors comment that ENN is more aggressive at downsampling the majority class than Tomek Links, providing more in-depth cleaning.  They apply the method, removing examples from both the majority and minority classes

*ENN is used to remove examples from both classes.  Thus, any example that is misclassified by its three nearest neighbors is removed from the training set.*
— A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data,2004

The SMOTE configuration can be set as a SMOTE object via the smote argument, and the ENN configuration can be set via the EditedNearestNeighbours object via the enn argument.SMOTE  defaults  to  balancing  the  distribution,  followed  by  ENN  that  by  default  removes misclassified examples from all classes.  We could change the ENN to only remove examples from the majority class by setting the enn argument to an EditedNearestNeighbours instance with sampling strategy argument set to ‘majority’.


In this case, we see a further lift in performance over SMOTE with the random under sampling method from about 0.81 to about 0.85.

In [6]:
# combined SMOTE and Edited Nearest Neighbors sampling for imbalanced classification
from numpy import mean
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from imblearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from imblearn.combine import SMOTEENN
# generate dataset
X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1)
# define model
model = DecisionTreeClassifier()
# define sampling
resample = SMOTEENN()
# define pipeline
pipeline = Pipeline(steps=[('r', resample), ('m', model)])
# define evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(pipeline, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)
# summarize performance
print('Mean ROC AUC: %.3f' % mean(scores))

Mean ROC AUC: 0.849


This result highlights that editing the oversampled minority class may also be an important consideration that could easily be overlooked.  

This was the same finding in the 2004 paper where the authors discover that SMOTE with Tomek Links and SMOTE with ENN perform well across a range of datasets