<a href="https://colab.research.google.com/github/cagBRT/Data/blob/main/Data_Sampling_2b.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook explores additional data augmentation techniques for imbalanced datasets.

In [None]:
!pip install six

# **Borderline SMOTE**

Borderline SMOTE is used when you want to be selective about the minority class modifications

Borderline SMOTE selects data examples that are in the minority class and have been misclassified. <br>
The misclassified instances most likely overlap another class (on or near a decision boundary)

**Import libraries**

In [None]:
# borderline-SMOTE for imbalanced dataset
from collections import Counter
from sklearn.datasets import make_classification
from imblearn.over_sampling import BorderlineSMOTE
from matplotlib import pyplot
from numpy import where

**Define a dataset**

In [None]:
# define dataset
X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,
n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1)
# summarize class distribution
counter = Counter(y)
print(counter)

In [None]:
for label, _ in counter.items():
  row_ix = where(y == label)[0]
  pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label))
  pyplot.legend()
pyplot.show()

In [None]:
# transform the dataset
oversample = BorderlineSMOTE()
X, y = oversample.fit_resample(X, y)
# summarize the new class distribution
counter = Counter(y)
print(counter)

In [None]:
# scatter plot of examples by class label
for label, _ in counter.items():
  row_ix = where(y == label)[0]
  pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label))
  pyplot.legend()
pyplot.show()



---



---



# **Using SMOTE with a Support Vector Machine**

Recall, SVMs use data instances that are clloses to thhe boundary decision plane (support vectors). If we use SMOTE or Borderline-SMOTE we could skew the decision boundary. <br>

Instead, use SVMSMOT, which is sspecifically designed for SVMs

SVMSMOTE uses an SVM to find the decision boundary defined by the support vectors and identifies instances in the minority class that are close to the support vectors. These instances become the focus for generating synthetic examples.

**Import libraries**

In [None]:
from collections import Counter
from sklearn.datasets import make_classification
from imblearn.over_sampling import SVMSMOTE
from matplotlib import pyplot
from numpy import where

**Create an imbalanced dataset**

In [None]:
X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,
n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1)
# summarize class distribution
counter = Counter(y)
print(counter)

In [None]:
for label, _ in counter.items():
  row_ix = where(y == label)[0]
  pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label))
  pyplot.legend()
pyplot.show()

**Use the SVMSMOTE function to transform the dataset**

In [None]:
# transform the dataset
oversample = SVMSMOTE()
X, y = oversample.fit_resample(X, y)
# summarize the new class distribution
counter = Counter(y)
print(counter)

In [None]:
# scatter plot of examples by class label
for label, _ in counter.items():
  row_ix = where(y == label)[0]
  pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label))
  pyplot.legend()
pyplot.show()



---



---



# **Adaptive Synthetic Sampling (ADASYN)**

ADASYN is used on imbalanced datasets to generate minority class instances in low density areas near the decision boundary. ADASYN does not generate minority class instances in high density areas.<br>
Examples that have the most class overlap have the most focus

**Import libraries**

In [None]:
from collections import Counter
from sklearn.datasets import make_classification
from imblearn.over_sampling import ADASYN
from matplotlib import pyplot
from numpy import where

In [None]:
# define dataset
X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,
n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1)
# summarize class distribution
counter = Counter(y)
print(counter)

In [None]:
for label, _ in counter.items():
  row_ix = where(y == label)[0]
  pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label))
  pyplot.legend()
pyplot.show()

Modify the dataset with ADASYN

In [None]:
oversample = ADASYN()
X, y = oversample.fit_resample(X, y)
# summarize the new class distribution
counter = Counter(y)
print(counter)


In [None]:
# scatter plot of examples by class label
for label, _ in counter.items():
  row_ix = where(y == label)[0]
  pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label))
  pyplot.legend()
pyplot.show()