<a href="https://colab.research.google.com/github/cagBRT/Data/blob/main/Data_Sampling_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

https://imbalanced-learn.org/dev/under_sampling.html#controlled-under-sampling


In [None]:
# Clone the entire repo.
!git clone -l -s https://github.com/cagBRT/Data.git cloned-repo
%cd cloned-repo/images

In [None]:
from IPython.display import Image

# **Undersampling Techniques**

In [None]:
from collections import Counter
from sklearn.datasets import make_classification
from matplotlib import pyplot
from numpy import where
from imblearn.under_sampling import NearMiss
from imblearn.under_sampling import EditedNearestNeighbours
from imblearn.under_sampling import OneSidedSelection
from imblearn.under_sampling import NeighbourhoodCleaningRule
from imblearn.under_sampling import CondensedNearestNeighbour

**Create a dataset**

In [None]:
X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,
        n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1)
# summarize class distribution
counter = Counter(y)
print(counter)
# scatter plot of examples by class label 
for label, _ in counter.items():
  row_ix = where(y == label)[0]
  pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label)) 
pyplot.legend()
pyplot.show()

**Undersample imbalanced dataset with NearMiss-1**<br>
NearMiss-1 selects samples from the majority class for which the average distance to some nearest neighbours is the smallest. In the following example, we use a 3-NN to compute the average distance on 2 specific samples of the majority class. Therefore, in this case the point linked by the green-dashed line will be selected since the average distance is smaller.

[NearMiss](hhttps://imbalanced-learn.org/dev/under_sampling.html#controlled-under-sampling) adds some heuristic rules to select samples. NearMiss implements 3 different types of heuristic which can be selected with the parameter version:<br>


In [None]:
Image("Screen Shot 2022-02-15 at 2.20.04 PM.png" , width=640)

**Example of NearMiss-1 undersampling**<br>

In [None]:
undersample = NearMiss(version=1, n_neighbors=3) # transform the dataset
X, y = undersample.fit_resample(X, y)
# summarize the new class distribution
counter = Counter(y)
print(counter)
# scatter plot of examples by class label 
for label, _ in counter.items():
  row_ix = where(y == label)[0]
  pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label)) 
  pyplot.legend()
pyplot.show()

**Example of Undersampling with NearMiss-2**<br>
NearMiss-2 selects samples from the majority class for which the average distance to the farthest neighbors is the smallest. With the same configuration as previously presented, the sample linked to the green-dashed line will be selected since its distance the 3 farthest neighbors is the smallest.

In [None]:
Image("Screen Shot 2022-02-15 at 2.27.39 PM.png" , width=640)

In [None]:
# define the undersampling method
undersample = NearMiss(version=2, n_neighbors=3) # transform the dataset
X, y = undersample.fit_resample(X, y)
# summarize the new class distribution
counter = Counter(y)
print(counter)
# scatter plot of examples by class label
for label, _ in counter.items():
  row_ix = where(y == label)[0]
  pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label)) 
  pyplot.legend()
pyplot.show()


**Undersampling with NearMiss-3**<br>
NearMiss-3 can be divided into 2 steps. First, a nearest-neighbors is used to short-list samples from the majority class (i.e. correspond to the highlighted samples in the following plot). Then, the sample with the largest average distance to the k nearest-neighbors are selected.

In [None]:
Image("Screen Shot 2022-02-15 at 2.33.33 PM.png" , width=640)

In [None]:
# Undersample imbalanced dataset with NearMiss-3
undersample = NearMiss(version=3, n_neighbors_ver3=3) # transform the dataset
X, y = undersample.fit_resample(X, y)
# summarize the new class distribution
counter = Counter(y)
print(counter)
# scatter plot of examples by class label 
for label, _ in counter.items():
  row_ix = where(y == label)[0]
  pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label)) 
  pyplot.legend()
pyplot.show()

**Condensed Nearest Neighbor Undersampling**<br>
[CondensedNearestNeighbour](https://imbalanced-learn.org/stable/under_sampling.html#condensed-nearest-neighbors) uses a 1 nearest neighbor rule to iteratively decide if a sample should be removed or not

In [None]:
# undersample and plot imbalanced dataset with the Condensed Nearest Neighbor Rule
undersample = CondensedNearestNeighbour(n_neighbors=1) # transform the dataset
X, y = undersample.fit_resample(X, y)
# summarize the new class distribution
counter = Counter(y)
print(counter)
# scatter plot of examples by class label
for label, _ in counter.items():
  row_ix = where(y == label)[0]
  pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label)) 
pyplot.legend()
pyplot.show()

CondensedNearestNeighbour is sensitive to noise and will add noisy samples.



**One Sided Selection Undersampling**<br>
OneSidedSelection will use TomekLinks to remove noisy samples.

In [None]:

# undersample and plot imbalanced dataset with Tomek Links
# define the undersampling method undersample = TomekLinks()
# transform the dataset
oss = OneSidedSelection(random_state=0)
X, y = oss.fit_resample(X, y)
# summarize the new class distribution counter = Counter(y)
print(counter)
# scatter plot of examples by class label 
for label, _ in counter.items():
  row_ix = where(y == label)[0]
  pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label)) 
  pyplot.legend()
pyplot.show()

**Edited Nearest Neighbor Undersampling**<br>
EditedNearestNeighbours applies a nearest-neighbors algorithm and “edits” the dataset by removing samples which do not agree “enough” with their neighboorhood. For each sample in the class to be under-sampled, the nearest-neighbours are computed and if the selection criterion is not fulfilled, the sample is removed:

In [None]:
# undersample and plot imbalanced dataset with the Edited Nearest Neighbor rule
undersample = EditedNearestNeighbours(n_neighbors=3) # transform the dataset
X, y = undersample.fit_resample(X, y)
# summarize the new class distribution
counter = Counter(y)
print(counter)
# scatter plot of examples by class label
for label, _ in counter.items():
  row_ix = where(y == label)[0]
  pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label)) 
pyplot.legend()
pyplot.show()

In [None]:
# undersample and plot imbalanced dataset with One-Sided Selection
undersample = OneSidedSelection(n_neighbors=1, n_seeds_S=200) # transform the dataset
X, y = undersample.fit_resample(X, y)
# summarize the new class distribution
counter = Counter(y)
print(counter)
# scatter plot of examples by class label
for label, _ in counter.items():
  row_ix = where(y == label)[0]
  pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label)) 
pyplot.legend()
pyplot.show()

**Assigment**<br>
1. Change the weights of the classes, what is impact on the performance?