# Rationale
This document describes how to proceed when you are dealing with unbalanced data: For example, when you have a dependent categorical binary variable with two possible values: 1 or 0 and the frequency of one the values is clearly unbalanced.  

Firstly, it is worth mentioning that there is not silver bullet to solve this issue, but one possible solution is to generate new samples in the classes which are under-represented. There are different oversampling methods, though it seems that one of the most popular is the Synthetic Minority Oversampling Technique (SMOTE) method, which synthetically creates new samples for the uner-represented category. In this document I will exemplify how to use this method


Now, let's generate a 3-classes classification problem using `make_classification` in which the 3rd category has clearly more samples than the other 2 (see the `weights` parameter)

In [8]:
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=5000, n_features=2, n_informative=2,
                           n_redundant=0, n_repeated=0, n_classes=3,
                           n_clusters_per_class=1,
                           weights=[0.01, 0.05, 0.94],
                           class_sep=0.8, random_state=0)

If we check the counts for each of the classes:

In [9]:
from collections import Counter

print(sorted(Counter(y).items()))

[(0, 64), (1, 262), (2, 4674)]


Which is clearly unbalanced, so if we use SMOTE for oversampling:

In [11]:
from imblearn.over_sampling import SMOTE

X_resampled, y_resampled = SMOTE().fit_sample(X, y)

We can see that the counts are balanced

In [13]:
print(sorted(Counter(y_resampled).items()))

[(0, 4674), (1, 4674), (2, 4674)]
