# Under- and Over-representation

In some cases, the dataset may not be representative of the distribution. One way of tackling this is by enriching the dataset through additional sampling.

- *Under*sampling: out of many, choose few. Shrinks the dataset, may reduce information
- *Over*sampling: out of few, make many. Enlarges the dataset, may introduce noise

How do we sample?

| Strategy      | Speed    | Likelihood    | Variance | Assumption                          |
| ------------- | -------- | ------------- | -------- | ----------------------------------- |
| Randomly      | High     | Low           | High     | High space density                  |
| Interpolation | High     | Medium to Low | Medium   | High space density                  |
| Learning      | Low      | High          |          | We can induce the data distribution |

Once sampled, the new instances are added to the original dataset, hopefully improving downstream tasks.

---

In [24]:
import pandas as pd
import numpy as np

adult = pd.read_csv('./../data/adult_imbalanced.csv')

In [25]:
adult

Unnamed: 0,age,fnlwgt,education_num,capital_gain,capital_loss,hours_per_week,relationship_num,race_num,is_male_num,occupation_num,marital_status_num,workclass_num,native_country_num,label_num
0,39,77516,13,2174,0,40,1,4,1,1,4,7,39,0
1,50,83311,13,0,0,13,0,4,1,4,2,6,39,0
2,38,215646,9,0,0,40,1,4,1,6,0,4,39,0
3,53,234721,7,0,0,40,0,2,1,6,2,4,39,0
4,28,338409,13,0,0,40,5,2,0,10,2,4,5,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29812,40,237601,13,0,0,55,1,3,0,12,4,4,39,1
29813,39,329980,13,0,2415,60,0,4,1,12,2,5,39,1
29814,24,145964,13,0,0,40,4,4,1,4,4,4,39,1
29815,60,207665,9,0,0,40,5,4,0,13,2,4,39,1


In [26]:
adult.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29817 entries, 0 to 29816
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   age                 29817 non-null  int64
 1   fnlwgt              29817 non-null  int64
 2   education_num       29817 non-null  int64
 3   capital_gain        29817 non-null  int64
 4   capital_loss        29817 non-null  int64
 5   hours_per_week      29817 non-null  int64
 6   relationship_num    29817 non-null  int64
 7   race_num            29817 non-null  int64
 8   is_male_num         29817 non-null  int64
 9   occupation_num      29817 non-null  int64
 10  marital_status_num  29817 non-null  int64
 11  workclass_num       29817 non-null  int64
 12  native_country_num  29817 non-null  int64
 13  label_num           29817 non-null  int64
dtypes: int64(14)
memory usage: 3.2 MB


Let's check how data are distributed

In [27]:
adult['label_num'].value_counts(True)

label_num
0    0.829057
1    0.170943
Name: proportion, dtype: float64

In [28]:
from sklearn.model_selection import train_test_split
label = adult.pop('label_num')
train_set, test_set, train_label, test_label = train_test_split(adult, label, stratify =label, test_size=0.30)

### DecisionTree

In [29]:
from sklearn import tree
dt = tree.DecisionTreeClassifier(criterion='gini', splitter='best', 
                                  max_depth=10, 
                                  min_samples_split=3, min_samples_leaf=4)
dt = dt.fit(train_set, train_label)

In [30]:
from sklearn.metrics import classification_report
test_pred = dt.predict(test_set)
print(classification_report(test_label, test_pred, target_names=['<=50', '>50']))

              precision    recall  f1-score   support

        <=50       0.90      0.96      0.93      7417
         >50       0.73      0.49      0.58      1529

    accuracy                           0.88      8946
   macro avg       0.82      0.72      0.76      8946
weighted avg       0.87      0.88      0.87      8946



The recall for class >50 is 0.49. This is due to the fact that the dataset is imablanced.
Let's try a different classifier.

### SVM

In [32]:
from sklearn.svm import SVC
svm = SVC(gamma='scale')
svm.fit(train_set, train_label)

In [33]:
test_pred = svm.predict(test_set)
print(classification_report(test_label, test_pred, target_names=['<=50', '>50']))

              precision    recall  f1-score   support

        <=50       0.84      1.00      0.91      7417
         >50       0.98      0.10      0.18      1529

    accuracy                           0.85      8946
   macro avg       0.91      0.55      0.55      8946
weighted avg       0.87      0.85      0.79      8946



### AdaBoost

In [34]:
from sklearn.ensemble import AdaBoostClassifier
clf = AdaBoostClassifier()
clf.fit(train_set, train_label)



In [35]:
test_pred = clf.predict(test_set)
print(classification_report(test_label, test_pred, target_names=['<=50', '>50']))

              precision    recall  f1-score   support

        <=50       0.90      0.97      0.94      7417
         >50       0.78      0.49      0.60      1529

    accuracy                           0.89      8946
   macro avg       0.84      0.73      0.77      8946
weighted avg       0.88      0.89      0.88      8946



Even if with AdaBoost we can obtain a better result, the perfomance are still suffering from the great imbalance. Let's see possible solutions.

### Weights to the model

We can give weights to the model, in this way we can say that the examples of a given class are more important, hence if an error occurs, it is worst than an error for the other class. 
This approach is available in any classifier from sklearn and keras, pytorch.

In [36]:
weights = {0:1.0, 1:100.0}
svm = SVC(gamma='scale', class_weight=weights)
svm.fit(train_set, train_label)

In [37]:
test_pred = svm.predict(test_set)
print(classification_report(test_label, test_pred, target_names=['<=50', '>50']))

              precision    recall  f1-score   support

        <=50       1.00      0.00      0.00      7417
         >50       0.17      1.00      0.29      1529

    accuracy                           0.17      8946
   macro avg       0.59      0.50      0.15      8946
weighted avg       0.86      0.17      0.05      8946



Clearly, these weights are not well suited for the problem at hand. We can run a grid search to see what combination of weights is better in this setting.

In [38]:
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import GridSearchCV
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
balance = [{0:100,1:1}, {0:10,1:1}, {0:1,1:10}, 'balanced']
param_grid = dict(class_weight=balance)
grid_search = GridSearchCV(estimator=svm, param_grid=param_grid, n_jobs=-1, cv=cv, scoring='roc_auc')

In [39]:
grid_result = grid_search.fit(test_set, test_label)

In [40]:
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
# report all configurations
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Best: 0.632689 using {'class_weight': 'balanced'}
0.476906 (0.030871) with: {'class_weight': {0: 100, 1: 1}}
0.592562 (0.024993) with: {'class_weight': {0: 10, 1: 1}}
0.532323 (0.023162) with: {'class_weight': {0: 1, 1: 10}}
0.632689 (0.025197) with: {'class_weight': 'balanced'}


In [41]:
svm = SVC(gamma='scale', class_weight='balanced')
svm.fit(train_set, train_label)

In [42]:
test_pred = svm.predict(test_set)
print(classification_report(test_label, test_pred, target_names=['<=50', '>50']))

              precision    recall  f1-score   support

        <=50       0.86      0.99      0.92      7417
         >50       0.84      0.19      0.30      1529

    accuracy                           0.85      8946
   macro avg       0.85      0.59      0.61      8946
weighted avg       0.85      0.85      0.81      8946



### Undersampling or Oversampling 

In [43]:
import imblearn
from imblearn.over_sampling import RandomOverSampler
#duplicate
oversample = RandomOverSampler(sampling_strategy='minority')
adult_o, label_o = oversample.fit_resample(train_set, train_label)


In [44]:
label_o.value_counts(True)

label_num
1    0.5
0    0.5
Name: proportion, dtype: float64

In [45]:
from imblearn.under_sampling import RandomUnderSampler
undersample = RandomUnderSampler(sampling_strategy='majority')
adult_u, label_u = undersample.fit_resample(train_set, train_label)

In [46]:
label_u.value_counts(True)

label_num
0    0.5
1    0.5
Name: proportion, dtype: float64

In [47]:
oversample = RandomOverSampler(sampling_strategy=0.60)
adult_o_o, label_o_o = oversample.fit_resample(train_set, train_label)
undersample = RandomUnderSampler(sampling_strategy=0.70)
adult_o_u, label_o_u = undersample.fit_resample(adult_o_o, label_o_o)

In [48]:
label_o_u.value_counts(True)

label_num
0    0.588235
1    0.411765
Name: proportion, dtype: float64

In [49]:
clf = AdaBoostClassifier()
clf.fit(adult_o, label_o)
test_pred = clf.predict(test_set)
print(classification_report(test_label, test_pred, target_names=['<=50', '>50']))



              precision    recall  f1-score   support

        <=50       0.97      0.81      0.88      7417
         >50       0.48      0.86      0.62      1529

    accuracy                           0.82      8946
   macro avg       0.72      0.83      0.75      8946
weighted avg       0.88      0.82      0.83      8946



In [50]:
clf = AdaBoostClassifier()
clf.fit(adult_u, label_u)
test_pred = clf.predict(test_set)
print(classification_report(test_label, test_pred, target_names=['<=50', '>50']))

              precision    recall  f1-score   support

        <=50       0.97      0.80      0.88      7417
         >50       0.48      0.86      0.61      1529

    accuracy                           0.81      8946
   macro avg       0.72      0.83      0.75      8946
weighted avg       0.88      0.81      0.83      8946





In [51]:
clf = AdaBoostClassifier()
clf.fit(adult_o_u, label_o_u)
test_pred = clf.predict(test_set)
print(classification_report(test_label, test_pred, target_names=['<=50', '>50']))



              precision    recall  f1-score   support

        <=50       0.95      0.85      0.90      7417
         >50       0.53      0.79      0.63      1529

    accuracy                           0.84      8946
   macro avg       0.74      0.82      0.77      8946
weighted avg       0.88      0.84      0.85      8946



In [52]:
clf = SVC()
clf.fit(adult_o_u, label_o_u)
test_pred = clf.predict(adult_o_u)
print(classification_report(label_o_u, test_pred, target_names=['<=50', '>50']))

              precision    recall  f1-score   support

        <=50       0.63      0.99      0.77     14830
         >50       0.95      0.18      0.30     10381

    accuracy                           0.66     25211
   macro avg       0.79      0.59      0.54     25211
weighted avg       0.77      0.66      0.58     25211



In [53]:
oversample = RandomOverSampler(sampling_strategy=0.30)
adult_o, label_o = oversample.fit_resample(train_set, train_label)

In [54]:
label_o.value_counts(True)

label_num
0    0.769262
1    0.230738
Name: proportion, dtype: float64

In [55]:
clf = SVC()
clf.fit(adult_o, label_o)
test_pred = clf.predict(test_set)
print(classification_report(test_label, test_pred, target_names=['<=50', '>50']))

              precision    recall  f1-score   support

        <=50       0.85      1.00      0.92      7417
         >50       0.98      0.14      0.24      1529

    accuracy                           0.85      8946
   macro avg       0.92      0.57      0.58      8946
weighted avg       0.87      0.85      0.80      8946



In [56]:
undersample = RandomUnderSampler(sampling_strategy=0.30)
adult_u, label_u = undersample.fit_resample(train_set, train_label)

In [57]:
clf = SVC()
clf.fit(adult_u, label_u)
test_pred = clf.predict(test_set)
print(classification_report(test_label, test_pred, target_names=['<=50', '>50']))

              precision    recall  f1-score   support

        <=50       0.85      1.00      0.92      7417
         >50       0.98      0.11      0.20      1529

    accuracy                           0.85      8946
   macro avg       0.91      0.56      0.56      8946
weighted avg       0.87      0.85      0.79      8946



### SMOTE

In [58]:
from imblearn.over_sampling import SMOTE
oversample = SMOTE(sampling_strategy=0.3)
adult_s, label_s = oversample.fit_resample(train_set, train_label)

In [59]:
clf = SVC(kernel='sigmoid', gamma='scale')
clf.fit(adult_s, label_s)
test_pred = clf.predict(test_set)
print(classification_report(test_label, test_pred, target_names=['<=50', '>50']))

              precision    recall  f1-score   support

        <=50       0.84      0.78      0.81      7417
         >50       0.20      0.26      0.22      1529

    accuracy                           0.69      8946
   macro avg       0.52      0.52      0.52      8946
weighted avg       0.73      0.69      0.71      8946



In [63]:
dt = tree.DecisionTreeClassifier(criterion='gini', splitter='best', 
                                  max_depth=1, 
                                  min_samples_split=3, min_samples_leaf=4)
dt = dt.fit(adult_s, label_s)
test_pred_dt = dt.predict(test_set)
print(classification_report(test_label, test_pred_dt, target_names=['<=50', '>50']))

              precision    recall  f1-score   support

        <=50       0.83      1.00      0.91      7417
         >50       0.00      0.00      0.00      1529

    accuracy                           0.83      8946
   macro avg       0.41      0.50      0.45      8946
weighted avg       0.69      0.83      0.75      8946



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [64]:
oversample = SMOTE(sampling_strategy=0.3)
adult_s, label_s = oversample.fit_resample(train_set, train_label)
undersample = RandomUnderSampler(sampling_strategy=0.40)
adult_s_u, label_s_u = undersample.fit_resample(adult_s, label_s)

In [65]:
clf = AdaBoostClassifier()
clf.fit(adult_s_u, label_s_u)
clf.predict(test_set)
print(classification_report(test_label, test_pred, target_names=['<=50', '>50']))



              precision    recall  f1-score   support

        <=50       0.84      0.78      0.81      7417
         >50       0.20      0.26      0.22      1529

    accuracy                           0.69      8946
   macro avg       0.52      0.52      0.52      8946
weighted avg       0.73      0.69      0.71      8946



# Automatic Search

### Oversampling

In [2]:
from imblearn.over_sampling import RandomOverSampler
from imblearn.over_sampling import SMOTE, SMOTENC
from imblearn.over_sampling import ADASYN

import pandas
import numpy


categorical_features = dataset.select_dtypes(exclude="number").columns.tolist()
models = [
    RandomOverSampler(random_state=RANDOM_STATE),
    SMOTE(random_state=RANDOM_STATE, k_neighbors=10),
    SMOTENC(random_state=RANDOM_STATE, k_neighbors=10, categorical_features=categorical_features),
    ADASYN(random_state=RANDOM_STATE, n_neighbors=10)
]
oversampling_algorithms = [
    "random",
    "smote_interpolation",
    "smote_interpolation_w_categorical",
    "adasyn"
]
oversampled_datasets = list()

for algorithm, model in zip(oversampling_algorithms, models):
    if algorithm in ("smote_interpolation", "adasyn"):
        oversampled_data, oversampled_labels = model.fit_resample(data_only_dataset.select_dtypes(include="number"), label_only_dataset)
        oversampled_dataset = pandas.DataFrame(numpy.hstack((oversampled_data, oversampled_labels)), columns=dataset.select_dtypes(include="number").columns)
    else:
        oversampled_data, oversampled_labels = model.fit_resample(data_only_dataset, label_only_dataset)
        oversampled_dataset = pandas.DataFrame(numpy.hstack((oversampled_data, oversampled_labels)), columns=dataset.columns)
    oversampled_dataset["algorithm"] = algorithm

    oversampled_datasets.append(oversampled_dataset)

---

### Undersampling

In [66]:
from imblearn.under_sampling import RandomUnderSampler
from imblearn.under_sampling import CondensedNearestNeighbour
from imblearn.under_sampling import ClusterCentroids

import pandas
import numpy


models = [
    RandomUnderSampler(random_state=RANDOM_STATE),
    CondensedNearestNeighbour(random_state=RANDOM_STATE),
    ClusterCentroids(random_state=RANDOM_STATE)
]
undersampling_algorithms = [
    "random",
    "condensed_rule",
    "centroids"
]
undersampled_datasets = list()

for algorithm, model in zip(undersampling_algorithms, models):
    print(algorithm)
    if algorithm in ("condensed_rule", "centroids"):
        undersampled_data, undersampled_labels = model.fit_resample(data_only_dataset.select_dtypes(include="number"), label_only_dataset)
        undersampled_dataset = pandas.DataFrame(numpy.hstack((undersampled_data, undersampled_labels)), columns=dataset.select_dtypes(include="number").columns)
    else:
        undersampled_data, undersampled_labels = model.fit_resample(data_only_dataset, label_only_dataset)
        undersampled_dataset = pandas.DataFrame(numpy.hstack((undersampled_data, undersampled_labels)), columns=dataset.columns)
    undersampled_dataset["algorithm"] = algorithm

    undersampled_datasets.append(undersampled_dataset)

---

# Validation

How good is the generated data? We use a statistical test ([Kolmogorov-Smirnov 2-sample test](https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test)) to verify that indeed, the sampled and original dataset are sampled from the same distribution.

In [None]:
from scipy.stats import ks_2samp


tests_per_algorithm = list()
for algorithm, oversampled_dataset in zip(oversampling_algorithms, oversampled_datasets):
    columns = oversampled_dataset.columns
    test_results = [
        ks_2samp(
            dataset[column],
            oversampled_dataset[column],
            alternative="two-sided"
        )
        for column in columns if column != "algorithm"
    ]
    test_data = [(
        test.statistic,
        test.pvalue,
        test.statistic_location
        )
        for test in test_results        
    ]
    test_data = pandas.DataFrame(test_data, columns=["KS_test", "p_value", "margin"])
    test_data["algorithm"] = algorithm

    tests_per_algorithm.append(test_data)

validation = pandas.concat(tests_per_algorithm, axis="rows")
validation.groupby("algorithm").describe()