# Q2 Imbalanced Data Classification

## Reference

* 机器学习之类别不平衡问题 (3) —— 采样方法: https://www.cnblogs.com/massquantity/p/9382710.html
* 不平衡数据集的处理: https://www.cnblogs.com/kamekin/p/9824294.html
* imblearn document: https://imbalanced-learn.org/stable/index.html

In [244]:
import numpy as np
import pandas as pd
from collections import Counter

from imblearn.combine import SMOTEENN
from imblearn.over_sampling import SMOTE
from imblearn.ensemble import RUSBoostClassifier
from imblearn.ensemble import BalancedRandomForestClassifier

from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

## Bi-class Datasets

`v_train.csv` and `p_train.csv` are data sets with binary classes (e.g., positive, negative).

### Use SMOTE on `v_train.csv` dataset

In [71]:
Xv = pd.read_csv("v/v_train.csv", names=[0,1,2,3,4,5,6,7,8,9,'label'])
Xv.label.loc[Xv.label==' negative'] = 0
Xv.label.loc[Xv.label==' positive'] = 1

In [207]:
X = Xv.iloc[:,0:10]
y = Xv.iloc[:, 10]

In [212]:
print("Before SMOTE")
print(Counter(Xv.label))

smote = SMOTE(random_state=0) 

X_smote, y_smote = smote.fit_sample(X, y.astype('int'))

print("After SMOTE")
print(Counter(y_smote))

X_train, X_val, y_train, y_val = train_test_split(X_smote, y_smote, test_size=0.2)

# Random Forest Model
rf = RandomForestClassifier(n_estimators=5, random_state=0, max_depth=2)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_val) # do predict on validation set
print("Random Forest evalute on validation set")
print(classification_report(y_val, y_pred)) # Show the evaluation result

Before SMOTE
Counter({0: 867, 1: 88})
After SMOTE
Counter({1: 867, 0: 867})
Random Forest evalute on validation set
              precision    recall  f1-score   support

           0       0.92      0.88      0.90       160
           1       0.90      0.93      0.92       187

    accuracy                           0.91       347
   macro avg       0.91      0.91      0.91       347
weighted avg       0.91      0.91      0.91       347



In [213]:
# Predict on testing set
Xv_test = pd.read_csv("v/v_test.csv")
X_test = Xv_test.iloc[:,:10]
pred = rf.predict(Xv_test.iloc[:, :10])
Xv_test['label'] = pred

# Transfer 0/1 to negative/positive
Xv_test.label.loc[Xv_test.label == 0] = 'negative'
Xv_test.label.loc[Xv_test.label == 1] = 'positive'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)


In [214]:
# Save to CSV file
Xv_test.to_csv("v_test_pred.csv", index=False)

### Use SMOTEENN on `p_train.csv` dataset

In [185]:
Xp = pd.read_csv("p/p_train.csv", names=[0,1,2,3,4,5,6,7,'label'])
Xp.label.loc[Xp.label==' negative'] = 0
Xp.label.loc[Xp.label==' positive'] = 1

In [190]:
X = Xp.iloc[:,0:8]
y = Xp.iloc[:, 8]

In [191]:
print("Before SMOTEENN")
print(Counter(Xp.label))

sme = SMOTEENN(random_state=27)
X_sme, y_sme = sme.fit_resample(X, y.astype('int'))

print("After SMOTEENN")
print(Counter(y_sme))

X_train, X_val, y_train, y_val = train_test_split(X_sme, y_sme, test_size=0.2)

# Random Forest Model
rf = RandomForestClassifier(n_estimators=5, random_state=0, max_depth=2)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_val) # do predict on validation set
print("Random Forest evalute on validation set")
print(classification_report(y_val, y_pred)) # Show the evaluation result

Before SMOTEENN
Counter({0: 484, 1: 261})
After SMOTEENN
Counter({1: 266, 0: 213})
Random Forest evalute on validation set
              precision    recall  f1-score   support

           0       0.86      0.76      0.81        41
           1       0.83      0.91      0.87        55

    accuracy                           0.84        96
   macro avg       0.85      0.83      0.84        96
weighted avg       0.85      0.84      0.84        96



In [205]:
# Predict on testing set
Xp_test = pd.read_csv("p/p_test.csv")
X_test = Xp_test.iloc[:,:8]
pred = rf.predict(Xp_test.iloc[:,:8])
Xp_test['label'] = pred

# Transfer 0/1 to negative/positive
Xp_test.label.loc[Xp_test.label == 0] = 'negative'
Xp_test.label.loc[Xp_test.label == 1] = 'positive'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)


In [206]:
# Save to CSV file
Xp_test.to_csv("p_test_pred.csv", index=False)

## Multi-class Datasets

`y_train.csv`, `e_train.csv` and `a_train.csv` are datasets with multi-classes.

## `y_train.csv`

In [222]:
Xy = pd.read_csv("y/y_train.csv", names=[0,1,2,3,4,5,6,7,'label'])
Xy.label.loc[Xy.label=='MIT'] = 0
Xy.label.loc[Xy.label=='NUC'] = 1
Xy.label.loc[Xy.label=='CYT'] = 2
Xy.label.loc[Xy.label=='ME1'] = 3
Xy.label.loc[Xy.label=='EXC'] = 4
Xy.label.loc[Xy.label=='ME2'] = 5
Xy.label.loc[Xy.label=='ME3'] = 6
Xy.label.loc[Xy.label=='VAC'] = 7
Xy.label.loc[Xy.label=='POX'] = 8
Xy.label.loc[Xy.label=='ERL'] = 9

In [223]:
Counter(Xy.label)

Counter({0: 234,
         1: 410,
         2: 450,
         3: 36,
         4: 33,
         5: 46,
         6: 140,
         7: 30,
         8: 18,
         9: 5})

In [232]:
X = Xy.iloc[:,0:8]
y = Xy.iloc[:, 8]

In [254]:
X_train, X_val, y_train, y_val = train_test_split(X, y.astype('int'), test_size=0.2)

# Random Forest Model
brf = BalancedRandomForestClassifier(n_estimators=20, max_depth=2, random_state=28)
brf.fit(X_train, y_train)
y_pred = brf.predict(X_val) # do predict on validation set
print("Balanced Random Forest evalute on validation set")
print(classification_report(y_val, y_pred)) # Show the evaluation result

Balanced Random Forest evalute on validation set
              precision    recall  f1-score   support

           0       0.23      0.27      0.25        41
           1       0.46      0.47      0.47        87
           2       0.38      0.03      0.06        94
           3       0.33      0.14      0.20         7
           4       0.25      0.75      0.38         4
           5       0.50      0.22      0.31         9
           6       0.64      0.74      0.69        31
           7       0.00      0.00      0.00         4
           8       0.02      0.50      0.03         2
           9       0.40      1.00      0.57         2

    accuracy                           0.31       281
   macro avg       0.32      0.41      0.29       281
weighted avg       0.40      0.31      0.30       281



In [247]:
X_train, X_val, y_train, y_val = train_test_split(X, y.astype('int'), test_size=0.2)

# Random Forest Model
rusb = RUSBoostClassifier(n_estimators=50, learning_rate=1e-3, random_state=6)
rusb.fit(X_train, y_train)
y_pred = rusb.predict(X_val) # do predict on validation set
print("RUSBoost Classifier evalute on validation set")
print(classification_report(y_val, y_pred)) # Show the evaluation result

RUSBoost Classifier evalute on validation set
              precision    recall  f1-score   support

           0       0.59      0.19      0.29        52
           1       0.37      0.23      0.28        80
           2       0.42      0.54      0.47        90
           3       0.00      0.00      0.00         4
           4       0.60      0.30      0.40        10
           5       0.09      0.17      0.12         6
           6       0.21      0.23      0.22        26
           7       0.05      0.25      0.08         8
           8       0.67      0.50      0.57         4
           9       0.50      1.00      0.67         1

    accuracy                           0.33       281
   macro avg       0.35      0.34      0.31       281
weighted avg       0.40      0.33      0.33       281



## `e_train.csv`

## `a_train.csv`