## Demo Handling of Imbalance Data For Classification Problems
- Demos the use of SMOTE - Synthetic Minority Oversampling TEchnique for resolving a data imbalance problem
- The demo uses the [imblearn api](http://contrib.scikit-learn.org/imbalanced-learn/stable/generated/imblearn.over_sampling.SMOTE.html)
- Also I refer to the [Nick Becker's blog](https://beckernick.github.io/oversampling-modeling/) which explains the correct way to apply oversampling to correct data imbalance

### Installing imbalanced-learn
pip install -U imbalanced-learn

In [1]:
!pip install -U imbalanced-learn

Collecting imbalanced-learn
[?25l  Downloading https://files.pythonhosted.org/packages/80/a4/900463a3c0af082aed9c5a43f4ec317a9469710c5ef80496c9abc26ed0ca/imbalanced_learn-0.3.3-py3-none-any.whl (144kB)
[K    100% |████████████████████████████████| 153kB 8.2MB/s ta 0:00:01
[?25hRequirement not upgraded as not directly required: scikit-learn in /home/nbuser/anaconda3_501/lib/python3.6/site-packages (from imbalanced-learn) (0.19.1)
Requirement not upgraded as not directly required: scipy in /home/nbuser/anaconda3_501/lib/python3.6/site-packages (from imbalanced-learn) (0.19.1)
Requirement not upgraded as not directly required: numpy in /home/nbuser/anaconda3_501/lib/python3.6/site-packages (from imbalanced-learn) (1.14.3)
[31mgrpcio 1.11.0 has requirement protobuf>=3.5.0.post1, but you'll have protobuf 3.4.1 which is incompatible.[0m
Installing collected packages: imbalanced-learn
Successfully installed imbalanced-learn-0.3.3


### Define the imports

In [2]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import recall_score
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder

### Define global variables

In [3]:
SEED = 10
data_path = './data/pima-data.csv'
# Loading some example data
raw_data = pd.read_csv(data_path, delimiter=',')
data = raw_data.values
X, y = data[:,0:9], data[:,-1]
N_SAMPLES = X.shape[0]
N_FEATURES = X.shape[1]
N_RESPONSES = 1
SPLIT_FRACTION = 0.30
CV = 5
raw_data.head(3)

Unnamed: 0,num_preg,glucose_conc,diastolic_bp,thickness,insulin,bmi,diab_pred,age,skin,diabetes
0,6,148,72,35,0,33.6,0.627,50,1.379,True
1,1,85,66,29,0,26.6,0.351,31,1.1426,False
2,8,183,64,0,0,23.3,0.672,32,0.0,True


### Encode the Label y to binary

In [4]:
encoder = LabelEncoder()
encoder.fit(y)
y_encoded = encoder.transform(y) 

### Compute the label imbalance

In [5]:
def computeImbalance(y):
    label_imbalance = (np.count_nonzero(y==0)/y.shape[0])*100, (np.count_nonzero(y==1)/y.shape[0])*100
    return label_imbalance

computeImbalance(y_encoded)

(65.10416666666666, 34.89583333333333)

### Split the data into Training and Test partitions

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.1, random_state=SEED)

### Split Training data to get the Validation partition

In [7]:
X_train_new, X_valid, y_train_new, y_valid = train_test_split(X_train, y_train, test_size=0.1, random_state=SEED)

### Apply SMOTE Oversampling to the Training Partition

In [8]:
sm = SMOTE(random_state=SEED, ratio = 'minority')
X_train_resample, y_train_resample = sm.fit_sample(X_train_new, y_train_new)

### Compute the Class Imbalance on the Training set after SMOTE

In [9]:
print("Class imbalance \n\nBefore Oversampling:\n{0}\n\nAfter Oversampling:\n{1}".format(computeImbalance(y_train_new), computeImbalance(y_train_resample)))

Class imbalance 

Before Oversampling:
(63.929146537842186, 36.07085346215781)

After Oversampling:
(50.0, 50.0)


### Use Random Forest Classifier Model for training

In [10]:
clf_rf = RandomForestClassifier(n_estimators=25, random_state=SEED)
clf_rf.fit(X_train_resample, y_train_resample)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=25, n_jobs=1,
            oob_score=False, random_state=10, verbose=0, warm_start=False)

### Evalaute the trained model

In [11]:
def printEvaluationReport(model, data):
    X_valid, X_test, y_valid, y_test = data
    print('Validation Results')
    print("Score: {}".format(model.score(X_valid, y_valid)))
    print("Recall: {}".format(recall_score(y_valid, clf_rf.predict(X_valid))))
    print('\nTest Results')
    print("Score: {}".format(model.score(X_test, y_test)))
    print("Recall: {}".format(recall_score(y_test, clf_rf.predict(X_test))))

data = (X_valid, X_test, y_valid, y_test)
printEvaluationReport(clf_rf, data)

Validation Results
Score: 0.8142857142857143
Recall: 0.7777777777777778

Test Results
Score: 0.7142857142857143
Recall: 0.6153846153846154


### Using Cross Validation with oversampling

In [25]:
#X_train_resample2, y_train_resample2 = sm.fit_sample(X_train, y_train)
rf_class_weight = [{0:0.01, 1:0.99}, {0:0.10, 1:0.90}, {0:0.80, 1:0.20}]
parameters = {'classifier__n_estimators':range(10, 100, 10),
             'classifier__max_features':['sqrt', 'log2'],
             'classifier__max_depth': range(1, 50, 5),
             'classifier__min_samples_split': [100, 200],
             'classifier__min_samples_leaf': [5, 10],
             'classifier__criterion': ['entropy', 'gini']}
rf_cv = RandomForestClassifier(random_state=SEED)
model = Pipeline([
        ('sampling', sm),
        ('classifier', rf_cv)
    ])
clf_grid_search = GridSearchCV(model, parameters, cv=CV,verbose=True, n_jobs=-1)
clf_grid_search.fit(X_train, y_train)
print("Best: {0:.4f} using {1}".format(clf_grid_search.best_score_, str(clf_grid_search.best_params_)))

Fitting 5 folds for each of 1440 candidates, totalling 7200 fits


[Parallel(n_jobs=-1)]: Done 344 tasks      | elapsed:   16.5s
[Parallel(n_jobs=-1)]: Done 1094 tasks      | elapsed:   58.3s
[Parallel(n_jobs=-1)]: Done 2344 tasks      | elapsed:  2.1min
[Parallel(n_jobs=-1)]: Done 4094 tasks      | elapsed:  3.6min
[Parallel(n_jobs=-1)]: Done 6344 tasks      | elapsed:  5.4min


Best: 0.7786 using {'classifier__criterion': 'gini', 'classifier__max_depth': 6, 'classifier__max_features': 'sqrt', 'classifier__min_samples_leaf': 5, 'classifier__min_samples_split': 100, 'classifier__n_estimators': 20}


[Parallel(n_jobs=-1)]: Done 7200 out of 7200 | elapsed:  6.1min finished


### Evaluate the trained CV Model

In [26]:
data = (X_train, X_test, y_train, y_test)
printEvaluationReport(clf_grid_search, data)


Validation Results
Score: 0.808972503617945
Recall: 0.9834710743801653

Test Results
Score: 0.7402597402597403
Recall: 0.6153846153846154
