**Implementation Note:** For each task, the sections **Solution** and **Validating student's solution** are not meant to be seen by the student before attempting the task.

**Motivation**: This assignment contains exercises on various methods to handle imbalanced datasets.

# Handling Imbalanced pima-indians dataset

## Setting

We have gone through `pima-indians` dataset before. Here is an artifically imbanced pima-indian dataset.

* We have discussed that using accuracy is not the best way to measure the performance of an algorithm in case of imbalnced datasets. 
* Hence, we will be using cohen's kappa along with accuracy to get a more all-round picture of the performance.
* With Base XGBoost model, we are getting 
    - Accuracy: 84.34%
    - Kappa: 35.32

In [1]:
from imblearn.datasets import make_imbalance
from numpy import loadtxt
from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score, cohen_kappa_score
from xgboost import XGBClassifier



In [2]:
dataset = loadtxt('./data/pima-indians-diabetes.csv', delimiter=",")
X = dataset[:,0:8]
Y = dataset[:,8]

In [3]:
X, Y = make_imbalance(X, Y, ratio=0.20, min_c_=1, random_state=42)

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=42)

In [5]:
xgbmodel = XGBClassifier(seed=42)
xgbmodel.fit(X_train, y_train);

In [6]:
y_pred = xgbmodel.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
kappa = cohen_kappa_score(y_test, y_pred)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
print("Kappa: %.2f" % (kappa * 100.0))

Accuracy: 84.34%
Kappa: 35.32


## 1: Handling Imbalanced Data with Algorithmic Appraoch

* Your task is to achieve at least 45% kappa with 84%+ accurcy on the given X_train, X_test, y_train, y_test. However, 
    * You are only allowed to use the `XGBoost` model. 
    * You are only allowed to change at most one hyperparameter.

You need to write a function called `myImbalanced()` that

* Accepts the following parameters:
    - X_train, X_test,  y_train, y_test (Numpy arrays for training, testing; any format acceptable by sklearn will work)
    - **kwargs (Accepts the model parameter that you wish to change)
  

* Should return
    - Accuracy on test data
    - Cohen's kappa on test data
    - Trained model
    
    
**Note**: 

1. Keep seed=42, random_state=42 wherever applicable.
1. [Imp] Since, the function only accepts final parameter, you might have run a few experiments before submitting the final answer.

**Hint**:
1. If I were you, I would probably try to recall what we discussed in 'algorithmic approach'.

In [7]:
from sklearn.metrics import accuracy_score, cohen_kappa_score
from xgboost import XGBClassifier

In [8]:
def myImbalanced(X_train, X_test, y_train, y_test, **kwargs):
    
    model = XGBClassifier(seed=42)
    model.set_params(**kwargs)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    accuracy = accuracy_score(y_test, y_pred)
    kappa = cohen_kappa_score(y_test, y_pred)
    return accuracy, kappa, model

In [9]:
myImbalanced_acc, myImbalanced_kappa, myImbalanced_model =\
    myImbalanced(X_train, X_test, y_train, y_test, scale_pos_weight=3.6)

In [10]:
print "accuracy", myImbalanced_acc
print "kappa:", myImbalanced_kappa

accuracy 0.843434343434
kappa: 0.455947527034


In [11]:
myImbalanced_model

XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
       gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=100, nthread=-1,
       objective='binary:logistic', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=3.6, seed=42, silent=True, subsample=1)

#### **Note**: For the remaining exercises, do not tune hyperparameters of the XGBoost model, unless advised otherwise.

## 2: Implement Oversampling with SMOTE as a part of ML pipeline

* Next, we will try to tackle the same dataset using oversamping technique SMOTE.
* You need to write a function that applies smote to the dataset, and then trains it on XGBoost model.

* Write a function `SMOTEpipeline` that
    * Accepts no parameters:
    * Should return
        - Discussed pipleline object
    
**Note**: 

1. keep seed=42, random_state=42 wherever applicable.

In [12]:
from imblearn.pipeline import make_pipeline
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import GridSearchCV

In [13]:
def SMOTEpipeline():
    smote = SMOTE(random_state=42)
    model = XGBClassifier(seed=42)
    pipeline = make_pipeline(smote, model)
    return pipeline

In [14]:
SMOTE_pipeline = SMOTEpipeline()

In [15]:
SMOTE_pipeline.named_steps

{'smote': SMOTE(k=None, k_neighbors=5, kind='regular', m=None, m_neighbors=10, n_jobs=1,
    out_step=0.5, random_state=42, ratio='auto', svm_estimator=None),
 'xgbclassifier': XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
        gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
        min_child_weight=1, missing=None, n_estimators=100, nthread=-1,
        objective='binary:logistic', reg_alpha=0, reg_lambda=1,
        scale_pos_weight=1, seed=42, silent=True, subsample=1)}

## 3: Implement Undersampling with Edited Nearest Neighbours as a part of ML pipeline

* Next, we will try to tackle the same dataset using oversamping technique SMOTE.
* You need to write a function that applies smote to the dataset, and then trains it on XGBoost model.
* Write a function `ENNpipeline` that
    * Accepts no parameters:
    * Should return
        - Discussed pipleline object
    
**Note**: 

1. keep seed=42, random_state=42 wherever applicable.

In [16]:
from imblearn.pipeline import make_pipeline
from imblearn.under_sampling import EditedNearestNeighbours, RepeatedEditedNearestNeighbours
from sklearn.model_selection import GridSearchCV

In [17]:
def ENNpipeline():
    enn = EditedNearestNeighbours(random_state=42)
    model = XGBClassifier(seed=42)
    pipeline = make_pipeline(enn, model)
    return pipeline

In [18]:
ENN_pipeline = ENNpipeline()

In [19]:
ENN_pipeline.named_steps

{'editednearestneighbours': EditedNearestNeighbours(kind_sel='all', n_jobs=1, n_neighbors=3,
             random_state=42, ratio='auto', return_indices=False,
             size_ngh=None),
 'xgbclassifier': XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
        gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
        min_child_weight=1, missing=None, n_estimators=100, nthread=-1,
        objective='binary:logistic', reg_alpha=0, reg_lambda=1,
        scale_pos_weight=1, seed=42, silent=True, subsample=1)}

## 4: Implement Grid Search on the Pipeline Object

* You need to run `GridSearchCV` on the `pipeline` object from the previous exercise in order to figure out the best parameters for SMOTE oversampling.

* You need to write a function called `myGridSearch()` that
    * Accepts the following parameters:
        - X_train, X_test,  y_train, y_test (Numpy arrays for training, testing; any format acceptable by sklearn will work)
        - pipeline object
        - paramsgrid (for GridSearchCV)
    * Should return
        - Accuracy of test set
        - Kappa of test set
        - classification report of test set
        - Trained GridSearchCV model
    
**Note**: 

1. keep seed=42, random_state=42 wherever applicable.

In [39]:
from imblearn.metrics import classification_report_imbalanced

def myGridSearch(X_train, X_test, y_train, y_test, pipeline, paramgrid):
    
#     print pipeline
#     print paramgrid

    gridsearch = GridSearchCV(pipeline, param_grid=paramgrid, cv=3)
    gridsearch.fit(X_train, y_train)
    y_pred = gridsearch.predict(X_test)

    accuracy = accuracy_score(y_test, y_pred)
    kappa = cohen_kappa_score(y_test, y_pred)
    clf_report = classification_report_imbalanced(y_test, y_pred)
    return accuracy, kappa, clf_report, gridsearch

In [40]:
myGridSearch_pipeline = ENNpipeline()
myGridSearch_paramgrid = {"editednearestneighbours__n_neighbors": [2,3,4,5,6,7,8]}

myGridSearch_acc, myGridSearch_kappa, myGridSearch_clf_report, myGridSearch_params = \
myGridSearch(X_train, X_test, y_train, y_test, myGridSearch_pipeline, myGridSearch_paramgrid)

In [41]:
myGridSearch_acc

0.85353535353535348

In [42]:
myGridSearch_kappa

0.5234854771784232

In [43]:
print myGridSearch_clf_report

                   pre       rec       spe        f1       geo       iba       sup

        0.0       0.92      0.90      0.66      0.91      0.73      0.55       163
        1.0       0.57      0.66      0.90      0.61      0.73      0.51        35

avg / total       0.86      0.85      0.70      0.86      0.73      0.54       198



In [44]:
myGridSearch_params.best_estimator_.steps[0][1]

EditedNearestNeighbours(kind_sel='all', n_jobs=1, n_neighbors=3,
            random_state=42, ratio='auto', return_indices=False,
            size_ngh=None)

In [45]:
myGridSearch_params.best_estimator_.steps[1][1]

XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
       gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=100, nthread=-1,
       objective='binary:logistic', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=42, silent=True, subsample=1)

## 5: Figure out optimum `k_neighbours` values for `SMOTEpipeline`:

* You need to achieve similar results as the previous exercise (45%+ kappa with 84%+ accurcy).

You need to write a function called `k_neighbours()` that

* Accepts the following parameters:
    - X_train, X_test,  y_train, y_test (Numpy arrays for training, testing; any format acceptable by sklearn will work)
* Should return
    - Accuracy of test set
    - Kappa of test set
    - classification report of test set
    - Trained gridsearchCV model for `k_neighbours`
    
**Note**: 

1. keep seed=42, random_state=42 wherever applicable.
2. You may need to hardcode the parameter grid in your function.

In [46]:
def k_neighbours(X_train, X_test, y_train, y_test):
    
    k_neighbours_paramsgrid = {"smote__k_neighbors": [2,3,4,5,6,7,8]}
    return myGridSearch(X_train, X_test, y_train, y_test, SMOTEpipeline(), k_neighbours_paramsgrid)

In [47]:
k_neighbours_acc, k_neighbours_kappa, k_neighbours_clf_report, k_neighbours_grid = \
    k_neighbours(X_train, X_test, y_train, y_test)

In [48]:
print k_neighbours_acc
print k_neighbours_kappa
print k_neighbours_clf_report

0.787878787879
0.317129249466
                   pre       rec       spe        f1       geo       iba       sup

        0.0       0.89      0.85      0.49      0.87      0.61      0.38       163
        1.0       0.41      0.49      0.85      0.45      0.61      0.35        35

avg / total       0.80      0.79      0.55      0.79      0.61      0.38       198



In [49]:
print k_neighbours_grid.best_estimator_.steps[0][1]

SMOTE(k=None, k_neighbors=2, kind='regular', m=None, m_neighbors=10, n_jobs=1,
   out_step=0.5, random_state=42, ratio='auto', svm_estimator=None)


## 6: Figure out optimum `kind` for `SMOTEpipeline`:

* You need to achieve similar results as the previous exercise (45%+ kappa with 84%+ accurcy ).

You need to write a function called `kind()` that

* Accepts the following parameters:
    - X_train, X_test,  y_train, y_test (Numpy arrays for training, testing; any format acceptable by sklearn will work)
* Should return
    - Accuracy of test set
    - Kappa of test set
    - classification report of test set
    - Trained gridsearchCV model for optimum `kind` of smote
    
**Note**: 

1. keep seed=42, random_state=42 wherever applicable.
2. You may need to hardcode the parameter grid in your function.

In [50]:
def kind(X_train, X_test, y_train, y_test):
    
    kind_paramsgrid = {"smote__kind": ['regular', 'borderline1', 'borderline2', 'svm']}
    return myGridSearch(X_train, X_test, y_train, y_test, SMOTEpipeline(), kind_paramsgrid)

In [51]:
kind_acc, kind_kappa, kind_clf_report, kind_grid = kind(X_train, X_test, y_train, y_test)

In [52]:
print kind_acc
print kind_kappa
print kind_clf_report

0.843434343434
0.468018720749
                   pre       rec       spe        f1       geo       iba       sup

        0.0       0.91      0.90      0.57      0.90      0.71      0.52       163
        1.0       0.56      0.57      0.90      0.56      0.71      0.49        35

avg / total       0.85      0.84      0.63      0.84      0.71      0.52       198



In [53]:
print kind_grid.best_estimator_.steps[0][1]

SMOTE(k=None, k_neighbors=5, kind='svm', m=None, m_neighbors=10, n_jobs=1,
   out_step=0.5, random_state=42, ratio='auto', svm_estimator=None)


## 7: Finetune both `kind` and `k_neighbour` for `SMOTEpipeline`:

* You need to achieve similar results as the previous exercise (45%+ kappa with 84%+ accurcy ).
* You need to write a function called `optimum()` that

    * Accepts the following parameters:
        - X_train, X_test,  y_train, y_test (Numpy arrays for training, testing; any format acceptable by sklearn will work)
    * Should return
        - Accuracy of test set
        - Kappa of test set
        - classification report of test set
        - Trained gridsearchCV object for optimum `kind` and `k_neighbour` combination.
    
**Note**: 

1. keep seed=42, random_state=42 wherever applicable.
2. You may need to hardcode the parameter grid in your function.

In [54]:
def optimum(X_train, X_test, y_train, y_test):
    
    optimum_paramsgrid = {"smote__k_neighbors": [2,3,4,5,6,7,8],
                  "smote__kind": ['regular', 'borderline1', 'borderline2', 'svm']}
    return myGridSearch(X_train, X_test, y_train, y_test, SMOTEpipeline(), optimum_paramsgrid)

In [55]:
optimum_acc, optimum_kappa, optimum_clf_report, optimum_grid = optimum(X_train, X_test, y_train, y_test)

In [56]:
print optimum_acc
print optimum_kappa
print optimum_clf_report

0.843434343434
0.468018720749
                   pre       rec       spe        f1       geo       iba       sup

        0.0       0.91      0.90      0.57      0.90      0.71      0.52       163
        1.0       0.56      0.57      0.90      0.56      0.71      0.49        35

avg / total       0.85      0.84      0.63      0.84      0.71      0.52       198



In [57]:
print optimum_grid.best_estimator_.steps[0][1]

SMOTE(k=None, k_neighbors=5, kind='svm', m=None, m_neighbors=10, n_jobs=1,
   out_step=0.5, random_state=42, ratio='auto', svm_estimator=None)


## 8: Figure out optimum `n_neighbour` for `ENNpipeline`:

* You need to achieve similar results as the previous exercise (45%+ kappa with 84%+ accurcy ).
* You need to write a function called `n_neighbour()` that

    * Accepts the following parameters:
        - X_train, X_test,  y_train, y_test (Numpy arrays for training, testing; any format acceptable by sklearn will work)
    * Should return
        - Accuracy of test set
        - Kappa of test set
        - classification report of test set
        - Trained gridsearchCV object for optimum `kind` and `k_neighbour` combination.
    
**Note**: 

1. keep seed=42, random_state=42 wherever applicable.
2. You may need to hardcode the parameter grid in your function.

In [85]:
def n_neighbour(X_train, X_test, y_train, y_test):
    
    n_neighbour_paramsgrid = {"editednearestneighbours__n_neighbors": [2,3,4,5,6,7,8]}
    return myGridSearch(X_train, X_test, y_train, y_test, ENNpipeline(), n_neighbour_paramsgrid)

In [86]:
n_neighbour_acc, n_neighbour_kappa, n_neighbour_clf_report, n_neighbour_grid = n_neighbour(X_train, X_test, y_train, y_test)

In [87]:
print n_neighbour_acc
print n_neighbour_kappa
print n_neighbour_clf_report

0.853535353535
0.523485477178
                   pre       rec       spe        f1       geo       iba       sup

        0.0       0.92      0.90      0.66      0.91      0.73      0.55       163
        1.0       0.57      0.66      0.90      0.61      0.73      0.51        35

avg / total       0.86      0.85      0.70      0.86      0.73      0.54       198



In [88]:
print n_neighbour_grid.best_estimator_.steps[0][1]

EditedNearestNeighbours(kind_sel='all', n_jobs=1, n_neighbors=3,
            random_state=42, ratio='auto', return_indices=False,
            size_ngh=None)
