# Meta Techniques with classifiers

In this session, we'll play around with classifiers, and techniques to optimize them.

## Step 0 - Imports and load training data

In [21]:
import sys
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import random
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

# Add input as import path
sys.path.insert(0,'../input')

import joblib #or your dataset handler
X, Y = joblib.load("traindata.pkl")
X_test, Y_test = joblib.load("testdata.pkl")

#Double check if all is well
print("TrainData=",X[0:10]," with shape ", X.shape)
print("TrainLables=", Y[0:10], Y.shape) 

print("TestData=",X_test[0:10], " with shape ",X_test.shape)
print("TestLabels=", Y_test[0:10], Y.shape) # We don't have test labels. Should be NaNs

TrainData= [[ -1.54609786e+00  -1.34499549e+00   2.75868709e-16   4.81287772e-01
   -4.44999502e-01   1.27192065e+00  -2.40989649e+00   1.28580768e+00
    5.98139336e-01]
 [ -3.52090723e-01  -1.34499549e+00   9.22992157e-03  -4.79086761e-01
   -4.44999502e-01  -4.42748897e-01   5.07835440e-01   1.23087575e+00
   -7.19428002e-01]
 [ -1.54609786e+00  -1.34499549e+00  -4.56670909e-01   2.40203684e+00
    1.86652569e+00   1.27192065e+00  -1.92360784e+00  -6.68199626e-01
   -7.19428002e-01]
 [  8.41916418e-01   7.43496915e-01   2.75868709e-16   4.81287772e-01
   -4.44999502e-01  -1.30008367e+00   5.07835440e-01  -8.35669181e-02
    2.68747501e-01]
 [  8.41916418e-01  -1.34499549e+00  -1.00022188e+00  -4.79086761e-01
   -4.44999502e-01  -1.30008367e+00   5.07835440e-01  -4.87708992e-01
   -7.19428002e-01]
 [ -3.52090723e-01   7.43496915e-01  -2.23720494e-01   4.81287772e-01
   -4.44999502e-01   4.14585875e-01   5.07835440e-01   1.58008589e+00
    2.68747501e-01]
 [  8.41916418e-01   7.434969

## 1. HyperParameters

HyperParameters are just high level parameters which affect the training of a given classifier. 
_If you have implemented the datasetloader you could introduce feature variables into a parameter search too_

Here, we experiment with SVM classifier. You need to try more

* Q1. Experiment with the RandomForests
* Q2. Increase search space
* Q3. Use the library: __hyperopt__ and do more things
    * Hint : [http://hyperopt.github.io/hyperopt/]

scikit-learn module here  : __sklearn.model_selection__

In [25]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.svm import SVC
from utils import accuracy_score_numpy


# Set the parameters by cross-validation
parameters_to_tune = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
                     'C': [1, 10, 100, 1000]},
                    {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]

scores = ['precision', 'recall']


# Since we don't really have labelled test data, we will split our training data into a new test and train
#####!!!! Notice the caps - Mayasculo Menusculo
x_train, x_test, y_train, y_test = train_test_split(
    X, Y, test_size=0.5, random_state=0)

for score in scores:
    print("# Tuning hyper-parameters for %s" % score)
    print()

    clfgrid = GridSearchCV(SVC(), parameters_to_tune, cv=5,
                       scoring='%s_macro' % score)
    clfgrid.fit(x_train, y_train)

    print("Best parameters set found on development set:")
    print()
    print(clfgrid.best_params_)
    print()
    print("Grid scores on development set:")
    print()
    means = clfgrid.cv_results_['mean_test_score']
    stds = clfgrid.cv_results_['std_test_score']
    for mean, std, params in zip(means, stds, clfgrid.cv_results_['params']):
        print("%0.3f (+/-%0.03f) for %r"
              % (mean, std * 2, params))
    print()

    print("Detailed classification report:")
    print()
    print("The model is trained on the full development set.")
    print("The scores are computed on the full evaluation set.")
    print()
    y_true, y_pred = y_test, clfgrid.predict(x_test)
    print(classification_report(y_true, y_pred))
    print("Accuracy : ", np.count_nonzero(y_true==y_pred)/len(y_pred))
    print()

# Tuning hyper-parameters for precision

Best parameters set found on development set:

{'C': 1000, 'gamma': 0.001, 'kernel': 'rbf'}

Grid scores on development set:

0.300 (+/-0.001) for {'C': 1, 'gamma': 0.001, 'kernel': 'rbf'}
0.300 (+/-0.001) for {'C': 1, 'gamma': 0.0001, 'kernel': 'rbf'}
0.740 (+/-0.073) for {'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}
0.300 (+/-0.001) for {'C': 10, 'gamma': 0.0001, 'kernel': 'rbf'}
0.757 (+/-0.061) for {'C': 100, 'gamma': 0.001, 'kernel': 'rbf'}
0.738 (+/-0.079) for {'C': 100, 'gamma': 0.0001, 'kernel': 'rbf'}
0.766 (+/-0.060) for {'C': 1000, 'gamma': 0.001, 'kernel': 'rbf'}
0.752 (+/-0.064) for {'C': 1000, 'gamma': 0.0001, 'kernel': 'rbf'}
0.765 (+/-0.070) for {'C': 1, 'kernel': 'linear'}
0.755 (+/-0.044) for {'C': 10, 'kernel': 'linear'}
0.755 (+/-0.044) for {'C': 100, 'kernel': 'linear'}
0.755 (+/-0.044) for {'C': 1000, 'kernel': 'linear'}

Detailed classification report:

The model is trained on the full development set.
The scores are computed

In [32]:
from sklearn.model_selection import RandomizedSearchCV
import scipy

#And now, let's try some Random searches instead,
param_space = {'C': scipy.stats.expon(scale=1000), 'gamma': scipy.stats.expon(scale=.1),
  'kernel': ['rbf','linear'], 'class_weight':['balanced', None]}

scores = ['precision', 'recall']
n_iter_search = 20
clf_to_optimize = SVC()

for score in scores:
    print("# Tuning hyper-parameters for %s" % score)
    print()
    
    clfrnd = RandomizedSearchCV(clf_to_optimize, param_distributions=param_space,
                                   n_iter=n_iter_search)
    clfrnd.fit(x_train, y_train)

    print("Best parameters set found on development set:")
    print()
    print(clf.best_params_)
    print()
    print("Random scores on development set:")
    print()
    means = clfrnd.cv_results_['mean_test_score']
    stds = clfrnd.cv_results_['std_test_score']
    for mean, std, params in zip(means, stds, clfrnd.cv_results_['params']):
        print("%0.3f (+/-%0.03f) for %r"
              % (mean, std * 2, params))
    print()

    print("Detailed classification report:")
    print()
    print("The model is trained on the full development set.")
    print("The scores are computed on the full evaluation set.")
    print()
    y_true, y_pred = y_test, clfrnd.predict(x_test)
    print(classification_report(y_true, y_pred))
    print("Accuracy : ", np.count_nonzero(y_true==y_pred)/len(y_pred))
    print()

# Tuning hyper-parameters for precision

Best parameters set found on development set:

{'C': 1790.4038321743124, 'class_weight': 'balanced', 'gamma': 0.029690219996218577, 'kernel': 'linear'}

Random scores on development set:

0.771 (+/-0.082) for {'C': 2875.1204361646783, 'class_weight': 'balanced', 'gamma': 0.070024172813174995, 'kernel': 'linear'}
0.743 (+/-0.029) for {'C': 569.00502651030354, 'class_weight': None, 'gamma': 0.03381156095995333, 'kernel': 'linear'}
0.667 (+/-0.061) for {'C': 1059.733773577402, 'class_weight': 'balanced', 'gamma': 0.16400436046777886, 'kernel': 'rbf'}
0.679 (+/-0.018) for {'C': 1396.5177333418828, 'class_weight': 'balanced', 'gamma': 0.057010758820175339, 'kernel': 'rbf'}
0.679 (+/-0.049) for {'C': 4056.1042518863442, 'class_weight': None, 'gamma': 0.072668920684137453, 'kernel': 'rbf'}
0.670 (+/-0.092) for {'C': 1973.1937577691358, 'class_weight': 'balanced', 'gamma': 0.28874897795011328, 'kernel': 'rbf'}
0.664 (+/-0.085) for {'C': 924.355678411508

In [33]:
# Lets compare the two test accuracies
y_pred_grid = clfgrid.predict(X_test)
y_pred_rnd = clfrnd.predict(X_test)
print ("Accuracy Grid -",  accuracy_score_numpy(y_pred_grid))
print ("Accuracy RandomSearch -",  accuracy_score_numpy(y_pred_rnd))


Accuracy Grid - 0.798473282443
Accuracy RandomSearch - 0.772519083969


## 2. Ensembling Techniques.

Boosting is a really powerful technique. Luckily, there are many implemented in sklearn already
* Q1] Implement Boostrapping in the traditional sense. Look at _sklearn.cross_validation.Bootstrap_
        * Hint : [http://ogrisel.github.io/scikit-learn.org/sklearn-tutorial/modules/generated/sklearn.cross_validation.Bootstrap.html]
        
scikit module here : __sklearn.ensemble__
        

In [42]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import BaggingClassifier
#Adaboosting
clf = AdaBoostClassifier(n_estimators=100)
clf.fit(x_train,y_train)
scores = cross_val_score(clf, x_test, y_test)
print("Score : ", scores.mean())

#Gradient Boosting
clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=0)
clf.fit(x_train,y_train)
print ( "Score :" , clf.score(x_test, y_test))

#Bagging
clf = BaggingClassifier(base_estimator=SVC(), n_estimators=10) # Lets take the best grid classifier
clf.fit(x_train, y_train)
print ( "Score :" , clf.score(x_test, y_test))

Score :  0.770642201835
Score : 0.813455657492
Score : 0.807339449541


## 3. Extras - AutoSklearn

We hate showing this, but here goes...

In [None]:
import autosklearn.classification as AutoClf
from sklearn.metrics import accuracy_score

clf = AutoClf.AutoSklearnClassifier(time_left_for_this_task=300) #Default is 1 hour
clf.fit(x_train,y_train)
y_pred = clf.predict(x_test)
print("Accuracy score", accuracy_score(y_test, y_pred))

Time limit for a single run is higher than total time limit. Capping the limit for a single run to the total time given to SMAC (299.561931)
