#### Cross Validation is an important part of machine learning. There are different ways for cross validation. 
We can use it to evaluate:
- what models are more effective
- what parameters to use for a specific model
- selecting features

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# note: sklearn.cross_validation import train_test_split will be deprecated
# note: sklearn.cross_validation import cross_val_score will be deprecated
# note: sklearn.cross_validation import KFold will be deprecated
from sklearn.model_selection import KFold, train_test_split, cross_val_score
from sklearn import preprocessing, metrics
from sklearn.svm import SVC

import math, time

#### Model Evaluation Metrics
In order to evaluate each model system, we need to have metrics systems to help us. 
- for classification: the target(s) are category data, so we use ***metrics.accuracy_score*** for measuring
  * **error** - binary classification error rate. It is calculated as # (wrong cases) / #(all casees). Treat predicted values with probability p > 0.5 as positive
  * **merror** - multiclass classification error rate. It is calculated as # (wrong cases) / #(all casees).
- for regression: the target(s) are continuous data. The goal is to ___minimize___ them in the loss functions:
  * **Mean Absolute Error (MAE): metrics.mean_absolute_error** 
  $$mae = \frac{1}{n}\sum_{i=0}^n|y_{i} - \bar{y}_{i}|$$
  * **Mean Square Error (MSE): metics.mean_squared_error **
  $$mse = \frac{1}{n}\sum_{i=0}^n(y_{i} - \bar{y}_{i})^2$$
  * **Root Mean Square Error (RMSE) **
  $$rmse = \sqrt{\frac{1}{n}\sum_{i=0}^n(y_{i} - \bar{y}_{i})^2}$$
  * **Logloss** - negaive log-likelihood 
  * **AUC**  - area under curve          (***Maximize this***)
  * **NDCG** - normalized discounted cumulative gain   (***Maximize this***)
  * **MAP**  - mean average precision                  (***Maximize this***)
- by default, an error metric will be used!

### without cross_validation
- run only once

In [None]:
from sklearn.datasets import load_iris

# fetch data first
X = load_iris().data
y = load_iris().target

# preprocessing data and split it into train and test sets
X = preprocessing.scale(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# model
clf = SVC(kernel='linear', C=1)
clf.fit(X_train, y_train)
print("The model score is:", clf.score(X_test, y_test))

When we do train_test_split, part of the samples are used for testing. However, it provides a high variance estimate since changing which observations happen to be in the testing set can significantly change testing accuracy!

How to make use of those test data for training ===> K-folds cross_validation would solve this problem:
- A model is trained using k-1 of the folds as training data;
- the resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy).
- The performance measure reported by k-fold cross-validation is then the average of the values computed in the loop (using different test sets).

In [None]:
# Here is the step by step cross validation (cv)
from sklearn.datasets import load_boston

# fetch data first
X = load_iris().data
X = preprocessing.scale(X)
y = load_iris().target

# cv fold
nfolds = 10
kf = KFold(n_splits=nfolds, shuffle=True, random_state=int(time.time()))

clf = SVC(kernel='linear', C=1)
rmse = []
for train_index, test_index in kf.split(X):
    #print("%s, %s" % (train_index, test_index))
    #print("%d, %d" % (len(train_index), len(test_index)))
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    rmse.append(math.sqrt(metrics.mean_squared_error(y_pred, y_test)))

print(rmse)
print(np.sqrt(np.mean(rmse)))

The **Good** thing is that you usually don't need to inplement the details about cross validation. The sklearn package provides a high level function ***cross_val_score()*** to do all the above.
- In addition, for classification problems, ***stratified sampling*** is recommended for creating the folds; that is
  * each response (or target) should be represented with equal proportions in each of the K folds.
  * **sklearn.cross_val_score()** function does this by default!
- Validation options are:
    - ['accuracy', 'adjusted_rand_score', 'average_precision', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'neg_log_loss', 'neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_median_absolute_error', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'r2', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'roc_auc']

In [None]:
# Here is the simplified version of cross validation (cv)

X = load_iris().data
X = preprocessing.scale(X)
y = load_iris().target

#scores = -cross_val_score(svm.SVC(), X, y, cv=10, scoring='neg_mean_absolute_error')
scores = cross_val_score(SVC(), X, y, cv=10, scoring='accuracy')
print(scores)
print(scores.mean())

#### Now let's use this to tune model parameters

In [None]:
print("Booster best train score: {}".format(bst.best_score))
print("Booster best iteration: {}".format(bst.best_iteration))
print("Booster best number of trees limit: {}".format(bst.best_ntree_limit))

### Cross validating results
Native XGBoost package provides an option for cross-validating results (but not as sophisticated as sklearn package). 

The next input shows a basic execution. 

***Notice that we are passing only single DMatrix, so it would be good to merge train and test into one object to have more training samples***
- by default, we get a pandas data frame object (can be changed with as_pandas param)
- metrics are passed as an argument (multiple values are allowed)
- we can use own evaluation metrics (param feval and maximize)

In [None]:
num_rounds = 10   # how many estimators

hist = xgb.cv(params, dtrain, num_rounds, nfold=10, metrics={'error'}, seed=seed)
hist

## Hyper-parameter tuning more
- many parameters are tunable. Each one results in different output. The question is which conbination produces best results.
- scikit-learn provides a lot of such modules for us to use!

In [None]:
from xgboost.sklearn import XGBClassifier
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.datasets import make_classification
from sklearn.model_selection import StratifiedKFold
# Note: from sklearn.cross_validation import StratifiedKFold has been deprecated

from scipy.stats import randint, uniform
seed = 342  # fixed seed makes results reproducible
np.random.seed(seed)

# generate artificial dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=8, n_redundant=3, n_repeated=2, random_state=seed)

Define cross-validation strategy for testing. Let's use ***StratifiedKFold*** which guarantees that target labels are equally distributed across each fold.

In [None]:
cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
cv.get_n_splits(X, y)

### Grid Search
In grid-search, we start by defining a dictionary holding possible parameter values we want to test. 
- All combinations will be evaluated

In [None]:
params_grid = { 'max_depth' : [1,2,3], 
                'n_estimators' : [5, 10, 25, 50],
                'learning_rate' : np.linspace(1e-16, 1, 3)}

#### add a dictionary for fixed parameters

In [None]:
params_fixed = { 'objective' : 'binary:logistic',
                 'silent' : 1}

Create a GridSearchCV estimator, We will be looking for combination giving the best accuracy

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris

# fetch data first
X = load_iris().data
#X = preprocessing.scale(X)
y = load_iris().target

k_range = range(1,31)
k_scores = []

for i in k_range:
    knn = KNeighborsClassifier(n_neighbors=i)
    k_scores.append(cross_val_score(knn, X, y, cv=10, scoring='accuracy').mean())
    
print(k_scores)

# plotting
plt.plot(k_range, k_scores)

#### The following will be done in XGBoost
- the dataset will be taken from Mushroom dataset

In [None]:
import xgboost as xgb
from pprint import pprint

# for reproducibility, if you don't want this, you could use time.time() to get different value every time
seed = 123
np.random.seed(seed)

dtrain = xgb.DMatrix('../../data/agaricus.txt.train')
dtest  = xgb.DMatrix('../../data/agaricus.txt.test')

# train parameters - we are going to use 5 decision tree stumps with average learning rate.
# the defaul error metric is 'error'
params = {'objective' : 'binary:logistic',
          'max_depth' : 2,
          'silent' : 1,
          'eta' : 0.5}
num_rounds = 5
watch_list = [(dtest, 'eval'), (dtrain, 'train')]

# training
bst = xgb.train(params, dtrain, num_rounds, watch_list)

### let's change the error metric to logloss

In [None]:
params['eval_metric'] = 'logloss'
bst = xgb.train(params, dtrain, num_rounds, watch_list)

### we could use multiple error metrics

In [None]:
params['eval_metric'] = ['auc', 'map']
bst = xgb.train(params, dtrain, num_rounds, watch_list)

### Creating custom evaluation metric
In order to create our own evaluation metric, the only thing needed to do is to create a method taking two arguments - ***predicted probabilities*** and ***DMatrix*** objet holding training data

In this example, our classification metric will simply count the number of mis-classified examples assuming that classes with p > 0.5 are positive. You can change this threshold if you want more certainty

The algorithm is getting better when the number of mis-classified examples is getting lower. Remember to also set the argument ***maximize=False*** while training

In [None]:
def misclassified(pred_probs, dtrain):
    labels = dtrain.get_label()   # obtain true labels
    preds  = pred_probs > 0.5     # obtain predicted values
    return 'misclassified', np.sum(labels != preds)

params['eval_metric'] = []
# the argument order is important! if you switch them, you will get error messages
bst = xgb.train(params, dtrain, num_rounds, watch_list, feval=misclassified, maximize=False)

### Extracting the evaluation results
We can get evaluation scores by declaring a dictionary for holding values and passing it as a parameter for ***evals_result*** argument

In [None]:
evals_results = {}
bst = xgb.train(params, dtrain, num_rounds, watch_list, feval=misclassified, maximize=False, evals_result=evals_results)

In [None]:
# now reuse these scores for other purposes (such as plotting)
pprint(evals_results)

### Early stopping
There is a nice optimization trick when fitting multiple trees.

You can train the model until the validation score stops improving. Validation error needs to decrease at least every early_stopping_rounds to continue training. This approach results in simpler model, because the lowest number of trees will be found (simplicity).

In the following example a total number of 1500 trees is to be created, but we are telling it to stop if the validation score does not improve for last ten iterations.

In [None]:
num_rounds = 1500
params['eval_metric'] = 'error'

bst = xgb.train(params, dtrain, num_rounds, watch_list, early_stopping_rounds=10)

When using early_stopping_rounds parameter, the resulting model will have 3 additional fields - ***bst.best_score***, ***bst.best_iteration*** and ***bst.best_ntree_limit***

- Note: train() will return a model from the last iteration, not the best one

In [None]:
bst_grid = GridSearchCV( estimator=XGBClassifier(**params_fixed, seed=seed),
                         param_grid=params_grid,
                         cv=cv,
                         scoring='accuracy')

Before running the calculations, notice that we will have 3 \* 4 \* 3 \* 10 = 360 models created to test all combinations.
- you should always have rough estimations about what is going to happen

In [None]:
bst_grid.fit(X, y)

Now, we can look at all obtained scores, and try to manually see what matters and what not, A quick glance looks that the larger n_estimators then the accuracy is higher

In [None]:
bst_grid.grid_scores_

In [None]:
bst_grid.cv_results_

If there are too many results, we can filter them manually to get the best combination
- Note: looking for best parameters is an iterative process. You should start with coarsed-granularity and move to more detailed values.

In [None]:
print("Best accuracy obtained: {0}".format(bst_grid.best_score_))
print("Parameters")
for key, value in bst_grid.best_params_.items():
    print("\t{}: {}".format(key, value))

### Randomized Grid-Search
when the number of parameters and their values is getting big, the traditional grid-search approach quickly becomes ineffective.
- A possible solution might be to randomly pick certain parameters from their distribution. While it's not an exhaustive solution, it's worth giving a shot!

In [None]:
# Create a parameters distribution dictionary:
params_dist_grid = { 'max_depth' : [1, 2, 3, 4],
                     'gamma' : [0, 0.5, 1],
                     'n_estimators' : randint(1, 1001),   # uniform discrete random distribution
                     'learning_rate' : uniform,           # gaussain distribution
                     'subsample' : uniform(),             # gaussain distribution
                     #'colsample_bytree' : uniform()       # gaussain distribution
                   }

Initialize ***RandomizedSearchCV*** to randomly pick 10 combinations of parameters. 
- with this approach you can easily control the number of tested models