### Example workflow of machine learning algorithm cross validation in python sci-kit learn

Believe your local cv score rather than public LB score is the first principal that all kagglers should know. In previous notebook we already get involved with cross validation: we use `GridSearchCV` to search optimal hyperparameters based on cv score. 

This time is for calculate cv score for certain algorithm, this also could be viewed as an evaluation. This notebook's code has the following reference: https://www.kaggle.com/ogrellier/lgbm-with-words-and-chars-n-gram/code

In [1]:
import numpy as np
import pandas as pd

from sklearn.metrics import roc_auc_score, accuracy_score
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression

In [2]:
iris = load_iris()
X, y = iris.data[:, :2], iris.target
X.shape, y.shape

((150, 2), (150,))

#### Calculate cross validation score for a given model (estimator)

We can just use `cross_val_score()`, this is the simplest case where the estimator could be `fit` and `predict`. However we just get the score and don't know which part of the data is used.

In [3]:
lr_clf = LogisticRegression()
lr_clf

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [4]:
cv_scores = cross_val_score(lr_clf, X, y, scoring='accuracy', cv=10)
cv_scores

array([ 0.73333333,  0.8       ,  0.66666667,  0.8       ,  0.6       ,
        0.73333333,  0.8       ,  0.8       ,  0.8       ,  0.86666667])

In [5]:
np.mean(cv_scores)

0.76000000000000001

#### Calculate cross validation score based on splited data

In [6]:
scores = []
folds = KFold(n_splits=10, shuffle=True, random_state=233)
type(folds)

sklearn.model_selection._split.KFold

In [7]:
print(' Start cross validation . . .\n')
for n_fold, (train_idx, val_idx) in enumerate(folds.split(X, y)):
    
    # prepare data
    X_train, y_train, X_val, y_val = X[train_idx], y[train_idx], X[val_idx], y[val_idx]
    
    # fit model
    clf = LogisticRegression()
    clf.fit(X_train, y_train)
    
    # evaluate model
    y_pred_val = clf.predict(X_val)
    curr_score = accuracy_score(y_val, y_pred_val)
    scores.append(curr_score)
    print('\t Fold {} accuracy: {}'.format(n_fold + 1, curr_score))
    
print('\n Total CV score is {}'.format(np.mean(scores)))    

 Start cross validation . . .

	 Fold 1 accuracy: 0.5333333333333333
	 Fold 2 accuracy: 0.8666666666666667
	 Fold 3 accuracy: 0.8
	 Fold 4 accuracy: 0.6666666666666666
	 Fold 5 accuracy: 0.7333333333333333
	 Fold 6 accuracy: 0.26666666666666666
	 Fold 7 accuracy: 0.9333333333333333
	 Fold 8 accuracy: 0.8
	 Fold 9 accuracy: 0.6666666666666666
	 Fold 10 accuracy: 0.8

 Total CV score is 0.7066666666666667
