# Cross validation

Cross-validation is a technique used to assess how a statistical analysis will generalize to an independent data set.

When creating a predictive model, the model is trained using a dataset called the training dataset. The accuracy of the trained model is then tested on another unknown dataset called the testing dataset. The process is called cross-validation.

Useful links
* https://en.wikipedia.org/wiki/Cross-validation_%28statistics%29
* http://scikit-learn.org/stable/modules/cross_validation.html
* http://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
* https://en.wikipedia.org/wiki/F1_score

In [35]:
import sklearn
from sklearn import cross_validation
import pandas as pd

Scikit learn makes it easy to use multiple methods for cross validation. A basic approach is called k-fold cross validation. The dataset is split into k smaller sets, where 1 of the sets is used to validate the model while the remaining are used to train the model. The peformance measures reported by the k-fold cross-validations are the average of the values computed by choosing a different set for the cross-validation and using the remaining for training.

In [36]:
from sklearn.datasets import load_breast_cancer
dataset = load_breast_cancer()

The "Wisconsin Breast Cancer" dataset is used to demonstrate cross-validation.

In [37]:
target = pd.Series(dataset.target, dtype='category')
target.cat.rename_categories(['malignant', 'benign'], inplace=True)

This data set has 569 samples of which 357 are benign and 212 are malignant. Ten factors are used to predict breast cancer.

In [38]:
column_names = [
    'radius', 'texture', 'perimeter', 'area',
    'smoothness', 'compactness', 'concavity', 'concave_points',
    'symmetry', 'fractal_dimension']
df = pd.DataFrame(data=dataset.data[:, :10], columns=column_names)

In addition to precision and recall, the F1 score is calculated. The F1 score is the harmonic mean and equally weights precision and recall. A F1 score reaches its highest value at 1 and lowest value at 0.

$F1 = 2 * (precision * recall) / (precision + recall)$

In [39]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

def get_metrics(clf, data, target, name):
    predicted = cross_validation.cross_val_predict(clf, data, target, cv=5)
    accuracy_scores = accuracy_score(target, predicted)
    precision_scores = precision_score(target, predicted)
    recall_scores = recall_score(target, predicted)
    f1_scores = f1_score(target, predicted)
    return {
        'classifier': name,
        'accuracy': accuracy_scores.mean(),
        'precision': precision_scores.mean(),
        'recall': recall_scores.mean(),
        'f1_score': f1_scores.mean()
    }

In [40]:
from sklearn import linear_model
# C is the inverse of regularization parameter (smaller values specify strong regularization)
logreg = linear_model.LogisticRegression(C=1e5)
result1a = get_metrics(logreg, df.values, dataset.target, 'logistic regression')

In [41]:
from sklearn.svm import SVC
clf = SVC(kernel='rbf')
result2a = get_metrics(clf, df.values, dataset.target, 'support vector (radial basis)')

In [42]:
from sklearn import tree
clf = tree.DecisionTreeClassifier(max_depth=10)
result3a = get_metrics(clf, df.values, dataset.target, 'decision tree')

In [43]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=50)
result4a = get_metrics(clf, df.values, dataset.target, 'random forest')

The four classifiers: logistic regression, support vector, decision tree and random forests are compared on the cross-validation scores. They perform much worse on the test dataset as compared to the training dataset. Compare the results with those in the previous post.

In [44]:
pd.DataFrame([result1a, result2a, result3a, result4a], columns=['classifier', 'accuracy', 'precision', 'recall', 'f1_score'])

Unnamed: 0,classifier,accuracy,precision,recall,f1_score
0,logistic regression,0.926186,0.938719,0.943978,0.941341
1,support vector (radial basis),0.717047,0.704167,0.946779,0.807646
2,decision tree,0.905097,0.922006,0.927171,0.924581
3,random forest,0.947276,0.955432,0.960784,0.958101
