# Module 2: Cross Validation - Practice

In this practice you will create a **20-fold cross validation** to a **Gaussian Naive Bayes model**, which attempts to fit the **titanic** dataset.

+ Look for **placeholders** in the code and fill in the blanks.
+ Expect requirements in **bold** font when provided.
+ Presentation of printouts are not strict as long as they are readable and equivalent.


In [13]:
import os, sys
from collections import Counter
import numpy as np
import pandas as pd
import sklearn.model_selection
from sklearn.naive_bayes import GaussianNB

np.random.seed(18937) # please ignore and leave this line alone

## Load Dataset

Load dataset from files into multi-dimensional array.

In [14]:
# Dataset location
DATASET = 'datasets/titanic.csv'
assert os.path.exists(DATASET)

# Load and shuffle
dataset = pd.read_csv(DATASET).sample(frac = 1).reset_index(drop=True)
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 890 entries, 0 to 889
Data columns (total 8 columns):
pclass      890 non-null int64
sex         890 non-null int64
age         890 non-null float64
sibsp       890 non-null int64
parch       890 non-null int64
fare        890 non-null float64
embarked    890 non-null int64
survived    890 non-null int64
dtypes: float64(2), int64(6)
memory usage: 55.7 KB


## Part 1: Cross validation with sklearn

Make 20-fold cross validation using cross_val_score() provided by sklearn.

In [15]:
model = GaussianNB()

X = np.array(dataset.iloc[:,:-1])
y = np.array(dataset.survived)

sklearn.model_selection.cross_val_score(model, X, y, cv=20) # <placeholder>

array([ 0.76086957,  0.60869565,  0.7826087 ,  0.8       ,  0.68888889,
        0.75555556,  0.77777778,  0.84090909,  0.77272727,  0.77272727,
        0.79545455,  0.77272727,  0.77272727,  0.79545455,  0.75      ,
        0.72727273,  0.72727273,  0.79545455,  0.88636364,  0.77272727])

## Part 2: Create cross validation manually

In [28]:
def cross_val_score(model, X, y, cv):
    X_folds = np.array_split(X, cv)
    y_folds = np.array_split(y, cv)
    print('X_folds', Counter([i.shape for i in X_folds]), 'y_folds', Counter([i.shape for i in y_folds]))
    
    for i in range(cv):
        X_train = np.concatenate([X_folds[j] for j in range(cv) if j!=i])
        X_test = X_folds[i]
        y_train = np.concatenate([y_folds[j] for j in range(cv) if j!=i])
        y_test = y_folds[i]
        model.fit(X_train, y_train)
        yield model.score(X_test, y_test)

print("Cross validation:")
for i, score in enumerate(cross_val_score(model, X, y, cv=20)):
    print(('\tscore[%d] ='%i), score)

Cross validation:
X_folds Counter({(44, 7): 10, (45, 7): 10}) y_folds Counter({(45,): 10, (44,): 10})
	score[0] = 0.733333333333
	score[1] = 0.688888888889
	score[2] = 0.711111111111
	score[3] = 0.822222222222
	score[4] = 0.688888888889
	score[5] = 0.755555555556
	score[6] = 0.755555555556
	score[7] = 0.844444444444
	score[8] = 0.777777777778
	score[9] = 0.777777777778
	score[10] = 0.795454545455
	score[11] = 0.772727272727
	score[12] = 0.727272727273
	score[13] = 0.772727272727
	score[14] = 0.840909090909
	score[15] = 0.681818181818
	score[16] = 0.659090909091
	score[17] = 0.863636363636
	score[18] = 0.886363636364
	score[19] = 0.772727272727


# Save your notebook!