# Module 2: Cross-Validation - Practice

In this practice you will create a **20-fold cross-validation** to a **Gaussian Naive Bayes model**, 
which attempts to fit the **titanic** dataset.

+ Look for **placeholders** in the code and fill in the appropriate code.
+ Expect requirements in **bold** font when provided.
+ Presentation of printouts are not strict as long as they are readable and equivalent.


In [1]:
import os, sys
from collections import Counter
import numpy as np
import pandas as pd
import sklearn.model_selection
from sklearn.naive_bayes import GaussianNB

np.random.seed(18937) # please ignore and leave this line alone

## Load Dataset

Load dataset from files into multidimensional array.

In [2]:
# Dataset location
DATASET = '/dsa/data/all_datasets/titanic_ML/titanic.csv'
assert os.path.exists(DATASET)

# Load and shuffle
dataset = pd.read_csv(DATASET).sample(frac = 1).reset_index(drop=True)
dataset.describe()

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,embarked,survived
count,890.0,890.0,890.0,890.0,890.0,890.0,890.0,890.0
mean,2.31236,0.642697,29.548697,0.503371,0.351685,32.865772,0.895506,0.389888
std,0.837241,0.479475,13.379025,1.095286,0.790069,52.639685,0.529535,0.487999
min,1.0,0.0,0.17,0.0,0.0,0.0,0.0,0.0
25%,2.0,0.0,21.0,0.0,0.0,7.8958,1.0,0.0
50%,3.0,1.0,28.0,0.0,0.0,13.775,1.0,0.0
75%,3.0,1.0,37.0,1.0,0.0,29.925,1.0,1.0
max,3.0,1.0,80.0,8.0,9.0,512.3292,2.0,1.0


In [3]:
dataset.head()

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,embarked,survived
0,1,0,30.0,0,0,164.8667,1,1
1,2,1,36.0,0,0,10.5,1,0
2,3,1,19.0,0,0,0.0,1,0
3,3,1,29.0,3,1,22.025,1,1
4,3,1,18.0,1,1,20.2125,1,0


## Part 1: Cross-validation with sklearn

Make a **20-fold** cross-validation using `cross_val_score()` provided by sklearn.

In [None]:
model = GaussianNB()

X = np.array(<placeholder>)
y = np.array(<placeholder>)

sklearn.model_selection.cross_val_score(<placeholder>, <placeholder>, <placeholder>, cv=<placeholder>)

In [4]:
model = GaussianNB()

X = np.array(dataset.iloc[:,:-1])
y = np.array(dataset.survived)

sklearn.model_selection.cross_val_score(model,X,y,cv=20)

array([ 0.76086957,  0.60869565,  0.7826087 ,  0.8       ,  0.68888889,
        0.75555556,  0.77777778,  0.84090909,  0.77272727,  0.77272727,
        0.79545455,  0.77272727,  0.77272727,  0.79545455,  0.75      ,
        0.72727273,  0.72727273,  0.79545455,  0.88636364,  0.77272727])

## Part 2: Create cross-validation manually

Make a 20-fold cross-validation without using the scikit learn provided cross-validation scoring method.

In [None]:
def cross_val_score(model, X, y, cv):
    X_folds = np.array_split(<placeholder>, <placeholder>)
    y_folds = np.array_split(<placeholder>, <placeholder>)
    print('X_folds', Counter([i.shape for i in X_folds]), 'y_folds', Counter([i.shape for i in y_folds]))
    
    for i in range(cv):
        X_train = np.concatenate([X_folds[<placeholder>] for j in range(cv) if <placeholder>])
        X_test = X_folds[<placeholder>]
        y_train = np.concatenate([y_folds[<placeholder>] for j in range(cv) if <placeholder>])
        y_test = y_folds[<placeholder>]
        model.fit(<placeholder>, <placeholder>)
        yield model.score(<placeholder>, <placeholder>)

print("Cross-validation:")
for i, score in enumerate(cross_val_score(model, X, y, cv=<placeholder>)):
    print(('\tscore[%d] ='%i), score)

In [5]:
def cross_val_score(model,X,y,cv):
    X_folds = np.array_split(X,20)
    y_folds = np.array_split(y,20)
    print('X_folds',Counter([i.shape for i in X_folds]),'y_folds', Counter([i.shape for i in y_folds]))
    
    for i in range(cv):
        X_train = np.concatenate([X_folds[j] for j in range(20) if j != i])
        X_test = X_folds[i]
        y_train = np.concatenate([y_folds[j] for j in range(20) if j != i])
        y_test = y_folds[i]
        model.fit(X_train,y_train)
        yield model.score(X_test,y_test)
        
        
print("Cross-validation:")
for i,score in enumerate(cross_val_score(model,X,y,cv=20)):
    print(('\tscore[%d]='%i),score)

Cross-validation:
X_folds Counter({(44, 7): 10, (45, 7): 10}) y_folds Counter({(45,): 10, (44,): 10})
	score[0]= 0.733333333333
	score[1]= 0.688888888889
	score[2]= 0.711111111111
	score[3]= 0.822222222222
	score[4]= 0.688888888889
	score[5]= 0.755555555556
	score[6]= 0.755555555556
	score[7]= 0.844444444444
	score[8]= 0.777777777778
	score[9]= 0.777777777778
	score[10]= 0.795454545455
	score[11]= 0.772727272727
	score[12]= 0.727272727273
	score[13]= 0.772727272727
	score[14]= 0.840909090909
	score[15]= 0.681818181818
	score[16]= 0.659090909091
	score[17]= 0.863636363636
	score[18]= 0.886363636364
	score[19]= 0.772727272727


# Save your notebook!