# Cross-Validation

## Evaluation mit Cross-Validation
Um verschiede Verfahren und Parameter möglichst ohne die Gefahr des overfitting evaluieren zu können, steht man immer vor dem Problem: Mit welchen Daten trainiere ich meine Verfahren und mit welchen teste ich? Offensichtlich hängt das Ergebnis der Evaluation stark von der konkreten Auswahl des Test- bzw. Trainingsdatensatzes ab. 

Eine in der Literatur etablierte Methode der systematischen Evaluation ist Cross-Validation (Link: [Cross-Validation](https://en.wikipedia.org/wiki/Cross-validation_%28statistics%29)). Die grundlegende Idee des k-Fold  Cross-Validation (Link: [k-fold cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)\#k-fold_cross-validation)) ist wie folgt: Die Gesamtmenge an Klassen-annotierten Datensätzen $T$ wird zufällig in $k$ gleich große Teilmengen (Folds) $T_1 \dots T_k$ aufgeteilt. Es werden $k$ Testiteration $i_1 \dots i_k$ durchgeführt. In jeder Iteration wird jeweils eine andere Teilmenge $T_i$ als Testdatensatz und die restlichen Daten $T \setminus T_i$ als Trainingsdatensatz verwendet. Als Gesamt Ergebniss der Cross-Validation wird der Mittelwert der Genauigkeiten der einzelnen Iteration herangezogen. 

Weitere Verfahren sind bspw. Holdout (Link: [Holdout](https://en.wikipedia.org/wiki/Cross-validation_(statistics)\#Holdout_method)), Nested cross-validation (Link: [Nested cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)\#Nested_cross-validation)) etc.

<figure>
<img src="./Figures/k-fold-cross-validation.png" alt="drawing" style="width:600px;">
    <figcaption>k-fold Cross Validation, Quelle: https://upload.wikimedia.org/wikipedia/commons/thumb/b/b5/K-fold_cross_validation_EN.svg/500px-K-fold_cross_validation_EN.svg.png
        </figcaption>
</figure>

Erweitern Sie Ihre Implementierung des KNN-Algorithmus aus dem vorherigen Teil um das <b>k-fold Cross-Validation</b> Verfahren. Wählen Sie hierbei einen geeigneten Wert für die Anzahl der k-folds, bzw. experimentieren Sie mit verschiedenen Werte.

In [None]:
import numpy as np
import csv as csv
import matplotlib.pyplot as plt
import pandas as pd
import itertools
%matplotlib inline

variables = ["Age", "Pclass",  "Sex", "Fare", "SibSp", "Parch"]
# (1) Datenlücken interpolieren
def prepareData(df):
    df.loc[(df.Age.isnull()) & (df.Sex == 'male'),'Age'] = df[(df.Sex == 'male')].Age.median()
    df.loc[(df.Age.isnull()) & (df.Sex == 'female'),'Age'] = df[(df.Sex == 'female')].Age.median()
    df.loc[df.Age.isnull(),'Age']= df.Age.median()
    df.loc[(df.Fare.isnull()) & (df.Sex == 'male'),'Fare'] = df[(df.Sex == 'male')].Fare.median()
    df.loc[(df.Fare.isnull()) & (df.Sex == 'female'),'Fare'] = df[(df.Sex == 'female')].Fare.median()
    df.loc[df.Fare.isnull(),'Fare']= df.Fare.median()
    df.loc[(df.Sex == 'male'), 'Sex'] = 1
    df.loc[(df.Sex == 'female'), 'Sex'] = 0

    return df

def extractFeatureVector(row):
    return np.array([row.Age, row.Pclass, row.Sex, row.Fare, row.SibSp, row.Parch])

class KNN(object):
    
    def __init__(self, k):
        self.k = k

    def euklid(self, vector1,vector2):
        sum = 0
        for item in range(len(vector1)-1):
            sum += (vector1[item] - vector2[item])**2
        return np.sqrt(sum)
    
    def cityBlock(self, vector1, vector2):
        sum=0
        for item in range(len(vector1) -1):
            sum += abs(vector1[item] - vector2[item])
        return sum
    def fit(self, df):
        self.trainData =  df
        self.trainLabel = [0,1] #Versteh den Sinn nicht so ganz. 
    
    def predict(self, x):
        distances = list()
        for row in self.trainData.itertuples():
            dst = self.cityBlock(extractFeatureVector(row), extractFeatureVector(x))
            distances.append((row, dst))
        distances.sort(key=lambda tup: tup[1])
        survivedCount = 0
        for i in range(self.k):
            survivedCount += distances[i][0].Survived
        if (survivedCount > round(self.k/2)):
            return self.trainLabel[1]
        return self.trainLabel[0]


def testFunction(df_train_norm, df_test_norm):
    knn = KNN(3)
    knn.fit(df_train_norm)
    truePositives = 0
    trueNegatives = 0
    falsePositives = 0
    falseNegatives = 0
    for item in df_test_norm.itertuples():
        predicted = knn.predict(item)
        if (predicted == item.Survived & item.Survived == 1):
            truePositives+=1
        if (predicted == item.Survived & item.Survived == 0):
            trueNegatives+=1
        if (predicted != item.Survived & item.Survived == 1):
            falsePositives+=1
        if (predicted != item.Survived & item.Survived == 0):
            falseNegatives+=1
    return (truePositives+trueNegatives)/(truePositives+trueNegatives+falsePositives+falseNegatives)


def normalize(df, variables, means, stds):
    result = df
    for item in result.index:
        for v in variables:
            result.at[item, v] = (result.at[item, v]-means[v])/stds[v]
    return result

def calcNormModel(df, variables):
    means = {}
    stds = {}
    for x in variables:
        means[x] = df[x].mean()  
        stds[x] = np.std(df[x])
    return means, stds

In [None]:
def createFolds(df, crossK):
    folds = []
    for i in range(crossK):
        limitLower = round(((i/crossK)*len(df)))
        limitUpper = round((((i+1)/crossK)*len(df)))
        folds.append( df[limitLower:limitUpper])
    return folds
    

In [None]:
def crossFold(df, crossK):
    folds = createFolds(df, crossK)
    df_test = None
    df_train = None
    mean = 0
    for i in range(len(folds)):
        folds_copy = folds
        df_test = folds[i]
        train_array = [x for j,x in enumerate(folds_copy) if j != i]
        df_train = pd.concat(train_array)
        
        model = calcNormModel(df_train, variables)
        
        df_train_normalized = normalize(df_train, variables, model[0], model[1])
        df_test_normalized = normalize(df_test, variables, model[0], model[1])
        result = testFunction(df_train, df_test)
        print("run " +str(i) +": " + str(result))
        mean+= result
    print(mean/crossK)
        


In [None]:

DATA_FILE = './Data/original_titanic.csv'

df_original = pd.read_csv(DATA_FILE, header=0)

df_prepared = prepareData(df_original)

df_shuffled =  df_prepared.sample(frac=1)

for i in range(2, 10):
    print("crossK: " + str(i))
    crossFold(df_shuffled, i)

In [None]:
crossK: 2
run 0: 0.7446483180428135
run 1: 0.6274809160305344
0.686064617036674
crossK: 3
run 0: 0.6788990825688074
run 1: 0.8054919908466819
run 2: 0.6077981651376146
0.6973964128510346
crossK: 4
run 0: 0.6941896024464832
run 1: 0.3363914373088685
run 2: 0.6859756097560976
run 3: 0.5963302752293578
0.5782217311852018
crossK: 5
run 0: 0.6145038167938931
run 1: 0.549618320610687
run 2: 0.735632183908046
run 3: 0.6564885496183206
run 4: 0.5877862595419847
0.6288058260945862
crossK: 6
run 0: 0.6238532110091743
run 1: 0.5642201834862385
run 2: 0.42660550458715596
run 3: 0.410958904109589
run 4: 0.7293577981651376
run 5: 0.5596330275229358
0.5524381048133719
crossK: 7
run 0: 0.732620320855615
run 1: 0.39572192513368987
run 2: 0.7272727272727273
run 3: 0.6096256684491979
run 4: 0.45989304812834225
run 5: 0.6203208556149733
run 6: 0.5828877005347594
0.5897631779984722
crossK: 8
run 0: 0.6219512195121951
run 1: 0.3312883435582822
run 2: 0.6341463414634146
run 3: 0.7300613496932515
run 4: 0.6524390243902439
run 5: 0.5853658536585366
run 6: 0.6134969325153374
run 7: 0.5975609756097561
0.5957887550501273
crossK: 9
run 0: 0.6758620689655173
run 1: 0.4452054794520548
run 2: 0.6068965517241379
run 3: 0.6506849315068494
run 4: 0.593103448275862
run 5: 0.6095890410958904
run 6: 0.5172413793103449
run 7: 0.5753424657534246
run 8: 0.6068965517241379
0.58675799086758

In [None]:
crossK: 2
0.7478850059527977
crossK: 3
0.7509464726835037
crossK: 4
0.7524683933765943
crossK: 5
0.7456114182094703
crossK: 6
0.7509460572801028
crossK: 7
0.7555385790679907
crossK: 8
0.7510100254376777
crossK: 9
0.7532042198079042

In [None]:
crossK: 2
run0: 0.753822629969419
run1: 0.7297709923664122
crossK: 3
run0: 0.7752293577981652
run1: 0.7322654462242563
run2: 0.7362385321100917
crossK: 4
run0: 0.7889908256880734
run1: 0.7522935779816514
run2: 0.7560975609756098
run3: 0.7431192660550459
crossK: 5
run0: 0.7709923664122137
run1: 0.7404580152671756
run2: 0.7777777777777778
run3: 0.7633587786259542
run4: 0.732824427480916
crossK: 6
run0: 0.7752293577981652
run1: 0.7477064220183486
run2: 0.7201834862385321
run3: 0.7671232876712328
run4: 0.7431192660550459
run5: 0.7201834862385321
crossK: 7
run0: 0.7647058823529411
run1: 0.7700534759358288
run2: 0.7593582887700535
run3: 0.7647058823529411
run4: 0.7593582887700535
run5: 0.7486631016042781
run6: 0.7165775401069518
crossK: 8
run0: 0.7682926829268293
run1: 0.8220858895705522
run2: 0.7560975609756098
run3: 0.7423312883435583
run4: 0.7865853658536586
run5: 0.7439024390243902
run6: 0.7852760736196319
run7: 0.6951219512195121
crossK: 9
run0: 0.7448275862068966
run1: 0.821917808219178
run2: 0.7586206896551724
run3: 0.7054794520547946
run4: 0.7586206896551724
run5: 0.7534246575342466
run6: 0.7655172413793103
run7: 0.7465753424657534
run8: 0.7241379310344828