# K-fold cross validation

Implement a random k-fold cross validation algorithm from scratch.

Your algorithm should:
- load the iris dataset and split its columns into features and target
- split the dataset into k-fold to perform cross validation

You can use the code bellow to implement your algorithm or implement yourself from scratch.



In [71]:
# we will implement a k-fold cross validation from scratch
# we will use the iris dataset

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris

# load the iris dataset
iris = load_iris()
X = iris.data
y = iris.target

def k_fold_cross_validation(X, y, k, model):
    accuracies = [] #initialisation de la variable accuracies
    # X is the data
    # y is the target
    # k is the number of folds
    # model is the model to use
    # we will return the accuracy of the model
    # we will use the accuracy as a metric
  

    #################################

    # shuffle the data and create X and y ready to be used to fit the model
    # in a way that if I say X[0] the algorithm will return the first fold of the data, the same for y

    #################################
    
    # we will need to define a for loop to iterate over the folds and guarantee that each fold is used as a test set at least once
    # inside this for loop we will call the functions fit and accuracy for each one of the folds
    # X_train, y_train, X_test, y_test are build each time the for loop is called by using X and y divided before
    
    x_y_stack = np.column_stack((X, y)) # on refusione les colonnes avec le label et les datas ensemble
    np.random.shuffle(x_y_stack) # on effectue le redandom shuffle
    #on resépare les colonnes entre les datas et les labels
    X = x_y_stack[:, :-1]  
    y = x_y_stack[:, -1]
    # on prend 80% des colonnes pour le test
    test_size = 0.8 
    #creation des fold 
    fold_size = len(X) // k
    #on split les datas pour effectué les k fold
    Xfolds = np.array_split(X, k)
    Yfolds = np.array_split(y, k)
    
    #######Your for loop here
    
    
    for i in range(k):
        #Indices de l'ensemble de test et d'entrainement
        test_start = int(len(X) * i / k)
        test_end = int(len(X) * (i + test_size) / k) 
        test_indices = list(range(test_start, test_end))
        train_indices = [j for j in range(len(X)) if j not in test_indices]
        # Your code to define X_train, y_train, X_test, y_test
        X_train, y_train = X[train_indices], y[train_indices]
        X_test, y_test = X[test_indices], y[test_indices]

    # we will fit the model on the train data
        model.fit(X_train, y_train)
        
        # we will predict on the test data
        y_pred = model.predict(X_test)
        
        # we will compute the accuracy
        accuracy = np.mean(y_pred == y_test)
        
        # we will append the accuracy to the list
        accuracies.append(accuracy)
    
    # we will return the mean accuracy
        return np.mean(accuracies)

In [72]:
#You can use the code bellow to test your function

#import the random forest model
from sklearn.ensemble import RandomForestClassifier

# we will use the random forest model
model = RandomForestClassifier()

# we will use the k_fold_cross_validation function
k_fold_cross_validation(X, y, 5, model)

0.9166666666666666