<a href="https://colab.research.google.com/github/ganesh46/AI-LAB/blob/master/K_Fold_Cross_Validation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# K Fold Cross validation

## 1) Import the required packages

In [None]:
import pandas
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LinearRegression  
import numpy as np

## 2) Import the dataset

In [None]:
dataset = pandas.read_csv('students_placement_data.csv')
dataset.head()

Unnamed: 0,Roll No,Gender,Section,SSC Percentage,inter_Diploma_percentage,B.Tech_percentage,Backlogs,registered_for_ Placement_Training,placement status
0,1,M,A,87.3,65.3,40.0,18,NO,Not placed
1,2,F,A,89.0,92.4,71.45,0,yes,Placed
2,3,F,A,67.0,68.0,45.26,13,yes,Not placed
3,4,M,A,71.0,70.4,36.47,17,yes,Not placed
4,5,M,A,,65.5,42.52,17,yes,Not placed


## 3) Remove missing values if any 

In [None]:
dataset=dataset.fillna(method="ffill") # Forward fill method is used. You may use any other method of your choice

## 4) Divide the data into features and class label
Let's predict **B.Tech_percentage** based on **SSC Percentage**	and **inter_Diploma_percentage** . I.e, perform multiple linear regression.

In [None]:
X = dataset.iloc[:, [3, 4]] # SSC Percentage and inter_Diploma_percentage are features.
y = dataset.iloc[:, 5]  # B.Tech percentage.

print(X[0:5]) # Let's display first 5 features.
print("\n")
print(y[0:5]) # let's display first 5 labels.

   SSC Percentage  inter_Diploma_percentage
0            87.3                      65.3
1            89.0                      92.4
2            67.0                      68.0
3            71.0                      70.4
4            71.0                      65.5


0    40.00
1    71.45
2    45.26
3    36.47
4    42.52
Name: B.Tech_percentage, dtype: float64


## 5) Normalize the data

In [None]:
scaler = MinMaxScaler(feature_range=(0, 1)) # Let's normalize the data in range of 0 to 1 using MinMaxScaler
X = scaler.fit_transform(X)
X[0:5] # let's check first 5 values

array([[0.86131705, 0.29306488],
       [0.89838639, 0.89932886],
       [0.4186655 , 0.35346756],
       [0.50588748, 0.40715884],
       [0.50588748, 0.29753915]])

## 6) Apply the model on the data.

In [None]:
model= LinearRegression() 
scores = cross_val_score(model, X, y, cv=3) # Here we are performing 3 fold cross validation.

## 7) Check the performance of the model.

In [None]:
print(scores) # We will get 3 scores. Because, we are using cv=3. 

# We average the scores for overall performance of the model. Not R2 Score is default measure.
print("The overall r2 score of the model using cross_val_score",np.mean(scores)) 

[0.55849429 0.73188684 0.12129382]
The overall r2 score of the model using cross_val_score 0.47055831314866764


# We have build a model using **cross_val_score.**

## Now, let's perform k fold cross validation using "KFOLD in sklearn" to demonstrate the working of k-fold cross validation. (optional- Only for understanding purpose)

In [None]:
scores = [] # Take an empty array
model1= LinearRegression() # Consider the regression model
cross_val=KFold(n_splits=3) # Use 3 fold cross validation
count_itr=0 # Initalize a counter (to display iteration number)
for train_index, test_index in cross_val.split(X):
    count_itr+=1 # Increment iteration number)
    print("Iteration",count_itr) # Print the iteration number
    print("Train Index: ", train_index, "\n") # Returns indexes of training data in each iteration
    print("Test Index: ", test_index) # Returns indexes of test data in each iteration
    print("\n")
    # divide the data into train and test.
    X_train, X_test, y_train, y_test = X[train_index], X[test_index], y[train_index], y[test_index] 
    
    model1.fit(X_train, y_train) # Apply the model on training data
    scores.append(model1.score(X_test, y_test)) # Store the score of each model in "scores" variable. 
    # Note default score is r2 score
print("The individual r2 score of the model using KFOLD is", scores) 
print("The overall r2 score of the model is using KFOLD",np.mean(scores))

Iteration 1
Train Index:  [ 39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56
  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73  74
  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91  92
  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107 108 109 110
 111 112 113 114 115 116] 

Test Index:  [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38]


Iteration 2
Train Index:  [  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35
  36  37  38  78  79  80  81  82  83  84  85  86  87  88  89  90  91  92
  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107 108 109 110
 111 112 113 114 115 116] 

Test Index:  [39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62
 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77]


Iteration 3
Train Index:  [ 0  1  2  3