## Cross Validation
#### What is Cross Validation ?
Cross Validation is a practice of <strong>tuning</strong> your <strong>predictive model</strong> to prevent the models from <strong>overfitting</strong> or <strong>underfitting</strong> a dataset.  

Let's dive deeper as to what the definitions mean and we will begin with the term predictive model.
<ul>
    <li>Predictive Model</li>
    <small>It refers to an alogorithm which discovers new patterns in a dataset.</small>
    <hr>
    <li>Tuning</li>
    <small>The concept of making your Model make better predictions. </small>
    <hr>
    <li>Overfitting</li>
    <small>The idea that your algorithm/model becomes too fimiliar with only a centaian set of training data. Making the model a failure with other types of datasets.</small>
    <hr>
    <li>Underfit</li>
    <small>The idea that only your model isn't good enough and needs more tuning</small>
</ul>

The practice of Cross-validation begins by spitting conditioned dataset for train-test dataset. This is a traditional practice followed in a professional environment.<br>
We can split the dataset either:- 
<ul>
    <li>80/20 where 80% are training dataset and rest 20% are the test dataset</li>
    <li>70/30 where 70% are training dataset and rest 30% are the test dataset</li>
</ul>

The above 2 scenarios are mostly used in cross validation. <br>
But we have the freedom to experiment with different type so of combinations, it advised not to take 50/50 because it leaves the model unfinished or an underfitting model.

Cross Validation helps with understanding how your model might perform with new datasets. <br>
And in the following example we shall see how to implement it.
<ul>
    <li>We will begin by import all the required modules</li>
    <li>Importing the Wine dataset, the dataset contains 13 features which reflect which class of wine a certain data point is. load_wine() exists in the sklearn module.</li>
</ul>

```Python
from sklearn.datasets import load_wine
```
<ul>
    <li>Importing the classifier (pre-defined algorithm for running an analysis).</li>
    <li>To be more specific we will be using a naive bayes classifier. The classifier was randomly selected. </li>
</ul>

```Python
from sklearn.naive_bayes import GaussianNB
```
<ul>
    <li>The following modules are the main reason for Cross Validation.</li>
    <li>train_test_split helps with splitting the dataset into either 70/30, 80/20 etc.</li>
    <li>confusion_matrix helps with calculating the accuracy, which we will implement further on</li>
</ul>

```Python
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
```

<ul>
    <li>Extracting the data from the module</li> 
    <li><i>data</i> variable contains the 13 features for the wine in a numpy 2d array</li>
    <li><i>target</i> variable contains the class (0,1,2) of each set of features in the corresponding index</li>
</ul>

```Python
data = load_wine()['data']
target = load_wine()['target']
```
<ul>
    <li>We instantiate our classifier here</li>
    <li>This classifier allows us to analyse our dataset by using its functions </li>
    <li>This would be considered an example of a predictive model</li>

</ul>

```Python
clNB = GaussianNB()
```

<ul>
    <li>Line 1: This is the first at training our model where we are splitting the dataset into train and test dataset</li>
    <small>In this case we have <i>train_w</i> and <i>target_train_w</i>
    which the naming convention suggests comprise of the data which we going to use to train our model with. And the variables <i>test_w</i> and <i>target_test_w</i> will be used to make our test data  </small>.
    <img src='cv.PNG' width="150" height="150">
    <li>Line 2: We use our classifier to fit the data and we use do it through .fit method</li>
    <small>.fit() method is used by almost all the classifiers, the basic idea here is that the model is trying to familiarize itself with the data. Similar to it you might also encounter .fit_transform() goes on a further step to standardize it.</small> 
    <li>Line 3: Here we have 2 processes to think about. </li>
    <small>
        <ol>
            <li>.predict() method inside of the confusion matrix basically returs a list of predict values.</li>
            <li>.confusion_matrix() returns a 2D array whose sum of diagonal values reflect the correctly predicted values and estimate the accuracy.</li>
        </ol>
    </small>
</ul>



```Python
# Line 1 #
train_w, test_w, target_train_w, target_test_w = train_test_split(data, target, test_size=0.25, random_state=0)
# Line 2 #
clNB = clNB.fit(train_w, target_train_w)
# Line 3 #
matrix = confusion_matrix(clNB.predict(test_w), target_test_w)
print("Bayes' Analysis",  matrix)
print('Accuracy {:.2f}'.format((matrix[0][0]+matrix[1][1]+matrix[2][2])/len(test_w) * 100))
```
The Results

<img src='cm.PNG' height="150" width="150">
<img src='acc.PNG'height="150" width="150">

Our trained model managed to get an accuracy of 93%. But a better practice is for running a cross-validation. <br>
Since, this dataset is small it could be that our model might have overfitted to the data that was provided in the test set<br>
meaning in a hypothetical scnario with more testing samples our model wouldn't work that well.  <br>

For that reason we will cross-validate our model.

<ol>
    <li>We begin by spliting the training data further into training and test sets</li>
    <li>We will then create a new model for analysis and fit our constructed data</li>
    <li>Then we run an analysis on the constructed dataset and check our accuracy</li>
    <li>After which we can use this model to check the accuracy on the main dataset</li>
</ol>

<img src='KCV.PNG' height="300" width="500">

```Python
# 1.
train_cv_w, test_cv_w, target_cv_train_w, target_cv_test_w = train_test_split(train_w, target_train_w, test_size=0.25, random_state=0)
# 2.
clNB1 = GaussianNB()
clNB1 = clNB1.fit(train_cv_w, target_cv_train_w)

# 3.
matrix1 = confusion_matrix(clNB1.predict(test_cv_w), target_cv_test_w)
print('Bayes CV Analysis')
print(matrix1)
print('Accuracy {:.2f}'.format((matrix1[0][0]+matrix1[1][1]+matrix1[2][2])/len(test_w) * 100))
```
The Results<br>
<img src='CV.PNG' height="150" width="150">

Gives a good idea that our is working.


```Python
# 4.
matrix2 = confusion_matrix(clNB1.predict(test_w), target_test_w)
print("Bayes' Analysis",  matrix2)
print('Accuracy {:.2f}'.format((matrix2[0][0]+matrix2[1][1]+matrix2[2][2])/len(test_w) * 100))
```
The Results<br>

<img src='PostKCV.PNG' height="150" width="150">



The results give us a confidence that our model is atleast not overfitting but in the next we should learn how to make this classifier even better.



Exercise: Given an instantiated classifier run a cross verified analysis on the established dataset

In [None]:
# Exercise
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import KFold, train_test_split
from sklearn.metrics import confusion_matrix

data = load_breast_cancer()['data']
target = load_breast_cancer()['target']



#### Solution

```Python
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import KFold, train_test_split
from sklearn.metrics import confusion_matrix

data = load_breast_cancer()['data']
target = load_breast_cancer()['target']

cls = KNeighborsClassifier(n_neighbors=3)
train, test, target_train, target_test = train_test_split(data, target, test_size=0.25, random_state=0)
train_cv_w, test_cv_w, target_cv_train_w, target_cv_test_w = train_test_split(train, target_train, test_size=0.25, random_state=0)

cls = cls.fit(train_cv_w, target_cv_train_w)
result = cls.predict(test_cv_w)

accuracy = confusion_matrix(target_cv_test_w, result)
accuracy = (accuracy[0][0]+accuracy[1][1])/len(target_cv_test_w)
accuracy = accuracy*100

print("Cross Validated Accuracy: {:.2f}".format(accuracy))

final_result = cls.predict(test)
accuracy = confusion_matrix(target_test, final_result)
accuracy = (accuracy[0][0]+accuracy[1][1])/len(target_test)
accuracy = accuracy*100
print("Accuracy: {:.2f}".format(accuracy))
```

## K-Fold Cross Validation
#### What does K-Fold mean ?
<i>K</i> refers to the splits we can make to the training set.
The concept here is that we can essentially train our model more than once with different datasets. <br>
<ol>
    <li>We can first split the training dataset into different buckets, and each bucket holds an equal number of training datapoints (the bucket length is handled by a method and the how its split will handled automatically so you don't have to worry about length of any object the remainder datapoints will be taken care of)</li>
    <img src="KFold.png" height="300" width="500">
    <li>Then we loop through through the list of buckets and then we .fit and .predict each one of them to tally their accuracy</li>
    <img src="KFold1.png" height="300" width="500">
    <li>At the end, we tend to use this trained model for our final prediction</li>
</ol>

Let's go back to iur example for Cross-Validation try to make a K-fold cross verified model and then compare our results. 

<ul>
    <li>Import the Libraries you want</li>
    <li>KFold: This module is used to make buckets of our training data. this works similar with train__test_split() but in this case it tries to split the data in exactly the number of splits mentioned. In this case it splits in terms of indices and we will see how we can use them </li>
</ul>


```Python
from sklearn.model_selection import KFold
from sklearn.neighbors import GaussianNB
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
```

<ul>
    <li>Loading the data</li>
    <li>Splitting the data to make train and test set</li>
    <li>Instantiate the classifier</li>
</ul>

```Python
data = load_wine()['data']
target = load_wine()['target']

train_w, test_w, target_train_w, target_test_w = train_test_split(data, target, test_size=0.25, random_state=0)

clNBm = GaussianNB()
```

<ul>
    <li>We first create a KFold object. <small>Refresher: objects are entities that help with data computations assigned for that variable</small></li>
    <li>we then use the .split() method to get a set of train_index and test_index.</li>
    <li>On every iteration we create training and testing data and get its accuracy on that particular data making it a different dataset everytime</li>
</ul>

```Python
kfold = KFold(n_splits=10)

for train_index, test_index in kfold.split(train_w):
    X_train, X_test = train_w[train_index], train_w[test_index]
    y_train, y_test = target_train_w[train_index], target_train_w[test_index]
    clNBm = clNBm.fit(X_train, y_train)
    y_pred1 = clNBm.predict(X_test)
    matrix1 = confusion_matrix(y_test, y_pred1)
    print('GB',((matrix1[0][0]+matrix1[1][1]+matrix1[2][2])/len(X_test))*100)
    
```
The Results<br>
<img src="kfoldr.png" height="150" width="150">

We see 10 values for 10 splits and we can see the how these played out. 

Now we run our classifier on our main dataset.

```Python
pred1 = clNBm.predict(test_w)
matrix = confusion_matrix(target_test_w, pred1)
accuracy = (matrix[0][0]+matrix[1][1]+matrix[2][2])/len(target_test)
accuracy = accuracy*100
print("Accuracy on test set: {:.2f}".format(accuracy))
```
The Results<br>
95.56% accuracy. 


Run a KFold validation with 10 splits on a new model with the previous dataset

In [None]:
# Exercise Solution here

```Python
# Solution
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import KFold, train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix

data = load_breast_cancer()['data']
target = load_breast_cancer()['target']

clg = KNeighborsClassifier(n_neighbors=3)
train, test, target_train, target_test = train_test_split(data, target, test_size=0.25, random_state=0)
kfold = KFold(n_splits=10)

for train_index, test_index in kfold.split(train):
    X_train, X_test = train[train_index], train[test_index]
    y_train, y_test = target_train[train_index], target_train[test_index]
    
    clg = clg.fit(X_train, y_train)
    
    y_pred = clg.predict(X_test)
    
    matrix = confusion_matrix(y_test, y_pred)
    
    print('KN',((matrix[0][0]+matrix[1][1])/len(X_test))*100)

    
matrix1 = confusion_matrix(target_test,clg.predict(test))    
accuracy = (matrix1[0][0]+matrix1[1][1])/len(target_test)
accuracy = accuracy*100
print("Accuracy on test set: {:.2f}".format(accuracy))
```