# Training and Testing Datasets
Author: Ravin Poudel

Main goal in the statistical or machine learning model is to biuld a generalized predictive-model. Often we start with a set of data to build a model and describe the model fit and other properties. However, it is equally important to test the model with new data (the data that has not been used in fitting a model), and check/evaluate the model performace. From agricultural perspective, basically we need to run an additional experiment to generate a data for purpose of model validation. Instead what we can do is to __randomly__ divide a single dataset into two parts, and use one part for the purpose of learnign whereas the other part for testing the model performacne.

<img src="../nb-images/Train_test.png">

> Train data set: A data set used to __construct/train/learn__ a model. 

> Test data set: A data set used to __evaluate__ the model.



#### How do we spilit a single dataset into two?

There is not a single or one best solution. Its convention to use more data for training the model than to test/evaluate the moddel. Often convention such as `75%/ 25% train/ test` or `90%/10% train/test` scheme are used. Larger the training dataset allows to learn better model, while the larger testing dataset, the better condifence in the model evaluation. 
> Can we apply similar data-splitting scheme when we have a small dataset? Often the case in agriculure or lifescience - "as of now".

> Does a single random split make our predictive model random? Do we want a stable model or a random model?


We will be using an `iris dataset` to explore the concept of data-spiliting. The data set contains:

- 50 samples of 3 different species of iris flower (150 samples in total)
- Iris flower: Setosa, Versicolour, and Virginica
- Measurements: sepal length, sepal width, petal length, petal width


In [28]:
# import modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split


In [29]:
# import iris data from scikit and data preparation
iris = datasets.load_iris() # inbuilt data 
iris_X = iris['data'] # there are features data
iris_y = iris['target'] # this has information about the flower type, but has been coded as 0, 1, or 2.
names = iris['target_names'] # flower type
feature_names = iris['feature_names'] # features name


In [30]:
# check data shape
iris_X.data.shape


(150, 4)

In [31]:
print(iris_y)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


In [32]:
print(names)

['setosa' 'versicolor' 'virginica']


In [33]:
print(feature_names)

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


In [34]:
# splitting into train and test data
# test dataset = 20% of the original dataset

X_train, X_test, y_train, y_test = train_test_split(iris_X, iris_y, test_size=0.2, random_state=0)

In [35]:
# shape of train dataset
X_train.shape, y_train.shape

((120, 4), (120,))

In [36]:
# shape of test dataset
X_test.shape, y_test.shape


((30, 4), (30,))

In [37]:
# instantiate a K-Nearest Neighbors(KNN) model, and fit with X and y
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier()
model_tt = model.fit(X_train, y_train)

In [38]:
# check the accuracy on the training set
model_tt.score(X_test, y_test)


0.9666666666666667

In [39]:
# predict class labels for the test set
predicted = model_tt.predict(X_test)
print (predicted)

[2 1 0 2 0 2 0 1 1 1 2 1 1 1 2 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0]


In [40]:
print(y_test)

[2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0]


In [41]:
# generate evaluation metrics
from sklearn import metrics
print (metrics.accuracy_score(y_test, predicted))

0.9666666666666667


In [42]:
print (metrics.confusion_matrix(y_test, predicted))


[[11  0  0]
 [ 0 12  1]
 [ 0  0  6]]


In [43]:
print (metrics.classification_report(y_test, predicted))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        11
           1       1.00      0.92      0.96        13
           2       0.86      1.00      0.92         6

    accuracy                           0.97        30
   macro avg       0.95      0.97      0.96        30
weighted avg       0.97      0.97      0.97        30



## NOTE:

> Never train model on your test dataset.

> Be suspesious: If you ever happen to have 100% accuracy __over-fitting__ in youe model with test data, be suspecious and double check if you have not used test dataset for traning your model. 

> __over-fitting__ If the model performs very well on the training data but poorly on the test data, then it’s overfit.



### Model Evaluation via Cross-Validation

In [44]:
# evaluate the model using 5-fold cross-validation
from sklearn.model_selection import cross_val_score
scores = cross_val_score(KNeighborsClassifier(), iris_X, iris_y, cv=5)
print (scores)


[0.96666667 1.         0.93333333 0.96666667 1.        ]


In [45]:
print (scores.mean())

0.9733333333333334


In [46]:
# The mean score and sd:
print("Accuracy: %.3f%% (%.3f%%)" % (scores.mean()*100.0, scores.std()*100.0))

Accuracy: 97.333% (2.494%)



### K-Folds Cross Validation
In K-Folds Cross Validation, first we divide the dataset randomly into k subset/bins. One of the subset/bin is used to validate the model, whereas the rest of bins for training model. We repeat the process for multiple rounds, at each round the data-fold used for test and train are randomized.
K fold Cross Validation
<img src="../nb-images/CV.png">

In [47]:
from sklearn import model_selection
model = KNeighborsClassifier()
kfold = model_selection.KFold(n_splits=5, random_state=12323, shuffle=True)

In [48]:
results = model_selection.cross_val_score(model, iris_X, iris_y, cv=kfold)
results

array([0.93333333, 0.96666667, 0.96666667, 1.        , 0.96666667])

In [49]:
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))

Accuracy: 96.667% (2.108%)


### K fold with randomization in split? -- might help us to understand why accurracy of model in cv is different from k fold?

In [23]:
model = KNeighborsClassifier()
kfold = model_selection.KFold(n_splits=5, random_state=12323, shuffle=True)

In [24]:
results = model_selection.cross_val_score(model, iris_X, iris_y, cv=kfold)
results

array([0.93333333, 0.96666667, 0.96666667, 1.        , 0.96666667])

In [25]:
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))

Accuracy: 96.667% (2.108%)


### LOOCV

LOOCV
<img src="../nb-images/LOOV.png">

In [26]:
model = KNeighborsClassifier()
loocv = model_selection.LeaveOneOut()
results = model_selection.cross_val_score(model, iris_X, iris_y, cv=loocv)
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))

Accuracy: 96.667% (17.951%)


# Exercise


