![alt text](https://qph.fs.quoracdn.net/main-qimg-c5ed87e938b35f372db3f8fbddc19290)




Cross-validation is a technique in which we train our model using the subset of the data-set and then evaluate using the complementary subset of the data-set.

steps:

1. You reserve a sample data set.

2. Train the model using the remaining part of the dataset.

3. Use the reserve sample of the test (validation) set. This will help you in gauging the effectiveness of your model’s performance. If your model delivers a positive result on validation data, go ahead with the current model

**Holdout Validation Approach - Train and Test Set Split**

we perform training on the 50% of the given data-set and rest 50% is used for the testing purpose. The major drawback of this method is that we perform training on the 50% of the dataset, it may possible that the remaining 50% of the data contains some important information which we are leaving while training our model i.e higher bias

In [76]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
import pandas as pd

In [77]:
data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = pd.Series(data.target)
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [78]:
X = df.iloc[:,:-1]
Y = df.iloc[:,-1]

In [79]:
# use train/test split with different random_state values
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=6)


In [80]:
train.shape, validation.shape

((75, 5), (75, 5))

In [81]:
# check classification accuracy of KNN with K=5
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
metrics.accuracy_score(y_test, y_pred)

0.9736842105263158

**Leave one out cross validation (LOOCV)**

In this approach, we reserve only one data point from the available dataset, and train the model on the rest of the data. This process iterates for each data point. 

*  We make use of all data points, hence the bias will be low.

*  We repeat the cross validation process n times (where n is number of data points) which results in a higher execution time

*  This approach leads to higher variation in testing model effectiveness because we test against one data point

In [82]:
# Import necessary modules
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from math import sqrt
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import LeavePOut
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import StratifiedKFold

In [83]:
loocv = LeaveOneOut()
loocv.get_n_splits(X)

150

In [84]:
# printing the training and validation data
#for train_index, test_index in loocv.split(X):
  #print(train_index,test_index)

In [85]:
knn_loocv = KNeighborsClassifier(n_neighbors=5)
results_loocv = model_selection.cross_val_score(knn_loocv, X, Y, cv=loocv)
print("Accuracy: %.2f%%" % (results_loocv.mean()*100.0))

Accuracy: 96.67%


**K-fold cross validation**

1. Randomly split your entire dataset into k”folds”

2. For each k-fold in your dataset, build your model on k – 1 folds of the dataset. Then, test the model to check the effectiveness for kth fold
Record the error you see on each of the predictions

3. Repeat this until each of the k-folds has served as the test set

4. The average of your k recorded errors is called the cross-validation error and will serve as your performance metric for the model.

![alt text](https://www.analyticsvidhya.com/wp-content/uploads/2015/11/22.png)

In [86]:
#defining number of folds for model
from sklearn.model_selection import KFold
kf = KFold(n_splits=5, shuffle=False)
kf

KFold(n_splits=5, random_state=None, shuffle=False)

In [87]:
for fold, (train_idx, val_idx) in enumerate(kf.split(X,Y)):
  print(len(train_idx), len(val_idx))

120 30
120 30
120 30
120 30
120 30


In [88]:
model_kfold = KNeighborsClassifier(n_neighbors=5)
results_kfold = model_selection.cross_val_score(model_kfold,X,Y, cv=kf)
print("Accuracy: %.2f%%" % (results_kfold.mean()*100.0)) 

Accuracy: 91.33%


This runs K times faster than Leave One Out cross-validation because K-fold cross-validation repeats the train/test split K-times.

**Stratified k-fold cross validation**

Stratification is the process of rearranging the data so as to ensure that each fold is a good representative of the whole. For example, in a binary classification problem where each class comprises of 50% of the data, it is best to arrange the data such that in every fold, each class comprises of about half the instances.

![alt text](https://cdn.analyticsvidhya.com/wp-content/uploads/2015/11/skfold-768x530.png)

In [89]:
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, random_state=None)

In [90]:
model_skfold = KNeighborsClassifier(n_neighbors=5)
results_kfold = model_selection.cross_val_score(model_skfold,X,Y, cv=skf)
print("Accuracy: %.2f%%" % (results_kfold.mean()*100.0)) 

Accuracy: 97.33%


**RepeatedKFold**

This is where the k-fold cross-validation procedure is repeated n times, where importantly, the data sample is shuffled prior to each repetition, which results in a different split of the sample.

In [91]:
from sklearn.model_selection import RepeatedKFold
rkf = RepeatedKFold(n_splits=5, n_repeats=10, random_state=None)

In [92]:
model_rkfold = KNeighborsClassifier(n_neighbors=5)
results_kfold = model_selection.cross_val_score(model_rkfold,X,Y, cv=rkf)
print("Accuracy: %.2f%%" % (results_kfold.mean()*100.0)) 

Accuracy: 96.27%


**Cross Validation for time series**

Splitting a time-series dataset randomly does not work because the time section of your data will be messed up. For a time series forecasting problem, we perform cross validation in the following manner.

Folds for time series cross valdiation are created in a forward chaining fashion
Suppose we have a time series for yearly consumer demand for a product during a period of n years. The folds would be created like:

![alt text](https://cdn.analyticsvidhya.com/wp-content/uploads/2015/11/ts_1step-850x414.png)

fold 1: training [1], test [2]

fold 2: training [1 2], test [3]

fold 3: training [1 2 3], test [4]

fold 4: training [1 2 3 4], test [5]

fold 5: training [1 2 3 4 5], test [6]

.

.

.

fold n: training [1 2 3 ….. n-1], test [n]

**Underfitting – High bias and low variance**

Techniques to reduce underfitting :
1. Increase model complexity
2. Increase number of features, performing feature engineering
3. Remove noise from the data.
4. Increase the number of epochs or increase the duration of training to get better resul

**Techniques to reduce overfitting :**

1. Increase training data.
2. Reduce model complexity.
3. Early stopping during the training phase (have an eye over the loss over the training period as soon as loss begins to increase stop training).
4. Ridge Regularization and Lasso Regularization
5. Use dropout for neural networks to tackle overfitting.