While explaining about overfitting, I decided to divide the data into two parts. I
trained the model on one part and checked its performance on the other part. Well,
this is also a kind of cross-validation commonly known as a **hold-out set**. We use
this kind of (cross-) validation when we have a large amount of data and model
inference is a time-consuming process.

There are many different ways one can do cross-validation, and it is the most critical
step when it comes to building a good machine learning model which is
generalizable when it comes to unseen data. **Choosing the right cross-validation**
depends on the dataset you are dealing with, and one’s choice of cross-validation
on one dataset may or may not apply to other datasets. However, there are a few
types of cross-validation techniques which are the most popular and widely used.

### Types of Cross-Validation

* k-fold cross-validation
* stratified k-fold cross-validation
* hold-out based validation
* leave-one-out cross-validation
* group k-fold cross-validation

![Image](images/CrossValidation.png)

When you get a dataset to build machine learning models, you
separate them into **two different sets: training and validation**. Many people also
split it into a third set and call it a **test set**. We will, however, be using only two
sets. As you can see, we divide the samples and the targets associated with them.
We can divide the data into k different sets which are exclusive of each other. This
is known as **k-fold cross-validation**.

![Image](images/k-fold.png)

We can split any data into k-equal parts using KFold from scikit-learn. Each sample
is assigned a value from 0 to k-1 when using k-fold cross validation.

In [9]:
# import pandas and model_selection module of scikit-learn
import pandas as pd
from sklearn import model_selection
if __name__ == "__main__":
    # Training data is in a CSV file called train.csv
    df = pd.read_csv("../cat_train.csv")
    # we create a new column called kfold and fill it with -1
    df["kfold"] = -1
    # the next step is to randomize the rows of the data
    df = df.sample(frac=1).reset_index(drop=True)
    # initiate the kfold class from model_selection module
    kf = model_selection.KFold(n_splits=5)
    # fill the new kfold column
    for fold, (trn_, val_) in enumerate(kf.split(X=df)):
        df.loc[val_, 'kfold'] = fold
    # save the new csv with kfold column
    df.to_csv("train_folds.csv", index=False)

In [10]:
df

Unnamed: 0,id,bin_0,bin_1,bin_2,bin_3,bin_4,nom_0,nom_1,nom_2,nom_3,...,ord_0,ord_1,ord_2,ord_3,ord_4,ord_5,day,month,target,kfold
0,62725,0.0,1.0,1.0,F,Y,Red,Trapezoid,Lion,India,...,2.0,Grandmaster,Freezing,k,O,mi,2.0,7.0,1,0
1,433657,1.0,0.0,0.0,,N,Blue,Trapezoid,Dog,Russia,...,3.0,Expert,Cold,f,H,Io,7.0,11.0,1,0
2,201749,0.0,0.0,0.0,T,Y,Red,Trapezoid,Lion,India,...,3.0,Novice,Freezing,a,Y,Nh,2.0,3.0,0,0
3,24058,0.0,0.0,0.0,F,Y,Red,Polygon,Dog,Finland,...,3.0,Expert,Cold,,A,OM,1.0,2.0,0,0
4,499615,0.0,0.0,0.0,F,N,,Polygon,Cat,Finland,...,3.0,Expert,Lava Hot,k,M,DR,1.0,5.0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
599995,546779,0.0,1.0,,T,Y,Red,Circle,Axolotl,Finland,...,1.0,Novice,Cold,e,M,Mg,4.0,7.0,0,4
599996,312095,0.0,0.0,1.0,F,N,Red,Polygon,Dog,India,...,1.0,Master,Boiling Hot,a,D,th,6.0,12.0,0,4
599997,498824,0.0,1.0,0.0,F,N,Red,Triangle,Hamster,,...,1.0,Novice,Cold,c,M,SS,5.0,6.0,0,4
599998,272140,0.0,0.0,0.0,F,N,Blue,Triangle,Hamster,Costa Rica,...,3.0,Expert,Freezing,a,A,Dj,5.0,11.0,0,4


In [11]:
set(df.kfold)

{0, 1, 2, 3, 4}

In [14]:
df['kfold'].value_counts()

0    120000
1    120000
2    120000
3    120000
4    120000
Name: kfold, dtype: int64

In [15]:
len(df)

600000

### **The next important type of cross-validation is *stratified k-fold.***

**If you have a
skewed dataset for binary classification with 90% positive samples and only 10%
negative samples, you don't want to use random k-fold cross-validation.**

Using
simple k-fold cross-validation for a dataset like this can result in folds with all
negative samples. In these cases, we prefer using stratified k-fold cross-validation.
Stratified k-fold cross-validation keeps the ratio of labels in each fold constant. So,
in each fold, you will have the same 90% positive and 10% negative samples. Thus,
whatever metric you choose to evaluate, it will give similar results across all folds.

It’s easy to modify the code for creating k-fold cross-validation to create stratified
k-folds. We are only changing from model_selection.KFold to
model_selection.StratifiedKFold and in the kf.split(...) function, we specify the
target column on which we want to stratify. We assume that our CSV dataset has a
column called “target” and it is a classification problem!

In [19]:
# import pandas and model_selection module of scikit-learn
import pandas as pd
from sklearn import model_selection
if __name__ == "__main__":
    # Training data is in a csv file called train.csv
    df = pd.read_csv("../cat_train.csv")
    # we create a new column called kfold and fill it with -1
    df["kfold"] = -1
    # the next step is to randomize the rows of the data
    df = df.sample(frac=1).reset_index(drop=True)
    # fetch targets
    y = df.target.values
    # initiate the kfold class from model_selection module
    kf = model_selection.StratifiedKFold(n_splits=5)
    # fill the new kfold column
    for f, (t_, v_) in enumerate(kf.split(X=df, y=y)):
        df.loc[v_, 'kfold'] = f
    # save the new csv with kfold column
    df.to_csv("train_folds.csv", index=False)

In [20]:
df

Unnamed: 0,id,bin_0,bin_1,bin_2,bin_3,bin_4,nom_0,nom_1,nom_2,nom_3,...,ord_0,ord_1,ord_2,ord_3,ord_4,ord_5,day,month,target,kfold
0,525955,0.0,0.0,0.0,F,N,Red,Triangle,Lion,Costa Rica,...,3.0,Novice,Freezing,o,K,UV,5.0,8.0,0,0
1,112545,0.0,0.0,0.0,F,Y,Red,Polygon,,China,...,3.0,Grandmaster,Freezing,f,R,,5.0,6.0,0,0
2,142126,0.0,1.0,1.0,F,N,Red,Trapezoid,Lion,Costa Rica,...,2.0,Contributor,Freezing,c,Y,oJ,3.0,5.0,0,0
3,87767,0.0,0.0,0.0,F,Y,Red,Circle,Axolotl,Finland,...,1.0,Contributor,Hot,a,B,th,1.0,11.0,0,0
4,153417,0.0,0.0,1.0,F,,Red,Polygon,Lion,Russia,...,2.0,Contributor,Freezing,n,P,HK,6.0,7.0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
599995,4147,0.0,1.0,0.0,F,Y,Blue,Polygon,Axolotl,Costa Rica,...,1.0,Expert,,n,A,OZ,2.0,7.0,0,4
599996,106780,0.0,1.0,1.0,F,Y,Blue,Triangle,Hamster,India,...,2.0,Expert,Hot,d,U,dh,1.0,4.0,0,4
599997,484260,0.0,0.0,1.0,F,N,Red,Circle,Dog,Finland,...,1.0,Master,Warm,a,F,mD,1.0,2.0,0,4
599998,130183,0.0,0.0,0.0,F,Y,Red,Triangle,Axolotl,India,...,1.0,Expert,Warm,b,U,Hk,5.0,8.0,0,4


In [21]:
set(df.kfold)

{0, 1, 2, 3, 4}

In [22]:
df['kfold'].value_counts()

0    120000
1    120000
2    120000
3    120000
4    120000
Name: kfold, dtype: int64

### For the wine dataset, let’s look at the distribution of labels.

![Image](images/wine_dist.png)

Some classes have a lot of samples, and some don’t have that many. If we do a simple k-fold, we
won’t have an equal distribution of targets in every fold. Thus, we choose stratified
k-fold in this case.

The rule is simple. If it’s a standard classification problem, choose stratified k-fold
blindly.

But what should we do if we have a large amount of data? Suppose we have 1
million samples. A 5 fold cross-validation would mean training on 800k samples
and validating on 200k. Depending on which algorithm we choose, training and
even validation can be very expensive for a dataset which is of this size. In these
cases, we can opt for a **hold-out based validation**.

### Hold-out based validation

The process for creating the hold-out remains the same as stratified k-fold. For a
dataset which has 1 million samples, we can create ten folds instead of 5 and keep
one of those folds as hold-out. This means we will have 100k samples in the holdout,
and we will always calculate loss, accuracy and other metrics on this set and
train on 900k samples.

**Hold-out is also used very frequently with time-series data.**

Let’s assume the
problem we are provided with is predicting sales of a store for 2020, and you are
provided all the data from 2015-2019. In this case, you can select all the data for
2019 as a hold-out and train your model on all the data from 2015 to 2018.

In many cases, we have to deal with small datasets and creating big validation sets
means losing a lot of data for the model to learn. **In those cases, we can opt for a
type of k-fold cross-validation where k=N, where N is the number of samples in the
dataset.** This means that in all folds of training, we will be training on all data
samples except 1. The number of folds for this type of cross-validation is the same
as the number of samples that we have in the dataset.

One should note that this type of cross-validation can be costly in terms of the time
it takes if the model is not fast enough, but since it’s only preferable to use this
cross-validation for small datasets, it doesn’t matter much.