# Concept of Cross-Validation

Machine learning is an iterative process.You will face choices about predictive variables to use, what types of models to use, what arguments to
supply those models, etc. We make these choices in a data-driven way by measuring model quality of
various alternatives.

You've already learned to use ``train_test_split`` to split the data, so you can measure model quality on the
test data. Cross-validation extends this approach to model scoring (or "model validation.") Compared to
train_test_split, cross-validation gives you a more reliable measure of your model's quality, though it
takes longer to run.

### The Shortcoming of Train-Test Split

Imagine you have a dataset with 5000 rows. The ``train_test_split`` function has an argument for ``test_size``
that you can use to decide how many rows go to the training set and how many go to the test set. The
larger the test set, the more reliable your measures of model quality will be. At an extreme, you could
imagine having only 1 row of data in the test set. If you compare alternative models, which one makes
the best predictions on a single data point will be mostly a matter of luck.

You will typically keep about 20% as a test dataset. But even with 1000 rows in the test set, there's some
random chance in determining model scores. A model might do well on one set of 1000 rows, even if it
would be inaccurate on a different 1000 rows. The larger the test set, the less randomness (aka "noise")
there is in our measure of model quality.

### The Cross-Validation Procedure

In cross-validation, we run our modeling process on different subsets of the data to get multiple
measures of model quality. For example, we could have 5 folds or experiments. We divide the data into
5 pieces, each being 20% of the full dataset.

![Cross val](assets/crossval.png)

We run an experiment called experiment 1 which uses the first fold as a holdout set, and everything
else as training data. This gives us a measure of model quality based on a 20% holdout set, much as we
got from using the simple train-test split.

We then run a second experiment, where we hold out data from the second fold (using everything
except the 2nd fold for training the model.) This gives us a second estimate of model quality. We repeat
this process, using every fold once as the holdout. Putting this together, 100% of the data is used as a
holdout at some point.

Returning to our example above from train-test split, if we have 5000 rows of data, we end up with a
measure of model quality based on 5000 rows of holdout (even if we don't use all 5000 rows
simultaneously.


### Trade-offs Between Cross-Validation and Train-Test Split

Cross-validation gives a more accurate measure of model quality, which is especially important if you
are making a lot of modeling decisions. However, it can take more time to run, because it estimates
models once for each fold. So it is doing more total work.

Given these tradeoffs, when should you use each approach? On small datasets, the extra computational
burden of running cross-validation isn't a big deal. These are also the problems where model quality
scores would be least reliable with train-test split. So, if your dataset is smaller, you should run crossvalidation.

For the same reasons, a simple train-test split is sufficient for larger datasets. It will run faster, and you
may have enough data that there's little need to re-use some of it for holdout.

There's no simple threshold for what constitutes a large vs small dataset. If your model takes a couple
minute or less to run, it's probably worth switching to cross-validation. If your model takes much longer
to run, cross-validation may slow down your workflow more than it's worth.

Alternatively, you can run cross-validation and see if the scores for each experiment seem close. If each
experiment gives the same results, train-test split is probably sufficient.

Now, let's see an example of how this can be done with the iris dataset. We will start with library imports and data imports


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv("datasets/iris.csv")
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,Class
0,5.1,3.5,1.4,0.2,Setosa
1,4.9,3.0,1.4,0.2,Setosa
2,4.7,3.2,1.3,0.2,Setosa
3,4.6,3.1,1.5,0.2,Setosa
4,5.0,3.6,1.4,0.2,Setosa


We have covered the data-preprocessing steps multiple times before, so now we will directly make a feature dataframe and labels dataframe

In [3]:
X_train = df.drop("Class", axis=1)
y_train = df["Class"]

Now we need a model, lets use an ANN for this example

In [4]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

before proceeding, we need to do k-fold cross validation where k is the number of folds we want. sklearn has this function already and we will be using that

In [5]:
from sklearn.model_selection import KFold

K-Folds cross-validator

Provides train/test indices to split data in train/test sets. Split
dataset into k consecutive folds (without shuffling by default). Each fold is then used once as a validation while the k - 1 remaining
folds form the training set.

In [6]:
kfolds = KFold(n_splits=5 ,shuffle=True, random_state=42) # shuffle=True is important! Otherwise, the data is not shuffled and the CV is not random
# random_state=42 is just for reproducibility

now we will use the indices to split the data for each fold and then fit a model, and evaluate the performance. Lastly we will take the average score of all the folds. This is a way to measure how good our model is performing. Once we are happy with the average metrics from K-Fold Cross Validation, we can finalize the model architecture and hyperparameters and train it on the data.

In [12]:
scores_list = []
for i, (train_index, test_index) in enumerate(kfolds.split(X_train)):
    X_train_kf, X_test_kf = X_train.iloc[train_index], X_train.iloc[test_index]
    y_train_kf, y_test_kf = y_train.iloc[train_index], y_train.iloc[test_index]

    model = Sequential()
    model.add(Dense(8, input_dim=4, activation='relu')) # one layer with 8 neurons
    model.add(Dense(3, activation='softmax'))

    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    print(f"Training for fold : {i + 1}")
    model.fit(X_train_kf, pd.get_dummies(y_train_kf), epochs=50, batch_size=1, verbose = 0)

    print("Evaluating model")
    scores = model.evaluate(X_test_kf, pd.get_dummies(y_test_kf))
    print(f"Fold {i + 1}: {model.metrics_names[1]} of {scores[1]*100}")
    scores_list.append(scores[1]*100)
    print("\n")

print(f"Average accuracy: {np.mean(scores_list)}")
print(f"Standard Deviation: {np.std(scores_list)}")

Training for fold : 1
Evaluating model
Fold 1: accuracy of 96.66666388511658


Training for fold : 2
Evaluating model
Fold 2: accuracy of 96.66666388511658


Training for fold : 3
Evaluating model
Fold 3: accuracy of 96.66666388511658


Training for fold : 4
Evaluating model
Fold 4: accuracy of 93.33333373069763


Training for fold : 5
Evaluating model
Fold 5: accuracy of 100.0


Average accuracy: 96.66666507720947
Standard Deviation: 2.1081849811218007


cool, average accuracy of 96%, you can try some other models to see if this gets better using k-fold cross validation.