# Validation strategies

## Validation types
* Holdout
* K-fold
* Leave-one-out

## Validation types
### Holdout : `ngroups = 1`

```
sklearn.model_selection.ShuffleSplit
```
![holdout](img/holdout.png)

* Fit our model on the training data frame
* Evaluate its quality on the validation data frame
* Using the scores from the evaluation we select the best model
* When we are ready to make a submission, **we can retrain our model on our data with given labels**

#### Using `holdout` as validation is a good choice when ...
* we have enough data
* we are likely to get similar scores from the same model with different splits



### K-fold : `ngroups = k`

The core idea of is:
* **that we want to use every sample for validation only once!**

```
sklearn.model_selection.Kfold
```

![kfold](img/kfold.png)

#### Using `Kfold` as validation is good choice when
* we have a minimum amount of data
* we can get either a sufficiently big difference in performnace or different optimal parameters between folds


### Leave-one-out : `ngroups = len(train)`
* A special case of Kfold when `K = len(train)`
* It will iterate through every sample in our data

#### Using `leave-one-out` as validation is good choice when
* we have too little data and fast-enough model to retrain
* we can get either a sufficiently big difference in performnace or different optimal parameters between folds


## Stratification

Usually `Kfold` is used for validation task. But sometimes, especially if you do not have enough samples for some class, a random split can fail.<br>

[For example] 
We have binary classification tests and a small data set with eight samples - 4 of class `0`, 4 of class `1`.

* Let's split data into four folds (blue, orange, green, red)
* Notice that we are not always getting `0` or `1` in the same problem
* If we use the second fold for validation, we will get an average value of the target in the train of **two third instead of one half**
  * This can drastically change predictions of our model

![stratification](img/stratification.png)

### What we need here to handle the problem = `stratification`
A way to ensure we will get similar target distribution over different pars
* If we split data into four parts with stratification, the average of each part target values will be equal to one half.

### `Stratification` is useful for:
* Small datasets
* Unbalanced datasets
* Multiclass classification

For good classification data sets, the stratification split will be quite similar to a simple shuffle split (random split).

## Conclusion

There are three main validation strategies:
1. Holdout
2. KFold
3. LOO

**Stratification** preserve the same target distribution over different folds