# Problems occurring during Validation

## Validation
1. We discussed the concept of validation and overfitting
2. We understood how to choose validation strategy
3. We leanrned to identify data split made by organizers
4. **Validation problems**
  - Validation stage
    * usually local validation problems are caused by inconsistency of the data
    * a widespread example is getting different optimal parameters for different faults
    * **In this case we need to make more thorough validation**
  - Submission stage
    * only when we send our submissions to the platform and find the our validation score and actual score shown in leaderboard are different
    * **since we can't mimic the exact train test split on our validation**



## Validation stage

![holidays](../img/holidays-russia.png)

* Consider we need to predict sales in a shop in February.
* Say we have target values for the last year
* And we will usually take last months in the validation
  * This means `January`, but clearly January has much more holidays than February
  * and people tend to buy more, which causes target values to be higher in overall
  * **and that mean-squared-error of our predictions for January will be greater than for February**
  
### Does this mean that the model will perform worse for February?
* **Probably not** - at least in terms of overfitting

### Sometimes this kind of model behavior can be expected
* But what if there is no clear reason why scores differ for different folds?
* **`Let's identify several common reasons for this and see what we can do about it.`**

#### Causes of different scores and optimal parameters
1. Too little data
2. Too diverse and inconsistent data
  * **(Quick note)** notice that in this example we can reduce this diversity a bit if we validate on the February from the previous year!
  
#### We should do extensive validation
1. Average scores from different KFold splits (usually K=5 will do)
2. Tune model on one KFold splits and evaluate the model on the other KFold splits

## Submission stage

We can observe that:
* Leaderboard score is consistently higher/lower that validation score
* Leaderboard score is not correlated with validation score at all
  * `In the worst case, the higher the validation score is, the lower the leaderboard score becomes.`

### `EDA is your friend when it comes down to finding the root of the problems!`

### `Remember that the main rule of making a reliable validation is to mimic a train tests pre-made by organizers!`

Let's first sort out causes we could observe during validation stage.

0. **We may already have quite different scores in Kfold**
  - we can calculate the mean and standard deviation of the validation scores and estimate if the leaderboard score is expected. But if this is not the case ...
1. **Too little data in public leaderboard**
2. **Train and test data are from different distributions**

## Submission stage: different distributions

![dist-heights](../img/dist-heights.png)

Consider that the train data consists of only women and the test data consists of only men (without labeling).
* All the model predictions will around the average height for woman
* Our model will have a terrible score on the test data (men)

### Simplest way to solve this particular situation in a competition is
* to try to **figure out the optimal constant prediction for train and test data**
* and **shift your predictions by the difference**
* `Mean for train` = calculate from the train data
* `Mean for test` = `Leaderboard probing`
  * we send two constant submissions
  * write down a simple formula
  * find out the average target value for the test is equal to 70 inches
* Now we get the average value of train and test data - we can get the target values of test data by adding 7 to the train data


### Another example
The ratios for `men` and `women` target values in train and test data are different.
![men-women-ratio](../img/men-women-ratio.png)

* If the test data consists mostly of `men`, **force the validation to have the same distribution**
* In that case, you ensure that your validatoin will be fair
  - This is true for getting raw scores and optimal parameters correctly

![men-women-ratio2](../img/men-women-ratio2.png)

## Submission stage - conclusion

Causes of validation problems:
* too little data in public leaderboard
* incorrect train/test split
* different distributions in train and test

### If you have too little data in public leaderboard, just trust your validation. 
* If that's not the case, make sure that you did not make model overfit.
* Then check if you made correct train/test split as we discussed.
* Finally check if you have different distributions in train and test.

## Leaderboard Shuffle

Say a team ranked 3rd on the private leaderboard but ranked 392nd on the public leaderboard.

### Expect LB shuffle becuase of

### **Randomness**
  * Ex1. easy-to-predict competition
    - Scores of competitors were very close; most of them are overfitted.
    - Randomness does not make differences to performance much.
  * Ex2. hard-to-predict competition
    - Financial data in that competition was highly unpredictable
    - Randomness does make differences to performance much.
  * **`So one could say that the leaderboard shuffle there was among the biggest shuffles on KFold platform`**

### **Little amount of data**
  * Ex. train set consists of less than 200 gross and test set consists of less than 400 gross.
    - As you can see, shuffle here is more than we expected.

### **Different public/private distributions**
  * Usually for the case with `time-series predictions`
  * When we have a time-based split we usually have first few weeks in public leaderboard and next few weeks in private leaderboard
  * As people tend to adjust their submission model to public leaderboard and overfit - we can exepct worse results on private leaderboard
  * **`Here again, trust your validation and everything will be fine.`**

## Conclusion

* **`If we have big dispersion of scores on validation stage, we should do extensive validation`**
  - Average scores from different KFold splits
  - Tune model on one split, evaluate on the other
* **`If submission's score do not match local validation score, we should`**
  - Check if we have too little data in public leaderboard
  - Check if we overfitted
  - Check if we chose correct splitting strategy
  - Check if train/test have different distributions
* **`Leaderboard shuffle may occur because of`**
  - Randomness
  - Little amount of data
  - Different public/private distributions