# Data splitting strategies

### Concrete example
Say we're solving a competition with a time series prediction, namely, we are to predict a number of customers for a shop for which they're due in next month.<br>

How should we divide the data into `train` and `validation` here?

* Take random rows
* Make a time-based split

![different-approahces-to-validation](img/different-approahces-to-validation.png)

### Imagine we have a pool of different models trained on different features

![different-approaches-to-validation](img/different-approaches-to-validation.png)

* And we selected the best model for each type of validation
* Question - will these model differ from each other? If then, how significantly?
  * **`YES!`** - the most useful features for one model could be useless for another!

#### If you want to predict `what will happen a few points later`,
* then the model which favor features like `previous and next target values` will perform poorly
  * because in this case we don't have such observations for the test data
  * but we have to give the model something in the feature value - and it probably will be not numbers or missing values
  * the model has very few experience & expectation on the future values

#### Now, let's remember the second case `time-based trend`
* Here we need to rely more on the time trend
* So the features which is the model really needs here are like:
  * `what was the trend in the last couple of months or weeks?`

#### The model selected as the best model for the first type of validation will perfrom poorly for the second type of validation.

#### On the opposite, the best model for the second type of validation was trained to predict many points **ahead** and will not use adjacent target values.

### If we carefully generate features that are drawing attention to time-based patterns, will we get a reliable validation with a random-based split?
* **`NO!`**


[**FIRST**: random split / **SECOND**: time-based split]
![different-approaches-to-validation2](img/different-approaches-to-validation2.png)

#### Model predictions will be close to targets mean value calculated using train data
* Ff the validation points will be closer to this mean value compared to test points, we will get a better score in validation than in test.

#### But in the second case(`time-based split`), the validation points are roughly as far as the test points from target mean value
* So validation score will be more similar to the test score

## Time-based splits

### Important outcome
 
**Different splitting strategies can differ significantly**
1. in generated features
2. in a way the model will rely on that features
3. in some kind of target leak

### To be able to find smart ideas for `feature generation` and consistently improve our model,
### we ABSOLUTELY want to identify train/test split made by organizers and reproduce it.

## Splitting data into train and validation

1. Random, row-wise
2. **Timewise**
  * generally data before certain date = `train`
  * data after certain date = `test`
  * This can be **signal** to use speical approach to feature generation - **especially to make useful features based on the target**
    * (example) If we are to predict a number of customers for the shop for each day in the next week, 
      * **`the number of customers for the same day in the previous week`**
      * **`the average number of customers for the past month`**
3. By (unique) ID
  * We can probably can make a conclusion that features based on user's history
    * **`how many songs the user listened last week`**
    * But the feature right above will not help for completely new users!
  * (Example) Nature Conservancy fisheries monitoring
    * There were **photos of fish** from several different **fishing boats**
    * One could easily overfit if you would ignore risk and make a random-based split
    * Competitors had to derive the unique IDs for fishes by themselves

### Moving window
A special case of validation for the time-based split

![moving-window](img/moving-window.png)

### Combined split
Examples

* If you have a task of predicting a sales in a shop, we can choose a split in date for each shop independently, instead of using one date for every shop in the data.
* If we have search queries from multiple users, is using several search engines, we can split the data by a **combination of user ID and search engine ID**.

## Conclusion
1. In most cases data is split by
  * Row number
  * Time
  * ID
2. Logic of feature generation depends on the data spliting strategy
3. Set up your validation to mimic the train/test split of the competition