# Splitting the data

It's important to make sure all models trained can be compared on equal grounds, which means always using the same train/test split.

Depending on how the patients' different time windows are used, the models may be fed with one or multiple rows corresponding to each patient.
This fact can make splitting inconsistent if done on the usual row basis:
- If the splits are done before the rows processing, it will certainly be the case that many patients will have some of their time windows assigned to the train set and the other time windows assigned to the test set.
    - This would make the aggregating procedures not work.
- If the splits are done after the rows processing, they can't be done consistently by row. This is because the different processing procedures produce datasets with distinct numbers of rows.

For this reason, instead of splitting by individual rows, in this case it's better to split by patient *id*.
This way, the same patients will be in the train/test sets, regardless of how the time windows are being handled, and then processed without issue.

## Importing the data

In [1]:
from pathlib import Path
import pickle


data_path = Path('../data/data.pkl')
with data_path.open('rb') as file:
    data = pickle.load(file)

data

Unnamed: 0_level_0,Unnamed: 1_level_0,age_above65,age_percentil,gender,disease_grouping_1,disease_grouping_2,disease_grouping_3,disease_grouping_4,disease_grouping_5,disease_grouping_6,htn,...,respiratory_rate_diff,temperature_diff,oxygen_saturation_diff,bloodpressure_diastolic_diff_rel,bloodpressure_sistolic_diff_rel,heart_rate_diff_rel,respiratory_rate_diff_rel,temperature_diff_rel,oxygen_saturation_diff_rel,icu
id,window,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
0,0-2,1,60th,0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,...,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,0
0,2-4,1,60th,0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,...,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,0
0,4-6,1,60th,0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,...,,,,,,,,,,0
0,6-12,1,60th,0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,...,,-1.000000,-1.000000,,,,,-1.000000,-1.000000,0
0,above_12,1,60th,0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,...,0.176471,-0.238095,-0.818182,-0.389967,0.407558,-0.230462,0.096774,-0.242282,-0.814433,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
384,0-2,0,50th,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,0
384,2-4,0,50th,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,0
384,4-6,0,50th,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,0
384,6-12,0,50th,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,0


## Defining the splitting procedure

In [2]:
def split(data, random_seed=None, n=100):
    '''
    Split data into train and test set.

    The test set corresponds to n patient ids. The split is done by the index
    first level.
    '''
    from random import seed, shuffle

    # The unique method returns an Index, with which shuffle won't work, so it
    # has to be converted to an array.
    idx = data.index.get_level_values(0).unique().array

    if random_seed is not None:
        seed(random_seed)

    shuffle(idx)

    train_idx = sorted(idx[n:])
    test_idx = sorted(idx[:n])

    train_data = data.loc[train_idx, :]
    test_data = data.loc[test_idx, :]

    return train_data, test_data

## Exporting split data

In [3]:
train_data, test_data = split(data, random_seed=8001672212340744)

train_data_path = Path('../data/train_data.pkl')

if not train_data_path.exists():
    with train_data_path.open('wb') as file:
        pickle.dump(train_data, file)


test_data_path = Path('../data/test_data.pkl')

if not test_data_path.exists():
    with test_data_path.open('wb') as file:
        pickle.dump(test_data, file)