# Exploratory data analysis

In this notebook, the data is further explored.

## Importing data

In [1]:
from pathlib import Path
import pickle

data_path = Path('../data/data.pkl')

with data_path.open('rb') as file:
    data = pickle.load(file)

data

Unnamed: 0_level_0,Unnamed: 1_level_0,age_above65,age_percentil,gender,disease_grouping_1,disease_grouping_2,disease_grouping_3,disease_grouping_4,disease_grouping_5,disease_grouping_6,htn,...,respiratory_rate_diff,temperature_diff,oxygen_saturation_diff,bloodpressure_diastolic_diff_rel,bloodpressure_sistolic_diff_rel,heart_rate_diff_rel,respiratory_rate_diff_rel,temperature_diff_rel,oxygen_saturation_diff_rel,icu
id,window,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
0,0-2,1,60th,0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,...,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,0
0,2-4,1,60th,0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,...,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,0
0,4-6,1,60th,0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,...,,,,,,,,,,0
0,6-12,1,60th,0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,...,,-1.000000,-1.000000,,,,,-1.000000,-1.000000,0
0,above_12,1,60th,0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,...,0.176471,-0.238095,-0.818182,-0.389967,0.407558,-0.230462,0.096774,-0.242282,-0.814433,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
384,0-2,0,50th,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,0
384,2-4,0,50th,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,0
384,4-6,0,50th,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,0
384,6-12,0,50th,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,0


## Dealing with rows

Each patient's data comprehend 5 distinct rows, being one for each time window.
Since these rows represent the same patient, several different strategies can be considered for how to deal with them.

In this section those strategies will be explored.

### Note about time windows after patient admission

Although the documentation suggests discarding rows (time windows) for which the patients had already been admitted to the ICU, there might be value in not doing so.

Regardless of when the patient was admitted, all of his/her rows present values for a patient that would be admitted at some point in time.

In [2]:
data.loc[[11, 14], 'icu'].unstack(1)

window,0-2,2-4,4-6,6-12,above_12
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
11,0,0,0,1,1
14,0,0,1,1,1



Patient 11, for example, was admitted at the fourth time window.
On one hand, this means that at the first three time windows his/her values were of a patient that would be admitted in the future.
On the other hand, the last time window corresponds to values of a patient that had been already admitted.


Therefore, a model trained only with rows for which patients had not been admitted would consider patient 11's first three time windows.
It could be used to predict patient 14's admission, which was registered during the third time window.

If patient 14 had values similar to those of patient 11 in the two first time windows, this model would be expected to be able to guess ICU need correctly.
But if patient 14 was admitted into the hospital with values similar to patient 11's *last two* time window, there would be no reason to expect the model to make a correct prediction simply because it *may have never seen similar data*.
That's because patient 11's last time windows would have been dropped due to ICU status.

Nevertheless, a model trained against *every row* would have hope to be able to predict patient 14's admission regardless of the time window considered.

In conclusion, every time window presents values for a patient that *will* be admitted (or not) at some point in time and should be considered.
This is the approach that's going to be used in this work.

### One row per patient, first time window

The documentation suggests using only the first time window for each patient.
This is a straightforward way of dealing with the problem and may make the model more clinically relevant.
This is because, by being fit to the first time window, the model might be able to catch the needing patients as soon as they get into the hospital.

In this case, care must be taken to make sure *icu* indicates whether the patient was admitted at any point in time and not just on the first time window.

In [3]:
import pandas as pd

first_time_window = data.loc[(slice(None), '0-2'), :].droplevel('window').copy()

# The aggregation max() gives 1 for the patients that were admitted at some
# point in time.
first_time_window.loc[:, 'icu'] = (
    data.loc[:, 'icu']
    .groupby('id')
    .max()
)

first_time_window

Unnamed: 0_level_0,age_above65,age_percentil,gender,disease_grouping_1,disease_grouping_2,disease_grouping_3,disease_grouping_4,disease_grouping_5,disease_grouping_6,htn,...,respiratory_rate_diff,temperature_diff,oxygen_saturation_diff,bloodpressure_diastolic_diff_rel,bloodpressure_sistolic_diff_rel,heart_rate_diff_rel,respiratory_rate_diff_rel,temperature_diff_rel,oxygen_saturation_diff_rel,icu
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,1,60th,0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,1
1,1,90th,1,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,1
2,0,10th,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,,,,,,,,,,1
3,0,40th,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,,,,,,,,,,0
4,0,10th,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,,,,,,,,,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
380,0,40th,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,1
381,1,above_90th,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,,,,,,,,,,0
382,0,50th,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,,,,,,,,,,1
383,0,40th,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,,,,,,,,,,0


### One row per patient, aggregating

One other way to reduce each patient's data to a single row is by aggregating them.
This procedure could have the advantage of supplying the model with somewhat more information.
If it produces better results than only taking the first time window is something to be tested.

- Since the *demographics* are constant across time windows, the aggregating function can be just taking the *first* value.

- As already seen, the *comorbidities* actually vary.
Considering they're about the existence of diseases, it seems reasonable to consider them to be present whenever at least one time window is positive.

- For the *labs* and *vitals*, the *mean* will be used.

- *icu* should be aggregated as in the previous case.

In [4]:
import json


groups_path = Path('../data/groups.json')
with groups_path.open('r') as file:
    groups = json.load(file)

agg_funcs_dict = {
    'demographics': 'first',
    'comorbidities': 'max',
    'labs': 'mean',
    'vitals': 'mean',
}

agg_funcs = {
    feature: agg_funcs_dict[group]
    for group, feature_list in groups.items()
    for feature in feature_list
}

# icu is not in the groups dictionary.
agg_funcs['icu'] = 'max'

aggregated_data = (
    data.groupby('id')
    .agg(agg_funcs)
)

aggregated_data

Unnamed: 0_level_0,age_above65,age_percentil,gender,disease_grouping_1,disease_grouping_2,disease_grouping_3,disease_grouping_4,disease_grouping_5,disease_grouping_6,htn,...,respiratory_rate_diff,temperature_diff,oxygen_saturation_diff,bloodpressure_diastolic_diff_rel,bloodpressure_sistolic_diff_rel,heart_rate_diff_rel,respiratory_rate_diff_rel,temperature_diff_rel,oxygen_saturation_diff_rel,icu
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,1,60th,0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,...,-0.607843,-0.809524,-0.954545,-0.796656,-0.530814,-0.743487,-0.634409,-0.810570,-0.953608,1
1,1,90th,1,0.0,0.0,0.0,0.0,1.0,0.0,1.0,...,-0.600000,-0.747619,-0.959596,-0.718228,-0.726155,-0.836096,-0.634409,-0.748573,-0.960463,1
2,0,10th,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-0.803922,-0.750000,-0.885522,-0.595604,-0.419448,-0.681860,-0.792832,-0.752732,-0.887561,1
3,0,40th,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-0.784314,-0.682540,-0.723906,-0.769565,-0.685906,-0.689698,-0.776583,-0.682540,-0.724145,0
4,0,10th,0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,-0.901961,-0.761905,-0.959596,-0.884058,-0.826611,-0.839287,-0.896057,-0.766042,-0.960291,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
380,0,40th,1,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,-0.838235,-0.880952,-0.929293,-0.876430,-0.779962,-0.888383,-0.811492,-0.883840,-0.929354,1
381,1,above_90th,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-0.941176,-0.654762,-0.979798,-0.991304,-0.935754,-0.902335,-0.939068,-0.654898,-0.980026,0
382,0,50th,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-0.725490,-0.833333,-0.946128,-0.837999,-0.745459,-0.872459,-0.730617,-0.832872,-0.945017,1
383,0,40th,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-0.862745,-0.857143,-0.946128,-0.850932,-0.861989,-0.852417,-0.857826,-0.857536,-0.946175,0


### One row per patient, but using every value.

Another option is to put every value a given patient has in a single row by using some kind of a "pivot" operation.
The upside is that it results in a single row per patient, but without losing any information.
The downside is that it can be much more complicated to use these values to make a predictive model.

In [5]:
# Only labs and vitals have different values for each time window. The
# resulting DataFrame has a MultiIndex (feature, time_window) for its columns.
pivoted = data[groups['labs']+groups['vitals']].unstack('window')

def flatten_index(multiindex, sep='__'):
    '''
    Turn ``MultiIndex`` into ``Index``.

    Flattens the given ``MultiIndex`` by joining the various levels with
    ``sep`` and returns its flattened version, replacing dashes (-) with
    underscores (_).
    '''
    flat_index = sep.join(multiindex).replace('-', '_')

    return flat_index

pivoted.columns = pivoted.columns.map(flatten_index)

# Reconstruct whole DataFrame by concatenating with other features aggregated.
pivoted = pd.concat(
    [
        data[groups['demographics']+groups['comorbidities']].groupby('id').max(),
        pivoted,
        data['icu'].groupby('id').max(),
    ],
    axis=1,
)

pivoted

Unnamed: 0_level_0,age_above65,age_percentil,gender,disease_grouping_1,disease_grouping_2,disease_grouping_3,disease_grouping_4,disease_grouping_5,disease_grouping_6,htn,...,temperature_diff_rel__2_4,temperature_diff_rel__4_6,temperature_diff_rel__6_12,temperature_diff_rel__above_12,oxygen_saturation_diff_rel__0_2,oxygen_saturation_diff_rel__2_4,oxygen_saturation_diff_rel__4_6,oxygen_saturation_diff_rel__6_12,oxygen_saturation_diff_rel__above_12,icu
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,1,60th,0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,...,-1.0,,-1.000000,-0.242282,-1.0,-1.0,,-1.000000,-0.814433,1
1,1,90th,1,0.0,0.0,0.0,0.0,1.0,0.0,1.0,...,-1.0,-1.000000,-0.882574,0.139709,-1.0,-1.0,-1.000000,-1.000000,-0.802317,1
2,0,10th,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,,,-1.000000,-0.505464,,,-0.961262,-0.801293,-0.900129,1
3,0,40th,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,,-1.000000,-1.000000,-0.047619,,,-1.000000,-1.000000,-0.172436,0
4,0,10th,0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,,-0.953536,-0.698797,-0.645793,,,-0.980333,-0.960463,-0.940077,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
380,0,40th,1,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,,-1.000000,-1.000000,-0.535361,-1.0,,-1.000000,-1.000000,-0.717417,1
381,1,above_90th,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,,,-0.612627,-0.697169,,,,-1.000000,-0.960052,0
382,0,50th,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-1.0,,-1.000000,-0.498615,,-1.0,,-1.000000,-0.835052,1
383,0,40th,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,,-1.000000,-1.000000,-0.572609,,,-1.000000,-1.000000,-0.838524,0


### Every row

Finally, using all the dataset's rows.
This can be viewed as a kind of data augmentation, in which the data for each patient in perceived as data for 5 distinct patients (one for each time window).
Given that each time window brings different values, this procedure will make it possible to supply the model with a lot more data, which may be helpfull to getting better predictions.

In [6]:
data

Unnamed: 0_level_0,Unnamed: 1_level_0,age_above65,age_percentil,gender,disease_grouping_1,disease_grouping_2,disease_grouping_3,disease_grouping_4,disease_grouping_5,disease_grouping_6,htn,...,respiratory_rate_diff,temperature_diff,oxygen_saturation_diff,bloodpressure_diastolic_diff_rel,bloodpressure_sistolic_diff_rel,heart_rate_diff_rel,respiratory_rate_diff_rel,temperature_diff_rel,oxygen_saturation_diff_rel,icu
id,window,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
0,0-2,1,60th,0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,...,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,0
0,2-4,1,60th,0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,...,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,0
0,4-6,1,60th,0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,...,,,,,,,,,,0
0,6-12,1,60th,0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,...,,-1.000000,-1.000000,,,,,-1.000000,-1.000000,0
0,above_12,1,60th,0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,...,0.176471,-0.238095,-0.818182,-0.389967,0.407558,-0.230462,0.096774,-0.242282,-0.814433,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
384,0-2,0,50th,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,0
384,2-4,0,50th,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,0
384,4-6,0,50th,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,0
384,6-12,0,50th,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,0


### Summary

Given the dataset's nature, it be can worked in several different ways in regard to the rows. For each patient, it is possible to

- Take the first time window;
- Aggregate every time window into a single value;
- Pivot the data so that every value for the given patient is contained in a single row;
- Use every row as is.

In this project, modelling this predictive problem will be tried using each of these methods.
Also, the time windows after the patient's admission will not be discarded.

## Discarding redundant columns

Every *lab* and *vital* feature has associated with it a *diff* variable (defined as *max* - *min*). The *vitals* also have a *_diff_rel* variable (defined as *diff* / *median*).
Since those two variables are directly obtained from the other ones, it could be a good option to not consider them in a first model, for the sake of dimensionality reduction.

In [7]:
def drop_redundant_columns(df):
    '''
    Drop redundant columns from the ``DataFrame``.

    Returns the ``DataFrame`` with the ``_diff`` and ``_diff_rel`` columns
    removed.
    '''
    redundant_features = (
        df.columns
        .str.extract('(\w+_diff(?:_rel)?(?:__.+)?)')
        .squeeze()
        .dropna()
    )

    out_df = df.drop(redundant_features, axis=1)

    return out_df

drop_redundant_columns(data)

Unnamed: 0_level_0,Unnamed: 1_level_0,age_above65,age_percentil,gender,disease_grouping_1,disease_grouping_2,disease_grouping_3,disease_grouping_4,disease_grouping_5,disease_grouping_6,htn,...,respiratory_rate_min,temperature_min,oxygen_saturation_min,bloodpressure_diastolic_max,bloodpressure_sistolic_max,heart_rate_max,respiratory_rate_max,temperature_max,oxygen_saturation_max,icu
id,window,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
0,0-2,1,60th,0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,...,-0.500000,0.208791,0.898990,-0.247863,-0.459459,-0.432836,-0.636364,-0.420290,0.736842,0
0,2-4,1,60th,0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,...,-0.500000,0.714286,0.838384,-0.076923,-0.459459,-0.313433,-0.636364,0.246377,0.578947,0
0,4-6,1,60th,0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,...,,,,,,,,,,0
0,6-12,1,60th,0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,...,,0.318681,0.898990,,,,,-0.275362,0.736842,0
0,above_12,1,60th,0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,...,-0.857143,0.098901,0.797980,-0.076923,0.286486,0.298507,0.272727,0.362319,0.947368,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
384,0-2,0,50th,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-0.428571,0.714286,0.919192,-0.299145,-0.502703,-0.164179,-0.575758,0.246377,0.789474,0
384,2-4,0,50th,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-0.500000,0.472527,0.838384,-0.247863,-0.567568,-0.298507,-0.636364,-0.072464,0.578947,0
384,4-6,0,50th,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-0.500000,0.472527,0.898990,-0.247863,-0.459459,-0.343284,-0.636364,-0.072464,0.736842,0
384,6-12,0,50th,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-0.571429,0.560440,0.797980,-0.162393,-0.567568,-0.358209,-0.696970,0.043478,0.473684,0


## Removing duplicated columns

Some columns in this dataset are duplicated and should be removed.

### Comorbidities

In [8]:
duplicated_comorbidities = (
    data.loc[:, groups['comorbidities']]
    .columns[
        data[groups['comorbidities']]
        .T
        .duplicated()
    ]
)

duplicated_comorbidities

Index([], dtype='object')

### Demographics

In [9]:
duplicated_demographics = (
    data.loc[:, groups['demographics']]
    .columns[
        data[groups['demographics']]
        .T
        .duplicated()
    ]
)

duplicated_demographics

Index([], dtype='object')

### Labs

In [10]:
duplicated_labs = (
    data.loc[:, groups['labs']]
    .columns[
        data[groups['labs']]
        .T
        .duplicated()
    ]
)

duplicated_labs

Index(['albumin_mean', 'albumin_min', 'albumin_max', 'be_arterial_mean',
       'be_arterial_min', 'be_arterial_max', 'be_arterial_diff',
       'be_venous_mean', 'be_venous_min', 'be_venous_max',
       ...
       'ttpa_max', 'ttpa_diff', 'urea_mean', 'urea_min', 'urea_max',
       'urea_diff', 'dimer_mean', 'dimer_min', 'dimer_max', 'dimer_diff'],
      dtype='object', length=143)

In [11]:
data_dropped_labs = (
    data
    .loc[:, groups['labs']]
    .drop(duplicated_labs, axis=1)
)

(
    data[groups['labs']]
    .columns
    .str.extract('(max|mean|median|min|diff|diff_rel)$')
    .value_counts()
)

diff      36
max       36
mean      36
median    36
min       36
dtype: int64

After dropping every duplicated lab result, there are 36 columns remaining, all of them being *max*.
Since there were originally 36 lab results, that implies every different value (*max*, *mean*, etc.) for each of them was repeated.
That probably means only one lab test was collected in each time window (which would explain equal values for all statistics).

### Vitals

In [12]:
duplicated_vitals = (
    data.loc[:, groups['vitals']]
    .columns[
        data[groups['vitals']]
        .T
        .duplicated()
    ]
)

duplicated_vitals

Index([], dtype='object')

### Removing duplicated columns

Only the *labs* results are duplicated and for each feature all statistics are equal.
For that reason, a function can be defined to drop all *labs* features but *max*.

In [13]:
def drop_duplicate_columns(data):
    '''
    Drop ``DataFrame``'s repeated columns.

    Since the data only has duplicated columns in the ``labs`` category, they
    can be dropped directly.
    '''
    cols_to_drop = (
        data[groups['labs']].columns
        .str.extract('(\w+_(?:mean|median|min|diff|diff_rel)(?:__.+)?)')
        # With only one capturing group, extract returns a single-column
        # DataFrame, which squeeze then turns into a Series.
        .squeeze()
        .dropna()
    )

    data_dropped = data.drop(cols_to_drop, axis=1)

    return data_dropped

data_dropped_labs = drop_duplicate_columns(data)

data_dropped_labs

Unnamed: 0_level_0,Unnamed: 1_level_0,age_above65,age_percentil,gender,disease_grouping_1,disease_grouping_2,disease_grouping_3,disease_grouping_4,disease_grouping_5,disease_grouping_6,htn,...,respiratory_rate_diff,temperature_diff,oxygen_saturation_diff,bloodpressure_diastolic_diff_rel,bloodpressure_sistolic_diff_rel,heart_rate_diff_rel,respiratory_rate_diff_rel,temperature_diff_rel,oxygen_saturation_diff_rel,icu
id,window,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
0,0-2,1,60th,0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,...,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,0
0,2-4,1,60th,0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,...,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,0
0,4-6,1,60th,0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,...,,,,,,,,,,0
0,6-12,1,60th,0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,...,,-1.000000,-1.000000,,,,,-1.000000,-1.000000,0
0,above_12,1,60th,0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,...,0.176471,-0.238095,-0.818182,-0.389967,0.407558,-0.230462,0.096774,-0.242282,-0.814433,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
384,0-2,0,50th,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,0
384,2-4,0,50th,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,0
384,4-6,0,50th,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,0
384,6-12,0,50th,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,0
