# Notebook: Indexing on an event

In [1]:
import os
import sys
from pathlib import Path
import logging
import torch

from FastEHR.dataloader import FoundationalDataModule, index_inclusion_method
from example_wrappers import t2d_inclusion_method

torch.manual_seed(1337)

logging.basicConfig(level=logging.INFO)
device = 'cuda' if torch.cuda.is_available() else 'cpu'

print(f"Using device: {device}.")
!pwd
%load_ext autoreload
%autoreload 2

Using device: cpu.
/home/ubuntu/Documents/GitHub/SurvivEHR/FastEHR/examples/3_build_fine_tuning_datasets/1_indexing


## Example for creating a supervised dataset indexed on T2D with Hypertension outcomes

In [2]:
print(t2d_inclusion_method.__doc__)

None


In [3]:
# Build 
dm = FoundationalDataModule(path_to_db="../../data/_built/example_database.db",
                            path_to_ds="../../data/_built/indexed_datasets/T2D_hypertension/",
                            load=False,
                            include_diagnoses=True,
                            include_measurements=True,
                            drop_missing_data=False,
                            drop_empty_dynamic=True,
                            tokenizer="tabular",
                            overwrite_practice_ids = "../../data/_built/dataset/practice_id_splits.pickle",
                            overwrite_meta_information="../../data/_built/dataset/meta_information.pickle",
                            study_inclusion_method=t2d_inclusion_method(outcomes=["HYPERTENSION"]),
                            # num_threads=1
                           )

INFO:root:Creating unsupervised collator for DataModule
INFO:root:Building Polars datasets and saving to ../../data/_built/indexed_datasets/T2D_hypertension/
INFO:root:Using train/test/val splits from ../../data/_built/dataset/practice_id_splits.pickle
INFO:root:Processing test split...
Thread generating parquet for 1 practices: 100%|██| 1/1 [00:00<00:00, 19.69it/s]
INFO:root:Created dataset at ../../data/_built/indexed_datasets/T2D_hypertension/split=test with 0 number of samples
Getting file row counts. This allows the creation of an index to file map, increasing read efficiency: 0it [00:00, ?it/s]
INFO:root:	 Obtained with a total of 0 samples
INFO:root:Processing train split...
Thread generating parquet for 18 practices: 100%|█| 18/18 [00:00<00:00, 32.04it/
INFO:root:Created dataset at ../../data/_built/indexed_datasets/T2D_hypertension/split=train with 0 number of samples
Getting file row counts. This allows the creation of an index to file map, increasing read efficiency: 1it [00

### Meta information

When we generated the pre-trained dataset we created a meta information file. This file contained estimated quantiles, event counts, etc for our full dataset. 

We can choose to create a new meta_information file here by leaving ``overwrite_meta_information`` blank, or re-use the one we have already. This 
- saves redundant computation, 
- allows us to retain a persistent (e.g. in the case of fine-tuning)
    - standardisation strategy,
    - tokenisation strategy

### Other information on setup

In [4]:
#TODO ADD

## Datasets

We observe that only two sample from our example training set remain. This is because all other samples did not fit the inclusion criteria specified by ``t2d_inclusion_method`` (our example dataset contains only two cases of Type II diabetes).

For these two examples, we see one has observed the outcome of interest within the study period, whilst the other did not. Infact, the second patient died before this outcome could be seen. 

In both cases, the penultimate event is the index event, and the last even is either the first observed specified outcome, or the last observation within the study period. All other events following the index event are removed.

We also see that events beginning before the study period are included.

In [5]:
for dataset in [dm.train_set, dm.val_set, dm.test_set]:
    for sample in range(len(dataset)):
        dataset.view_sample(sample)

SEX                 | F
IMD                 | 2.0
ETHNICITY           | WHITE
birth_year          | 1940.0
Sequence of 6 events

Token                                                                       | Age at event (days)         | Standardized value
OTHER_CHRONIC_LIVER_DISEASE_OPTIMAL                                        | 16425.0                       | nan
LYMPHOMA_PREVALENCE_V2                                                     | 16425.0                       | nan
HAEMOCHROMATOSIS_V2                                                        | 16425.0                       | nan
OSTEOPOROSIS                                                               | 16425.0                       | nan
TYPE2DIABETES                                                              | 20075.0                       | nan
DEATH                                                                      | 23725.0                       | nan
