# Table of Content

<a name="outline"></a>

## Setup

- [A](#seca) External Imports
- [B](#secb) Internal Imports
- [C](#secd) Configurations and Paths 
- [D](#sece) Patient Interface and Train/Val/Test Partitioning


## Clustering

- [1](#sec2) Disease Embeddings Clustering
- [2](#sec3) Subject Embeddings Clustering

<a name="seca"></a>

### A External Imports [^](#outline)

In [1]:
import os
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from pathlib import Path
from IPython.display import display
from upsetplot import from_contents, plot, UpSet, from_indicators


<a name="secb"></a>

### B Internal Imports [^](#outline)

In [2]:
# HOME and DATA_STORE are arbitrary, change as appropriate.
HOME = os.environ.get('HOME')
DATA_STORE = f'{HOME}/GP/ehr-data'
SOURCE_DIR = os.path.abspath("..")

DATA_FILE = os.path.join(DATA_STORE, 'cprd-data/DUMMY_DATA.csv')
ARTEFACTS_DIR = 'cprd_artefacts'
TRAIN_DIR = os.path.join(ARTEFACTS_DIR, 'train')


%load_ext autoreload
%autoreload 2

import analysis as A
import common as C



  PyTreeDef = type(jax.tree_structure(None))


<a name="secc"></a>

### C Configurations and Paths [^](#outline)

In [3]:
with C.modified_environ(DATA_FILE=DATA_FILE):
    cprd_dataset = C.datasets['CPRD']

In [4]:
relative_auc_config = {
    'pvalue': 0.01, 
    'min_auc': 0.9
}
top_k_list=[1, 2, 3, 5, 7, 10, 15, 20]
percentile_range=20 
n_percentiles=int(100/percentile_range)


import matplotlib.font_manager as font_manager
plt.rcParams.update(plt.rcParamsDefault)
plt.rcParams.update({'font.family': 'sans-serif',
                     'font.sans-serif': 'Helvetica',
                     'font.weight':  'normal'})

In [5]:
output_dir = 'cprd_analysis_artefacts'
Path(output_dir).mkdir(parents=True, exist_ok=True)


<a name="secd"></a>

### D Patient Interface and Train/Val/Test Patitioning [^](#outline)

In [6]:
code_scheme = {
    'dx': 'dx_cprd_ltc9809',
    'dx_outcome': 'dx_cprd_ltc9809'
}

cprd_interface = C.Subject_JAX.from_dataset(cprd_dataset, code_scheme=code_scheme)

In [7]:
cprd_splits = cprd_interface.random_splits(split1=0.7, split2=0.85, random_seed=42)
cprd_train_ids, cprd_valid_ids, cprd_test_ids = cprd_splits


In [8]:
cprd_percentiles = cprd_interface.dx_outcome_by_percentiles(20, cprd_splits[0])


<a name="sec1"></a>

## 1 Snooping/Selecting Best Models from the Validation Set [^](#outline)

In [9]:
from glob import glob
clfs = [os.path.basename(d) for d in glob(f"{TRAIN_DIR}/*")]
model_dir = dict(zip(clfs, clfs))

In [10]:

cprd_top = A.get_trained_models(clfs=clfs, train_dir={'cprd': TRAIN_DIR}, 
                                model_dir=model_dir, data_tag='cprd', 
                               criterion='MICRO-AUC',  comp=max)
display(cprd_top['summary'])


Unnamed: 0,Clf,Best_i,MICRO-AUC
0,ICE-NODE_UNIFORM,0,0.440169
1,RETAIN,38,0.898837
2,LogReg,0,0.499683
3,ICE-NODE,0,0.440169
4,GRU,0,0.358739


In [11]:

def select_predictor(clf):
    config = cprd_top['config'][clf] 
    params = cprd_top['params'][clf]
    model = C.model_cls[clf].create_model(config, cprd_interface, [])
    state = model.init_with_params(config, params)
    return model, state

cprd_predictors = {clf: select_predictor(clf) for clf in clfs}

<a name="sec1"></a>

## 1 Disease Embeddings Clustering on CPRD [^](#outline)

<a name="sec2"></a>

## 2 Subject Embeddings Clustering on CPRD [^](#outline)