# Table of Content

<a name="outline"></a>

## Setup

- [A](#seca) External Imports
- [B](#secb) Internal Imports
- [C](#secc) Configurations and Paths 
- [D](#secd) Patient Interface and Train/Val/Test Partitioning
- [E](#sece) Setup Metrics


## 1. [Load Models: Uninitialised](#models)
## 2. [Snapshot Selection](#snapshot)
## 3. [Disease Embeddings Clustering](#disease-clusters)
## 4. [Subject Embeddings Clustering](#subject-clusters)


<a name="seca"></a>

### A External Imports [^](#outline)

In [1]:
import sys
import os
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from pathlib import Path
from IPython.display import display
import jax

jax.config.update('jax_platform_name', 'cpu')

<a name="secb"></a>

### B Internal Imports [^](#outline)

In [2]:
sys.path.append("..")


from lib import utils as U
from lib.ehr.dataset import load_dataset

%load_ext autoreload
%autoreload 2


ModuleNotFoundError: No module named 'lib'

<a name="secc"></a>

### C Configurations and Paths [^](#outline)

In [None]:
training_dir = 'cprd_artefacts/train'
output_dir = 'cprd_clustering_artefacts'

Path(output_dir).mkdir(parents=True, exist_ok=True)

In [None]:
# Assign the folder of the dataset to `DATA_FILE`.
HOME = os.environ.get('HOME')
DATA_FILE = f'{HOME}/GP/ehr-data/cprd-data/DUMMY_DATA.csv'
DATA_STORE = f'{HOME}/Documents/DS211/users/tb1009/DATA'
DATA_FILE = os.path.join(DATA_STORE, 'ICE_TEST_50.csv')
SOURCE_DIR = os.path.abspath("..")

with U.modified_environ(DATA_FILE=DATA_FILE):
    cprd_dataset = load_dataset('CPRD')

In [3]:
df = pd.read_csv(DATA_FILE, sep='\t')


NameError: name 'DATA_FILE' is not defined

<a name="secd"></a>

### D Patient Interface and Train/Val/Test Patitioning [^](#outline)

**Configurations should be matching the training notebook**

In [6]:
from lib.ehr.coding_scheme import DxLTC212FlatCodes, DxLTC9809FlatMedcodes, EthCPRD5, EthCPRD16
from lib.ehr import OutcomeExtractor, SurvivalOutcomeExtractor
from lib.ehr import Subject_JAX
from lib.ehr import StaticInfoFlags

%load_ext autoreload
%autoreload 2

code_scheme = {
    #'dx': DxLTC9809FlatMedcodes(), # other options 
    #'outcome': SurvivalOutcomeExtractor('dx_cprd_ltc9809'),
    # Comment above^, and uncomment below, to consider only the first occurrence of codes per subject.
    # 'outcome': SurvivalOutcomeExtractor('dx_cprd_ltc9809'),
    'dx': DxLTC212FlatCodes(),
    'outcome': SurvivalOutcomeExtractor('dx_cprd_ltc212'),
    'eth': EthCPRD5()
}


static_info_flags = StaticInfoFlags(
 gender=True,
 age=True,
 idx_deprivation=True,
 ethnicity=EthCPRD5(), # <- include it by the category of interest, not just 'True'.
)

cprd_interface = Subject_JAX.from_dataset(cprd_dataset, code_scheme=code_scheme, static_info_flags=static_info_flags)
cprd_splits = cprd_interface.random_splits(split1=0.7, split2=0.85, random_seed=42)


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


<a name="sece"></a>

### E Setup Metrics [^](#outline)


In [7]:
from lib.metric import (CodeAUC, UntilFirstCodeAUC, AdmissionAUC, CodeGroupTopAlarmAccuracy, LossMetric, MetricsCollection)
# pecentile_range=20 will partition the codes into five gruops, where each group contains 
# codes that overall constitutes 20% of the codes in all visits of specified 'subjects' list.
code_freq_partitions = cprd_interface.outcome_by_percentiles(percentile_range=20, subjects=cprd_splits[0])



# Evaluate for different k values
top_k_list = [3, 5, 10, 15, 20]

metrics = {'code_auc': CodeAUC(cprd_interface),
           'code_first_auc': UntilFirstCodeAUC(cprd_interface),
           'admission_auc': AdmissionAUC(cprd_interface),
           'loss': LossMetric(cprd_interface),
           'code_group_acc': CodeGroupTopAlarmAccuracy(cprd_interface, top_k_list=top_k_list, code_groups=code_freq_partitions)}

metric_extractor = {
    'code_auc': metrics['code_auc'].aggregate_extractor({'field': 'auc', 'aggregate': 'mean'}),
    'code_first_auc': metrics['code_first_auc'].aggregate_extractor({'field': 'auc', 'aggregate': 'mean'}),
    'admission_auc': metrics['admission_auc'].aggregate_extractor({'field': 'auc', 'aggregate': 'mean'}),
    'loss': metrics['loss'].value_extractor({'field': 'focal_softmax'}),
}

<a name="models"></a>

## 1. Loading Models (Uninitialised) [^](#outline)

In [8]:
from lib.ml import ICENODE, ICENODE_UNIFORM, GRU, RETAIN, WindowLogReg
from lib.vis import models_from_configs, performance_traces, probe_model_snapshots

model_cls = {
    'ICE-NODE': ICENODE,
    'ICE-NODE_UNIFORM': ICENODE_UNIFORM,
    'GRU': GRU,
}  
#'RETAIN': RETAIN,
#'LogReg': WindowLogReg

cprd_models = models_from_configs(training_dir, model_cls, cprd_interface, cprd_splits)




<a name="snapshot"></a>


## 2. Snapshot Selection [^](#outline)

In [9]:
result = probe_model_snapshots(train_dir=training_dir, metric_extractor=metric_extractor, 
                               selection_metric='admission_auc_val', models=cprd_models)
display(result)

# Now cprd_models have the selected snapshots

Unnamed: 0,model,code_auc_idx,code_auc_val,code_first_auc_idx,code_first_auc_val,admission_auc_idx,admission_auc_val,loss_idx,loss_val
RETAIN,RETAIN,14,0.552814,5,0.521491,23,0.871129,23,0.000915
GRU,GRU,51,0.584395,67,0.584309,62,0.873847,6,0.000957
ICE-NODE,ICE-NODE,1,0.489967,1,0.520575,1,0.639506,1,0.003561
ICE-NODE_UNIFORM,ICE-NODE_UNIFORM,27,0.488168,27,0.469039,59,0.655015,59,0.001046


<a name="disease-clusters"></a>

## 3. Disease Embeddings Clustering on CPRD [^](#outline)

In [10]:


# Should be the same one used in JAX interface in the training notebook.
#dx_scheme = DxLTC9809FlatMedcodes()
dx_scheme = DxLTC212FlatCodes()

In [11]:
# scheme indices (textual code -> integer index)
dx_scheme.index

# reverse index (integer index -> textual code)
idx2code = {idx: code for code, idx in dx_scheme.index.items()}

### 1.A GloVe Based Disease Embeddings

Get the coocurrence matrix

In [12]:
cprd_all_subjects = (cprd_interface.keys())
# Time-window context coocurrence
cprd_cooc_timewin = cprd_interface.dx_coocurrence(cprd_all_subjects, window_size_days=365)

# Sequence context coocurrence
cprd_cooc_seqwin = cprd_interface.dx_coocurrence(cprd_all_subjects, context_size=20)

from lib.embeddings import train_glove

cprd_glove_timewin = train_glove(cprd_cooc_timewin, embeddings_size=100, iterations=500, prng_seed=0)
cprd_glove_seqwin = train_glove(cprd_cooc_seqwin, embeddings_size=100, iterations=500, prng_seed=0)

df_glove = pd.DataFrame(cprd_glove_timewin).reset_index().rename(columns={'index':'disease_num'})
df_glove

Unnamed: 0,disease_num,0,1,2,3,4,5,6,7,8,...,90,91,92,93,94,95,96,97,98,99
0,0,0.001242,-0.000807,-0.004115,0.002093,-0.004294,-0.005540,-0.004192,0.008911,-0.000257,...,0.002184,0.004009,0.003964,0.001470,0.002865,0.000099,0.005078,0.000796,0.001688,-0.000031
1,1,0.000366,0.000912,-0.002171,0.003847,0.000241,0.002782,0.000597,-0.006066,-0.005391,...,-0.002519,0.001920,0.001107,0.006656,0.003977,0.007349,-0.003042,0.001407,-0.001205,0.001454
2,2,0.006486,0.183002,0.003330,-0.231654,-0.269090,-0.285032,-0.307540,-0.151689,-0.146009,...,-0.237354,0.174808,-0.536512,0.545282,0.325026,0.180104,-0.287489,-0.296969,0.290399,0.057354
3,3,-0.002718,-0.145560,0.200045,-0.183807,-0.163657,-0.172980,-0.215122,-0.151889,-0.085377,...,-0.158136,0.155777,-0.288254,0.235776,0.245836,0.136115,-0.207578,-0.205647,0.210379,0.133025
4,4,-0.000156,0.005045,-0.004965,-0.005312,0.004280,-0.001236,0.003074,-0.000903,0.001274,...,-0.001816,0.002555,0.001290,0.003993,0.006069,-0.003770,-0.005270,0.001809,0.003092,0.005363
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
207,207,0.002067,-0.000658,-0.003971,0.000463,0.001474,0.007575,-0.002212,-0.003356,0.005706,...,-0.001927,-0.001482,0.004272,-0.003058,-0.004056,0.003591,0.006935,0.003737,0.007271,-0.001460
208,208,0.004861,0.000750,0.001153,0.005485,-0.002193,0.003985,0.003017,0.005489,0.003782,...,0.007154,-0.006716,0.001663,-0.002820,-0.003687,0.001244,-0.006166,0.006520,0.007019,0.003399
209,209,-0.001237,-0.001956,-0.001938,-0.002694,-0.003505,0.001826,0.001536,-0.001186,0.001391,...,-0.002413,0.009046,0.003344,0.000054,-0.004120,0.001274,-0.000224,0.000407,-0.002698,0.002112
210,210,0.004282,0.003583,0.002399,-0.003177,0.001700,0.002057,0.000873,0.005088,0.006114,...,0.006795,0.000482,0.003718,-0.000696,-0.000829,-0.003460,0.001645,-0.003976,0.001729,0.000613


In [13]:
# read in disease/medcode file
code_root = "///home//tb1009//Documents//DS211//users/tb1009//CODES//"
disease = pd.read_csv(code_root+'212_LTC_ALL.csv')
disease = disease[['disease','disease_num']]
disease = disease.drop_duplicates()
df_glove['disease_num'] = df_glove['disease_num']+1
df_glove = df_glove.merge(disease, on='disease_num', how='left')
del df_glove['disease_num']
df_glove = df_glove.set_index('disease')
df_glove

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
disease,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Abdominal Aortic Aneurysm,0.001242,-0.000807,-0.004115,0.002093,-0.004294,-0.005540,-0.004192,0.008911,-0.000257,-0.002310,...,0.002184,0.004009,0.003964,0.001470,0.002865,0.000099,0.005078,0.000796,0.001688,-0.000031
Abdominal Hernia,0.000366,0.000912,-0.002171,0.003847,0.000241,0.002782,0.000597,-0.006066,-0.005391,-0.002157,...,-0.002519,0.001920,0.001107,0.006656,0.003977,0.007349,-0.003042,0.001407,-0.001205,0.001454
Acne,0.006486,0.183002,0.003330,-0.231654,-0.269090,-0.285032,-0.307540,-0.151689,-0.146009,-0.308472,...,-0.237354,0.174808,-0.536512,0.545282,0.325026,0.180104,-0.287489,-0.296969,0.290399,0.057354
Alcohol Misuse,-0.002718,-0.145560,0.200045,-0.183807,-0.163657,-0.172980,-0.215122,-0.151889,-0.085377,-0.185361,...,-0.158136,0.155777,-0.288254,0.235776,0.245836,0.136115,-0.207578,-0.205647,0.210379,0.133025
Alcoholic liver disease,-0.000156,0.005045,-0.004965,-0.005312,0.004280,-0.001236,0.003074,-0.000903,0.001274,-0.001732,...,-0.001816,0.002555,0.001290,0.003993,0.006069,-0.003770,-0.005270,0.001809,0.003092,0.005363
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Venous thromboembolic disease (Excl PE),0.002067,-0.000658,-0.003971,0.000463,0.001474,0.007575,-0.002212,-0.003356,0.005706,-0.000402,...,-0.001927,-0.001482,0.004272,-0.003058,-0.004056,0.003591,0.006935,0.003737,0.007271,-0.001460
Ventricular tachycardia,0.004861,0.000750,0.001153,0.005485,-0.002193,0.003985,0.003017,0.005489,0.003782,0.001129,...,0.007154,-0.006716,0.001663,-0.002820,-0.003687,0.001244,-0.006166,0.006520,0.007019,0.003399
Visual impairment and blindness,-0.001237,-0.001956,-0.001938,-0.002694,-0.003505,0.001826,0.001536,-0.001186,0.001391,0.003171,...,-0.002413,0.009046,0.003344,0.000054,-0.004120,0.001274,-0.000224,0.000407,-0.002698,0.002112
Vitamin B12 deficiency anaemia,0.004282,0.003583,0.002399,-0.003177,0.001700,0.002057,0.000873,0.005088,0.006114,-0.000337,...,0.006795,0.000482,0.003718,-0.000696,-0.000829,-0.003460,0.001645,-0.003976,0.001729,0.000613


In [24]:
# determine cosine similarity
from sklearn.metrics.pairwise import cosine_similarity
cosine = pd.DataFrame(cosine_similarity(df_glove))
cosine

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,202,203,204,205,206,207,208,209,210,211
0,1.000000,0.028115,-0.057225,0.011354,0.042538,0.035975,-0.004608,0.040601,0.066750,0.074798,...,0.106073,-0.076187,0.054631,-0.027858,0.094131,0.060488,-0.029715,0.111526,-0.069061,-0.096497
1,0.028115,1.000000,-0.078819,0.001376,0.050626,-0.116568,0.031903,-0.060540,-0.006487,-0.197531,...,0.003031,-0.077121,0.073024,0.065553,0.119106,0.160582,0.068591,0.070397,-0.114254,-0.076077
2,-0.057225,-0.078819,1.000000,0.762191,-0.047297,0.064413,0.626958,0.620441,-0.103104,0.054687,...,0.386834,0.105699,-0.377809,0.657383,0.102163,-0.060283,-0.063798,0.001875,-0.038389,0.130777
3,0.011354,0.001376,0.762191,1.000000,0.084905,0.082571,0.676913,0.822966,-0.035494,-0.033328,...,0.623674,0.023393,-0.062557,0.659471,0.511139,0.052620,-0.021326,-0.022180,-0.037553,0.018160
4,0.042538,0.050626,-0.047297,0.084905,1.000000,-0.008709,0.086351,0.105986,0.092880,-0.058936,...,0.122986,0.104195,-0.031028,-0.013691,0.215767,0.188557,-0.017651,0.060367,0.063786,-0.051598
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
207,0.060488,0.160582,-0.060283,0.052620,0.188557,-0.108490,0.031771,0.061729,0.089349,-0.009207,...,0.171953,-0.104530,0.022256,-0.001979,0.093061,1.000000,0.040306,0.089099,0.032521,-0.092307
208,-0.029715,0.068591,-0.063798,-0.021326,-0.017651,-0.020798,0.120379,0.018924,0.076156,-0.016907,...,0.024885,-0.054360,0.058072,-0.037588,0.030481,0.040306,1.000000,-0.079439,0.048213,-0.104387
209,0.111526,0.070397,0.001875,-0.022180,0.060367,-0.042225,-0.001755,0.123407,-0.077764,0.059242,...,0.109090,-0.089859,-0.004408,-0.034577,-0.006463,0.089099,-0.079439,1.000000,-0.047750,-0.047651
210,-0.069061,-0.114254,-0.038389,-0.037553,0.063786,-0.030345,0.148015,-0.112604,-0.187482,0.135790,...,-0.119860,0.100183,0.106066,0.088281,0.067855,0.032521,0.048213,-0.047750,1.000000,0.046593


In [26]:
import matplotlib.pyplot as plt

np.histogram(cosine)

(array([   16,    42,    58,  1762, 21960, 15090,  2134,  2080,  1380,
          422]),
 array([-0.94979489, -0.7548154 , -0.55983591, -0.36485642, -0.16987694,
         0.02510255,  0.22008204,  0.41506153,  0.61004102,  0.80502051,
         1.        ]))

### 1.B Predictor Based Disease Embeddings



In [14]:
def disease_embeddings_dictionary(model):
    model = cprd_models[model]
    
    # Code history
    dx_for_emb = cprd_interface.dx_batch_history_vec(cprd_all_subjects)
    # Embeddings Mat
    dx_G = model.dx_emb.compute_embeddings_mat(dx_for_emb)

    embeddings_dict = {}
    for code, idx in dx_scheme.index.items():
        in_vec = np.zeros((cprd_interface.dx_dim, ))
        in_vec[idx] = 1.
        out_vec = model.dx_emb.encode(dx_G, in_vec)
        embeddings_dict[code] = out_vec
    return embeddings_dict

#icenode_emb = disease_embeddings_dictionary('ICE-NODE')
icenode_uni_emb = disease_embeddings_dictionary('ICE-NODE_UNIFORM')
#retain_emb = disease_embeddings_dictionary('RETAIN')
#gru_emb = disease_embeddings_dictionary('GRU')

In [15]:
df_icenode = pd.DataFrame(icenode_uni_emb).transpose().reset_index().rename(columns={'index':'disease_num'})
df_icenode

Unnamed: 0,disease_num,0,1,2,3,4,5,6,7,8,...,190,191,192,193,194,195,196,197,198,199
0,1,0.034354,0.006377,0.020291,0.095340,0.039363,0.013324,0.015236,-0.019774,-0.009697,...,0.023654,0.107989,0.029968,-0.034746,-0.025487,0.008924,-0.084223,0.070025,0.011552,0.041005
1,10,0.045588,-0.008883,0.023119,0.063800,-0.032382,-0.034550,-0.063825,-0.046611,0.093856,...,-0.009938,0.055628,0.005696,-0.000249,-0.017055,0.027488,-0.047985,0.031725,-0.036866,-0.021010
2,100,0.012472,-0.022207,-0.037610,-0.004317,0.086994,-0.087752,0.010708,-0.044408,0.098859,...,0.053491,0.109065,0.090753,-0.049716,-0.052463,0.026228,0.000080,0.088452,-0.113319,0.001631
3,101,0.007556,0.004765,0.019028,0.072645,0.011264,-0.026325,0.044112,0.050443,-0.013260,...,-0.031748,0.051171,0.006230,-0.006869,0.004506,0.055581,-0.083482,-0.011786,-0.074751,0.076787
4,102,0.038017,0.027850,0.092761,0.034729,-0.023006,-0.032593,-0.007070,-0.010825,0.067998,...,0.072086,0.113120,0.092962,-0.042078,-0.056799,0.017845,0.017546,0.074893,-0.070090,0.041081
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
207,95,0.030927,-0.000351,-0.035067,0.009752,0.019077,-0.056351,-0.070181,0.037748,0.060266,...,0.024758,0.053686,0.029046,0.008311,-0.085794,-0.002919,-0.084193,0.078561,-0.072987,0.052101
208,96,0.077079,0.108659,-0.016877,0.001349,0.019904,-0.010782,0.017054,-0.040028,-0.000538,...,0.071200,0.110472,0.075783,-0.071217,-0.046773,0.007033,0.000078,0.088295,0.001796,-0.015136
209,97,0.076456,0.011903,-0.036558,0.066446,0.009414,0.024636,-0.057594,0.021504,0.115226,...,0.020698,0.068961,0.074940,0.022505,-0.016051,0.026839,0.005004,0.025117,-0.007339,-0.045194
210,98,0.072218,0.089632,-0.005358,0.057925,0.005210,-0.085189,-0.016467,0.038324,0.056914,...,0.084068,0.046616,0.109373,-0.096822,-0.030381,0.008976,-0.068383,0.045340,-0.121349,-0.004477


In [16]:
# read in disease/medcode file
code_root = "///home//tb1009//Documents//DS211//users/tb1009//CODES//"
disease = pd.read_csv(code_root+'212_LTC_ALL.csv', dtype='str')
disease = disease[['disease','disease_num']]
disease = disease.drop_duplicates()
df_icenode = df_icenode.merge(disease, on='disease_num', how='left')
del df_icenode['disease_num']
df_icenode = df_icenode.set_index('disease')
df_icenode

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,190,191,192,193,194,195,196,197,198,199
disease,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Abdominal Aortic Aneurysm,0.034354,0.006377,0.020291,0.095340,0.039363,0.013324,0.015236,-0.019774,-0.009697,0.058446,...,0.023654,0.107989,0.029968,-0.034746,-0.025487,0.008924,-0.084223,0.070025,0.011552,0.041005
Ankylosing spondylitis,0.045588,-0.008883,0.023119,0.063800,-0.032382,-0.034550,-0.063825,-0.046611,0.093856,-0.024544,...,-0.009938,0.055628,0.005696,-0.000249,-0.017055,0.027488,-0.047985,0.031725,-0.036866,-0.021010
Myocardial Infarction,0.012472,-0.022207,-0.037610,-0.004317,0.086994,-0.087752,0.010708,-0.044408,0.098859,-0.063765,...,0.053491,0.109065,0.090753,-0.049716,-0.052463,0.026228,0.000080,0.088452,-0.113319,0.001631
Neuropathic Bladder,0.007556,0.004765,0.019028,0.072645,0.011264,-0.026325,0.044112,0.050443,-0.013260,-0.010744,...,-0.031748,0.051171,0.006230,-0.006869,0.004506,0.055581,-0.083482,-0.011786,-0.074751,0.076787
Non-Hodgkin Lymphoma,0.038017,0.027850,0.092761,0.034729,-0.023006,-0.032593,-0.007070,-0.010825,0.067998,-0.028483,...,0.072086,0.113120,0.092962,-0.042078,-0.056799,0.017845,0.017546,0.074893,-0.070090,0.041081
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Motor neurone disease,0.030927,-0.000351,-0.035067,0.009752,0.019077,-0.056351,-0.070181,0.037748,0.060266,-0.025363,...,0.024758,0.053686,0.029046,0.008311,-0.085794,-0.002919,-0.084193,0.078561,-0.072987,0.052101
Multiple sclerosis,0.077079,0.108659,-0.016877,0.001349,0.019904,-0.010782,0.017054,-0.040028,-0.000538,-0.034795,...,0.071200,0.110472,0.075783,-0.071217,-0.046773,0.007033,0.000078,0.088295,0.001796,-0.015136
Multiple valve disorder,0.076456,0.011903,-0.036558,0.066446,0.009414,0.024636,-0.057594,0.021504,0.115226,-0.014006,...,0.020698,0.068961,0.074940,0.022505,-0.016051,0.026839,0.005004,0.025117,-0.007339,-0.045194
Myasthenia gravis,0.072218,0.089632,-0.005358,0.057925,0.005210,-0.085189,-0.016467,0.038324,0.056914,-0.034435,...,0.084068,0.046616,0.109373,-0.096822,-0.030381,0.008976,-0.068383,0.045340,-0.121349,-0.004477


In [31]:
np.histogram(df_icenode)

(array([ 893, 2437, 3944, 5710, 7369, 7595, 6020, 4539, 2804, 1089]),
 array([-0.13681182, -0.10950255, -0.08219327, -0.05488399, -0.02757471,
        -0.00026543,  0.02704384,  0.05435312,  0.0816624 ,  0.10897168,
         0.13628095], dtype=float32))

In [18]:
# determine cosine similarity
from sklearn.metrics.pairwise import cosine_similarity
cosine = pd.DataFrame(cosine_similarity(df_icenode))
cosine

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,202,203,204,205,206,207,208,209,210,211
0,1.000000,0.540794,0.525982,0.496396,0.480316,0.589160,0.523688,0.543232,0.490020,0.483961,...,0.519885,0.525111,0.493414,0.433989,0.532337,0.553694,0.530708,0.511494,0.546437,0.542154
1,0.540794,1.000000,0.506664,0.483411,0.511454,0.556887,0.562089,0.438324,0.492719,0.537083,...,0.528795,0.525292,0.485547,0.463777,0.518194,0.591763,0.462852,0.553915,0.479925,0.550744
2,0.525982,0.506664,1.000000,0.472108,0.513673,0.589576,0.498230,0.502664,0.397320,0.515280,...,0.482030,0.573971,0.422347,0.527007,0.579013,0.582660,0.528903,0.514321,0.521522,0.570155
3,0.496396,0.483411,0.472108,1.000000,0.543452,0.484830,0.492034,0.511931,0.477389,0.515689,...,0.481874,0.514130,0.406230,0.465198,0.511790,0.542642,0.512290,0.535904,0.488877,0.533151
4,0.480316,0.511454,0.513673,0.543452,1.000000,0.540103,0.467915,0.457112,0.455300,0.540509,...,0.434567,0.545156,0.505217,0.454443,0.484131,0.534188,0.498232,0.488250,0.512932,0.467654
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
207,0.553694,0.591763,0.582660,0.542642,0.534188,0.577664,0.589052,0.520269,0.482539,0.607509,...,0.553967,0.611005,0.503073,0.561594,0.626168,1.000000,0.575216,0.550135,0.512592,0.554270
208,0.530708,0.462852,0.528903,0.512290,0.498232,0.527440,0.500111,0.472544,0.491700,0.480596,...,0.496479,0.494781,0.448402,0.485051,0.552111,0.575216,1.000000,0.528941,0.470046,0.454606
209,0.511494,0.553915,0.514321,0.535904,0.488250,0.503216,0.458338,0.524169,0.458468,0.501104,...,0.526470,0.529191,0.472809,0.501644,0.529292,0.550135,0.528941,1.000000,0.549030,0.526599
210,0.546437,0.479925,0.521522,0.488877,0.512932,0.508700,0.502219,0.501407,0.484258,0.490368,...,0.559466,0.479839,0.497034,0.463614,0.549754,0.512592,0.470046,0.549030,1.000000,0.570474


In [19]:
np.histogram(cosine)

(array([  230,  4932, 23390, 15158,  1018,     4,     0,     0,     0,
          212]),
 array([0.3274313 , 0.39468822, 0.46194515, 0.5292021 , 0.59645903,
        0.66371596, 0.7309729 , 0.7982298 , 0.86548674, 0.93274367,
        1.0000006 ], dtype=float32))

In [38]:
icenode_m = cprd_models['ICE-NODE']
cosine = cosine_similarity(icenode_m.dx_emb.linear.weight.T)
np.histogram(cosine)

(array([ 1016, 17698, 23686,  2318,    14,     0,     0,     0,     0,
          212]),
 array([-0.2682722 , -0.14144492, -0.01461766,  0.11220961,  0.23903687,
         0.36586416,  0.4926914 ,  0.6195187 ,  0.74634594,  0.87317324,
         1.0000005 ], dtype=float32))

In [39]:
icenode_m = cprd_models['ICE-NODE']
cosine = cosine_similarity(icenode_m.dx_emb.linear.weight.T+icenode_m.dx_emb.linear.bias)
np.histogram(cosine)

(array([  230,  4932, 23390, 15158,  1018,     4,     0,     0,     0,
          212]),
 array([0.32743126, 0.39468816, 0.4619451 , 0.529202  , 0.5964589 ,
        0.66371584, 0.7309727 , 0.79822963, 0.86548656, 0.93274343,
        1.0000004 ], dtype=float32))

In [69]:
cosine.iloc[10:25,10:25]

disease,Oesophageal varices,Osteoarthritis (excl spine),Anterior and Intermediate Uveitis,Osteoporosis,Other haemolytic anaemias,Pancreatitis,Parkinson's disease,Pericardial Effusion,Peripheral Vascular Disease,Peripheral Neuropathy,Personality disorders,Plasma Cell Malignancy,Pleural effusion,Anxiety disorders,Pleural plaque
disease,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
Oesophageal varices,1.0,-0.218016,-0.006784,-0.122762,-0.122489,0.043772,-0.048974,-0.012213,-0.126672,0.035129,-0.03765,0.05752,0.036886,-0.148663,-0.036938
Osteoarthritis (excl spine),-0.218016,1.0,-0.0179,0.384354,0.111929,0.00424,0.144729,0.054201,0.03661,0.258057,0.313421,0.002432,0.134533,0.347338,0.149827
Anterior and Intermediate Uveitis,-0.006784,-0.0179,1.0,0.089688,-0.05055,0.15157,-0.119308,0.046075,0.008253,0.005142,0.132492,-0.121242,0.058971,0.081326,0.04071
Osteoporosis,-0.122762,0.384354,0.089688,1.0,0.038504,0.010492,0.096438,0.14703,0.008058,0.294078,0.329224,-0.161708,0.081187,0.449707,0.020847
Other haemolytic anaemias,-0.122489,0.111929,-0.05055,0.038504,1.0,-0.003024,0.029227,0.026076,0.015906,-0.020655,-0.023166,0.016605,0.250249,0.077008,0.076591
Pancreatitis,0.043772,0.00424,0.15157,0.010492,-0.003024,1.0,0.161491,-0.041991,0.022429,0.045276,0.075352,0.042117,0.051955,0.149495,-0.053777
Parkinson's disease,-0.048974,0.144729,-0.119308,0.096438,0.029227,0.161491,1.0,0.058128,-0.010466,0.222913,-0.00789,-0.036663,-0.041321,0.07288,0.095121
Pericardial Effusion,-0.012213,0.054201,0.046075,0.14703,0.026076,-0.041991,0.058128,1.0,-0.045256,-0.014608,0.134099,-0.042889,0.072465,-0.008318,-0.141926
Peripheral Vascular Disease,-0.126672,0.03661,0.008253,0.008058,0.015906,0.022429,-0.010466,-0.045256,1.0,0.046707,-0.136183,-0.16394,-0.05989,-0.040896,0.016147
Peripheral Neuropathy,0.035129,0.258057,0.005142,0.294078,-0.020655,0.045276,0.222913,-0.014608,0.046707,1.0,0.338374,-0.132435,-0.174813,0.463161,0.118288


In [77]:
cosine.loc[cosine.index=='Raised Total Cholesterol', 'Raised LDL-C']

disease
Raised Total Cholesterol    0.862697
Name: Raised LDL-C, dtype: float64

<a name="subject-clusters"></a>

## 4. Subject Embeddings Clustering on CPRD [^](#outline)

In [14]:
def subject_embeddings_dictionary(model):
    
    model = cprd_models[model]
    # All subjects in the study are passed
    return model.subject_embeddings(cprd_interface, cprd_all_subjects)

icenode_subj_emb = subject_embeddings_dictionary('ICE-NODE')
#icenode_subj_uni_emb = subject_embeddings_dictionary('ICE-NODE_UNIFORM')
#retain_subj_emb = subject_embeddings_dictionary('RETAIN')
#gru_subj_emb = subject_embeddings_dictionary('GRU')

In [4]:
#pd.DataFrame(icenode_subj_emb)

In [5]:
y = set(map(int, cprd_all_subjects))

NameError: name 'cprd_all_subjects' is not defined