## Create RxNorm -> CUI lookup with UMLS API

First, acquire a [UMLS license](https://www.nlm.nih.gov/databases/umls.html#license_request) if you don't already have one. 

Then, replace `'MY-SECRET-KEY'` in the cell below with your UMLS API key.

Then we can initialize embeddings for the RxNorm codes in the annotations, using existing CUI embeddings.

In [1]:
import time

from umls_api_auth import Authentication
import requests as rq
from collections import defaultdict
from tqdm import tqdm



In [2]:
rxns = set([line.strip()[3:] for line in open('vocab.txt') if line.startswith('RX_')])

In [3]:
auth = Authentication('MY-SECRET-KEY')
tgt = auth.gettgt()

URI = "https://uts-ws.nlm.nih.gov/rest"

In [4]:
rxn2cuis = defaultdict(set)
for med in tqdm(rxns):
    route = f'/content/current/source/RXNORM/{med}/atoms'
    query = {'ticket': auth.getst(tgt)}
    res = rq.get(URI+route, params=query)
    if res.status_code == 200:
        cuis = [result['concept'].split('/')[-1] for result in res.json()['result']]
        rxn2cuis[med].update(cuis)
    # rate limit to 20 requests/sec
    time.sleep(0.05)

100%|██████████| 587/587 [09:24<00:00,  1.12it/s]


In [5]:
with open('data/rxn2cuis.txt', 'w') as of:
    for rxn, cuis in rxn2cuis.items():
        for cui in cuis:
            of.write(','.join([rxn, cui]) + '\n')

## Initialize problem and target embeddings
Section 3.2 "Initialization and pre-processing"

In [13]:
%run init_embed.py embeddings/claims_codes_hs_300.txt w2v 300

loading dx code embeddings
loading cui embeddings
loading CPT, LOINC code embeddings
loading problem definitions
building vocab
loading random init embeddings for missing codes
frac of rxn codes with embeddings: 292/587
frac of lab codes with embeddings: 287/451
frac of CPT codes with embeddings: 333/451


## Table 2: held-out triplets, Choi et al embeddings
Section 4.1 "Held-out triplets"

In [4]:
%run train.py embeddings/clinicalml.txt\
              vocab.txt \
              data/train_rand.csv \
              --patience 10\
              --max_epochs 100\
              --criterion mr\
              --px_codes data/intersect_pxs.txt\
              --rxn_codes data/intersect_rxns.txt\
              --loinc_codes data/intersect_loincs.txt\
              --use_negs\
              --lr 1e-4\
              --split_type triplets\
              --run_test

starting!
COMMAND: train.py embeddings/clinicalml.txt vocab.txt data/train_rand.csv --patience 10 --max_epochs 100 --criterion mr --px_codes data/intersect_pxs.txt --rxn_codes data/intersect_rxns.txt --loinc_codes data/intersect_loincs.txt --use_negs --lr 1e-4 --split_type triplets --run_test


  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


. . . . . . . . . . . . . !!! early stopping hit !!!

Reloading and evaluating model with best mr (epoch 2)

RUNNING TEST
METRICS
MR,MRR,RX_MRR,RX_H@1,PX_MRR,PX_H@1,LAB_MRR,LAB_H@1
Choi et al,1.56,0.827,0.816,0.690,0.892,0.812,0.736,0.571

THIS RUN'S RESULT DIR IS: results/distmult_clinicalml_Jul_31_16:24:33





Compare the above to line 4 "Choi et al (2016)" of Table 2 in the paper. 

## Table 3: held-out problems, Choi et al embeddings
Section 4.2 "Held-out problems"

In [5]:
%run train.py embeddings/clinicalml.txt\
              vocab.txt \
              data/train_probs.csv \
              --patience 10\
              --max_epochs 100\
              --criterion mr\
              --px_codes data/intersect_pxs.txt\
              --rxn_codes data/intersect_rxns.txt\
              --loinc_codes data/intersect_loincs.txt\
              --use_negs\
              --lr 1e-4\
              --run_test

starting!
COMMAND: train.py embeddings/clinicalml.txt vocab.txt data/train_probs.csv --patience 10 --max_epochs 100 --criterion mr --px_codes data/intersect_pxs.txt --rxn_codes data/intersect_rxns.txt --loinc_codes data/intersect_loincs.txt --use_negs --lr 1e-4 --run_test
. . . . . . . . . . . . . . . !!! early stopping hit !!!

Reloading and evaluating model with best mr (epoch 4)

RUNNING TEST
METRICS
MR,MRR,RX_MRR,RX_H@5,PX_MRR,PX_H@5,LAB_MRR,LAB_H@5
Choi et al,7.08,0.392,0.375,0.493,0.451,0.800,0.377,0.765

MATRIX
lab, sleep_apnea: 1.000
lab, hypokalemia: 0.100
lab, thrombocytopenia: 0.667
lab, hypertension: 1.000
lab, uti: 0.970
medication, sleep_apnea: 0.000
medication, hypokalemia: 0.000
medication, thrombocytopenia: 0.333
medication, hypertension: 0.528
medication, uti: 0.789
procedure, sleep_apnea: 1.000
procedure, hypokalemia: 1.000
procedure, thrombocytopenia: 0.750
procedure, hypertension: 0.833
procedure, uti: 0.667

THIS RUN'S RESULT DIR IS: results/distmult_clinicalml_Ju

Compare the above results (under "METRICS") to line 4 "Choi et al (2016)" of Table 3 in the paper. 

## Table 5: Examples

This will not give identical results, as the table in the paper is derived from a model that uses the site-specific data features.

In [6]:
# Paste the result directory (after "THIS RUN'S RESULT DIR IS: " above) into
s = open('results/distmult_clinicalml_Jul_31_16:25:33/html_examples.txt').read().split('\n')

In [7]:
from IPython.display import HTML
s = '\n'.join(s)
h = HTML(s); h

0
Hypertension

0,1,2
Medication,Procedure,Lab
6918,93000,2160-0
11170,93224,30934-4
1202,93288,2345-7
3443,93306,5902-2
17767,93351,2951-2
69749,93010,2823-3
25789,93005,3094-0
6057,93225,6301-6
20352,93798,2028-9

0
Uti

0,1,2
Medication,Procedure,Lab
7454,76770,6463-4
2551,51702,43409-2
8120,51700,18970-4
10180,74150,18868-0
10829,51798,18879-7
82122,74400,18895-3
733,52000,2106-3
4450,76856,18865-6
2194,52005,18864-9

0
Thrombocytopenia

0,1,2
Medication,Procedure,Lab
105694,76700,17849-1
24947,J1441,15061-5
8704,J1440,43399-5
7806,38221,2160-0
4511,38220,5902-2
6918,76705,718-7
8640,74150,14979-9
338036,G0306,6301-6
10180,70450,4542-7

0
Sleep apnea

0,1,2
Medication,Procedure,Lab
67108,94660,6301-6
88249,95811,10366-3
60548,95810,3854-7
7213,95807,33762-6
10355,94762,30934-4
7804,94760,2160-0
4603,93010,718-7
86009,93000,2276-4
41126,94010,10535-3

0
Hypokalemia

0,1,2
Medication,Procedure,Lab
67108,93010,13969-1
6185,93306,30934-4
4850,93005,33762-6
3423,71250,2160-0
6918,93225,5902-2
7806,93000,718-7
3443,93351,2777-1
11289,93732,6301-6
5224,93280,1763-2


## Table 3 (cont): Ontology baselines

In [8]:
%run compute_ndfrt_baseline.py

loading problem definitions
num all meds: 587
num rxnorm meds: 413
processing NEGATIVES


processing POSITIVES
MR: 30.069767441860463, MRR: 0.04327666122621905, H@1: 0.0, H@5: 0.023255813953488372, H@10: 0.023255813953488372, H@30: 0.7093023255813954


Compare the above results (MRR and H@5) to line 1 "Ontology baselines" of Table 3 in the paper: columns "Medications MRR" and "Medications H@5"

Before running the below, edit the line in `compute_cpt_baseline.py` that has `'MY-SECRET-KEY'` to instead use your UMLS API key. 

In [10]:
%run compute_cpt_baseline.py

loading problem definitions
num all procs: 541
num cpt procs: 425
MR: 24.42063492063492, MRR: 0.14042622893554568, H@1: 0.031746031746031744, H@5: 0.20634920634920634, H@10: 0.30158730158730157, H@30: 0.7619047619047619


Compare the above results (MRR and H@5) to line 1 "Ontology baselines" of Table 3 in the paper: columns "Procedures MRR" and "Procedures H@5"