This repository contains code accompanying publication of the paper:
Y. Choi, Y. Chiu, D. Sontag. Learning Low-Dimensional Representations of Medical Concepts. To appear in Proceedings of the AMIA Summit on Clinical Research Informatics (CRI), 2016.
In the base directory there are three files containing the two best 300-dimensional embeddings learned in the paper, and the embeddings used in the previous work which we compared to:
claims_codes_hs_300.txt.gz: Embeddings of ICD-9 diagnosis and procedure codes, NDC medication codes, and LOINC laboratory codes, derived from a large claims dataset from 2005 to 2013 for roughly 4 million people.
stanford_cuis_svd_300.txt.gz: Embeddings of UMLS concept unique identifiers (CUIs), derived from 20 million clinical notes spanning 19 years of data from Stanford Hospital and Clinics, using a data set released in a paper by Finlayson, LePendu & Shah.
DeVine_etal_200.txt.gz: Embeddings of UMLS CUIs learned by De Vine et al. CIKM '14, derived from 348,566 medical journal abstracts (courtesy of the authors).
eval directory there are three files of interest:
eval/Embedding_Evaluation.ipynb, an iPython notebook which reproduces the main results of the paper. If you come up with your own embeddings, you can use this benchmark to quantitatively compare them to our embeddings.
eval/visualize_claims_embeddings.pya Python program you can run which will allow you to look at nearest neighbors for the
claims_codes_hs_300.txtembeddings (after decompressing the file using
eval/visualize_stanford_embeddings.py, same as above but for the
Note that you may need to decompress, using
gunzip, files in the
eval directory prior to being able to run some of the programs. Additionally, to run the iPython notebook, you need to place the file
MRCONSO.RRF from the UMLS Metathesaurus into the
eval directory (we do not distribute this).