# Hyperbolic Embedding Methods for Medical Ontology Networks
<br>


**Drew Wilimitis** <br> 
Department of Biomedical Informatics, Vanderbilt University Medical Center

![title](images/three_models.png)

## Writing/Research Project Outline <br>

____

### Abstract :
<br>
 

### Intro: Broader Introduction and Contextual Framing  of My Investigation
<br>

As discussed in a recent survey of representation learning for Electronic Health Records **[1]**, there is currently a vast, disparate set of clinical data sources through which we attempt to understand a multi-layered, incredibly complex social network of patient-clinician interactions. In addition, the prevalence of high-throughput genomics data and cross-collaboration with clinical research is incredibly promising towards a more complete scientific understanding of biomedicine as it currently exists. However, to understand the dynamical processes involved in this system will significantly complicate the challenge of learning meaningful, informative data representations.
<br>
<br>

This problem of representation learning challenges the progression of biomedical informatics and clinical science not only in the potential to build less accurate predictive models, but also to potentially erode any human interpretation or explainability of these algorithms. Given that many SOTA methods for representation learning are highly sophisticated deep-learning algorithms, and also because these SOTA methods involve immensely expensive transfer learning, converging to potentially hundreds of millions of parameters like BERT, despite its undeniable success in NLP tasks.<br>

### 2. Background Overview of Hyperbolic Geometry:

<br>

**Note**: other datasets amenable to hyperbolic approaches: train a better and joint embedding with the hierarchy of medical ontologies, the textual descriptions of concepts, and even patients’ EHR data in the hyperbolic space

### 3.1-3.4: Hyperbolic Embeddings <br>

- Method 1: Apply the Poincare Embedding Algorithm

- Method 2: Apply Lorentz Embedding

- Method 3: Lorentzian Distance Learning??


### 4.  Evaluation (to Euclidean & Earlier Approach) <br>
<br>

### 5. Apply Hyperbolic KMeans/Clustering?<br>
<br>

### 6. Conclusion/Discussion <br>
<br>
<br>

___
### References: <br>


**[1]**. Weng, Wei-Hung and Peter Szolovits. “Representation Learning for Electronic Health Records.” ArXiv abs/1909.09248 (2019): n. pag.

Beaulieu-Jones, Brett & Kohane, Isaac & Beam, Andrew. (2019). Learning Contextual Hierarchical Structure of Medical Concepts with Poincairé Embeddings to Clarify Phenotypes. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing. 24. 8-17. 

Cao, Jiazhen. “A Case Study for Predicting in-Hospital Mortality by Utilizing the Hyperbolic Embedding of ICD-9 Medical Ontology.” (2019).

Top FB AI poincare compare approaches with ICD: https://arxiv.org/pdf/1902.00913.pdf

Snomed2Vec and Poincaré Embeddings of a Clinical Knowledge Base for Healthcare Analytics: https://arxiv.org/pdf/1907.08650.pdf

gave public access to their results/embeddings/data: https://drive.google.com/drive/folders/1zre60Kd0nmQubgQO4iaf0TtWpVLaEKZO

Learning Electronic Health Records through Hyperbolic
Embedding of Medical Ontologies: https://dl.acm.org/doi/pdf/10.1145/3307339.3342148

independent ICD embedding resarcher: 
also came up with these embedding 3d visuals and provides his data sources:

http://projector.tensorflow.org/?config=https://raw.githubusercontent.com/aaronteoh/icd-embeddings/master/projector-config.json
https://tech.aaronteoh.com/medical-diagnosis-codes-embeddings/
https://raw.githubusercontent.com/aaronteoh/icd-embeddings/master/meta_desc.tsv
<br>

___

# Initial Data and Process Exploration 

In [1]:
# import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn')
%matplotlib inline
import networkx as nx
import sys
import os

# ignore warnings
import warnings
warnings.filterwarnings('ignore');

# display multiple outputs within a cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all";

In [2]:
all_codes = pd.read_csv('data/tmp/allcodes.csv', sep="|" , encoding='latin1', false_values=['"'])
majors = pd.read_csv('data/tmp/majors.csv', sep="|")
chapters = pd.read_csv('data/tmp/chapters.csv', sep="|").transpose()
sub_chapters = pd.read_csv('./data/tmp/subchapters.csv', sep="|").transpose()

df = pd.DataFrame(columns=['parent', 'child'])
# print(all_codes.head(3))

# handle chapters
chapters = chapters.reset_index()
chapters.columns = ['name', 'start', 'end']
chapters['range'] = 'c_' + chapters['start'].map(str) + '_' + chapters['end'].map(str)

chap_name_dict = dict(zip(chapters['name'], chapters['range']))
chap_range_dict = dict(zip(chapters['range'], chapters['name']))

sub_chapters = sub_chapters.reset_index()
sub_chapters.columns = ['name', 'start', 'end']
sub_chapters['range'] = 's_' + sub_chapters['start'].map(str) + '_' + sub_chapters['end'].map(str)

subchap_name_dict = dict(zip(sub_chapters['name'], sub_chapters['range']))
subchap_range_dict = dict(zip(sub_chapters['range'], sub_chapters['name']))

In [22]:
all_codes.head()

Unnamed: 0,code,billable,short_desc,long_desc,three_digit,major,sub_chapter,chapter
1,1,False,Cholera,Cholera,1,Cholera,Intestinal Infectious Diseases,Infectious And Parasitic Diseases
2,10,True,Cholera d/t vib cholerae,Cholera due to vibrio cholerae,1,Cholera,Intestinal Infectious Diseases,Infectious And Parasitic Diseases
3,11,True,Cholera d/t vib el tor,Cholera due to vibrio cholerae el tor,1,Cholera,Intestinal Infectious Diseases,Infectious And Parasitic Diseases
4,19,True,Cholera NOS,"Cholera, unspecified",1,Cholera,Intestinal Infectious Diseases,Infectious And Parasitic Diseases
5,2,False,Typhoid and paratyphoid fevers,Typhoid and paratyphoid fevers,2,Typhoid and paratyphoid fevers,Intestinal Infectious Diseases,Infectious And Parasitic Diseases


In [24]:
majors.head()
majors.shape
chapters.head()
chapters.shape
sub_chapters.head()
sub_chapters.shape

Unnamed: 0,x
Cholera,1
Typhoid and paratyphoid fevers,2
Other salmonella infections,3
Shigellosis,4
Other food poisoning (bacterial),5


(1234, 1)

Unnamed: 0,name,start,end,range
0,Infectious.And.Parasitic.Diseases,1,139,c_1_139
1,Neoplasms,140,239,c_140_239
2,Endocrine..Nutritional.And.Metabolic.Diseases....,240,279,c_240_279
3,Diseases.Of.The.Blood.And.Blood.Forming.Organs,280,289,c_280_289
4,Mental.Disorders,290,319,c_290_319


(19, 4)

Unnamed: 0,name,start,end,range
0,Intestinal.Infectious.Diseases,1,9,s_1_9
1,Tuberculosis,10,18,s_10_18
2,Zoonotic.Bacterial.Diseases,20,27,s_20_27
3,Other.Bacterial.Diseases,30,41,s_30_41
4,Human.Immunodeficiency.Virus..Hiv..Infection,42,42,s_42_42


(164, 4)

In [26]:
# filter letter codes for now
print(chapters.shape)
chapters = chapters[:17]
print(chapters.shape)

print(majors.shape)
majors = majors[:928]
print(majors.shape)

sub_chapters = sub_chapters[:119]

(19, 4)
(17, 4)
(1234, 1)
(928, 1)


In [27]:
relations = []
for i, major in majors.iterrows():
    # chapter first
    # then subchapter
    relations.append((major['x'], chapters.loc[(int(major['x']) >= chapters['start'].astype(int))
                                   & (int(major['x']) <= chapters['end'].astype(int))]['range'].values[0]))    
    sub_chapter_formajor = sub_chapters.loc[(int(major['x']) >= sub_chapters['start'].astype(int))
                                       & (int(major['x']) <= sub_chapters['end'].astype(int))]['range'].values
   
    if len(sub_chapter_formajor) == 1:
        relations.append((major['x'], sub_chapter_formajor[0]))
        

for i, sub in sub_chapters.iterrows():
    relations.append((sub['range'], chapters.loc[(int(sub['start']) >= chapters['start'].astype(int))
                                   & (int(sub['start']) <= chapters['end'].astype(int))]['range'].values[0]))
    
for i, chapter in chapters.iterrows():
    relations.append((chapter['range'], 'center'))

    
print(relations[-50:])
print(len(relations))

[('s_725_729', 'c_710_739'), ('s_730_739', 'c_710_739'), ('s_760_763', 'c_760_779'), ('s_764_779', 'c_760_779'), ('s_780_789', 'c_780_799'), ('s_790_796', 'c_780_799'), ('s_797_799', 'c_780_799'), ('s_800_829', 'c_800_999'), ('s_800_804', 'c_800_999'), ('s_805_809', 'c_800_999'), ('s_810_819', 'c_800_999'), ('s_820_829', 'c_800_999'), ('s_830_839', 'c_800_999'), ('s_840_848', 'c_800_999'), ('s_850_854', 'c_800_999'), ('s_860_869', 'c_800_999'), ('s_870_897', 'c_800_999'), ('s_870_879', 'c_800_999'), ('s_880_887', 'c_800_999'), ('s_890_897', 'c_800_999'), ('s_900_904', 'c_800_999'), ('s_905_909', 'c_800_999'), ('s_910_919', 'c_800_999'), ('s_920_924', 'c_800_999'), ('s_925_929', 'c_800_999'), ('s_930_939', 'c_800_999'), ('s_940_949', 'c_800_999'), ('s_950_957', 'c_800_999'), ('s_958_959', 'c_800_999'), ('s_960_979', 'c_800_999'), ('s_980_989', 'c_800_999'), ('s_990_995', 'c_800_999'), ('s_996_999', 'c_800_999'), ('c_1_139', 'center'), ('c_140_239', 'center'), ('c_240_279', 'center'), ('

In [18]:
pd.read_csv('data/icd_data/icd.tsv', sep='\t', header=None, cols=[])

Unnamed: 0,0,1
0,001,c_1_139
1,001,s_1_9
2,002,c_1_139
3,002,s_1_9
4,003,c_1_139
...,...,...
1891,c_710_739,center
1892,c_740_759,center
1893,c_760_779,center
1894,c_780_799,center
