# MedLinker-Social Tutorial

In this notebook, we'll show how to load our Entity Linker and CUI embeddings to run some experiments.

Before following this tutorial, make sure you follow the installation instructions on https://github.com/danlou/MedLinker-Social.

Let's start by importing and initializing MedLinker-Social:

In [1]:
from medlinkersocial import MedLinkerSocial

# should have previously downloaded these files into the repository
db_path='data/SimString/umls_2020_aa_cat0129_ext.3gram.5toks.db'
map_path='data/SimString/umls_2020_aa_cat0129_ext.5toks.alias.map'

linker = MedLinkerSocial(db_path, map_path, alpha=0.5, n=5)

08-Dec-20 18:10:50 - INFO - Loading DB ...
08-Dec-20 18:11:30 - INFO - Loading Mapping ...
08-Dec-20 18:11:47 - INFO - Creating Searcher ...


In this tutorial, we're starting with the parameters used in the experiments described in the report.


- 'alpha': similarity/confidence threshold
- 'n': number of tokens (n-grams)




Just initializing MedLinker-Social this way is enough to start finding matches, let's try with this sentence (or edit the next cell to try a different one):

In [24]:
sent = "But I often check on her because I'm paranoid and scared of positional asphyxiation."

Now we can simply call the .search() method to run both Mention Detection (based on YAKE) and Entity Linking (based on SimString).

In [36]:
results = linker.search(sent)
results

{'sentence': "But I often check on her because I'm paranoid and scared of positional asphyxiation.",
 'tokens': ['But',
  'I',
  'often',
  'check',
  'on',
  'her',
  'because',
  'I',
  "'m",
  'paranoid',
  'and',
  'scared',
  'of',
  'positional',
  'asphyxiation',
  '.'],
 'matches': [{'keyword': 'scared positional asphyxiation',
   'cui': 'C0004044',
   'stys': ['T046'],
   'alias': 'asphyxiation',
   'start': 11,
   'end': 15,
   'score': 0.4890754108971043,
   'similarity': 0.6546536707079772},
  {'keyword': 'asphyxiation',
   'cui': 'C0004044',
   'stys': ['T046'],
   'alias': 'asphyxiation',
   'start': 14,
   'end': 15,
   'score': 0.012911142200214176,
   'similarity': 1.0},
  {'keyword': 'check',
   'cui': 'C1283174',
   'stys': ['T052'],
   'alias': 'check',
   'start': 3,
   'end': 4,
   'score': 0.007802980443496247,
   'similarity': 1.0},
  {'keyword': 'scared',
   'cui': 'C0015726',
   'stys': ['T041'],
   'alias': 'scared',
   'start': 11,
   'end': 12,
   'score': 

This function returns a dictionary with lots of details, let's look matches more closely.

In [26]:
for match in results['matches']:
    print(match)

{'keyword': 'scared positional asphyxiation', 'cui': 'C0004044', 'stys': ['T046'], 'alias': 'asphyxiation', 'start': 11, 'end': 15, 'score': 0.4890754108971043, 'similarity': 0.6546536707079772}
{'keyword': 'asphyxiation', 'cui': 'C0004044', 'stys': ['T046'], 'alias': 'asphyxiation', 'start': 14, 'end': 15, 'score': 0.012911142200214176, 'similarity': 1.0}
{'keyword': 'check', 'cui': 'C1283174', 'stys': ['T052'], 'alias': 'check', 'start': 3, 'end': 4, 'score': 0.007802980443496247, 'similarity': 1.0}
{'keyword': 'scared', 'cui': 'C0015726', 'stys': ['T041'], 'alias': 'scared', 'start': 11, 'end': 12, 'score': 0.007802980443496247, 'similarity': 1.0}
{'keyword': 'positional', 'cui': 'C0240795', 'stys': ['T033'], 'alias': 'positional', 'start': 13, 'end': 14, 'score': 0.007802980443496247, 'similarity': 1.0}
{'keyword': 'paranoid scared positional', 'cui': 'C0240795', 'stys': ['T033'], 'alias': 'positional', 'start': 9, 'end': 14, 'score': 0.2903808748157886, 'similarity': 0.62017367294

The linker was initialized with the default overlapping=True, which allows for multiple extractions for subparts of the same phrases. Let's try switching off this feature to get even cleaner results.

In [27]:
results = linker.search(sent, overlapping=False)

for match in results['matches']:
    print(match)

{'keyword': 'scared positional asphyxiation', 'cui': 'C0004044', 'stys': ['T046'], 'alias': 'asphyxiation', 'start': 11, 'end': 15, 'score': 0.4890754108971043, 'similarity': 0.6546536707079772}
{'keyword': 'check', 'cui': 'C1283174', 'stys': ['T052'], 'alias': 'check', 'start': 3, 'end': 4, 'score': 0.007802980443496247, 'similarity': 1.0}
{'keyword': 'paranoid', 'cui': 'C1456786', 'stys': ['T048'], 'alias': 'paranoid state', 'start': 9, 'end': 10, 'score': 0.007802980443496247, 'similarity': 0.7559289460184544}


Now you can see three distinct concepts (CUIs) matched against separate portions of our sentence, along with additional fields that we describe below:

- 'stys': semantic types associated to the matched CUI, as defined in the UMLS ontology.
- 'alias': which of the CUI's aliases was matched against the text span.
- 'start': index of the first token for the matched span, to be used with token list in `results['tokens']`.
- 'end': index of the last token for the matched span, to be used with token list in `results['tokens']`.
- 'score': score return by YAKE for this text span (i.e. keyword).
- 'similarity': similarity computed by SimString, can be understood as our confidence metric.

We also provide some auxilliary methods to help interpret this information, such getting the names of semantic type codes, or the most frequent aliases for concept ids.

In [29]:
from utils import sty_labels
sty_labels['T046']

'Pathologic Function'

In [30]:
from utils import cui_mfa
cui_mfa['C0004044']

'Suffocate'

In [31]:
# or show all aliases
from utils import cui_alias_map
cui_alias_map['C0004044']

['asphyxia',
 'suffocation nos',
 'suffocations',
 'asphyxiation event',
 'asphyxiation',
 'suffocating',
 'asphyxias',
 'suffocation',
 'asphyxiation nos',
 'suffocate',
 'suffocating finding',
 'not able breathe',
 'unable breath',
 'cannot breathe',
 "can't breathe"]

Now let's have a closer look at our top match.

In [41]:
top_match = results['matches'][0]
top_match_cui = top_match['cui']
top_match_tokens = results['tokens'][top_match['start']:top_match['end']]
top_match_conf = round(top_match['similarity'], 3)
(top_match_cui, top_match_tokens, top_match_conf)

('C0004044', ['scared', 'of', 'positional', 'asphyxiation'], 0.655)

If you find this span too long, and you'd prefer a more focused (higher confidence) match, you can achieve that through two mechanisms.

1. You may increase the similarity matching matching threshold: 

In [42]:
results = linker.search(sent, alpha=0.75)  # default is 0.5

top_match = results['matches'][0]
top_match_cui = top_match['cui']
top_match_tokens = results['tokens'][top_match['start']:top_match['end']]
top_match_conf = round(top_match['similarity'], 3)
(top_match_cui, top_match_tokens, top_match_conf)

('C0004044', ['asphyxiation'], 1.0)

2. Ignore the YAKE extraction score, to consider only matching similarity (making number of tokens not factor):

In [43]:
results = linker.search(sent, add_yake_score=False)  # default is True

top_match = results['matches'][0]
top_match_cui = top_match['cui']
top_match_tokens = results['tokens'][top_match['start']:top_match['end']]
top_match_conf = round(top_match['similarity'], 3)
(top_match_cui, top_match_tokens, top_match_conf)

('C0004044', ['asphyxiation'], 1.0)

Now let's see how the linker handles some variability in writing about these conditions.

In [55]:
sent = "She also had pulmonary hypertension and she is now on medication for that."

results = linker.search(sent, alpha=0.5, overlapping=False, add_yake_score=True)

for match in results['matches']:
    print(match)

{'keyword': 'pulmonary hypertension', 'cui': 'C0020542', 'stys': ['T046'], 'alias': 'pulmonary hypertension', 'start': 3, 'end': 5, 'score': 0.6724221003398605, 'similarity': 1.0}
{'keyword': 'medication', 'cui': 'C0013227', 'stys': ['T121'], 'alias': 'medication', 'start': 10, 'end': 11, 'score': 0.10919263322004649, 'similarity': 1.0}


Some changes will be covered by the thesaurus, thus replaced before matching and leaving similarity unaffected.

In [61]:
# pulmonary -> pulomary (variation covered by thesaurus)
sent = "She also had pulmonary hypertension and she is now on medication for that."

results = linker.search(sent, alpha=0.5, overlapping=False, add_yake_score=True)

for match in results['matches']:
    print(match)

{'keyword': 'pulmonary hypertension', 'cui': 'C0020542', 'stys': ['T046'], 'alias': 'pulmonary hypertension', 'start': 3, 'end': 5, 'score': 0.6724221003398605, 'similarity': 1.0}
{'keyword': 'medication', 'cui': 'C0013227', 'stys': ['T121'], 'alias': 'medication', 'start': 10, 'end': 11, 'score': 0.10919263322004649, 'similarity': 1.0}


Others will be caught by the approximate matching of SimString, affecting the similarity and being admissable depending on the threshold.

In [62]:
# medication -> medicacion (variation not covered by thesaurus)
sent = "She also had pulmonary hypertension and she is now on medicacion for that."

results = linker.search(sent, alpha=0.5, overlapping=False, add_yake_score=True)

for match in results['matches']:
    print(match)

{'keyword': 'pulmonary hypertension', 'cui': 'C0020542', 'stys': ['T046'], 'alias': 'pulmonary hypertension', 'start': 3, 'end': 5, 'score': 0.6724221003398605, 'similarity': 1.0}
{'keyword': 'medicacion', 'cui': 'C0013227', 'stys': ['T121'], 'alias': 'medication', 'start': 10, 'end': 11, 'score': 0.10919263322004649, 'similarity': 0.7}


In [63]:
# medication -> medicacion (variation not covered by thesaurus)
sent = "She also had pulmonary hypertension and she is now on medicacion for that."

# increasing threshold 0.5 -> 0.8
results = linker.search(sent, alpha=0.8, overlapping=False, add_yake_score=True)

for match in results['matches']:
    print(match)

{'keyword': 'pulmonary hypertension', 'cui': 'C0020542', 'stys': ['T046'], 'alias': 'pulmonary hypertension', 'start': 3, 'end': 5, 'score': 0.6724221003398605, 'similarity': 1.0}


Now let's see how we may make use of the precomputed UMLS concept embeddings. We can start by loading the embeddings learned from our Reddit and EuroPMC corpora.

In [65]:
from vectorspace import VSM

reddit_vsm = VSM('data/reddit/reddit-vectors-subset.txt')
europmc_vsm = VSM('data/europmc/europmc-vectors-subset.txt')

Now we can easily find most similar concepts, and compare differences in each space.

In [69]:
reddit_neighbors = reddit_vsm.most_similar('C0004044', topn=10)
[(cui, cui_mfa[cui], sim) for cui, sim in reddit_neighbors]  # including MFA to make it easier to follow

[('C0004044', 'Suffocate', 1.0),
 ('C0150082', 'Suffocation risk', 0.8749221563339233),
 ('C0231811', 'Suffocated', 0.8133983612060547),
 ('C1536743', 'Accidental suffocation nos', 0.7687143087387085),
 ('C0038644', 'Sids', 0.7321560978889465),
 ('C0546947', 'Choking sensation', 0.683613121509552),
 ('C0549159', 'Infant deaths', 0.6491534113883972),
 ('C0598697', 'Hazard', 0.6375374794006348),
 ('C0421611', 'Place death', 0.6327787637710571),
 ('C0021278', 'Infant death', 0.6298763751983643)]

In [70]:
europmc_neighbors = europmc_vsm.most_similar('C0004044', topn=10)
[(cui, cui_mfa[cui], sim) for cui, sim in europmc_neighbors]  # including MFA to make it easier to follow

[('C0004044', 'Suffocate', 0.9999999403953552),
 ('C0150082', 'Suffocation risk', 0.870628833770752),
 ('C0231811', 'Suffocated', 0.8408768177032471),
 ('C1536743', 'Accidental suffocation nos', 0.8010740280151367),
 ('C0038644', 'Sids', 0.7536582946777344),
 ('C0410916', 'Newborn death', 0.7269656658172607),
 ('C0011071', 'Death sudden', 0.7264581322669983),
 ('C0159020', 'Fits newborn', 0.698747992515564),
 ('C0011057', 'Sudden hear loss', 0.686185359954834),
 ('C0413297', 'Dry drowning', 0.6743696928024292)]

Since these concept embeddings are based on the word embeddings learned directly from each corpus, we can combine them with natural language terms.

So we'll also need to load the full fastText models.

In [71]:
import fasttext

europmc_fasttext = fasttext.load_model('data/europmc/europmc-vectors.bin')
reddit_fasttext = fasttext.load_model('data/reddit/reddit-vectors.bin')

Using these models we can, for example, see how UMLS concepts of a certain type are related to distressed emotional states:

In [94]:
import numpy as np

target_type = 'T047'  # Disease or Syndrome
for emotion in ['sad', 'miserable', 'angry', 'anxious', 'obsessed', 'depressed', 'exhausted', 'disgusted', 'outraged', 'confused']:
    emotion_vec = reddit_fasttext[emotion]
    emotion_vec = emotion_vec / np.linalg.norm(emotion_vec)  # normalize vector before cosine
    emotion_nns = reddit_vsm.most_similar_vec(emotion_vec, topn=None)  # returns all similarities
    emotion_nns = [(cui, sim) for cui, sim in emotion_nns if target_type in linker.get_types(cui)][:3]
    emotion_nns = [(cui, linker.get_mfa(cui), round(sim, 3)) for (cui, sim) in emotion_nns]
    print(emotion, emotion_nns)

sad [('C0018801', 'Heart failure', 0.471), ('C3665704', 'Crie', 0.444), ('C0000814', 'Missed miscarriage', 0.407)]
miserable [('C1960870', 'Chronic migraine', 0.474), ('C0014481', 'Three day sickness', 0.468), ('C0013467', 'East coast fever', 0.462)]
angry [('C0004095', 'Eye tired', 0.502), ('C0749539', 'Callous toe', 0.47), ('C0040264', 'Ear ringing', 0.43)]
anxious [('C0234533', 'Generalized seizure', 0.435), ('C0271489', 'Paralyzing vertigo', 0.42), ('C3203733', 'Texidor twinge', 0.406)]
obsessed [('C0424868', 'Chewing problem', 0.447), ('C0263940', 'Curb', 0.394), ('C0399397', 'Rampant dental caries', 0.393)]
depressed [('C0003123', 'Anorexia', 0.534), ('C0234533', 'Generalized seizure', 0.528), ('C0683323', 'Physical illness', 0.498)]
exhausted [('C0004095', 'Eye tired', 0.522), ('C0276573', 'Aids with fatigue', 0.473), ('C0009088', 'Cluster headache', 0.447)]
disgusted [('C0003123', 'Anorexia', 0.439), ('C0423086', 'Staring', 0.438), ('C0856619', 'Sexual problem', 0.436)]
outrage

Similarly, we may use distances between opposite terms of scale to determine the placement of each concept on that scale.



In [100]:
from utils import efcni_cuis

term1 = 'extremely preterm'
term2 = 'late preterm'

term1_vec = reddit_fasttext[term1]
term1_vec = term1_vec / np.linalg.norm(term1_vec)
term2_vec = reddit_fasttext[term2]
term2_vec = term2_vec / np.linalg.norm(term2_vec)

cui_dists = []
for cui in efcni_cuis:
    if cui in reddit_vsm.labels:
        cui_vec = reddit_vsm.get_vec(cui)
        term1_sim = np.dot(term1_vec, cui_vec)
        term2_sim = np.dot(term2_vec, cui_vec)
        cui_dists.append((cui, term2_sim - term1_sim))

cui_dists = sorted(cui_dists, key=lambda x: x[1])

print('EFCNI Concepts ordered by relatedness to early preterms vs. late preterms:\n')
for cui, dist in cui_dists:
    print(cui, linker.get_mfa(cui), dist)

EFCNI Concepts ordered by relatedness to early preterms vs. late preterms:

C3494262 Extremely preterm infant -0.19304973
C0270971 Floppy baby -0.08424756
C0022346 Jaundice -0.03182125
C0746102 Chronic lung disease -0.018420875
C0039231 Heart racing -0.015542209
C0020672 Hypothermia -0.01441738
C0022353 Baby jaundice -0.0125117
C0231835 Fast breathing -0.010804266
C0026827 Low muscle tone -0.009894878
C0004044 Suffocate -0.0058891177
C0746961 Desaturation -0.0028530657
C0003578 Apnea 0.0017508268
C0269810 Sepsis during labor 0.0029874444
C0369768 Oxygen saturation 0.0075264573
C0428977 Pulse slowed 0.008679509
C0559477 Perinatal depression 0.010171503
C0020615 Low blood sugar 0.01761666
C0036690 Sepsis 0.019049704
C0728731 Babies born premature 0.02091676
C0025289 Meningitis 0.023334652
C0242184 Decreased oxygen supply 0.024100065
C0035236 Rsv 0.032326072
C0020542 Pulmonary hypertension 0.04187584
C0000832 Placental abruption 0.046814203
C0032285 Pneumonia 0.05292332
C0032326 Pneumotho