In [1]:
import json
from pathlib import Path
from typing import List

from dacite import from_dict
from covid import DATA_DIR, logger
from covid.structs import Paper
from covid.io import read_data

## Toolbox

Since the paper's main content is stored in text (abstract and body), we'll use some text processing tools. Among those are spacy and scispacy. spacy is a really easy to use python library for many NLP tasks that probably does not need introduction any more. Anyways, here's the link to [the docs](https://spacy.io/usage).

### scispaCy 


[scispaCy](https://allenai.github.io/scispacy/) is a Python package containing spaCy models for processing biomedical, scientific or clinical text.
There are already trained models available for different biomedical entity types:

model|F1|Entity Types
--- | --- | ---
en_ner_craft_md|76.60|cell types, chemicals, proteins, genes
en_ner_jnlpba_md|74.26|cell lines, cell types, DNAs, RNAs, proteins
en_ner_bc5cdr_md|85.02|chemicals and diseases
en_ner_bionlp13cg_md|78.28|cancer genetics

For more details on the corpora, pls refer to the original paper and the references therein: [ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing](https://www.aclweb.org/anthology/W19-5034/)

#### Why?
Standar NER tools cover only the standard entity types (Person, Location, Organization etc). Such entities are usually super rare in biomedical or healthcare data and the predominant entity types are disease, protein, body part, medical device etc.

### scispacy - UMLS linker
As per the scispacy [docs](https://github.com/allenai/scispacy#umlsentitylinker-alpha-feature)
> The UmlsEntityLinker is a SpaCy component which performs linking to the Unified Medical Language System. Note that this is currently an alpha feature. The linker simply performs a string overlap search on named entities, comparing them with a knowledge base of 2.7 million concepts using an approximate nearest neighbours search.

That is, sciscpacy comes with a feature that let's us ground entity mentions to the concept they (most likely) refer to. For instance, *'rRNA'* and *'ribosomal RNA'* both refer to the UMLS concept 'C0035701', i.e. *Ribosomal ribonucleic acid*. As mentioned in the quote above, name ambiguity is not fully handled but let's hope that biomedical entities are less ambigue compared to person names (e.g. *John Smith*). 

What is this UMLS thing and why do we want to use it?

#### UMLS
As per the UMLS [docs](https://www.nlm.nih.gov/research/umls/index.html)
> The UMLS, or Unified Medical Language System, is a set of files and software that brings together many health and biomedical vocabularies and standards to enable interoperability between computer systems.

It does so by providing a hierarchy of uniquely identified concepts that doctors, software or also papers may refer to ensuring that the topic they *talk* about is the same - just like Wikipedia URIs (if interested, google 'Wikfication' or 'Named Entity Linking' ;-) ) 

#### Why?
When we want to investigate the topics of a text, grounded entities reduce the number of items we need to tackle. 

### Text Rank

TextRank applies a graph algorithm to find the most important keywords and sentences in a text ([Mihalcea & Tarau](https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf)). The [PyTextRank implementation](https://pypi.org/project/pytextrank/) can be plugged into the [spaCy pipeline](https://spacy.io/universe/project/spacy-pytextrank) which is super convenient. We may use it to see if we can rank the important concepts in the paper by something more elaborate than pure frequency.

In [None]:
import spacy
from scispacy.abbreviation import AbbreviationDetector
from scispacy.umls_linking import UmlsEntityLinker
import pytextrank

#### load models
Let's stick with *en_ner_bc5cdr_md* for now.

In [4]:

models = {
    #"base": spacy.load("en_core_sci_sm"),
    #"craft": spacy.load("en_ner_craft_md"),
    "bc5": spacy.load("en_ner_bc5cdr_md"),  
    #"jnlpba" : spacy.load("en_ner_jnlpba_md"), 
} 
textRank = pytextrank.TextRank()

In [None]:
linker = UmlsEntityLinker(resolve_abbreviations=True)

In [6]:
import csv
umls_types = {}
# manually downloaded from this gist: https://gist.github.com/joelkuiper/4869d148333f279c2b2e
with open(Path() / DATA_DIR / "umls_types.txt") as f:
    types_reader = csv.reader(f, delimiter=',')
    for row in types_reader:
        umls_types[row[0]] = row[1]

In [7]:
nlp = models["bc5"]
abbreviation_pipe = AbbreviationDetector(nlp)
if not nlp.has_pipe("abbrev"):
    nlp.add_pipe(abbreviation_pipe, name="abbrev")
if not nlp.has_pipe("textRank"):
    nlp.add_pipe(textRank.PipelineComponent, name="textRank", last=True)
if not nlp.has_pipe("linker"):
    nlp.add_pipe(linker, name="linker") 

In [8]:
biorxiv_dir: Path = DATA_DIR / "CORD/2020-03-13/"
corpus = read_data(biorxiv_dir, 1)

processed = []
for paper in corpus:
    text = "\n".join([f.text for f in paper.body_text])
    abstract_text = "\n".join([f.text for f in paper.abstract])
    full_text = "\n".join([abstract_text, text])
    mentions = {}
    doc = nlp(abstract_text)
    processed.append((paper, doc))

2020-03-23 19:37:19 INFO: [covid - read_data()] Read 1 files from /home/anja/sandbox/covid-nlp/data/CORD/2020-03-13.


What are the entities in the document? 
What are there UMLS concepts?  

For the latter, we'll trust scispacy and take the highest scoring candidate if the linking score is greater than 85%. This is a simplification but hopefully ok here. 
The UmlsEntity in scispacy is defined [here](https://github.com/allenai/scispacy/blob/master/scispacy/umls_utils.py) but let's shortly describe its attributes here:
* `concept_id`: the unique identifier of the UMLS concept
* `canonical_name`: probably the dominant name for the concept, similar to Wikipedia titles
* `aliases`: alias names, including abbreviations
* `types`: the semantic types of a term, e.g. T123 = 'Biologically Active Substance'. [Here](https://gist.github.com/joelkuiper/4869d148333f279c2b2e) is a gist listing them. 
* `definition`: the definition of the concept, a textual description

In [9]:
from collections import defaultdict, Counter
# Collect all UMLS entities
umls_entities = defaultdict(Counter)
# collect all entities
entities = defaultdict(lambda: defaultdict(Counter))

paper, doc = processed[0]
for entity in doc.ents:    
    entities[paper.paper_id][entity.label_].update([entity])
    # the umls candidate entities are attached to the entity mention, we'll take only the first
    top_candidate = next(iter([e for e in entity._.umls_ents if e[1] > 0.85]), None)
    if top_candidate:
        cuid = top_candidate[0]
        umls_entities[paper.paper_id].update([cuid])

In [10]:
umls_entities[paper.paper_id]
for cuid, freq in umls_entities[paper.paper_id].most_common():
    entity = linker.umls.cui_to_entity[cuid]
    print(f"{freq} UID: {entity.concept_id}, name: {entity.canonical_name}, types: {entity.types}")
       

2 UID: C0002520, name: Amino Acids, types: ['T116', 'T121', 'T123']
1 UID: C1414538, name: FBL gene, types: ['T028']
1 UID: C0035701, name: Ribosomal RNA, types: ['T114', 'T123']
1 UID: C0035709, name: Small Nuclear RNA, types: ['T114', 'T123']
1 UID: C0042769, name: Virus Diseases, types: ['T047']


In [11]:
entities[paper.paper_id]

defaultdict(collections.Counter,
            {'CHEMICAL': Counter({Fibrillarin: 1,
                      rRNA: 1,
                      snRNAs: 1,
                      amino acids: 1,
                      amino acids: 1}),
             'DISEASE': Counter({virus infections: 1})})

What are the top ranked phrases and the entities within them?

In [17]:
def entities_in_range(doc, start:int, end:int):
    """Collect all entities contained in the boundary defined by start and end."""
    return [e for e in doc.ents if e.start >= start and e.end <= end]

In [18]:
for p in doc._.phrases[:40]:
    phrased = entities_in_range(doc, p.chunks[0].start, p.chunks[0].end)
    if phrased:
        print(f"{p.rank:.4f} {p.count:5d}  {p.text}, entity/ies: {phrased}")

0.0926     2  amino acids, entity/ies: [amino acids]
0.0846     1  rrna, entity/ies: [rRNA]
0.0818     3  fibrillarin, entity/ies: [Fibrillarin]
0.0708     1  virus infections, entity/ies: [virus infections]
0.0402     1  snrnas, entity/ies: [snRNAs]


This is a little bit more helpful than the counter list..

## Next steps
* Get an UMLS account so that we can use the [UMLS API](https://documentation.uts.nlm.nih.gov/rest/relations/index.html) to query additional information. This repository contains very useful [sample code](https://github.com/HHS/uts-rest-api/blob/master/samples/python/retrieve-cui-or-code.py)