# Merging NLP and Standards-based Documentation

In this notebook, we present the very basic ideas that might go into creating a _post hoc_ system that would allow clinicians to generate text reports in a normal fashion and then generate standards-based documentation by identifying key concepts in the text using natural language processing and then linking these concepts to ontologies.

## [Spacy](https://spacy.io/)

In this notebook we will be using Spacy, an NLP library written in Python that has achieved great popularity. We will be using a statistical language model that has been trained on English clinical texts.

In [None]:
import spacy
import medspacy
from medspacy.visualization import visualize_ent

from IPython.display import YouTubeVideo
import os
from cdsutils.bioportal.metadataCollector import *
from ipywidgets import *
import json
from IPython.display import clear_output

In [None]:
REPORT_DIR = "/home/shared/gen_reports"

with open(".apikey") as f0:
    apiKey = f0.read().strip()
username="bcchap"
OUTDIR = os.path.join(REPORT_DIR, username)
if not os.path.exists(OUTDIR):
    os.makedirs(OUTDIR)

In [None]:
nlp = medspacy.load()

In [None]:
nlp.pipe_names

# We can use a machine learning system trained on some real medical reports marked up by humans to mark up problems, tests, and treatments in a clinical note

## Here's a really quick description of how a machine learning algorithm works:

In [None]:
YouTubeVideo("UEm7H8cfz80", start=2600, end=2708, rel=0)

# V. Using a Pre-Trained Machine Learning Model
With **statistical NLP**, you train a machine learning classifier to extract concepts based on annotated datasets.

We'll use a model trained on the i2b2 2012 shared task: [**"Evaluating temporal relations in clinical text"**](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3756273/). This model was trained on data for the first subtask in the shared task, referred to in the challenge as **"Clinically relevant events"**. For the purpose of this module, I specifically restricted it to the following labels of **clinical concepts**:
- **Problems:** Diagnoses, signs, and symptoms
- **Tests:** Lab and vital measurements
- **Treatments:** Medications, procedures, and therapies


The model has been pre-installed and is available with the name **"en_info_3700_i2b2_2012"**. To install on your own machine, run this command to download and install the model:
```pip
pip install https://github.com/abchapman93/spacy_models/raw/master/releases/en_info_3700_i2b2_2012-0.1.0/dist/en_info_3700_i2b2_2012-0.1.0.tar.gz
```

We can load this using both spacy or medSpaCy.

In [None]:
# Using spaCy
# nlp = spacy.load("en_info_3700_i2b2_2012")
# Using medSpaCy
nlp = medspacy.load("en_info_3700_i2b2_2012")

### Spacy Pipelines

Spacy makes use of pipelines that can be refined by inserting, deleting, or overwriting steps. The current pipeline consists of the following steps:

In [None]:
nlp.pipe_names

Let's see what labels will be predicted by the NER component:

In [None]:
ner = nlp.get_pipe("ner")
ner.labels

Now let's see what concepts are extracted by our model. Any of the target concepts in `doc.ents` will have been extracted by the statistical NER model. MedSpaCy will keep extracting the modifiers and section titles.

In [None]:
text = """Past Medical History:
1. Type II DM
2. Afib
3. CKD Stage 3

Family History:
1. Breast Cancer


Reason for this examination: Possible pneumonia.

IMPRESSION:
No evidence of pneumonia.

Assessment/Plan:
Continue metformin for type 2 dm."""
doc = nlp(text)

In [None]:
print(doc.ents)

In [None]:
visualize_ent(doc)

## How well does this NLP model work on radiology texts?
### Here is a transcript of the report of my bone scan from 6 April 1976

In [None]:
text = """SCAN IMPRESSION: Negative, essentially normal bone scan, with nonspecific increased
activity in right ankle and foot markedly enlarged right kidney.


BONE SCAN

DOSE:     8.0 Millicuries diphosphonate.

ADMISSION DIAGNOSIS:    Right kidney mass.

BRIEF CLINICAL HISTORY:

Patient had selective renal arteriogram, 4/5/76, which demonstrated a large vascular tumor 
of the right kidney suggestive of a Wilm's tumor.

SCAN DESCRIPTION:

This is a technetium polyphosphate bone scan. The study includes standard anterior
and posterior views of major skeletal structures with lateral views of the skull and
cervical spine. The scan appears entirely normal except that the right kidney is greatly
enlarged, perhaps, 2.5 to 3 times the size of the left kidney. Also, in the region
of the right ankle and foot there is increased diffuse activity, perhaps in soft tissue.
The epiphyses are prominent in both ankles as in other long bones. The activity in the right 
foot and ankle does not appear to represent metastatic disease. The etiology of this is uncertain."""
doc = nlp(text)

## Let's iterate through each item in doc (i.e., all of the problems, tests, and treatments identified by the machine learning system) and see if they are in Bioportal

In [None]:
visualize_ent(doc)

## Identifying Ontologies

No single ontology is probably sufficient for medical documentation. The RSNA bone scan template explored in [Templated Documentation](templated_documentation.ipynb) used RadLex, LOINC, and SNOMEDCT. It may be desirable to have a different ontology for each of the three types of named entities that the NLP system recognizes: 'PROBLEM', 'TEST', and 'TREATMENT'. You can use the `ConceptSelector` instance below to see if you can find matches to the identified term in various ontologies. The second argument is a list of ontologies to search (e.g. `["LOINC"]` or `["RADLEX", "DOID"]`).

In [None]:
termFinder = ConceptSelector("concept", ["LOINC"], bioportal_api_key=apiKey)
termFinder.display()

#### You can also explore using some additional bone scan reports extracted from MIMIC II demo data

In [None]:
with open("/home/shared/bone_scans.json","r") as f:
    reports = json.load(f)

In [None]:
@interact(i=IntSlider(min=0,max=len(reports), value=0))
def view_bone_scan_reports(i):
    clear_output()
    visualize_ent(nlp(reports[i]))
    

### Questions and Ideas

1. the named entity recognition is only identifying three types. What additional categories would we need to recognize to make this useful?
1. How much of an issue do you think it is that the language model was trained without reference to any of these ontologies?
1. We are searching the entire ontology. However, you can change the Bioportal search to explore subsets of the ontology? How might this be useful?