<a href="https://colab.research.google.com/github/abchapman93/nlp_downtown_coding_slc/blob/master/02_clinical_nlp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import spacy
from spacy import displacy
from IPython.display import HTML

In [0]:
DISPLAY_COLORS = {
    "PROBLEM": "#1f77b4",
    "TREATMENT": "#ff7f0e",
    "TEST": "#2ca02c",

}

# I. NLP in Healthcare
In the [last notebook](https://colab.research.google.com/drive/1ZANeSDAYsFVFgkIrxDVrCxk0_KKm_y0c?usp=sharing), we used a general-domain NLP model to extract information from news articles. However, NLP can be used in many specialties, and each specific domain will be interested in different concepts and tasks. In this notebook, we'll look at a few examples of how NLP can be used in the **clinical** domain for clinical concept extraction.

Healthcare is an **information rich** industry. Every time a patient sees a doctor, fills a prescription, or takes a test, data is generated and stored. Healthcare providers store their data in an **Electronic Health Record (EHR)**. The field of **Healthcare Informatics** deals with how to manage this data and use it to make decisions regarding patient care, clinical research, and healthcare operations.

One important type of data stored in the EHR is **clinical text**. These are free-text narratives which include very detailed and rich information about a patient's treatment. That makes it an excellent target of NLP. However, the language used in clinical text is notoriously complex and difficult to understand, and it is often very challenging to extract relevant information for EHR text.


In this botebook, we'll look at how a pre-trained statistical model can extract clinical concepts such as problems, treatments, and tests. Next, we'll see how to write rules to supplement this statistical model. Finally, we'll use the [**cycontext**](https://github.com/medspacy/cycontext) package to detect negation, temporality, uncertainty, and family history.

# II. Statistical Clinical NER

In statistical NLP, we **annotate** a large corpus of text to identify the concepts we're interested in. Then we use those annotations as examples to train a machine learning model. Here, we'll use a statistical model trained to extract the following labels of **clinical concepts**:
- **Problems:** Diagnoses, signs, and symptoms
- **Tests:** Lab and vital measurements
- **Treatments:** Medications, procedures, and therapies

To train this model, I used data from the i2b2 2012 shared task: [**"Evaluating temporal relations in clinical text"**](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3756273/). This model was trained on data for the first subtask in the shared task, referred to in the challenge as **"Clinically relevant events"**.

## Using our model
I've uploaded this model to my GitHub. We can install it by pointing pip to the location of the **tar.gz** file:

In [3]:
!pip install https://github.com/abchapman93/spacy_models/raw/master/releases/en_info_3700_i2b2_2012-0.1.0/dist/en_info_3700_i2b2_2012-0.1.0.tar.gz

Collecting https://github.com/abchapman93/spacy_models/raw/master/releases/en_info_3700_i2b2_2012-0.1.0/dist/en_info_3700_i2b2_2012-0.1.0.tar.gz
  Using cached https://github.com/abchapman93/spacy_models/raw/master/releases/en_info_3700_i2b2_2012-0.1.0/dist/en_info_3700_i2b2_2012-0.1.0.tar.gz
Building wheels for collected packages: en-info-3700-i2b2-2012
  Building wheel for en-info-3700-i2b2-2012 (setup.py) ... [?25l[?25hdone
  Created wheel for en-info-3700-i2b2-2012: filename=en_info_3700_i2b2_2012-0.1.0-cp36-none-any.whl size=12270780 sha256=0106ecbd80d2e3596ac72ca4a9c5a275c3d531650f2b1d0487207bba769b622d
  Stored in directory: /root/.cache/pip/wheels/83/ea/74/70f179fd9dea68dccac6b8ab23f8427e3c981282c7bd8082b2
Successfully built en-info-3700-i2b2-2012


No we can use this model by using `spacy.load`, just like we would normally read in a spaCy model. Make sure that you installed this in the [introduction](./00-introduction.ipynb) notebook. I named this model **"en_info_3700_i2b2_2012"**. The **"en"** stands for **"English"** (every spaCy model needs to start with its language), and the rest is named after my UVU class and the shared task.

### **Note**: 
You may have to restart this kernel after installing the model. You can do this and re-run all the cells above this by clicking **"Runtime" >> "Restart runtime" >> "Run before"**.

In [0]:
nlp = spacy.load("en_info_3700_i2b2_2012")

Our pipeline looks similar to the regular spaCy pipeline. But let's look at what labels are included in the `ner` component:

In [5]:
nlp.pipe_names

['tagger', 'parser', 'ner']

In [6]:
ner = nlp.get_pipe("ner")
ner.labels

('PROBLEM', 'TEST', 'TREATMENT')

We can now use our i2b2 model in the exact same way that we would use any other model. Let's call our clinical model on an example text.

In [0]:
text = "The patient was started on abx for his infection"
doc = nlp(text)

Let's look at what entities were extracted:

In [8]:
doc.ents

(abx, his infection)

In [9]:
HTML(displacy.render(doc, style="ent", options={"colors": DISPLAY_COLORS}))

We can see that our model extracted two clinical entities: **"abx"**, a **treatment**, and **"his infection"**, a **problem**. 

Let's now process a few more documents and view the model output. We'll combine all of the following sentences and process them all as one `doc` so that we can view all the output simultaneously.

In [0]:
texts = [
    "87-year-old man with htn and end-stage renal disease.",
    "His wife recently died from end stage renal disease.",
    "The patient was started on abx for his infection.",
    "There is continued mild-to-moderate congestive heart failure. ",
    "The patient is s/p median sternotomy and right thoracotomy.",
    "The pt presents for ckd stage 4",
    "He previously had CKD stage 3.",
    "The patient presented to the emergency room with cough and fever, concern for infections.",
    "Patient prescribed coumadin for her atrial fibrillation.",
    "Patient prescribed coumadin for her AF.",
]
# Let's join them together to make one long text so we can see all of the examples at the same time
long_text = "\n".join(texts)

In [0]:
doc = nlp(long_text)

In [12]:
HTML(displacy.render(doc, style="ent", options={"colors": DISPLAY_COLORS}))

### Discussion
Look at the output of the model above. Do you see any errors in the predictions? Are there any other limitations to the information extracted by the model?

# III. Rule-Based NER
Many models popularly used in the NLP community today are **deep learning-based, statistical mdoels**. These are great when you have large amounts of data which represent the concepts you want to extract. However, one disadvantage is that if a new concept arises which is not present in your dataset, you can't easily add it to your model without adding a large amount of new data.

For example, one application of clinical NLP is **biosurveillance**, where we use NLP to identify occurrences of outbreaks of diseases or syndromes. This is particularly important now during COVID-19. However, the dataset used to train our NLP model is from 2012, long before COVID-19 even existed.

Let's see if our model recognizes COVID-19 as a **"Problem"**:

In [0]:
doc = nlp("He presents to the ER for possible COVID-19.")

In [14]:
doc.ents

()

In [15]:
HTML(displacy.render(doc, style="ent"))

  "__main__", mod_spec)


Our model doesn't recognize COVID-19!

For cases like this, we may instead want to use a **rule-based system**. A rule-based system is defined by manually curated rules which define the concepts we're interested in and the patterns in text which we should extract. This may sometimes require more manual effort than a statistical model which learns directly from the data, but it also allows us to extract highly specific concepts.

One way to implement a rule-based system in spaCy is called the [EntityRuler](https://spacy.io/api/entityruler). This is a class which allows us extract entities by writing rules which will match tokens based on various attributes. The matches are then added to `doc.ents`.

A spaCy pattern rule consists of a **"label"** and a **"pattern"**. The simplest pattern is just the string we want to match, but spaCy also allows for more complex pattern matching using token attributes. For more details and examples, see the spaCy documentation on [rule-based matching](https://spacy.io/usage/rule-based-matching).

Let's instantiate an `EntityRuler`, add a pattern for "COVID-19", and then add it to our pipeline. We can add new components to spaCy by calling `nlp.add_pipe`. This allows us to create custom NLP workflows, which is one of the features that makes spaCy such a powerful library.

In [0]:
from spacy.pipeline import EntityRuler 
ruler = EntityRuler(nlp)

In [0]:
pattern = {"label": "PROBLEM", "pattern": "COVID-19"}
ruler.add_patterns([pattern])

In [0]:
nlp.add_pipe(ruler)

In [19]:
doc = nlp("He presents to the ER for possible COVID-19.")
doc.ents

(COVID-19,)

Now, let's see if we correctly identified COVID-19 as a problem:

In [20]:
HTML(displacy.render(doc, style="ent"))

# IV. Contextual Analysis

We now know how to extract clinical concepts using both a pre-trained statistical model and a rule-based model. We'll now look at one additional common step in clinical NLP: **contextual analysis**.

Do you see any issues with the example below? Did the model predict correctly? Is there any other information we need to extract from the text?

In [0]:
doc = nlp("There is no evidence of pneumonia.")

In [22]:
HTML(displacy.render(doc, style="ent", options={"colors": DISPLAY_COLORS}))

In this example, our model correctly extracts **"pneumonia"** as problem. However, the patient does not actually *have* pneumonia, so we don't want to extract it as one of the patient's problem.

To address this, we'll use a custom spaCy component from a library called `cycontext`.

## The ConText algorithm
One method for performing this analysis is the **ConText** algorithm. This algorithm was originally proposed in this [paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2757457/).

There are several implementations of ConText and clinical NLP systems which use ConText, including:
- [cycontext](https://github.com/medspacy/cycontext)
- [cTAKES](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2995668/)
- [Leo](https://department-of-veterans-affairs.github.io/Leo/index.html)
- [pyConText](https://github.com/chapmanbe/pyConTextNLP)

## How ConText works

ConText connects certain **modifiers**, such as **"no evidence of"** or no **"is negative"**, with the target concepts we are extracting. In our example sentence:

---
There is **no evidence of** **_pneumonia_**

---
the **target** is **_pneumonia_**: this is the clinical concept we are trying to extract. The **modifier** is **no evidence of**: this shows that the concept is **negated**. 

ConText finds these targets and modifiers in text and builds a **directed graph** between them, where the targets and modifiers are **nodes** and the edges between them show that the modifier applies to the target. 

## cycontext
[`cycontext`](https://github.com/medspacy/cycontext) is one component from the [`medSpaCy`](https://github.com/medspacy) project, an ongoing project from a group of NLP developers at the University of Utah (including myself). The goal of medSpaCy is to create a set of clinical NLP tools implemented in Python and using the spaCy framework. MedSpaCy also includes a [clinical sectionizer](https://github.com/medspacy/sectionizer) and will have more components released in the near future.

Let's use cycontext to detect negation in this sentence and then visualize how the algorithm works. We can install cycontext using pip:


In [23]:
# Install the package
!pip install cycontext==1.0.2



First, we'll instantiate the `context` component and add it to our pipeline.

In [0]:
from cycontext import ConTextComponent
from cycontext.viz import visualize_dep, visualize_ent

In [0]:
context = ConTextComponent(nlp)

In [26]:
nlp.add_pipe(context)
nlp.pipe_names

['tagger', 'parser', 'ner', 'entity_ruler', 'context']

When we visualize this text using functions from `cycontext`, we can see that in addition to the **"PROBLEM"** ent we also extracted the phrase **"no evidence of"**, which has an arrow showing that it modifies **"pneumonia"**:

In [27]:
doc = nlp("There is no evidence of pneumonia")
visualize_ent(doc)

In [28]:
visualize_dep(doc)

A new attribute is also added to our entity telling us that this entity is negated:

In [29]:
for ent in doc.ents:
    print(ent, ent._.is_negated)

pneumonia True


## Beyond negation
Clinical text is full of cases where a condition is mentioned, but the patient does not actually have it. In addition to negation, some other cases of this are:
- **Experiencer**: Did someone other than the patient experience the problem, such as a family member?
    - "Her _mother_ had breast cancer"
- **Temporality**: Did the condition occur recently or in the past?
    - "Past medical history significant for afib, CHF, and CKD."
- **Certainty**: Is the diagnosis definite or uncertain?
    - "He presents to the ER for possible COVID-19."

Let's look at some additional examples of cycontext in action:

In [0]:
texts = [
    "Her mother had breast cancer.",
    "Past medical history significant for afib, CHF, and CKD.",
    "He presents to the ER for possible COVID-19.",
    "Negative for Influenza.",
    "He is not currently taking any beta blockers."
    
]

In [0]:
docs = list(nlp.pipe(texts))

In [32]:
doc = docs[4]
visualize_ent(doc)

In [33]:
visualize_dep(doc)

In [34]:
# Check the entity attributes
for ent in doc.ents:
    print(ent, "is_negated:", ent._.is_negated, "\t", "is_family:", ent._.is_family, "\t", "is_historical:", ent._.is_historical, "\t", "is_uncertain:", ent._.is_uncertain)

beta blockers is_negated: True 	 is_family: False 	 is_historical: False 	 is_uncertain: False


# V. Next Steps

These notebooks provided a very high-level overview of NLP. If you're interested in learning more about Natural Language Processing in Python, here are some resources:

- [Advanced NLP with spaCy](https://course.spacy.io/en/): SpaCy's free online course (now available in multiple languages)
- [fast.ai NLP course](https://www.fast.ai/2019/07/08/fastai-nlp/): A free online course from fast.ai using PyTorch and the fast.ai library for NLP
- [Natural language processing: an introduction.](https://www.ncbi.nlm.nih.gov/pubmed/21846786): A 2011 paper with a good, high-level overview of NLP in medicine
- The complete NLP module for UVU's [INFO-3700](https://github.com/abchapman93/info_3700_spring_2020), which includes more in-depth tutorials and interactive examples: 
- cycontext's GitHub repository: https://github.com/medspacy/cycontext