In [None]:
import spacy
import medspacy

from IPython.display import Image

# I. Overview
As we saw in the last notebook, spaCy doesn't work great for clinical text out of the box. We're interested in extracting different types of information from clinical text than news or Wikipedia articles. Clinical text is also very different from general domain language. 
- **It is very messy**, with semi-structured formatting from EHR
- Clinical documents include **many abbreviations**, some of which are ambiguous
- There are **specific tasks** needed in clinical NLP, such as **detecting negation or uncertainty** for concepts in the text

One of the most powerful components of spaCy is that is **very customizable**. In addition to working with the default models provided in the core library, you can create your own [custom components](https://spacy.io/usage/processing-pipelines#custom-components) or add your own [extension attributes](https://spacy.io/usage/processing-pipelines#custom-components-attributes). Developers and researchers can then publish their spaCy extensions to the open-source community. Some examples of these openly available libraries are:

- [scispacy](https://allenai.github.io/scispacy/): Includes models trained on biomedical literature
- [medCAT](https://github.com/CogStack/MedCAT): Models trained for medical concept extraction

In the next two notebooks, we'll use [medspacy](https://github.com/medspacy/medspacy), a newly released package for performing clinical NLP tasks in spaCy. 

# medspacy
<img alt="MedSpaCy logo" src="https://github.com/medspacy/medspacy/raw/master/images/medspacy_logo.png">


[Medspacy](https://github.com/medspacy/medspacy) is an open-source package maintained by NLP developers at the University of Utah and the US Department of Veterans Affairs. The goal of medSpaCy is to provide flexible, easy-to-use spaCy components for common clinical NLP tasks, such as:

- Concept extraction
- Negation detection
- Document section splitting

One of the early uses of medSpaCy includes a [biosurveillance system for identifying positive cases of COVID-19](https://openreview.net/forum?id=ZQ_HvBxcdCv).


In [None]:
Image(url="https://github.com/medspacy/medspacy/blob/master/images/medspacy_logo.png?raw=true")

## Getting started with medspaCy
To get started with medspaCy, we'll load a model using the `medspacy.load()` function, similar to spaCy:

In [None]:
nlp = medspacy.load()

Let's look at the pipeline components that are loaded. Note that these are different than the components loaded with when we called `spacy.load("en_core_web_sm")`.

In [None]:
nlp.pipe_names

Here's a summary of what each of these components does in our pipeline:
- `sentencizer`: Splits a clinical document up into sentences
- `target_matcher`: Rule-based extraction for identifying entities
- `context`: Asserts attributes such as negation and family history for clinical concepts

We'll look at the context component in a future notebook, but we'll start with `target_matcher`

In [None]:
# Disable context for now
_ = nlp.remove_pipe("context")

# II. Concept extraction
The first step we'll take is to define the **target concepts** we're interested in. In the previous notebook, spaCy extracted concepts like **"PERSON"** and **"ORG"**. In this notebook, we'll extract the following labels:
- **"PROBLEM"**
- **"TREATMENT"**
- **"TEST"**

Look at the text below. What examples of these clinical concepts can you find in the text?

In [None]:
text = "76 year old man with hypotension, CKD Stage 3, previously ckd stage two, status post RIJ line placement and Swan."

We'll start by building a **rule-based system**. In rule-based NLP, we define patterns to match concepts in text. SpaCy offers many [rule-based methods](https://spacy.io/usage/rule-based-matching). MedSpaCy uses a pipeline component called `TargetMatcher` and rules defined by a class called `TargetRule`. Extracted concepts will be stored as `Span` objects in `doc.ents`.

We can access the target matcher through the `get_pipe()` method:

In [None]:
# Import class for defining rules
from medspacy.ner import TargetRule

In [None]:
target_matcher = nlp.get_pipe("target_matcher")
target_matcher

Target rules require two positional arguments:
- `literal`: A span of text to match in the text (case insensitive)
- `category`: The label to assign to extracted concepts

Let's define rules to extract a few of the relevant clinical concepts in the texts above.


In [None]:
print(text)

In [None]:
target_rules = [
    TargetRule("hypotension", "CONDITION"),
    TargetRule("CKD Stage 3", "CONDITION"),
]

We then add these rules to our target matcher:

In [None]:
target_matcher.add(target_rules)

### Extracting matches
Now, let's process our example texts and see what our model extracts. We can use medspaCy's `visualize_ent` to display our docs:

### TODO
Create a new `doc` by calling `nlp()` on the `text` variable. Then print out the entities.

In [None]:
doc = nlp(text)

In [None]:
print(doc.ents)

As we can see, we've now extracted the two patterns we defined in our **ruler**. Let's visualize this using `medspacy.visualization.visualize_ent()`, which offers extended visualizations for spaCy output:

In [None]:
from medspacy.visualization import visualize_ent

In [None]:
visualize_ent(doc)

# Advanced pattern matching
We could pass in simple strings to our `ruler` to extract exact matches. However, there may be lots of small variations in the text we want to extract, and it will grow cumbersome to type out every single possible string. Instead, we'll do some more advanced matching by using **token attribute matching**.

SpaCy allows us to write patterns based on not only the exact text, but other linguistic attributes such as **part-of-speech tag**, **numerical properties**, **regular expressions**, and much more. 

## Example: Chronic Kidney Disease
In the above text, we extracted two entities, including **"CKD Stage 3"**. However, there's a very similar span of text we want to extract: **"ckd stage two"**. We could write a new pattern to match this, but we would also want to match **"CKD Stage 2"**, **"ckd Stage 4"**, **"CKD Stage 5"**, etc. Instead of trying to think of the near-infinite number of variations, let's write one pattern which will match all of these clinical problems.

An advanced pattern in spaCy is a Python **list**. Each element in that list is a **dictionary** representing each of the **tokens** (individual words) in a span of text. The **keys** of the dictionary represent the token attributes to look at and the **values** represent the values which should trigger a match:

---
```python
[
    {"ATTRIBUTE": value}, # First token
    {"ATTRIBUTE": value}, # Second token
    {"ATTRIBUTE": value} # Third token
]
```

---

Let's now write a pattern which will match both **"CKD Stage 3"** and **"ckd stage two"**. What attributes are similar between these two spans of text? What is a general pattern that you could match?

Both spans of text start out with the text **"CKD"**, although one is upper-case and one is lower-case. To match either, we will match on the **"LOWER"** attribute of the token:

```python
{"LOWER": "ckd"}
```

The second token is **"Stage"**, but again there's a difference in case. So let's use the **"LOWER"** attribute again:

```python
{"LOWER": "stage"}
```

Finally, the last token is a number. In this text there are **"3"** and **"two"**, but there could potentially be any number **1-5**. So let's just match any number. SpaCy can also recognize that the word **"two"** is a number by using the **"LIKE_NUM"** attribute, which is a boolean:

```python
{"LIKE_NUM": True}
```

When we put it all together, here is our pattern.

### TODO
Add the three dictionaries shown above in the **"pattern"** slot.

In [None]:
ckd_rule = TargetRule("CKD Stage X", __,
                     pattern=[
                        {"LOWER": __}, # Token 1
                        {__: "stage"}, # Token 2
                        {"LIKE_NUM": True} # Token 3
    ])

In [None]:
target_matcher.add([ckd_rule])

In [None]:
doc = nlp(text)

In [None]:
doc.ents

In [None]:
visualize_ent(doc)

It worked! Our pattern will also match other variations of chronic kidney disease. Feel free to try it out yourself.

## Treatment entities
We've now extracted all of the **"PROBLEM"** entities from our text. The other class we're interested in now is **"TREATMENT"**, which could include medication, procedures, or therapies. In our text, the two treatments are **"RIJ line placement"** and **"Swan"**. 

### TODO
Add two new patterns to match these treatments. You could either match on exact strings or more complex attributes (like lower-casing) as seen in the examples above.

In [None]:
new_rules = [
    TargetRule(__, __),
    ____
    
]

In [None]:
target_matcher.add(new_rules)

In [None]:
doc = nlp(text)

Now check which ents are extracted. Did you get all of the PROBLEM and TREATMENT entities?

In [None]:
doc.ents

In [None]:
visualize_ent(doc)

# TODO: Write your own rules

Add rules to `target_matcher` that will extract the following concepts from these texts:
- "PROBLEM"
- "TREATMENT"

First, identify all of the **problems** and **treatments** in the texts below. Then write rules and add them to `target_matcher` to extract them from the text. You can do either simple string matching (where the **"pattern"** value is a string and will match a string exactly) or more complex patterns (where the **"pattern"** value is a list of dicts).

In [None]:
texts = [
    "87-year-old man with htn and end-stage renal disease.",
    "His wife recently died from end stage renal disease.",
    "The patient was started on abx for his infection",
    "There is continued mild-to-moderate congestive heart failure. ",
    "The patient is s/p median sternotomy and right thoracotomy.",
    "The pt presents for ckd stage 4",
    "He previously had CKD stage 3.",
    "The patient presented to the emergency room with cough and fever, concern for infections.",
    "Patient prescribed coumadin for her atrial fibrillation",
    "Patient prescribed coumadin for her AF",
]


In [None]:
rules = [
    TargetRule(__, __),
    # Add additional target rules here
]

In [None]:
target_matcher.add(rules)

Now, we'll process all of our texts:

In [None]:
docs = list(nlp.pipe(texts))

Next, we'll need to go through the results and see if all the concepts we specified are being extracted. SpaCy has great visualization methods and medspaCy extends some of these for inspecting the results of a model.

In the cell below, we'll use a Jupyter Widget to interactively scroll through the docs and visualize the output. Scroll through the docs using either the slider or the **"Previous"/"Next"** buttons.

In [None]:
from medspacy.visualization import MedspaCyVisualizerWidget

In [None]:
w = MedspaCyVisualizerWidget(docs)

#### Note
If the output above won't display, you may need to do some extra configuration to get widgets to show up in notebooks. First, try running these commands in your terminal:

```bash
pip install ipywidgets
jupyter nbextension enable --py widgetsnbextension
```

Then restart your kernel and try again.

If that doesn't work, you can manually change the index and visualize each doc one at at ime:

In [None]:
# idx = 0 # Change this to go through the docs
# visualize_ent(docs[idx])

# Next Steps
Rule-based systems can be very effective at extracting specific, targeted information from text. But they have disadvantages, such as that it is extremely manual effort to develop a comprehensive set of rules to extract concepts. 

In the next notebook we'll see how a **statistical model** can be used to extract information without writing specific rules.

[03-statistical-nlp.ipynb](03-statistical-nlp.ipynb)