<html>
<table width="100%" cellspacing="2" cellpadding="2" border="1">
<tbody>
<tr>
<td valign="center" align="center" width="45%"><img src="../media/Univ-Utah.jpeg"><br>
</td>
    <td valign="center" align="center" width="75%">
<h1 align="center"><font size="+1">University of Utah<br>Population Health Sciences<br>Data Science Workshop</font></h1></td>
<td valign="center" align="center" width="45%"><img
src="../media/U_Health_stacked_png_red.png" alt="Utah Health
Logo" width="128" height="134"><br>
</td>
</tr>
</tbody>
</table>
<br>
</html>

In [None]:
from helpers import *

# NLP with medspaCy
This notebook will introduce the Python package `medspaCy`, a toolkit for clinical NLP.

# I. Overview
Clinical text has several unique challenges that making doing NLP with EHR notes difficult. Some examples are:
- **It is very messy**, with semi-structured formatting from EHR
- Clinical documents include **many abbreviations**, some of which are ambiguous
- There are **specific tasks** needed in clinical NLP, such as **detecting negation or uncertainty** for concepts in the text

Because of these unique challenges, we need specialized tools for working with clinical data. One package we can use for this is called `medspaCy`.

## medspacy
<img alt="MedSpaCy logo" src="https://github.com/medspacy/medspacy/raw/master/images/medspacy_logo.png">


[`medspaCy`](https://github.com/medspacy/medspacy) is an open-source package maintained by NLP developers at the University of Utah and the US Department of Veterans Affairs. It's built using the popular [spaCy](https://spacy.io/) library and is specifically designed for working with clinical notes. 

The goal of medSpaCy is to provide flexible, easy-to-use spaCy components for common clinical NLP tasks, such as:

- Concept extraction
- Negation detection
- Document section splitting

Here are a couple of papers that used medspaCy:

- [Launching into clinical space with medspaCy: a new clinical text processing toolkit in Python
](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8861690/)
- [A Natural Language Processing System for National
COVID-19 Surveillance in the US Department of Veterans Affairs](https://aclanthology.org/2020.nlpcovid19-acl.10.pdf)
- [ReHouSED: A novel measurement of Veteran housing stability using natural language processing](https://www.sciencedirect.com/science/article/pii/S153204642100232X?via%3Dihub)
- [Assessing mortality prediction through different representation models based on concepts extracted from clinical notes](https://arxiv.org/pdf/2207.10872.pdf)
- [A Study into patient similarity through representation learning from medical
records ](https://arxiv.org/pdf/2104.14229.pdf)

## Getting started with medspaCy
This notebook will walk show how to use medspaCy to process clinical text and introduce some of the spaCy infrastrcuture. We'll then design some rules to extract concepts from clinical texts.


To get started with medspaCy, we'll import the library and then load a **model** which we will call `nlp`. A model in spaCy is the object which processes a note and performs the various steps of text processing.

In [None]:
import medspacy

### `nlp`
The simplest way to create a model in medspaCy is `medspaCy.load()`. This is a spaCy `English` class. You can also load models for other languages.

In [None]:
nlp = medspacy.load()
nlp

### `Doc`
To process a text, we call `nlp(text)` and save the result to `doc`. Calling `nlp` on a text returns an object from the `Doc` class. In spaCy, `Doc` objects represent a single text.

In [None]:
text = "Chief complaint: Fever and SOB"
doc = nlp(text)
doc

In [None]:
type(doc)

### `Token`
A `Token` is a single word, symbol, or whitespace in a `doc`. When we create a `doc` object, the text broken up into individual tokens. This is called **"tokenization"**.

**Discussion**: Look at the tokens generated from this text snippet. What can you say about the tokenization method? Is it as simple as splitting up into words every time we reach a whitespace?

In [None]:
for token in doc:
    print(token)

In [None]:
print(type(token))

If we access a single index of a doc, we get a token:

In [None]:
token = doc[0]
token

In [None]:
MultipleChoiceQuiz("""What would the value be? <p style="font-family:courier";>doc[0]</p> we had run: <p style="font-family:courier";>text[0]</p>,""",
                  options=["'Chief'", "'C'"], answer="'C'")

### `Span`
While a `Token` represents a single word, a `Span` represents one or more words from a `Doc`. We can get a `Span` by slicing a `Doc` object:

In [None]:
span = doc[0:3]
span

## Pipeline Components
Under the hood, the `nlp` object goes through a number of sequential steps to process the text. This is called a **pipeline** and it allows us to create modular, independent processing steps when analyzing text. We can see the names of our pipeline components through the `nlp.pipe_names` attribute:

In [None]:
nlp.pipe_names

There's also a hidden component which runs before all of them called the `tokenizer`. This splits text up into tokens and creates a `Doc` object, which is then passed on to the rest of the components.

In [None]:
nlp.tokenizer(text)

We'll learn more about some of these pipeline components in the following notebooks. First, we'll start with the `target_matcher` component and learn how to extract clinical concepts from text.

In [None]:
# For now, remove some components we don't need
nlp.remove_pipe("medspacy_pyrush")
nlp.remove_pipe("medspacy_context")
nlp.pipe_names

## Concept Extraction
One of the first step in many clinical NLP tasks is identiyfing particular **concepts** in text. These will vary in each use case, but some common examples of concepts are:
- Diagnoses
- Signs and symptoms
- Medications
- Tests

### TODO
For each of the texts below, identify the best description of the concepts **in bold**.

In [None]:
# RUN CELL TO SEE QUIZ
quiz_medical_concepts_1

In [None]:
# RUN CELL TO SEE QUIZ
quiz_medical_concepts_2

In [None]:
# RUN CELL TO SEE QUIZ
quiz_medical_concepts_3

The task of extracting these spans of text is called **named entity recognition (NER)**. This can be done using either machine learning models or rule-based models. In this class, we'll focus on building rule-based systems. In rule-based NLP, we define patterns to match concepts in text. SpaCy offers many [rule-based methods](https://spacy.io/usage/rule-based-matching). MedSpaCy uses a pipeline component called `TargetMatcher` and rules defined by a class called `TargetRule`. Extracted concepts will be stored as `Span` objects in `doc.ents`.

### `target_matcher`
To start adding rules, we'll first need to access the pipeline component. We can do this by calling `nlp.get_pipe(pipe_name)`:

In [None]:
target_matcher = nlp.get_pipe("medspacy_target_matcher")

Next we need to actually write some rules using the `TargetRule` class. Target rules require two positional arguments:
- `literal`: A span of text to match in the text (case insensitive)
- `category`: The label to assign to extracted concepts

(There are also a few keyword arguments that we'll explore later, but these are the two required arguments.)

Let's say that we want to extract patient diagnoses from the following text:

In [None]:
dx_text = "Pt is a 63M w/ h/o metastatic carcinoid tumor, HTN and hyperlipidemia"

There are three diagnoses in this text. The first is `"metastatic carcinoid tumor"`. Let's write a rule to capture this:

In [None]:
from medspacy.target_matcher import TargetRule
rule = TargetRule("metastatic carcinoid tumor", "DIAGNOSIS")

We can then add it to our target matcher:

In [None]:
target_matcher.add(rule)

In [None]:
target_matcher.rules

Now let's process the text above and see if it's extracted by our NLP model by looking at `doc.ents`:

In [None]:
doc = nlp(dx_text)
doc.ents

The `target_matcher` added a `Span` to the doc's entities representing the concept we just extracted. Let's assign this span to the variable `ent`. We can see the concept category by checking the `ent.label_` attribute.

In [None]:
ent = doc.ents[0]
print(ent)
print(type(ent))
print(ent.label_)

`medspaCy` provides some visualization functions which make it easier to look at what has been extracted from the notes:

In [None]:
from medspacy.visualization import visualize_ent


In [None]:
visualize_ent(doc)

### TODO
Edit the cell below to write a list of rules for extracting the two remaining diagnoses from `dx_text`. Then add them to the target matcher and reprocess the doc.

In [None]:
rules = [
    TargetRule("HTN", ____),
    TargetRule(____, ____),
]

In [None]:
target_matcher.____(rules)

In [None]:
doc_dx = nlp(dx_text)

In [None]:
visualize_ent(doc_dx)

In [None]:
doc_dx.ents

In [None]:
# RUN CELL TO TEST VALUE
test_dx_text.test(doc_dx)

### Advanced pattern matching
We could pass in simple strings to our `ruler` to extract exact matches. However, there may be lots of small variations in the text we want to extract, and it will grow cumbersome to type out every single possible string. Instead, we'll do some more advanced matching by using **token attribute matching**.

SpaCy allows us to write patterns based on not only the exact text, but other linguistic attributes such as **part-of-speech tag**, **numerical properties**, **regular expressions**, and much more. 

### Example: Chronic Kidney Disease
Each of the texts below mention a different stage of Chronic Kidney Disease:

---
- 76 year old man with CKD Stage 3.
- relevant diagnoses: ckd stage 4
- The patient has progressed to ckd stage 5
---

We could write different target rules to match each text, but sometimes there are too many combinations to feasibly write out every option. Instead of trying to think of the near-infinite number of variations, let's write one pattern which will match all of these clinical problems.

An advanced pattern in spaCy is a Python **list**. Each element in that list is a **dictionary** representing each of the **tokens** (individual words) in a span of text. The **keys** of the dictionary represent the token attributes to look at and the **values** represent the values which should trigger a match:

---
```python
[
    {"ATTRIBUTE": value}, # First token
    {"ATTRIBUTE": value}, # Second token
    {"ATTRIBUTE": value} # Third token
]
```

Let's now write a pattern which will match both **"CKD Stage 3"** and **"ckd stage two"**. What attributes are similar between these two spans of text? What is a general pattern that you could match?

Both spans of text start out with the text **"CKD"**, although one is upper-case and one is lower-case. To match either, we will match on the **"LOWER"** attribute of the token (which is the lower-case text):

```python
{"LOWER": "ckd"}
```

The second token is **"Stage"**, but again there's a difference in case. So let's use the **"LOWER"** attribute again:

```python
{"LOWER": "stage"}
```

Finally, the last token is a number. In this text there are **"3"** and **"two"**, but there could potentially be any number **1-5**. So let's just match any number. SpaCy can also recognize that the word **"two"** is a number by using the **"LIKE_NUM"** attribute, which is a boolean:

```python
{"LIKE_NUM": True}
```

When we put it all together, here is our pattern:
```python
pattern = [
    {"LOWER": "ckd"}, # Token 1
    {"LOWER": "stage"}, # Token 2
    {"LIKE_NUM": True} # Token 3
]
```

Once we've written a rule like this, we can add it to the target rule using the `pattern` keyword argument.

```python
TargetRule("CKD Stage X", "DIAGNOSIS", pattern=pattern)
```

Another helpful attribute for advanced pattern matching is `"OP"`, which lets the pattern be flexible on how many times the token is matched:

- `"OP": "?` matches zero or 1 time
- `"OP": "*` matches zero or more times (up to any number)
- `"OP": "+` matches 1 or more time

So if we modified the pattern above to include an operator attribute then we could make `"stage"` and the number optional and match just "`ckd`":

```python
pattern = [
    {"LOWER": "ckd"}, # Token 1
    {"LOWER": "stage", "OP": "?"}, # Token 2
    {"LIKE_NUM": True, "OP": "?"} # Token 3
]
```

#### TODO
Finish the code below to create a rule matching which will match all three examples of CKD. Then add it to the pipeline and test your model. You can test it on the three examples below.

In [None]:
texts = [
    "76 year old man with CKD Stage 3.",
    "relevant diagnoses: ckd stage 4",
    "The patient has progressed to ckd stage 5",
    "She was dx'd with CKD in January."
]

In [None]:
rule = TargetRule("CKD Stage X", "DIAGNOSIS", 
                  ____=____
)

In [None]:
target_matcher = nlp.get_pipe("medspacy_target_matcher")
target_matcher.add(rule)

In [None]:
for text in texts:
    visualize_ent(nlp(text))

In [None]:
# RUN CELL TEST VALUE
test_ckd_stage_x.test(nlp)

## Concept extraction practice
Let's return to the example discharge summary we looked at in a previous notebook. Add rules to `target_matcher` that will extract the following concepts from the text:
- `"DIAGNOSIS"`
- `"MEDICATION"`
- `"SIGN/SYMPTOM"`
- `"SOCIAL_DETERMINANT"`
- `"PROCEDURE"`

It might be useful to work in teams with clinicians or people familiar with these concepts so you can identify and define them. You don't need to extract every concept from the text (there are a lot!) so maybe just go through the note and add a few examples of each conceptt. If you'd like to write more sophisticated rules, it may be helpful to review spaCy's [rule-based NLP documentation](https://spacy.io/usage/rule-based-matching#matcher) (look at the documentation under `Matcher`).

In [None]:
# RUN CELL TO SEE HINT
hint_discharge_summ_target_rules

In [None]:
# Load a fresh NLP model
nlp = medspacy.load(enable=["medspacy_target_matcher"])
target_matcher = nlp.get_pipe("medspacy_target_matcher")

In [None]:
rules = [
    TargetRule("SOB", "SIGN/SYMPTOM"),
    TargetRule("DOE", "SIGN/SYMPTOM"),
    
    TargetRule("metastatic carcinoid tumor", "DIAGNOSIS"),
    TargetRule("HTN", "DIAGNOSIS",
              pattern=[{"LOWER": {"IN": ["htn", "hypertension"]}}]),
    TargetRule("hyperlipidemia", "DIAGNOSIS"),
    
    TargetRule(literal="congestive heart failure", category="DIAGNOSIS"),
    TargetRule(literal=" diabetes mellitus type 2", category="DIAGNOSIS"),
    TargetRule(literal="basal cell carcinoma", category="DIAGNOSIS"),
    TargetRule(literal="atelectasis", category="DIAGNOSIS"),
    TargetRule(literal="pneumonia", category="DIAGNOSIS"),
    
    TargetRule(literal="homeless", category="SOCIAL_DETERMINANT"),
    TargetRule(literal="employed", category="SOCIAL_DETERMINANT"),
    TargetRule(literal="lives with family", category="SOCIAL_DETERMINANT",
              pattern=[
                  {"LOWER": "lives"},
                  {"LOWER": "with"},
                  {"LIKE_NUM": True, "OP": "?"}, 
                  {"LOWER": {"IN": [
                      "daughter",
                      "daughters",
                      "son",
                      "sons",
                      "family",
                  ]}}
              ]),
    
    TargetRule("CXR", "PROCEDURE")
    
]

target_matcher.add(rules)

In [None]:
doc = nlp(disch_summ)
visualize_ent(doc)