# Student Name:
### TODO
Edit this cell and add your name to the top of the page.

In [None]:
import spacy
from spacy import displacy

In [None]:
# Values to assign for highlighting later
DISPLAY_COLORS = {
    "PROBLEM": "#1f77b4",
    "TREATMENT": "#ff7f0e",
    "TEST": "#2ca02c",

}

# Pattern Matching
It's clear that spaCy's out-of-the-box NER is not going to fit our needs. In that case, we need to take matters into our own hands. SpaCy has several methods which enable us to do rule-based matching, while still having access to the many linguistic attributes which are classified by spaCy's statistical models. 

One such method is called the [EntityRuler](https://spacy.io/api/entityruler). This is a class which allows us extract entities by writing rules which will match tokens based on various attributes. The matches are then added to `doc.ents`. 

Let's load our model. However, since we know the NER model own't be useful for our task, we can leave that component out of our pipeline by passing in **"ner"** to the `spacy.load()` function.

### TODO
Create `nlp` without an NER component by calling `spacy.load()` and loading the **"en_core_web_sm"** model.

In [None]:
nlp = ____.load(____, disable="ner")

Next, let's crate an object called `ruler`. It takes one argument: the `nlp` model which we loaded earlier.  This will be our rule-based NER matcher. When we first create it, it's blank - there are no labels or patterns included.

In [None]:
from spacy.pipeline import EntityRuler
ruler = EntityRuler(nlp)

In [None]:
ruler.labels

In [None]:
ruler.patterns

## Adding to our pipeline
In the last notebook, we saw what a spacy **pipeline** looked like. One of the most powerful features of spaCy is the ability to add to that pipeline. We'll add our `ruler` object to the pipeline so that our rule-based system is applied to texts when we call `nlp(text)`:

### TODO
Add the EntityRuler to the processing pipeline by calling `nlp.add_pipe()` and passing in the ruler as the argument.

In [None]:
____.add_pipe(____)

In [None]:
nlp.pipe_names

# Basic pattern matching

A **pattern** for the `EntityRuler` takes the form of a Python **dictionary** with two keys:
- `"label"`: The class of the entity we want to extract
- `"pattern"`: The pattern we will match on in the text

---

```python
{"label": "LABEL", "pattern": ...}
```

---


The simplext form of this is going to be matching the **exact string**. For example, to match the exact strings **"hypotension"** and **"CKD Stage 3"**, we include patterns with those strings:

In [None]:
patterns = [
    {"label": "PROBLEM", "pattern": "hypotension"},
    {"label": "PROBLEM", "pattern": "CKD Stage 3"}
]

Next, we **add** these patterns to our ruler by calling `ruler.add()`.

In [None]:
ruler.add_patterns(patterns)

### Extracting matches
Now, let's process that same clinical text we saw in the last notebook and see what entities are extracted.

### TODO
Create a new `doc` by calling `nlp()` on the `text` variable. Then print out the entities.

In [None]:
text = "76 year old man with hypotension, CKD Stage 3, previously ckd stage two, status post RIJ line placement and Swan."
doc = ____

In [None]:
print(doc.ents)

As we can see, we've now extracted the two patterns we defined in our **ruler**. Let's visualize this using `spacy.displacy()`, which offers visualizations for spaCy output:

In [None]:
displacy.render(doc, style="ent", options={"colors": DISPLAY_COLORS})

# Advanced pattern matching
We could pass in simple strings to our `ruler` to extract exact matches. However, there may be lots of small variations in the text we want to extract, and it will grow cumbersome to type out every single possible string. Instead, we'll do some more advanced matching by using **token attribute matching**.

SpaCy allows us to write patterns based on not only the exact text, but other linguistic attributes such as **part-of-speech tag**, **numerical properties**, **regular expressions**, and much more. 

## Example: Chronic Kidney Disease
In the above text, we extracted two entities, including **"CKD Stage 3"**. However, there's a very similar span of text we want to extract: **"ckd stage two"**. We could write a new pattern to match this, but we would also want to match **"CKD Stage 2"**, **"ckd Stage 4"**, **"CKD Stage 5"**, etc. Instead of trying to think of the near-infinite number of variations, let's write one pattern which will match all of these clinical problems.

An advanced pattern in spaCy is a Python **list**. Each element in that list is a **dictionary** representing each of the **tokens** (individual words) in a span of text. The **keys** of the dictionary represent the token attributes to look at and the **values** represent the values which should trigger a match:

---
```python
[
    {"ATTRIBUTE": value}, # First token
    {"ATTRIBUTE": value}, # Second token
    {"ATTRIBUTE": value} # Third token
]
```

---

Let's now write a pattern which will match both **"CKD Stage 3"** and **"ckd stage two"**. What attributes are similar between these two spans of text? What is a general pattern that you could match?

Both spans of text start out with the text **"CKD"**, although one is upper-case and one is lower-case. To match either, we will match on the **"LOWER"** attribute of the token:

```python
{"LOWER": "ckd"}
```

The second token is **"Stage"**, but again there's a difference in case. So let's use the **"LOWER"** attribute again:

```python
{"LOWER": "stage"}
```

Finally, the last token is a number. In this text there are **"3"** and **"two"**, but there could potentially be any number **1-5**. So let's just match any number. SpaCy can also recognize that the word **"two"** is a number by using the **"LIKE_NUM"** attribute, which is a boolean:

```python
{"LIKE_NUM": True}
```

When we put it all together, here is our pattern.

### TODO
Create a new pattern to match multiple CKD stages. Give the pattern a label of **"PROBLEM"**. Add the three dictionaries shown above in the **"pattern"** slot.

In [None]:
ckd_pattern = {
    "label": ____,
    "pattern": [
        ____, # Token 1
        ____, # Token 2
        ____ # Token 3
    ]
}


Let's add this to our **ruler** and see if we can match both spans of text. In `doc.ents`, we should expect to see both **"CKD Stage 3"** and **"ckd stage two"**:

In [None]:
ruler.add_patterns([ckd_pattern])

In [None]:
doc = nlp(text)

In [None]:
doc.ents

In [None]:
displacy.render(doc, style="ent", options={"colors": DISPLAY_COLORS})

It worked! Our pattern will also match other variations of chronic kidney disease. Feel free to try it out yourself.

## Treatment entities
We've now extracted all of the **"PROBLEM"** entities from our text. The other class we're interested in now is **"TREATMENT"**, which could include medication, procedures, or therapies. In our text, the two treatments are **"RIJ line placement"** and **"Swan"**. 

### TODO
Add two new patterns to match these treatments. You could either match on exact strings or more complex attributes (like lower-casing) as seen in the examples above.

In [None]:
new_patterns = [
    {
        ____: "TREATMENT",
        "pattern": ____
    },
    {
        "label": ____,
        ____: ____
    }
]


In [None]:
ruler.add_patterns(new_patterns)

In [None]:
ruler(doc)

Now check which ents are extracted. Did you get all of the PROBLEM and TREATMENT entities?

In [None]:
doc.ents

In [None]:
displacy.render(doc, style="ent", options={"colors": DISPLAY_COLORS})

# Other pattern attributes
In these last two examples, we saw some very simple ways to match patterns. But you can use many more attributes to match complicated patterns in text.

Here are a few other useful pattern attributes you can use:
- `{"LOWER": ...}`: Match on the lower-case string of a token. In clinical text, we'll usually use lower-case matching since clinical text is very inconsistent
- `{"LEMMA": ...}`: Match on the **lemma** or **root word** of a token
- `{"POS": ...}`: Match a word with a certa **part of speech tag**, like **"ADJ"**, **"VERB"**, or **"NOUN"**
    
- `{"LOWER": {"IN": [...]}}`: Match any word which is in a list of strings
- `{"LOWER": {"NOT_IN": [...]}}`: Match any word which is **not** in a list of strings
- `{"LOWER": {"REGEX": ...}}`: Match a regular expression
- `{"IS_TITLE": True}`: Match any word which starts with a capital later

See [spaCy's rule-based matching documentation](https://spacy.io/usage/rule-based-matching) and the [Matcher demo](https://explosion.ai/demos/matcher) for more explanation and examples.

# TODO: Write your own rule-based matcher

Use the `EntityRuler` class to extract the following concepts from these texts:
- "PROBLEM"
- "TREATMENT"

First, identify all of the **problems** and **treatments** in the texts below. Then write patterns and add them to `ruler` to extract them from the text. You can do either simple string matching (where the **"pattern"** value is a string and will match a string exactly) or more complex patterns (where the **"pattern"** value is a list of dicts).

The number of patterns you write might vary, but I wrote **12 patterns** to match **15 entities**.

In [None]:
texts = [
    "87-year-old man with htn and end-stage renal disease.",
    "His wife recently died from end stage renal disease.",
    "The patient was started on abx for his infection",
    "There is continued mild-to-moderate congestive heart failure. ",
    "The patient is s/p median sternotomy and right thoracotomy.",
    "The pt presents for ckd stage 4",
    "He previously had CKD stage 3.",
    "The patient presented to the emergency room with cough and fever, concern for infections.",
    "Patient prescribed coumadin for her atrial fibrillation",
    "Patient prescribed coumadin for her AF",
]

# Let's join them together to make one long text so we can see all of the examples at the same time
long_text = "\n".join(texts)

In [None]:
patterns = [
    {"label": ____, "pattern": ____},
    # etc...

]

In [None]:
ruler.add_patterns(patterns)

In [None]:
doc = nlp(long_text)
doc.ents

In [None]:
# Look at the highlighted text and try to identify any additional 
# concepts which you should extract
displacy.render(doc, style="ent", options={"colors": DISPLAY_COLORS})

In [None]:
print(len(doc.ents))

# Next Steps
Rule-based systems can be very effective at extracting specific, targeted information from text. But they have disadvantages, such as that it is extremely manual effort to develop a comprehensive set of rules to extract concepts. 

In the next notebook we'll see how a **statistical model** can be used to extract information without writing specific rules.

[03-statistical-nlp.ipynb](03-statistical-nlp.ipynb)

## Week 11 Attendance
Save this notebook as an HTML and submit it on Canvas for credit for Week 11.