
<img src="https://github.com/abchapman93/DELPHI_Intro_to_NLP_Spring_2024/blob/main/media/DELPHI-long.png?raw=true" size="20%">
</br>

<h1 valign="center" align="center"><font size="+150">Introduction to NLP in Python</br>Spring 2024</font></h1>

In [None]:
!pip install https://github.com/abchapman93/DELPHI_Intro_to_NLP_Spring_2024/releases/download/v0.1/delphi_nlp_2024-0.1.tar.gz

In [2]:
!pip install pandas



In [None]:
!pip install scikit-learn

In [4]:
from delphi_nlp_2024 import *
from delphi_nlp_2024.quizzes.quizzes import *
from delphi_nlp_2024.helpers import *

In [5]:
import medspacy
from IPython.display import Image

In [6]:
from medspacy.visualization import visualize_dep, visualize_ent, MedspaCyVisualizerWidget

In the last few notebooks we learned the tools for building an NLP system. In this notebook, we'll put all those tools together and build a full classification system using real-world data.

## The task
We want to identify patients who have radiographic evidence of pneumonia. 

This task

## 0. Load the data
The dataset in these examples is a set of MIMIC-II radiology reports. The annotations were created by University of Utah physician-scientist and pneumonia extraordinaire [Dr. Barbara Jones](https://healthcare.utah.edu/fad/mddetail.php?physicianID=u0102859&name=barbara-e-jones). As baseline to compare our system against we will use a system recently developed by her team for identifying misdiagnosis of pneumonia in clinical notes: [`medspacy_pna`](https://github.com/abchapman93/medspacy_pneumonia). This was system was designed for VA and University of Utah data, so it might not achieve as high of performance on MIMIC data as what is reported in the paper. Let's see if we can beat its performance!

The data is split into two sets: the **training set** and **testing set**. We'll start by developing our system with the training set before doing a final evaluation on the testing set.

### 0.3 Read in the data
Run the code below to read in the training set. The resulting dataframe will have a column for:
- The document name
- The text
- The annotator's document classification (this is the **"truth"**), where 1 indicates there is *positive* evidence of pneumonia and 0 indicates there is *not* positive evidence of pneumonia
- The baseline NLP system's document classification (this is the **"prediction"**)

We'll eventually add another column with our own predictions.

In [7]:
train = load_pneumonia_data()

In [8]:
train.head()

Unnamed: 0,record_id,text,document_classification,split,baseline_document_classification,html
0,subject_id_157_hadm_id_26180,\n\n\n DATE: [**3128-5-28**] 10:42 AM\n ...,1,train,0,<h1>Document classification: 1</h1><div class=...
3,subject_id_7272_hadm_id_19098,\n\n\n DATE: [**2699-1-5**] 12:25 AM\n ...,1,train,1,<h1>Document classification: 1</h1><div class=...
5,subject_id_8156_hadm_id_23798,\n\n\n DATE: [**2533-6-14**] 9:28 PM\n ...,1,train,1,<h1>Document classification: 1</h1><div class=...
7,subject_id_4726_hadm_id_27535,\n\n\n DATE: [**2904-8-20**] 4:47 PM\n ...,0,train,0,<h1>Document classification: 0</h1><div class=...
8,subject_id_26_hadm_id_15067,\n\n\n DATE: [**3079-3-6**] 8:03 AM\n ...,0,train,0,<h1>Document classification: 0</h1><div class=...


In [9]:
len(train)

70

#### TODO
The cell below will display the note and annotations for the note with index `idx`. Scroll through the notes and look at the annotations; what phrases seem to be indicative of a positive document?

In [10]:
idx = 4

visualize_pneumonia_annotations(train.iloc[idx]["html"])


## 1. Document annotation
Before building an NLP system we need to define our concepts annotate a corpus of notes to use as a reference standard. We already have an annotated corpus, so we'll review a few short examples and see how we would annotate them and then look at the reference standard annotations that we already have.

### 1.1
For this task, we will define a **"POS"** note as: 

*A note which contains a positive **or** possible mention of a term referring to pneumonia.*

Consider the following terms to be pneumonia:

- Pneumonia
- Pna
- Opacity
- Infiltrate
- Consolidation

Review the following notes and annotate each as 1 for positive or 0 for negative.

In [11]:
# RUN CELL TO SEE QUIZ
quiz_pna_annotation1

VBox(children=(HTML(value='<p style="font-family:courier";>\n    REASON FOR THIS EXAMINATION:</br>\n      Plea…



In [12]:
# RUN CELL TO SEE QUIZ
quiz_pna_annotation2

VBox(children=(HTML(value='<p style="font-family:courier";>IMPRESSION:  Findings consistent with CHF, although…



In [13]:
# RUN CELL TO SEE QUIZ
quiz_pna_annotation3

VBox(children=(HTML(value='<p style="font-family:courier";>IMPRESSION:</br>\n     1. Mild CHF.</br>\n     2. L…



In [14]:
# RUN CELL TO SEE QUIZ
quiz_pna_annotation4

VBox(children=(HTML(value='<p style="font-family:courier";>IMPRESSION:</br>\n\n     1) Tubes and lines as desc…



### 1.2
The true annotations can be found in `df["document_classification"]`. 

In [15]:
train.groupby("document_classification").size()

document_classification
0    36
1    34
dtype: int64

## Evaluating performace
Once you have an annotated dataset, you can compare your NLP's classifications with the annotator's labels.

- **True positives** are notes that were annotated as positive by the annotator and also classified as positive by the NLP
- **False negatives** were annotated as positives by the annotator but negative by the NLP

In [16]:
quiz_pna_error_type

VBox(children=(HTML(value='Which error category is the classification below?\n</br><strong>Text:</strong> The …



After running the NLP on your dataset, you can compare your system's predicted labels .

We can sumarize the tendency for your model to make false posiitves and false negatives by using two statistics: **precision** and **recall** (also called **positive predictive value** and **sensitivity**).

**Precision** is the 

$$
\text{Precision} = P(\text{Truth}=1|\text{Predicted}=1)
$$

$$
\text{Recall}  = P(\text{Predicted}=1|\text{Truth}=1)
$$

Intuitively, **precision** answers the question: When my system calls a note positive, how likely is it to be correct? Whereas **sensitivity** answers the question: "When there is a positive note, how likely is it my system will call it positive"?

Another important metric is the **F1-score**, wich is the harmonic mean of precision and recall:

$$
\text{F1} = 2 * \frac{\text{Precision}*\text{Recall}}{\text{Precision}*\text{Recall}}
$$

We can estimate precision and recall by counting up true positives and true negatives:

$$
\hat{\text{P}}\text{recision} = \frac{\text{# TPs}}{\text{# TPs} + \text{# FPs}}
$$

$$
\hat{\text{S}}\text{ensitivity} = \frac{\text{# TPs}}{\text{# TPs} + \text{# FNs}}
$$



### Precision-recall tradeoff
There's a trade-off between precision and recall. To illustrate this, consider the following scenario:

---


In [18]:
quiz_precision_recall_all_pos

VBox(children=(HTML(value='I am a lazy NLP developer. I decide to simply predict that every single note is pos…



In [20]:
quiz_precision_recall_all_neg

VBox(children=(HTML(value='Feeling guilty about my job performance, I redesign my system so it calls everythin…



While we are typically not in the scenario of assigning every note the same prediction, we always have to deal with the trade-off between these two metrics. 

#### Discussion
Which is better - low precision or low recall?

Let's evaluate the performance of the baseline NLP classifier on the training data.

#### TODO
The cell below counts up the number of true positives, false positives, and false negatives for the baseline system. Use these counts to calculate precision, recall, and F1, then run the quiz to check your answers.

In [21]:
n_true_positives = train[train["document_classification"] == 1]["baseline_document_classification"].sum()

n_false_positives = train[train["document_classification"] == 0]["baseline_document_classification"].sum()

n_false_negatives = (1 - train[train["document_classification"] == 1]["baseline_document_classification"]).sum()

In [22]:
n_true_positives, n_false_negatives, n_false_positives

(27, 7, 2)

In [23]:
precision = n_true_positives / (n_true_positives + n_false_positives)
recall = n_true_positives / (n_true_positives + n_false_negatives)
f1 = 2*(precision*recall)/(precision+recall)

In [24]:
precision, recall, f1

(0.9310344827586207, 0.7941176470588235, 0.8571428571428571)

In [25]:
quiz_pre_recall_f1.test((precision, recall, f1))

Precision=0.93: Correct!
Recall=0.79: Correct!
F1=0.86: Correct!


#### Discussion
How would you interpret these scores? What does the baseline NLP system do well at/not well at?

# Build your NLP system and process texts
Now that we have some idea about what our dataset contains, let's starting building an NLP system and reviewing the output. First, build an empty NLP system. Then we'll process the notes in our dataset using our system as is (which doesn't have any rules). Go through the output and review the data. Find some examples of pneumonia that you should extract. Then go through and add rules for each of the following components as needed:

1. Add target concept rules to `target_matcher` to identify pneumonia in the text
2. Add ConText rules to `context` to improve attribute assertion
3. Optionally, add additional rules to `sectionizer` if the section logic is helpful for classifying the entities.
4. Build a document classifier which returns `0` or `1` for a doc. A simple version would just use the ConText attributes like `is_negated`, but a more complex version might also use information such as the section of the note.
5. Evaluate the system and review errors

After adding rules, reprocess your notes and review the output again. Since NLP is a computationally expensive procedure, you might want to work in batches of 10 or so before processing the whole corpus.

In [26]:
nlp = medspacy.load()

In [27]:
nlp.pipe_names

['medspacy_pyrush', 'medspacy_target_matcher', 'medspacy_context']

In [28]:
nlp.add_pipe("medspacy_sectionizer")

<medspacy.section_detection.sectionizer.Sectionizer at 0x7faeb1e69690>

### 2.1 Concept extraction
Add rules to the `target_matcher` component to extract mentions of pneumonia.

In [29]:
from medspacy.target_matcher import TargetRule
target_rules = [
    TargetRule("pneumonia", "PNEUMONIA"),
    TargetRule("pna", "PNEUMONIA"),
    TargetRule("opacity", "PNEUMONIA"),
     TargetRule("infiltrate", "PNEUMONIA"),
     TargetRule("consolidiation", "PNEUMONIA"),
]

nlp.get_pipe("medspacy_target_matcher").add(target_rules)

### 2.2 ConText
Add any modifiers that were not captured with the default rule set.

In [30]:
from medspacy.context import ConTextRule
context_rules = [

]

nlp.get_pipe("medspacy_context").add(context_rules)

### 2.3 Sections
Add any section titles which were not detected and led to errors.

In [31]:
from medspacy.section_detection import Sectionizer
section_rules = [

]

nlp.get_pipe("medspacy_sectionizer").add(section_rules)

### 2.4: Document Classification
The function below `classify_pna` takes a doc and returns a `1` if the document is positive for pneumonia and `0` if it is negative.

In [32]:
def classify_pna(doc):
    for ent in doc.ents:
        # Add additional logic as needed
        if ent.label_ == "PNEUMONIA" and not (ent._.is_negated or ent._.is_historical or ent._.is_hypothetical or ent._.is_family):
            return 1
    return 0

Check that your system is working by running the two cells below.

In [33]:
# Should be negative
classify_pna(nlp("There is no evidence of pna"))

0

In [34]:
# Should be positive
classify_pna(nlp("Impression: pneumonia"))

1

In [35]:
%%time
docs = list(nlp.pipe(train["text"]))
train["doc"] =  docs

CPU times: user 4.06 s, sys: 23.7 ms, total: 4.09 s
Wall time: 4.1 s


### 2.5: Evaluate your system on training data
After reprocessing your texts and creating `docs` with an updated NLP, run the code below to get performance metrics for your system. The function `evaluate_system` will return a DataFrame with performance characteristics for your system as well as the baseline system.

Look at the results and ask the following questions:
- What sorts of mistakes does my system appear to be making?
- Is precision or recall higher? What does that mean in the context of the research question?
- How is it comparing to the baseline NLP?

In [36]:
def add_document_classifications(df, clf):
    df["my_document_classification"] = [classify_pna(doc) for doc in df["doc"]]
    return df

In [37]:
# Add your predictions
train = add_document_classifications(train, classify_pna)

In [38]:
train.head()

Unnamed: 0,record_id,text,document_classification,split,baseline_document_classification,html,doc,my_document_classification
0,subject_id_157_hadm_id_26180,\n\n\n DATE: [**3128-5-28**] 10:42 AM\n ...,1,train,0,<h1>Document classification: 1</h1><div class=...,"(\n\n\n , DATE, :, [, *, *, 3128, -, 5, -,...",0
3,subject_id_7272_hadm_id_19098,\n\n\n DATE: [**2699-1-5**] 12:25 AM\n ...,1,train,1,<h1>Document classification: 1</h1><div class=...,"(\n\n\n , DATE, :, [, *, *, 2699, -, 1, -,...",1
5,subject_id_8156_hadm_id_23798,\n\n\n DATE: [**2533-6-14**] 9:28 PM\n ...,1,train,1,<h1>Document classification: 1</h1><div class=...,"(\n\n\n , DATE, :, [, *, *, 2533, -, 6, -,...",0
7,subject_id_4726_hadm_id_27535,\n\n\n DATE: [**2904-8-20**] 4:47 PM\n ...,0,train,0,<h1>Document classification: 0</h1><div class=...,"(\n\n\n , DATE, :, [, *, *, 2904, -, 8, -,...",1
8,subject_id_26_hadm_id_15067,\n\n\n DATE: [**3079-3-6**] 8:03 AM\n ...,0,train,0,<h1>Document classification: 0</h1><div class=...,"(\n\n\n , DATE, :, [, *, *, 3079, -, 3, -,...",0


In [39]:
from sklearn.metrics import classification_report

In [40]:
print(classification_report(train["document_classification"], train["my_document_classification"], labels=[1]))

              precision    recall  f1-score   support

           1       0.65      0.71      0.68        34

   micro avg       0.65      0.71      0.68        34
   macro avg       0.65      0.71      0.68        34
weighted avg       0.65      0.71      0.68        34



### 2.6: Error Analysis
Review examples of mistakes your NLP system made. We'll subset the dataframe to look at **false positives** and **false negatives**.

In [42]:
fps = train.query("document_classification == 0 & my_document_classification == 1")

In [43]:
fns = train.query("document_classification == 1 & my_document_classification == 0")

In [44]:
w_fps = MedspaCyVisualizerWidget(list(fps["doc"]))

Box(children=(HBox(children=(RadioButtons(options=('Ent', 'Dep', 'Both'), value='Ent'), Button(description='Pr…

In [45]:
w_fns = MedspaCyVisualizerWidget(list(fns["doc"]))

Box(children=(HBox(children=(RadioButtons(options=('Ent', 'Dep', 'Both'), value='Ent'), Button(description='Pr…

## 4. Final evaluation
Once you feel like you're ready, read in the testing data, run your NLP on it, and evaluate it. You should do this **one time** so that it is an honest evaluation of how your system will perform on new, unseen data. Once you see the final results, go through the steps we did above with the training data to understand our performance on the testing set and what sorts of errors happened. How did your final system do?

In [46]:
test = load_pneumonia_data("test",)
test["doc"] = list(nlp.pipe(test["text"]))
test = add_document_classifications(test, classify_pna)

In [47]:
len(test)

30

In [48]:
print(classification_report(test["document_classification"], test["my_document_classification"], labels=[1]))

              precision    recall  f1-score   support

           1       0.87      0.93      0.90        14

   micro avg       0.87      0.93      0.90        14
   macro avg       0.87      0.93      0.90        14
weighted avg       0.87      0.93      0.90        14

