<html>
<table width="100%" cellspacing="2" cellpadding="2" border="1">
<tbody>
<tr>
<td valign="center" align="center" width="45%"><img src="../media/Univ-Utah.jpeg"><br>
</td>
    <td valign="center" align="center" width="75%">
<h1 align="center"><font size="+1">University of Utah<br>Population Health Sciences<br>Data Science Workshop</font></h1></td>
<td valign="center" align="center" width="45%"><img
src="../media/U_Health_stacked_png_red.png" alt="Utah Health
Logo" width="128" height="134"><br>
</td>
</tr>
</tbody>
</table>
<br>
</html>

In [2]:
from helpers import *

# Natural Language Processing
When people talk about data science, they are often talking about **Artificial Intelligence (AI)**. Broadly speaking, AI is a set of techniques for generating insights or imitating human decision-making using large datasets ("big data"). Most AI systems use **machine learning** which are algorithms that learn directly from data with minimal manual input. But some AI systems also utilize human knowledge in the form of complex rules. 

The first part of this module will focus on **Natural Language Processing (NLP)**, a form of AI which deals with text data. We'll learn what goes into an NLP system and build a model which classifies radiology reports for pneumonia. Then we'll build a small **supervised machine learning** system to predict whether a patient has diabetes.

Let's begin by learning about some new types of data in the EHR.

## Unstructured Data in the EHR
When you see a doctor, they enter your information into the EHR in a few different ways. We've already seen some examples like:
- ICD-9/10 codes
- Numeric vital measurements
- Flags for abnormal tests

These are all **structured** data elements: the values are either numeric values or discrete elements with distinct, concrete meaning. Importantly, these values are *computable*: we can take the average of numeric vital measurements or count of ICD-10 codes.

However, some forms of documentation are **unstructured**. Some examples are:
- Videos
- Radiology imaging
- Full-text narratives

Data forms like this are great for humans: they are easy to interpret and can include much more context and nuance than rigid, standardized data elements. However, they can't immediately be computed with. While a collection of pixels can be very meaningful to a radiologist, machines don't inherently have the ability to make sense of them.

This presents a challenge to researchers since unstructured data accounts for a huge amount of the information stored in the EHR. While it would be great to utilize this information, we have to do a little extra work to make sense of it.

## Clinical Narratives
#### TODO
Read the following excerpt of a discharge summary and then complete the quizzes that follow.

In [3]:
print(disch_summ)


Service: MEDICINE

Chief Complaint:
5 days worsening SOB, DOE

History of Present Illness:
Pt is a 63M w/ h/o metastatic carcinoid tumor, HTN, 
hyperlipidemia who reports increasing SOB and DOE starting about 
a month ago but worsening significantly within the last 5 days. 
It has recently gotten so bad he can barely get up out of a 
chair without getting short of breath. He reports orthopnea but no PND. 

He reports no fever or chills, no URI symptoms, no recent travel, no changes 
in his medications.

Pt also reports ~5 episodes of chest pain in the last few weeks 
which he describes as pressure on his mid-sternum and usually 
occurs during exertion.

Past Medical History:
1. metastatic carcinoid tumor, Dx'ed 2002
2. hypertension
3. hyperlipidemia
4. carotid endarterectomy 1999
5. depression/anxiety

Social History:
Previously homeless, now lives with two daughters. Currently employed full-time.

Family History:
early CAD

Brief Hospital Course:
1. SOB: likely from CHF
The patient w

In [4]:
# RUN CELL TO SEE QUIZ
quiz_disch_summ1

VBox(children=(HTML(value='What is the main reason the patient came to the hospital?'), RadioButtons(layout=La…



In [5]:
# RUN CELL TO SEE QUIZ
quiz_disch_summ2

VBox(children=(HTML(value='Which of the following conditions does the patient have?.'), SelectMultiple(options…



In [6]:
# RUN CELL TO SEE QUIZ
quiz_disch_summ3

VBox(children=(HTML(value="The patient doesn't have any living relatives."), RadioButtons(layout=Layout(width=…



In [7]:
# RUN CELL TO SEE QUIZ
quiz_disch_summ4

VBox(children=(HTML(value='How many episodes of chest pain has the patient had in the last few weeks?'), Texta…



### Discussion
As you can see, there's a lot of really useful information in clinical notes. What is the advantage of documenting it using free text? What are some challenges you see with this?

## NLP
NLP systems aim to extract information from unstructured notes like the ones above and transform that information into structured data. For example, given the sentence:

--- 
Pt is a 63M w/ h/o metastatic carcinoid tumor, HTN, hyperlipidemia

---

We might want to create a table of diagnoses that the patient has:

| patient_id | diagnosis                  |
|------------|----------------------------|
| 1          | metastatic carcinoid tumor |
| 1          | HTN                        |
| 1          | hyperlipidemia             |

Or, given a set of chest imaging reports, we might want to classify each one as **positive**, **possible**, or **negative** for pneumonia:

| patient_id | note_id | document_classification |
|------------|---------|-------------------------|
| 1          | 1       | POSSIBLE                |
| 1          | 2       | POSITIVE                |
| 2          | 3       | NEGATIVE                |

## Design of NLP Systems
There are two main types of NLP systems: **rule-based** and **machine learning/statistical**. We'll focus mainly rule-based in this class but will briefly go over some of the features of both.

**Rule-based NLP** uses manually defined logic to extract information from text. For example, if you are classifying pneumonia from radiology reports, you could write out the terms that clinicians use to describe pneumonia and write code to identify mentions of pneumonia and other contextual information in the text. This works well when a task is highly specific (such as identifying pneumonia in text) so it's often used in applications such as clinical research projects which need information from text. But it doesn't generalize well when the task is very broad (like identifying any clinical concept in a text) or when the language is too complex.

Using **statistical NLP** avoids having to write out specific logic in your code. Instead, you annotate large amounts of data and train a **machine learning model** to learn statistical patterns in the data and make predictions. Most NLP research focuses on statistical approaches, particularly [transformers](https://towardsdatascience.com/illustrated-guide-to-transformers-step-by-step-explanation-f74876522bc0) and [large language models](https://hai.stanford.edu/news/how-large-language-models-will-transform-science-society-and-ai). These can learn very sophisticated patterns and often achieve higher performance than rule-based models. But they are also very computationally expensive, difficult to interpret and understand, and require lots of annotated training data which is might be more difficult than writing out rules.

Both rule-based and machine learning approaches have their advantages and disadvantages. In this class we will focus on rule-based NLP, but if you're interested in NLP beyond this course there is lots of exciting work being done in the field of statistical NLP.

#### TODO
Decide whether the scenarios below describe a rule-based of statistical NLP approach.

In [69]:
# RUN CELL TO SEE QUIZ
quiz_rule_based_v_statistical1

VBox(children=(HTML(value='To identify patients with cancer, you review notes and annotate cases of cancer in …



In [73]:
# RUN CELL TO SEE QUIZ
quiz_rule_based_v_statistical2

VBox(children=(HTML(value="You build a Covid-19 surveillance system with NLP which identifies patients who are…



## Measuring System Performance
No NLP system is perfect. Natural language is complex and no system will ever be able to perfectly understand what is in clinical notes. So if you're using NLP, it's important to measure how well it performs and what impact the errors would have on any analyses on your data. **False positives** occur when your system classifies a negative case as positive. **False negatives** are when your system misses a positive case and classifies it as negative.

To do this, when we develop NLP systems we also perform **validation** to understand what types of errors we make. This typically involves the following steps:
1. **Annotate** a set of notes for the concept you're interested in extracting. In this step, human reviewers define the clinical concepts they're interested in and agree upon how to identify them in texts.
2. **Develop** your NLP system by making predictions on your annotated dataset, reviewing errors, and making improvements.
3. **Evaluate** your system by running on a subset of notes called the *testing set* which your model/developer has never seen before. This gives you an indication of how well the system will perform on brand new data.


When we evaluate our system, there are a few standard quantitative metrics we typically report:
- **Precision**/**Positive Predictive Value**: This tells you how likely it is that a document classified as positive is truly positive. It is calculated as (# of true positives) / (# all predicted positives). It is equivalent to the conditional probability of a note being positive given that it was classified as positive: $$P(Y=1|X=1)$$
where `X` is the note classification and `Y` is the true value.

- **Recall**/**Sensitivity**: This tells you how well your system identifies positive cases. It is calculated as (# true positives) / (# all positives). It is also the conditional probability of a note being classified as positive given that it is actually positive: $$P(X=1|Y=1)$$

- **F1-score**: This is the harmonic mean of precision and recall and is a common summary score for system performance: $$\frac{2 * Precision * Recall}{Precision + Recall}$$

We'll see an example in this module of developing and validating an NLP system. 

#### TODO
The 2x2 table shows the predicted and true values of an annotated corpus of notes.

| NLP      |   Positive |   Negative |
|:---------|-----------:|-----------:|
| Positive |         40 |         15 |
| Negative |         10 |         35 |

In [59]:
# RUN CELL TO SEE QUIZ
quiz_precision

VBox(children=(HTML(value='What is the precision/PPV of the system?'), RadioButtons(layout=Layout(width='auto'…



In [60]:
# RUN CELL TO SEE QUIZ
quiz_recall

VBox(children=(HTML(value='What is the recall/sensitivity of the system?'), RadioButtons(layout=Layout(width='…



#### TODO
Let's say you've developed an NLP system for identifying Covid-19 patients from clinical texts. </br>
Your system processes notes from 1,000 patients, of which 100 are positive and 900 are negative (i.e., prevalence is 0.1).</br>
Your system achieves a perfect precision of 1.0 and a recall of 0.75. 

According to your system, what is the prevalence of Covid-19?

In [62]:
# RUN CELL TO SEE HINT
hint_covid_performance

VBox(children=(HTML(value='This hint is for the following quiz.</br><strong>Displaying hint 0/2</strong>'), Ou…



In [65]:
# RUN CELL TO SEE QUIZ
quiz_covid_performance

VBox(children=(HTML(value='What is the estimated prevalence of Covid?'), RadioButtons(layout=Layout(width='aut…

