<html>
<table width="100%" cellspacing="2" cellpadding="2" border="1">
<tbody>
<tr>
<td valign="center" align="center" width="45%"><img src="../media/Univ-Utah.jpeg"><br>
</td>
    <td valign="center" align="center" width="75%">
<h1 align="center"><font size="+1">University of Utah<br>Population Health Sciences<br>Data Science Workshop</font></h1></td>
<td valign="center" align="center" width="45%"><img
src="../media/U_Health_stacked_png_red.png" alt="Utah Health
Logo" width="128" height="134"><br>
</td>
</tr>
</tbody>
</table>
<br>
</html>

In [1]:
from helpers import *

# Textual Data in Medicine
As we've seen throughout this workshop, text is represented in Python with **strings**. So far, we've mainly limited our use of strings to short descriptions which represent structured elements. In this module, we'll see how text data plays a large role in medicine and why it's important to be able to analyze text data.

In this notebook, we'll start by learning about what information is stored in text form in the EHR and then dive deeper into working with strings in Python. Then we'll learn a set of tools for extracting information from clinical text using Natural Language Processing.

## Unstructured Data in the EHR
When you see a doctor, they enter your information into the EHR in a few different ways. We've already seen some examples like:
- ICD-9/10 codes
- Numeric vital measurements
- Flags for abnormal tests

These are all **structured** data elements: the values are either numeric values or discrete elements with distinct, concrete meaning. Importantly, these values are *computable*: we can take the average of numeric vital measurements or count of ICD-10 codes.

However, some forms of documentation are **unstructured**. Some examples are:
- Videos
- Radiology imaging
- Full-text narratives

Data forms like this are great for humans: they are easy to interpret and can include much more context and nuance than rigid, standardized data elements. However, they can't immediately be computed with. While a collection of pixels can be very meaningful to a radiologist, machines don't inherently have the ability to make sense of them.

This presents a challenge to researchers since unstructured data accounts for a huge amount of the information stored in the EHR. While it would be great to utilize this information, we have to do a little extra work to make sense of it.

## Clinical Narratives
#### TODO
Read the following excerpt of a discharge summary and then complete the quizzes that follow.

In [2]:
disch_summ = """
Service: MEDICINE

Chief Complaint:
5 days worsening SOB, DOE
 
History of Present Illness:
Pt is a 63M w/ h/o metastatic carcinoid tumor, HTN, 
hyperlipidemia who reports increasing SOB and DOE starting about 
a month ago but worsening significantly within the last 5 days. 
It has recently gotten so bad he can barely get up out of a 
chair without getting short of breath. He reports orthopnea but no PND. 

He reports no fever or chills, no URI symptoms, no recent travel, no changes 
in his medications.

Pt also reports ~5 episodes of chest pain in the last few weeks 
which he describes as pressure on his mid-sternum and usually 
occurs during exertion.
 
Past Medical History:
1. metastatic carcinoid tumor, Dx'ed 2002
2. hypertension
3. hyperlipidemia
4. carotid endarterectomy 1999
5. depression/anxiety
 
Social History:
Lives alone, has two daughters
 
Family History:
early CAD

Brief Hospital Course:
1. SOB: likely from CHF
The patient was initially diuresed for mild pulmonary edema: he 
received 20 IV Lasix on night of admission and 40mg [**9-10**], with 
good UOP. On [**9-10**], pt was reporting improvement of symptoms and 
able to walk around his room with 4L O2 NC. The following day he 
reported feeling worse, with increasing SOB, and was found to 
now be in oliguric renal failure. CXR [**9-11**] 8am showed showed 
atelectasis with possible superimposed pneumonia. Emergent TTE 
showed decreased EF (30%), anteroapical infarct with 
moderate-to-severe overall left ventricular contractile 
dysfunction; bicusapid aortic valve with at least mild aortic 
stenosis. He was sent to the MICU.
 
Medications on Admission:
ASA 81mg po qd
Lipitor 20mg po qpm

Discharge Disposition:
Extended Care
Discharge Diagnosis:
Primary: congestive heart failure
Secondary: metastatic carcinoid tumor, hypertension, 
hyperlipidemia, diabetes mellitus type 2, basal cell carcinoma
 
Discharge Condition:
good, stable
"""

In [3]:
MultipleChoiceQuiz("What is the main reason the patient came to the hospital?",
                  answer="He was experiencing shortness of breath.",
                  options=[
                      "He was referred by his oncologist.",
                      "He had a fever."
                  ])

VBox(children=(HTML(value='What is the main reason the patient came to the hospital?'), RadioButtons(layout=La…



In [4]:
SelectMultipleQuiz("Which of the following conditions does the patient have?.",
                  answer=["Congestive Heart Failure", "Diabetes", "Cancer"],
                  options=["Pneumonia", "Coronary Artery Disease"]
                  )

VBox(children=(HTML(value='Which of the following conditions does the patient have?.'), SelectMultiple(options…



In [5]:
MultipleChoiceQuiz("The patient doesn't have any living relatives.", answer="False", shuffle_answer=False)

VBox(children=(HTML(value="The patient doesn't have any living relatives."), RadioButtons(layout=Layout(width=…



In [6]:
FreeTextTest("How many episodes of chest pain has the patient had in the last few weeks?", answer=["5", "five"])

VBox(children=(HTML(value='How many episodes of chest pain has the patient had in the last few weeks?'), Texta…



### Discussion
As you can see, there's a lot of really useful information in clinical notes. What is the advantage of documenting it using free text? What are some challenges you see with this?

## String methods
- Slicing
- Upper/lower
- Replacing
- Splitting
- Joining
- Formatting

In [7]:
text = """Chief Complaint:
5 days worsening SOB, DOE"""

In [8]:
print(text)

Chief Complaint:
5 days worsening SOB, DOE


In [9]:
text

'Chief Complaint:\n5 days worsening SOB, DOE'

### Slicing
Under the hood, strings act similarly lists, so we can slice and index them in the same way.

To get the number of characters in a string we use the `len` function:

In [10]:
len("Chief Complaint:")

16

We can access individual characters by index (also 0-indexed, like everything in Python):

In [11]:
text[0]

'C'

In [12]:
text[:16]

'Chief Complaint:'

In [13]:
text[-3:-1]

'DO'

In [14]:
# RUN CELL TO SEE QUIZ
quiz_text_3

VBox(children=(HTML(value='<h4>TODO</h4>Using the variable `text` that we defined above, what would be the val…



### Substrings
A **substring** is a smaller (or equally-sized) string consisting of consecutive characters within a larger string (or **superstring**). When we slice a string, we get a substring:

In [15]:
sub_text = text[:5]
sub_text

'Chief'

We can check whether a string is contained in a superstring by using the `in` keyword:

In [16]:
sub_text in text

True

Strings are case-sensitive, though, so the characters need to match exactly:

In [17]:
"chief" in text

False

### Upper and lower-case
Sometimes we want to change the case of a string, or ignore the case altogether. A few methods return a new string with different capitalization.

The `text.upper()` returns an all upper-case version of the string, and `text.lower()` returns a lower-case one:

In [18]:
text.upper()

'CHIEF COMPLAINT:\n5 DAYS WORSENING SOB, DOE'

In [19]:
text.lower()

'chief complaint:\n5 days worsening sob, doe'

Switching to all one case can make it easier to search a text for a string:

In [20]:
"chief" in text.lower()

True

The `title()` method capitalizes the first letter of each word but lower-cases the rest:

In [21]:
text.title()

'Chief Complaint:\n5 Days Worsening Sob, Doe'

### Splitting strings
We can split a string into smaller strings whenever a particular character/characters occurs with the `split()` method. This is how **comma-separated files** distinguish between each column:

In [22]:
"name,age,city,state".split(",")

['name', 'age', 'city', 'state']

In [23]:
"alec chapman,29,salt lake city".split(",")

['alec chapman', '29', 'salt lake city']

After splitting a string, we can **unpack it** into distinct values:

In [24]:
name, age, city = "alec chapman,29,salt lake city".split(",")
print(name)
print(age)
print(city)

alec chapman
29
salt lake city


But be careful - if one of the elements actually contains a comma, it will split in a way you might not expect:

In [25]:
"alec chapman,29,salt lake city, utah".split(",")

['alec chapman', '29', 'salt lake city', ' utah']

In [26]:
# Throws an error
name, age, city = "alec chapman,29,salt lake city, ut".split(",")

ValueError: too many values to unpack (expected 3)

You can split by any character. Splitting by *whitespaces* is a simple way to break a string up into individual words (but do you see any problems with this?)

In [27]:
"This is a sentence.".split(" ")

['This', 'is', 'a', 'sentence.']

#### TODO
An **empty string** is a string without any characters:

`""`

In [28]:
# RUN CELL TO SEE QUIZ
quiz_len_empty

VBox(children=(HTML(value='What value would be generated by the following code:</br><p style="font-family:cour…



#### TODO


In [29]:
# RUN CELL TO SEE QUIZ
quiz_split_pna_empty

VBox(children=(HTML(value='What would happen if you split the string `"pna"` on an empty string?'), RadioButto…



### Joining strings
The inverse of *splitting* strings is *joining* strings. We saw in the first notebook how to **concatenate** two strings together to create a larger string:

In [30]:
"Chief" + " " + "complaint"

'Chief complaint'

We can take a list of strings and create a string with all of them joined by some character (or multiple characters):

In [31]:
" ".join(["Chief", "complaint:", "5", "days", "worsening", "SOB", "DOE"])

'Chief complaint: 5 days worsening SOB DOE'

In [32]:
"...".join(["This", "feels", "very", "passive", "aggressive", ""])

'This...feels...very...passive...aggressive...'

### String formatting
Sometimes we want to use "template" texts and fill in values based on variable. For example, way back in the first notebook we had a function called `print_name` which would create and print a string based on the function arguments. **String formatting** is a nice way to do this. For example, one way to do this is using so-called `f-strings`, which are denoted with an `f` at the beginning of the string and then contain variable names in curly brackets `{}`.

In [33]:
first = "Alec"
last = "Chapman"
print(f"My name is {first} {last}")

My name is Alec Chapman


You can also use the string method `.format()`:

In [34]:
print("My name is {} {}".format("Alec", "Chapman"))

My name is Alec Chapman


## Practice with strings

### 1.
In NLP, **tokenization** is the process of splitting a text into individual words. It can be informative to see which unique words appear in a document. Split `disch_summ` into tokens and then create an object containing unique tokens. As an optional next step, count how many times each token appears.

In [35]:
# RUN CELL TO SEE HINT
hint_tokenize_disch_summ 

VBox(children=(HTML(value='</br><strong>Displaying hint 0/2</strong>'), Output(), Button(description='Get hint…



In [36]:
tokens = disch_summ.split(" ")
set(tokens)

{'\n\nHe',
 '\nDischarge',
 '\nFamily',
 '\nHistory',
 '\nIt',
 '\nMedications',
 '\nPast',
 '\nService:',
 '\nSocial',
 '\na',
 '\nable',
 '\natelectasis',
 '\nchair',
 '\ndysfunction;',
 '\ngood',
 '\nhyperlipidemia',
 '\nhyperlipidemia,',
 '\nin',
 '\nmoderate-to-severe',
 '\nnow',
 '\noccurs',
 '\nreceived',
 '\nreported',
 '\nshowed',
 '\nstenosis.',
 '\nwhich',
 '(30%),',
 '1999\n5.',
 '2,',
 '20',
 '2002\n2.',
 '20mg',
 '40mg',
 '4L',
 '5',
 '63M',
 '81mg',
 '8am',
 'Admission:\nASA',
 'CAD\n\nBrief',
 'CHF\nThe',
 'CXR',
 'Care\nDischarge',
 'Complaint:\n5',
 'Condition:\ngood,',
 'Course:\n1.',
 'DOE',
 'DOE\n',
 'Diagnosis:\nPrimary:',
 'Disposition:\nExtended',
 "Dx'ed",
 'EF',
 'Emergent',
 'HTN,',
 'He',
 'History:\n1.',
 'History:\nLives',
 'History:\nearly',
 'Hospital',
 'IV',
 'Illness:\nPt',
 'Lasix',
 'MEDICINE\n\nChief',
 'MICU.\n',
 'Medical',
 'NC.',
 'O2',
 'On',
 'PND.',
 'Present',
 'SOB',
 'SOB,',
 'SOB:',
 'TTE',
 'The',
 'UOP.',
 'URI',
 '[**9-10**],',
 '[**

In [37]:
from collections import Counter
Counter(tokens)

Counter({'\nService:': 1,
         'MEDICINE\n\nChief': 1,
         'Complaint:\n5': 1,
         'days': 1,
         'worsening': 2,
         'SOB,': 2,
         'DOE\n': 1,
         '\nHistory': 1,
         'of': 6,
         'Present': 1,
         'Illness:\nPt': 1,
         'is': 1,
         'a': 2,
         '63M': 1,
         'w/': 1,
         'h/o': 1,
         'metastatic': 3,
         'carcinoid': 3,
         'tumor,': 3,
         'HTN,': 1,
         '\nhyperlipidemia': 1,
         'who': 1,
         'reports': 4,
         'increasing': 2,
         'SOB': 1,
         'and': 5,
         'DOE': 1,
         'starting': 1,
         'about': 1,
         '\na': 1,
         'month': 1,
         'ago': 1,
         'but': 2,
         'significantly': 1,
         'within': 1,
         'the': 3,
         'last': 2,
         '5': 1,
         'days.': 1,
         '\nIt': 1,
         'has': 2,
         'recently': 1,
         'gotten': 1,
         'so': 1,
         'bad': 1,
         'he': 4,


### 2.
Clinical notes are often structured with different *sections* which describe different parts of a patient's care. You can often recognize a new section by a title followed by a new line:

```
History of Present Illness:
The pt is a 63M w...
```

Each of the texts below come from different sections of a clinical note. Write a function called `get_section_name` which takes a text and returns the name of the section. The expected values are written as comments. Test it on the three strings below. Note that the character `"\n"` indicates a new line.

In [38]:
texts = [
    "Chief Complaint:\n5 days worsening SOB, DOE", # "Chief Complaint"
    "History of Present Illness:\nPt is a 63M w/ h/o metastatic carcinoid tumor.", # "History of Present Illness"
    "Social History:\nLives alone with two daughters." # "Social History"
]

In [39]:
def get_section_name(text):
    return text.split(":\n")[0]

In [40]:
# RUN CELL TO TEST FUNCTION
test_get_section_name.test(get_section_name)

That is correct!


### 3.
Write a function called `pneumonia_in_text` which checks if a string contains *"pneumonia"* or the abbreviation *"pna"* is in a string. Test it on the following strings:


In [41]:
pna_strings = [
    "The patient has pneumonia.",
    "INDICATION: EVALUATE FOR PNEUMONIA",
    "Patient shows symptoms concerning for pna.",
    "The chest image found no evidence of pna",
]

In [42]:
def pneumonia_in_text(text):
    if "pneumonia" in text.lower() or "pna" in text.lower():
        return True
    return False

In [43]:
# RUN CELL TO TEST FUNCTION
test_pneumonia_in_text.test(pneumonia_in_text)

That is correct!


In [44]:
# RUN CELL TO SEE QUIZ
quiz_mc_pneumonia_in_text

VBox(children=(HTML(value='If the function above returns True, that means the note indicates the patient has p…



### 4.
Edit the code below so that the function which generates a string which gives the patient's name, age, and chief complaint based on the argument values.

In [45]:
# RUN CELL TO SEE QUIZ
hint_generate_chief_complaint

VBox(children=(HTML(value='</br><strong>Displaying hint 0/1</strong>'), Output(), Button(description='Get hint…



In [46]:
def generate_chief_complaint(name, age, chief_complaint):
    text = "Alec Chapman is a 29-year-old patient who presents today with a broken arm."
    return text

In [47]:
# Each line should print a different value
print(generate_chief_complaint("Alec Chapman", "29", "a broken arm"))
print(generate_chief_complaint("John Doe", "41", "cough and fever."))
print(generate_chief_complaint("Jane Doe", "61", "symptoms concerning for pneumonia."))

Alec Chapman is a 29-year-old patient who presents today with a broken arm.
Alec Chapman is a 29-year-old patient who presents today with a broken arm.
Alec Chapman is a 29-year-old patient who presents today with a broken arm.


In [48]:
def generate_chief_complaint(name, age, chief_complaint):
    text = f"{name} is a {age}-year-old patient who presents today with {chief_complaint}."
    return text

In [49]:
# Each line should print a different value
print(generate_chief_complaint("Alec Chapman", "29", "a broken arm"))
print(generate_chief_complaint("John Doe", "41", "cough and fever"))
print(generate_chief_complaint("Jane Doe", "61", "symptoms concerning for pneumonia"))

Alec Chapman is a 29-year-old patient who presents today with a broken arm.
John Doe is a 41-year-old patient who presents today with cough and fever.
Jane Doe is a 61-year-old patient who presents today with symptoms concerning for pneumonia.
