<html>
<table width="100%" cellspacing="2" cellpadding="2" border="1">
<tbody>
<tr>
<td valign="center" align="center" width="45%"><img src="../media/Univ-Utah.jpeg"><br>
</td>
    <td valign="center" align="center" width="75%">
<h1 align="center"><font size="+1">University of Utah<br>Population Health Sciences<br>Data Science Workshop</font></h1></td>
<td valign="center" align="center" width="45%"><img
src="../media/U_Health_stacked_png_red.png" alt="Utah Health
Logo" width="128" height="134"><br>
</td>
</tr>
</tbody>
</table>
<br>
</html>

In [1]:
from helpers import *

# Text Data in Python
As we've seen throughout this workshop, text is represented in Python with **strings**. So far, we've mainly limited our use of strings to short descriptions which represent structured elements. In this module, we'll see how text data plays a large role in medicine and why it's important to be able to analyze text data.

In this notebook, we'll dive deeper into working with strings in Python. Then the next few notebooks will demonstrate a set of tools for extracting information from clinical text using Natural Language Processing.


## String methods
Strings are defined using quotation marks (single, double, or triple). For example, these are all valid ways of defining strings:

In [17]:
string1 = 'This is a string'
string2 = "This is also a string."

text = """Chief Complaint:
5 days worsening SOB, DOE"""

In [16]:
text

'Chief Complaint:\n5 days worsening SOB, DOE'

Let's learn a few methods for dealing with strings data.

### Slicing
Under the hood, strings act similarly to lists, so we can slice and index them in the same way.

To get the number of characters in a string we use the `len` function:

In [3]:
len("Chief Complaint:")

16

We can access individual characters by index (also 0-indexed, like everything in Python):

In [4]:
text[0]

'C'

In [5]:
text[:16]

'Chief Complaint:'

In [6]:
text[-3:-1]

'DO'

In [7]:
# RUN CELL TO SEE QUIZ
quiz_text_3

VBox(children=(HTML(value='<h4>TODO</h4>Using the variable `text` that we defined above, what would be the val…



### Substrings
A **substring** is a smaller (or equally-sized) string consisting of consecutive characters within a larger string (or **superstring**). When we slice a string, we get a substring:

In [10]:
sub_text = text[:5]
sub_text

'Chief'

We can check whether a string is contained in a superstring by using the `in` keyword:

In [11]:
sub_text in text

True

Strings are case-sensitive, though, so the characters need to match exactly:

In [12]:
"chief" in text

False

### Upper and lower-case
Sometimes we want to change the case of a string, or ignore the case altogether. A few methods return a new string with different capitalization.

The `text.upper()` returns an all upper-case version of the string, and `text.lower()` returns a lower-case one:

In [21]:
text.upper()

'CHIEF COMPLAINT:\n5 DAYS WORSENING SOB, DOE'

In [22]:
text.lower()

'chief complaint:\n5 days worsening sob, doe'

Switching to all one case can make it easier to search a text for a string:

In [23]:
"chief" in text.lower()

True

The `title()` method capitalizes the first letter of each word but lower-cases the rest:

In [24]:
text.title()

'Chief Complaint:\n5 Days Worsening Sob, Doe'

### Splitting strings
We can split a string into smaller strings whenever a particular character/characters occurs with the `split()` method. This is how **comma-separated files** distinguish between each column:

In [25]:
"name,age,city,state".split(",")

['name', 'age', 'city', 'state']

In [26]:
"alec chapman,29,salt lake city".split(",")

['alec chapman', '29', 'salt lake city']

After splitting a string, we can **unpack it** into distinct values:

In [27]:
name, age, city = "alec chapman,29,salt lake city".split(",")
print(name)
print(age)
print(city)

alec chapman
29
salt lake city


But be careful - if one of the elements actually contains a comma, it will split in a way you might not expect:

In [28]:
"alec chapman,29,salt lake city, utah".split(",")

['alec chapman', '29', 'salt lake city', ' utah']

In [29]:
# Throws an error
name, age, city = "alec chapman,29,salt lake city, ut".split(",")

ValueError: too many values to unpack (expected 3)

You can split by any character. Splitting by *whitespaces* is a simple way to break a string up into individual words (but do you see any problems with this?)

In [30]:
"This is a sentence.".split(" ")

['This', 'is', 'a', 'sentence.']

In [33]:
"name\tage".split("\t")

['name', 'age']

#### TODO
An **empty string** is a string without any characters:

`""`

In [31]:
# RUN CELL TO SEE QUIZ
quiz_len_empty

VBox(children=(HTML(value='What value would be generated by the following code:</br><p style="font-family:cour…



In [32]:
# RUN CELL TO SEE QUIZ
quiz_split_pna_empty

VBox(children=(HTML(value='What would happen if you split the string `"pna"` on an empty string?'), RadioButto…



### Joining strings
The inverse of *splitting* strings is *joining* strings. We saw in the first notebook how to **concatenate** two strings together to create a larger string:

In [34]:
"Chief" + " " + "complaint"

'Chief complaint'

We can take a list of strings and create a string with all of them joined by some character (or multiple characters):

In [35]:
" ".join(["Chief", "complaint:", "5", "days", "worsening", "SOB", "DOE"])

'Chief complaint: 5 days worsening SOB DOE'

In [36]:
"...".join(["This", "feels", "very", "passive", "aggressive", ""])

'This...feels...very...passive...aggressive...'

### String formatting
Sometimes we want to use "template" texts and fill in values based on variable. For example, way back in the first notebook we had a function called `print_name` which would create and print a string based on the function arguments. **String formatting** is a nice way to do this. For example, one way to do this is using so-called `f-strings`, which are denoted with an `f` at the beginning of the string and then contain variable names in curly brackets `{}`.

In [42]:
first = "Alec"
last = "Chapman"
print(f"My name is {first} {last}")

My name is Alec Chapman


You can also use the string method `.format()`:

In [40]:
print("My name is {} {}".format("Alec", "Chapman"))

My name is Alec Chapman


## Practice with strings

### 1.
In NLP, **tokenization** is the process of splitting a text into individual words. It can be informative to see which unique words appear in a document. Split `disch_summ` into tokens and then create an object containing unique tokens. As an optional next step, count how many times each token appears.

In [43]:
# RUN CELL TO SEE HINT
hint_tokenize_disch_summ 

VBox(children=(HTML(value='</br><strong>Displaying hint 0/2</strong>'), Output(), Button(description='Get hint…



In [53]:
from collections import Counter

Counter(disch_summ.lower().split(" "))

Counter({'\nservice:': 1,
         'medicine\n\nchief': 1,
         'complaint:\n5': 1,
         'days': 1,
         'worsening': 2,
         'sob,': 2,
         'doe\n\nhistory': 1,
         'of': 6,
         'present': 1,
         'illness:\npt': 1,
         'is': 1,
         'a': 2,
         '63m': 1,
         'w/': 1,
         'h/o': 1,
         'metastatic': 3,
         'carcinoid': 3,
         'tumor,': 3,
         'htn,': 1,
         '\nhyperlipidemia': 1,
         'who': 1,
         'reports': 4,
         'increasing': 2,
         'sob': 1,
         'and': 5,
         'doe': 1,
         'starting': 1,
         'about': 1,
         '\na': 1,
         'month': 1,
         'ago': 1,
         'but': 2,
         'significantly': 1,
         'within': 1,
         'the': 4,
         'last': 2,
         '5': 1,
         'days.': 1,
         '\nit': 1,
         'has': 1,
         'recently': 1,
         'gotten': 1,
         'so': 1,
         'bad': 1,
         'he': 6,
         'can': 

In [52]:
set(disch_summ.lower().split(" "))

{'\n\nhe',
 '\na',
 '\nable',
 '\natelectasis',
 '\nchair',
 '\ndysfunction;',
 '\ngood',
 '\nhyperlipidemia',
 '\nhyperlipidemia,',
 '\nin',
 '\nit',
 '\nmoderate-to-severe',
 '\nnow',
 '\noccurs',
 '\nreceived',
 '\nreported',
 '\nservice:',
 '\nshowed',
 '\nstenosis.',
 '\nwhich',
 '(30%),',
 '1999\n5.',
 '2,',
 '20',
 '2002\n2.',
 '20mg',
 '40mg',
 '4l',
 '5',
 '63m',
 '81mg',
 '8am',
 '[**9-10**],',
 '[**9-11**]',
 'a',
 'about',
 'admission',
 'admission:\nasa',
 'ago',
 'also',
 'and',
 'anteroapical',
 'aortic',
 'around',
 'as',
 'at',
 'bad',
 'barely',
 'basal',
 'be',
 'bicusapid',
 'breath.',
 'but',
 'cad\n\nbrief',
 'can',
 'carcinoid',
 'carcinoma\n\ndischarge',
 'care\ndischarge',
 'carotid',
 'cell',
 'changes',
 'chest',
 'chf\nthe',
 'chills,',
 'complaint:\n5',
 'condition:\ngood,',
 'congestive',
 'contractile',
 'course:\n1.',
 'currently',
 'cxr',
 'daughters.',
 'day',
 'days',
 'days.',
 'decreased',
 'depression/anxiety\n\nsocial',
 'describes',
 'diabetes',


### 2.
Clinical notes are often structured with different *sections* which describe different parts of a patient's care. You can often recognize a new section by a title followed by a new line:

```
History of Present Illness:
The pt is a 63M w...
```

Each of the texts below come from different sections of a clinical note. Write a function called `get_section_name` which takes a text and returns the name of the section. The expected values are written as comments. Test it on the three strings below. Note that the character `"\n"` indicates a new line.

In [54]:
texts = [
    "Chief Complaint:\n5 days worsening SOB, DOE", # "Chief Complaint"
    "History of Present Illness:\nPt is a 63M w/ h/o metastatic carcinoid tumor.", # "History of Present Illness"
    "Social History:\nLives alone with two daughters." # "Social History"
]

In [63]:
# Define a function called get_section_name
def get_section_name(text):
    return text.split(":")[0]


In [64]:
# RUN CELL TO TEST FUNCTION
test_get_section_name.test(get_section_name)

That is correct!


In [65]:
get_section_name(texts[2])

'Social History'

### 3.
Write a function called `pneumonia_in_text` which checks if a string contains *"pneumonia"* or the abbreviation *"pna"* is in a string. Test it on the following strings:


In [66]:
pna_strings = [
    "The patient has pneumonia.",
    "INDICATION: EVALUATE FOR PNEUMONIA",
    "Patient shows symptoms concerning for pna.",
    "The chest image found no evidence of pna",
]

In [77]:
def pneumonia_in_text(text):
    text = text.lower()
    return (("pneumonia" in text) or ("pna" in text))

In [74]:
# RUN CELL TO TEST FUNCTION
test_pneumonia_in_text.test(pneumonia_in_text)

That is correct!


In [75]:
# RUN CELL TO SEE QUIZ
quiz_mc_pneumonia_in_text

VBox(children=(HTML(value='If the function above returns True, that means the note indicates the patient has p…



### 4.
Edit the code below so that the function which generates a string which gives the patient's name, age, and chief complaint based on the argument values.

In [76]:
# RUN CELL TO SEE QUIZ
hint_generate_chief_complaint

VBox(children=(HTML(value='</br><strong>Displaying hint 0/1</strong>'), Output(), Button(description='Get hint…



In [None]:
def generate_chief_complaint(name, age, chief_complaint):
    text = "Alec Chapman is a 29-year-old patient who presents today with a broken arm."
    return text

In [None]:
# Each line should print a different value
print(generate_chief_complaint("Alec Chapman", "29", "a broken arm"))
print(generate_chief_complaint("John Doe", "41", "cough and fever."))
print(generate_chief_complaint("Jane Doe", "61", "symptoms concerning for pneumonia."))