<html>
<table width="100%" cellspacing="2" cellpadding="2" border="1">
<tbody>
<tr>
<td valign="center" align="center" width="45%"><img src="../media/Univ-Utah.jpeg"><br>
</td>
    <td valign="center" align="center" width="75%">
<h1 align="center"><font size="+1">University of Utah<br>Population Health Sciences<br>Data Science Workshop</font></h1></td>
<td valign="center" align="center" width="45%"><img
src="../media/U_Health_stacked_png_red.png" alt="Utah Health
Logo" width="128" height="134"><br>
</td>
</tr>
</tbody>
</table>
<br>
</html>

In [None]:
from helpers import *

# Text Data in Python
As we've seen throughout this workshop, text is represented in Python with **strings**. So far, we've mainly limited our use of strings to short descriptions which represent structured elements. In this module, we'll see how text data plays a large role in medicine and why it's important to be able to analyze text data.

In this notebook, we'll dive deeper into working with strings in Python. Then the next few notebooks will demonstrate a set of tools for extracting information from clinical text using Natural Language Processing.


## String methods
Strings are defined using quotation marks (single, double, or triple). For example, these are all valid ways of defining strings:

In [None]:
string1 = 'This is a string'
string2 = "This is also a string."
text = """Chief Complaint:
5 days worsening SOB, DOE"""

Let's learn a few methods for dealing with strings data.

### Slicing
Under the hood, strings act similarly lists, so we can slice and index them in the same way.

To get the number of characters in a string we use the `len` function:

In [None]:
len("Chief Complaint:")

We can access individual characters by index (also 0-indexed, like everything in Python):

In [None]:
text[0]

In [None]:
text[:16]

In [None]:
text[-3:-1]

In [None]:
# RUN CELL TO SEE QUIZ
quiz_text_3

### Substrings
A **substring** is a smaller (or equally-sized) string consisting of consecutive characters within a larger string (or **superstring**). When we slice a string, we get a substring:

In [None]:
sub_text = text[:5]
sub_text

We can check whether a string is contained in a superstring by using the `in` keyword:

In [None]:
sub_text in text

Strings are case-sensitive, though, so the characters need to match exactly:

In [None]:
"chief" in text

### Upper and lower-case
Sometimes we want to change the case of a string, or ignore the case altogether. A few methods return a new string with different capitalization.

The `text.upper()` returns an all upper-case version of the string, and `text.lower()` returns a lower-case one:

In [None]:
text.upper()

In [None]:
text.lower()

Switching to all one case can make it easier to search a text for a string:

In [None]:
"chief" in text.lower()

The `title()` method capitalizes the first letter of each word but lower-cases the rest:

In [None]:
text.title()

### Splitting strings
We can split a string into smaller strings whenever a particular character/characters occurs with the `split()` method. This is how **comma-separated files** distinguish between each column:

In [None]:
"name,age,city,state".split(",")

In [None]:
"alec chapman,29,salt lake city".split(",")

After splitting a string, we can **unpack it** into distinct values:

In [None]:
name, age, city = "alec chapman,29,salt lake city".split(",")
print(name)
print(age)
print(city)

But be careful - if one of the elements actually contains a comma, it will split in a way you might not expect:

In [None]:
"alec chapman,29,salt lake city, utah".split(",")

In [None]:
# Throws an error
name, age, city = "alec chapman,29,salt lake city, ut".split(",")

You can split by any character. Splitting by *whitespaces* is a simple way to break a string up into individual words (but do you see any problems with this?)

In [None]:
"This is a sentence.".split(" ")

#### TODO
An **empty string** is a string without any characters:

`""`

In [None]:
# RUN CELL TO SEE QUIZ
quiz_len_empty

In [None]:
# RUN CELL TO SEE QUIZ
quiz_split_pna_empty

### Joining strings
The inverse of *splitting* strings is *joining* strings. We saw in the first notebook how to **concatenate** two strings together to create a larger string:

In [None]:
"Chief" + " " + "complaint"

We can take a list of strings and create a string with all of them joined by some character (or multiple characters):

In [None]:
" ".join(["Chief", "complaint:", "5", "days", "worsening", "SOB", "DOE"])

In [None]:
"...".join(["This", "feels", "very", "passive", "aggressive", ""])

### String formatting
Sometimes we want to use "template" texts and fill in values based on variable. For example, way back in the first notebook we had a function called `print_name` which would create and print a string based on the function arguments. **String formatting** is a nice way to do this. For example, one way to do this is using so-called `f-strings`, which are denoted with an `f` at the beginning of the string and then contain variable names in curly brackets `{}`.

In [None]:
first = "Alec"
last = "Chapman"
print(f"My name is {first} {last}")

You can also use the string method `.format()`:

In [None]:
print("My name is {} {}".format("Alec", "Chapman"))

## Practice with strings

### 1.
In NLP, **tokenization** is the process of splitting a text into individual words. It can be informative to see which unique words appear in a document. Split `disch_summ` into tokens and then create an object containing unique tokens. As an optional next step, count how many times each token appears.

In [None]:
# RUN CELL TO SEE HINT
hint_tokenize_disch_summ 

### 2.
Clinical notes are often structured with different *sections* which describe different parts of a patient's care. You can often recognize a new section by a title followed by a new line:

```
History of Present Illness:
The pt is a 63M w...
```

Each of the texts below come from different sections of a clinical note. Write a function called `get_section_name` which takes a text and returns the name of the section. The expected values are written as comments. Test it on the three strings below. Note that the character `"\n"` indicates a new line.

In [None]:
texts = [
    "Chief Complaint:\n5 days worsening SOB, DOE", # "Chief Complaint"
    "History of Present Illness:\nPt is a 63M w/ h/o metastatic carcinoid tumor.", # "History of Present Illness"
    "Social History:\nLives alone with two daughters." # "Social History"
]

In [None]:
# Define a function called get_section_name

In [None]:
# RUN CELL TO TEST FUNCTION
test_get_section_name.test(get_section_name)

### 3.
Write a function called `pneumonia_in_text` which checks if a string contains *"pneumonia"* or the abbreviation *"pna"* is in a string. Test it on the following strings:


In [None]:
pna_strings = [
    "The patient has pneumonia.",
    "INDICATION: EVALUATE FOR PNEUMONIA",
    "Patient shows symptoms concerning for pna.",
    "The chest image found no evidence of pna",
]

In [None]:
def pneumonia_in_text(text):
    ____

In [None]:
# RUN CELL TO TEST FUNCTION
test_pneumonia_in_text.test(pneumonia_in_text)

In [None]:
# RUN CELL TO SEE QUIZ
quiz_mc_pneumonia_in_text

### 4.
Edit the code below so that the function which generates a string which gives the patient's name, age, and chief complaint based on the argument values.

In [None]:
# RUN CELL TO SEE QUIZ
hint_generate_chief_complaint

In [None]:
def generate_chief_complaint(name, age, chief_complaint):
    text = "Alec Chapman is a 29-year-old patient who presents today with a broken arm."
    return text

In [None]:
# Each line should print a different value
print(generate_chief_complaint("Alec Chapman", "29", "a broken arm"))
print(generate_chief_complaint("John Doe", "41", "cough and fever."))
print(generate_chief_complaint("Jane Doe", "61", "symptoms concerning for pneumonia."))