
<img src="https://github.com/abchapman93/DELPHI_Intro_to_NLP_Spring_2024/blob/main/media/DELPHI-long.png?raw=true" size="20%">
</br>

<h1 valign="center" align="center"><font size="+150">Introduction to NLP in Python</br>Spring 2024</font></h1>

In [1]:
%load_ext autoreload
%autoreload 2

In [28]:
import sys
sys.path.insert(0, "..")

from delphi_nlp_2024 import *
from delphi_nlp_2024.quizzes.quizzes import *
from delphi_nlp_2024.helpers import *

# Text Data in Python
In the last notebook we saw how important clinical information is recorded using text. In this notebook we'll grow more comfortable working with strings.

## String methods
Strings are defined using quotation marks (single, double, or triple). For example, these are all valid ways of defining strings:

In [5]:
string1 = 'This is a string'
string2 = "This is also a string."
text = """Chief Complaint:
5 days worsening SOB, DOE"""

Let's learn a few methods for dealing with strings data.

### Slicing
Under the hood, strings act similarly lists, so we can slice and index them in the same way.

To get the number of characters in a string we use the `len` function:

In [6]:
len("Chief Complaint:")

16

We can access individual characters by index (also 0-indexed, like everything in Python):

In [7]:
text[0]

'C'

In [8]:
text[:16]

'Chief Complaint:'

In [9]:
text[-3:-1]

'DO'

In [13]:
# RUN CELL TO SEE QUIZ
quiz_text_3

VBox(children=(HTML(value='<h4>TODO</h4>Using the variable `text` that we defined above, what would be the val…



### Substrings
A **substring** is a smaller (or equally-sized) string consisting of consecutive characters within a larger string (or **superstring**). When we slice a string, we get a substring:

In [14]:
sub_text = text[:5]
sub_text

'Chief'

We can check whether a string is contained in a superstring by using the `in` keyword:

In [15]:
sub_text in text

True

Strings are case-sensitive, though, so the characters need to match exactly:

In [16]:
"chief" in text

False

### Upper and lower-case
Sometimes we want to change the case of a string, or ignore the case altogether. A few methods return a new string with different capitalization.

The `text.upper()` returns an all upper-case version of the string, and `text.lower()` returns a lower-case one:

In [17]:
text.upper()

'CHIEF COMPLAINT:\n5 DAYS WORSENING SOB, DOE'

In [18]:
text.lower()

'chief complaint:\n5 days worsening sob, doe'

Switching to all one case can make it easier to search a text for a string:

In [19]:
"chief" in text.lower()

True

The `title()` method capitalizes the first letter of each word but lower-cases the rest:

In [20]:
text.title()

'Chief Complaint:\n5 Days Worsening Sob, Doe'

### Splitting strings
We can split a string into smaller strings whenever a particular character/characters occurs with the `split()` method. This is how **comma-separated files** distinguish between each column:

In [21]:
"name,age,city,state".split(",")

['name', 'age', 'city', 'state']

In [22]:
"alec chapman,29,salt lake city".split(",")

['alec chapman', '29', 'salt lake city']

After splitting a string, we can **unpack it** into distinct values:

In [23]:
name, age, city = "alec chapman,29,salt lake city".split(",")
print(name)
print(age)
print(city)

alec chapman
29
salt lake city


But be careful - if one of the elements actually contains a comma, it will split in a way you might not expect:

In [24]:
"alec chapman,29,salt lake city, utah".split(",")

['alec chapman', '29', 'salt lake city', ' utah']

In [25]:
# Throws an error
name, age, city = "alec chapman,29,salt lake city, ut".split(",")

ValueError: too many values to unpack (expected 3)

You can split by any character. Splitting by *whitespaces* is a simple way to break a string up into individual words (but do you see any problems with this?)

In [26]:
"This is a sentence.".split(" ")

['This', 'is', 'a', 'sentence.']

#### TODO
An **empty string** is a string without any characters:

`""`

In [27]:
# RUN CELL TO SEE QUIZ
quiz_len_empty

VBox(children=(HTML(value='What value would be generated by the following code:</br><p style="font-family:cour…



In [30]:
# RUN CELL TO SEE QUIZ
quiz_split_pna_empty

VBox(children=(HTML(value='What would happen if you split the string `"pna"` on an empty string?'), RadioButto…



### Joining strings
The inverse of *splitting* strings is *joining* strings. We saw in the first notebook how to **concatenate** two strings together to create a larger string:

In [31]:
"Chief" + " " + "complaint"

'Chief complaint'

We can take a list of strings and create a string with all of them joined by some character (or multiple characters):

In [32]:
" ".join(["Chief", "complaint:", "5", "days", "worsening", "SOB", "DOE"])

'Chief complaint: 5 days worsening SOB DOE'

In [33]:
"...".join(["This", "feels", "very", "passive", "aggressive", ""])

'This...feels...very...passive...aggressive...'

### String formatting
Sometimes we want to use "template" texts and fill in values based on variable. For example, way back in the first notebook we had a function called `print_name` which would create and print a string based on the function arguments. **String formatting** is a nice way to do this. For example, one way to do this is using so-called `f-strings`, which are denoted with an `f` at the beginning of the string and then contain variable names in curly brackets `{}`.

In [34]:
first = "Alec"
last = "Chapman"
print(f"My name is {first} {last}")

My name is Alec Chapman


You can also use the string method `.format()`:

In [35]:
print("My name is {} {}".format("Alec", "Chapman"))

My name is Alec Chapman


## Practice with strings

### 1.
In NLP, **tokenization** is the process of splitting a text into individual words. It can be informative to see which unique words appear in a document. Split `disch_summ` into tokens and then create an object containing unique tokens. As an optional next step, count how many times each token appears.

In [22]:
# RUN CELL TO SEE HINT
hint_tokenize_disch_summ 

VBox(children=(HTML(value='</br><strong>Displaying hint 0/2</strong>'), Output(), Button(description='Get hint…



### 2.
Clinical notes are often structured with different *sections* which describe different parts of a patient's care. You can often recognize a new section by a title followed by a new line:

```
History of Present Illness:
The pt is a 63M w...
```

Each of the texts below come from different sections of a clinical note. Write a function called `get_section_name` which takes a text and returns the name of the section. The expected values are written as comments. Test it on the three strings below. Note that the character `"\n"` indicates a new line.

In [37]:
texts = [
    "Chief Complaint:\n5 days worsening SOB, DOE", # "Chief Complaint"
    "History of Present Illness:\nPt is a 63M w/ h/o metastatic carcinoid tumor.", # "History of Present Illness"
    "Social History:\nLives alone with two daughters." # "Social History"
]

In [38]:
# Define a function called get_section_name
def get_section_name(text):
    # Write code to extract the section name
    # ...
    return section_name

In [5]:
# Define a function called get_section_name
def get_section_name(text):
    section_name = text.split(":\n")[0]
    return section_name

In [6]:
# RUN CELL TO TEST FUNCTION
test_get_section_name.test(get_section_name)

That is correct!


### 3.
Write a function called `pneumonia_in_text` which checks if a string contains *"pneumonia"* or the abbreviation *"pna"* is in a string. Test it on the following strings:


In [None]:
pna_strings = [
    "The patient has pneumonia.",
    "INDICATION: EVALUATE FOR PNEUMONIA",
    "Patient shows symptoms concerning for pna.",
    "The chest image found no evidence of pna",
]

In [None]:
def pneumonia_in_text(text):
    if ___:
        answer = True
    else:
        answer = False
    return answer

In [11]:
def pneumonia_in_text(text):
    if "pneumonia" in text.lower() or "pna" in text.lower():
        answer = True
    else:
        answer = False
    return answer

In [12]:
# RUN CELL TO TEST FUNCTION
test_pneumonia_in_text.test(pneumonia_in_text)

That is correct!


In [15]:
# RUN CELL TO SEE QUIZ
quiz_mc_pneumonia_in_text

VBox(children=(HTML(value='If the function above returns True, that means the note indicates the patient has p…



### 4.
Edit the code below so that the function which generates a string which gives the patient's name, age, and chief complaint based on the argument values.

In [29]:
# RUN CELL TO SEE QUIZ
hint_generate_chief_complaint

VBox(children=(HTML(value='</br><strong>Displaying hint 0/1</strong>'), Output(), Button(description='Get hint…



In [23]:
def generate_chief_complaint(name, age, chief_complaint):
    text = "Alec Chapman is a 30-year-old patient who presents today with a broken arm."
    return text

In [24]:
# Each line should print a different value
print(generate_chief_complaint("Alec Chapman", "30", "a broken arm"))
print(generate_chief_complaint("John Doe", "41", "cough and fever."))
print(generate_chief_complaint("Jane Doe", "61", "symptoms concerning for pneumonia."))

Alec Chapman is a 30-year-old patient who presents today with a broken arm.
Alec Chapman is a 30-year-old patient who presents today with a broken arm.
Alec Chapman is a 30-year-old patient who presents today with a broken arm.
