<a href="https://colab.research.google.com/github/antonutellaa/pete-comp-meth-public/blob/main/python_exercises4_complete.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

As was mentioned in the lecture, one of the great strengths of Python is the ease with which it allows you to assemble tools to solve a problem instead of having to reinvent all the techniques yourself.

For example, consider the following doctor's note:

In [1]:
note = """
John is 32 years old. Sarah is 29 years old. Their child, Sam is 2 years old.
All members of the family were healthy at Sam's last checkup. Today, however,
while the mother appeared healthy -- Sarah's temperature was 37 degrees, she
said -- she also said that the father couldn't make Sam's appointment today
because John's temperature was 40 degrees, so he went to go see his physician
instead. During the visit, we measured that Sam's temperature was 38 degrees,
a low grade fever. Sarah mentioned her mother, Jean was diagnosed with mprbble
recently, but after further discussion they hadn't seen Jean in over two weeks
and given the usual incubation time, it was improbable that this was also
afflicting little Sam. Several tests were ordered. When the results were in,
Sam was diagnosed with strep.
"""

We know how to do a little bit of analysis on the `note` already. For example, as a very crude approximation of the level of concern, one might wonder how many words are written in various doctor's notes, and see if e.g. notes about pneumonia are longer than those about the common cold.
Using what you've already learned, how many words are in this `note`?

In [4]:
len(note.split())

137

Does your count include numbers (e.g. "32")? How can you prove this to yourself? (It's always important to test things to make sure that they work the way you think they do.)

In [8]:
#look specifically for numbers and count how many numbers are in there
words = note.split()
for w in words:
  if w.isdigit():
    print(w)
#test for nums
sum(w.isdigit() for w in note.split())

32
29
2
37
40
38


6

Let's say that we wanted to answer a more complicated question: **how many people are mentioned in this note?** (Maybe you're testing a hypothesis that some diseases can be better understood in the context of the community.)

You could count, and it wouldn't take that long, but that is impractical if you have a million notes to analyze.

Go back and look at this note. See if you notice any recurring patterns that could indicate a name.

Throughout the exercise, I noticed that all of the patient's names were capitalized, but names usually appear as single capitalized words with additional descriptions like "mom" or "dad".

Pattern recognition is a pretty basic concept. You might think to yourself that people have probably tried to do pattern recognition in strings on the computer before, and you'd be right. Try a quick search for e.g. `python pattern recognition string` to see if you can find a module (library) that can help.

I actually didn't know Python has a built-in module called re that allows you to find specific patterns in a text. Regular expressions let us describe string patterns easily, like capitalized words (names!). This saves me time because now I don't have to manually revise every word. My computer can find all the matches for me.

Good job. You probably found that there's a concept called regular expressions that provides a formalized way of describing patterns in strings and that there's a module `re` for working with regular expressions. Go ahead and `import` the `re` module:

In [9]:
import re

Let's find out what functions are in the module by doing a `dir`:

In [10]:
print(dir(re))

['A', 'ASCII', 'DEBUG', 'DOTALL', 'I', 'IGNORECASE', 'L', 'LOCALE', 'M', 'MULTILINE', 'Match', 'NOFLAG', 'Pattern', 'RegexFlag', 'S', 'Scanner', 'T', 'TEMPLATE', 'U', 'UNICODE', 'VERBOSE', 'X', '_MAXCACHE', '_MAXCACHE2', '__all__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__version__', '_cache', '_cache2', '_casefix', '_compile', '_compile_template', '_compiler', '_constants', '_parser', '_pickle', '_special_chars_map', '_sre', 'compile', 'copyreg', 'enum', 'error', 'escape', 'findall', 'finditer', 'fullmatch', 'functools', 'match', 'purge', 'search', 'split', 'sub', 'subn', 'template']


`findall` sounds promising. Ask for `help` on `findall`:

In [11]:
help(re.findall)

Help on function findall in module re:

findall(pattern, string, flags=0)
    Return a list of all non-overlapping matches in the string.

    If one or more capturing groups are present in the pattern, return
    a list of groups; this will be a list of tuples if the pattern
    has more than one group.

    Empty matches are included in the result.



So we've learned `findall` takes a pattern and a string and returns a list of matches. The simplest pattern is of course just a string:

In [None]:
re.findall("years", note)

['years', 'years', 'years']

So "years" occurs three times. That's good to know, but we'd really like to know how many years.

The goal of this exercise is to get experience with calling library functions not regular expressions, but here is an introduction to a tiny but useful subset of regular expression pattern specification:

    [] indicate options, e.g. [ae] means a or e, ranges are allowed, e.g. [a-z] or even [a-zA-Z]
    + means the previous item repeats 1 or more times; e.g. [0-9]+ means one or more repeating digits
    () indicates grouping; if no group is specified, the whole match is returned

For reasons that don't matter with this subset, you'll often find regular expression patterns written using a raw-string which is like a regular string but with the letter `r` written in front, e.g. `r"the"`; we'll use this for all of our examples even though it's not strictly necessary:

For example, suppose we want to find all the words that occur after "the". That's the letters "the", a space, and then one or more letters. If we do:

In [None]:
re.findall(r"the [a-zA-Z]+", note)

['the family',
 'the mother',
 'the father',
 'the visit',
 'the usual',
 'the results']

Compare to the text of the `note` to convince yourself this result is correct.

If we want to get only specific data, we can use `()` around the piece or pieces that we're interested in:

In [None]:
re.findall(r"the ([a-zA-Z]+)", note)

['family', 'mother', 'father', 'visit', 'usual', 'results']

Adapt this to find the various ages. (Check the `note` to see the pattern for mentioning ages used by this author. You can imagine that there are other ways of indicating age, but we'll ignore that for now.)

In [13]:
age = re.findall(r"(\d+)\s+years\s+old",note)
print(age)

['32', '29', '2']


Now modify your expression (you'll need to add a second set of parentheses) to find out who is each age:

In [14]:
allage = re.findall(r"([A-Z][a-z]+)\s+is\s+(\d+)\s+years\s+old",note)
print(allage)

[('John', '32'), ('Sarah', '29'), ('Sam', '2')]


Extend this slightly by writing a function that takes a string (e.g. `note`) and indicates in a human-readable way who is each age. Make your own test and then run it with `note`:

In [15]:
def exactage(text):
  agematch = re.findall(r"([A-Z][a-z]+)\s+is\s+(\d+)\s+years\s+old",text)

  results=[]
  for name, age in agematch:
    results.append(f"{name} is {age} years old")

  return results

Modify your function to report whether or not each person is an adult (age 18 or older) in addition to their age. You will want to use the `int` function to convert a string into a number. Run it with the `note`.

In [19]:
def exactage(text):
  agematch = re.findall(r"([A-Z][a-z]+)\s+is\s+(\d+)\s+years\s+old",text)

  results=[]
  for name, age in agematch:
    age = int(age_str)
    if age >= 18:
      status = "an adult"
    else:
      status = "not an adult"
    results.append(f"{name} is {age} years old and is {status}.")

  return results

Find a similar pattern for indicating who has been diagnosed with what, and use that to get a *set* of the people who have diagnoses mentioned in the note:

In [25]:
diagnoses = re.findall(r"([A-Z][a-z]+)\s+was\s+diagnosed\s+with\s+([A-Za-z-]+)", note)
peepswdiagnosis= {name for name, dx in diagnoses}
diagnoses, peepswdiagnosis

([('Jean', 'mprbble'), ('Sam', 'strep')], {'Jean', 'Sam'})

Assume that everyone who is mentioned in a `note` either has an age or a diagnosis. Using your two regular expression patterns, get a *set* of all the people mentioned in the note. (Hint: you might want to use a set's `union` method to combine two sets.)

In [27]:
def allpeople(note):
  agematch = re.findall(r"([A-Z][a-z]+)\s+is\s+(\d+)\s+years\s+old",note)
  peepages = {name for name, age in age_matches}
  diagnoses = re.findall(r"([A-Z][a-z]+)\s+was\s+diagnosed\s+with\s+([A-Za-z-]+)", note)
  peepswdiagnosis= {name for name, dx in diagnoses}

  allpeople = peepages.union(peepswdiagnosis)

  return allpeople

Think about why we'd want to use a *set* instead of a *list*.

Finally, have the computer answer the original question: how many people are mentioned in this note?

In [32]:
#lists can have duplicates, which is bad when we are counting for unique values
def allpeople(note):
  agematch = re.findall(r"([A-Z][a-z]+)\s+is\s+(\d+)\s+years\s+old",note)
  peepages = {name for name, age in agematch}
  diagnoses = re.findall(r"([A-Z][a-z]+)\s+was\s+diagnosed\s+with\s+([A-Za-z-]+)", note)
  peepswdiagnosis= {name for name, dx in diagnoses}

  allpeople = peepages.union(peepswdiagnosis)

  return allpeople

allpeople = allpeople(note)
print("People mentioned:", allpeople)
print("Number of unique people mentioned:", len(allpeople))

People mentioned: {'Sam', 'John', 'Sarah', 'Jean'}
Number of unique people mentioned: 4
