# Regular Expression for Concept Extraction

## Language

With language, we use a large, but finite vocabulary to describe the world. These descriptions are composed of words and phrases that can be predicted with some regularity. For instance, when describing 'infiltration', we can use phrases like  'infiltration', 'infiltrate', 'infiltrated', etc. written using letters that are placed together with some regularity. We can readily identify the infiltration written with these three expressions by identifying the pattern for constructing these lexical variants. What do these three words have in common? 


$${\bf infiltrat}e$$
$${\bf infiltrat}ed$$
$${\bf infiltrat}ion$$

## Representing patterns

Regular expressions aka regex are string statements that represent the regularity in the text that you would like to leverage to **match** or **extract** a concept. Regex is a powerful and commonly used tool for natural language processing and has been used for several tasks:

* identify and classify a patient's smoking status
* extract family history information for genetic studies
* redacting patient protected health information from text for research
* parsing an excel spreadsheet to generate population statistics about childhood asthma

## For this lesson, we will complete a high-level overview of regexs primarily for the purpose of developing with pyConText.

For more detailed documentation and explanations, please see the <a href="https://docs.python.org/3/library/re.html">Python regex documentation</a>

Let's get started - we will start by importing the re module from the python package.

In [1]:
import re
from IPython.display import display, Math, Latex
from IPython.display import YouTubeVideo

# let's also do a quiz
from regular_expression_quiz import which_a_string_not_matched
from regular_expression_quiz import test_infiltrates_expression

Now, before we begin, we will review the various ways regex expressions can be applied. The general approach to using regexs is 1) to compile the search pattern and then 2) to apply the compiled regex to a segment of text:

$$complied\_expression = re.compile(pattern)$$
$$extracted\_text = complied\_expression.match(string)$$

However, if we want to return all instances of matched text found in our target string, we can simplify this two-step process into one-step as follows:

$$matches = re.findall(pattern, string)$$


## Applying regexs

There are several ways a regex pattern so we'll show the most common here and then the rest are listed near the end of this notebook:

- findall: returns all matches of the expression from the string

In [2]:
re.findall('cardiovascular', "cardiovascular: patient has cardiovascular disease.")

['cardiovascular', 'cardiovascular']

## Simple expressions
The simplest regular expressions are very easy to understand.  They will just match any sequence of characters that match the pattern that you provide.  Let's do a few examples of that:

In [3]:
re.findall('and', "We walked and walked and walked.")

['and', 'and']

In [4]:
re.findall('walked', "We walked and walked and walked.")

['walked', 'walked', 'walked']

## Case sensititivity
Note that regular expressions are very explicit in what they match.  This includes character case sensivity (i.e. lowercase and upper case characters.

Given this, will the expression below find a match?

In [5]:
re.findall('pneumonia', "Pneumonia reports were inconclusive")

[]

## NOTE : pyConText does ignore case by default, but it's important to be aware of case sensitivity for many text applications

## Sequencing ranges
How do we get around case sensitivity?  One way is to use character ranges.  In a range, characters can be listed individually in a range like this : 

* [pP] = Lowercase p or Uppercase P
* [abcde]  = Lowercase a, b, c, d, or e

Alternatively a range of characters be specified like this : 

* [a-e] = Any character between lowercase a and lowercase e (this is equivalent to the example above

*Sequencing range* additional examples include:
- [0-9] = any number between 0 and 9 e.g, 0,1,2,3,4,5..9
- [a-z] = any lower case letter between a and z e.g., a,b,c,d...z
- [A-Z] = any capitial case letter between A and Z e.g., A, B, ..C.
- [A-Za-z0-9] = any alphanumeric regardless of case e.g., A, a, B, b,.. 0, 1..

Let's look at our pneumonia capitalization issue again

In [6]:
# : Will this work?
re.findall('pneumonia', "Pneumonia reports were inconclusive")

[]

In [7]:
# : What about this one
re.findall('[pP]neumonia', "Pneumonia reports were inconclusive")

['Pneumonia']

## Let's look at one more example.  We'll scan for a range of uppercase letters.
Will we match the letter F in this sentence?

In [8]:
grades_text = 'Tom got an A.  Sally got a B.  Ralph got a D.  Kelly got an F.'
re.findall('[A-E]', grades_text)

['A', 'B', 'D']

In [9]:
# What about this one?
re.findall('[A-F]', grades_text)

['A', 'B', 'D', 'F']

In [10]:
# And this one
re.findall('[ABCDEF]', grades_text)

['A', 'B', 'D', 'F']

## Character classes

Regular expressions represent the strings text that you would like to included and excluded when extracting a string representing a concept. For instance, in the sentence below, we can 1) make a regular expression that only identifies numbers using the *regular expression character class* string "\d", which means one digit 2) compile the expression, and 3) apply the expression to the sentence text.

There are a handful of *regular expression character class* including:

* \d   =  One digit -- equivalent to the set [0-9]
* \D   =  One non-digit 
* \s   =  One whitespace -- equivalent to the set [ \t\n\r\f\v] (i.e. space, tab, newline, carriage return, form feed, vertical tab)
* \S   =  One non-white space
* \w   =  One word character -- equivalent to [a-zA-Z0-9_]
* \W   =  One non-word character



In [11]:
re.findall('\d', "patient is a 91-year old male with 2 broken fingers.")

['9', '1', '2']

Conversely we can generate and apply a *regular expression character class* string that identifies everything, but the numbers by using the string "\D".

## Coding exercise

Create and apply the \w+ and \W+ to the text above. What do you see?

In [12]:
re.findall('\w+', "patient is a 9-year old male with 2 broken fingers.")

['patient',
 'is',
 'a',
 '9',
 'year',
 'old',
 'male',
 'with',
 '2',
 'broken',
 'fingers']

## Useful character class : Word boundary
Sometimes it is very useful working in text to match a certain characters both no more.  For example if a regular expression was written as :

* 'cat'

This would match the substring 'cat' when found in all of the following words:
* cat
* category
* categories
* scatter
* etc...

For this purpose, the **word boundary** character class exists:

* \b

From the Python documentation:

Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of Unicode alphanumeric or underscore characters, so the end of a word is indicated by whitespace or a non-alphanumeric, non-underscore Unicode character. Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string. This means that r'\bfoo\b' matches 'foo', 'foo.', '(foo)', 'bar foo baz' but not 'foobar' or 'foo3'.

## Let's say that we want to match only 'cat' and none of the other examples above.  Let's see what would happen without word boundaries:

In [13]:
# let's try an experiment:
cat_text = 'My cat likes to put each category into multiple categories until everyone scatters!'

cat_expression_1 = r'cat'
cat_matches_1 = re.findall(cat_expression_1, cat_text)
print('Here are all of the matches on the first expression:')
print(cat_matches_1)

Here are all of the matches on the first expression:
['cat', 'cat', 'cat', 'cat']


## Pro Tip : Python raw strings and character conflicts in Expressions
As we move on, we are about to use the **word boundary** character class \b.  This means something specific in regular expressions, but this sequence also represents the character for a backspace in Python.  There are many special characters (i.e. escape sequences) like this in Python so one way to avoid them is to use a "raw" string like this :

r'I am a raw string'

So that if you use a regular expressions syntax that happens to be the same as a special character (\b) it can be evaluated as the literal regular expression instead of being convered to using a backspace.  See how we do this below with :

r'\bcat\b'

To see a full list of Python escape sequences, you can <a href="https://docs.python.org/3.5/reference/lexical_analysis.html#string-and-bytes-literals">check them here</a>

## We got 4 matches using the expression above and we only wanted one.  Let's see if we can fix that with the word boundary class:

In [14]:
cat_expression_2 = r'\bcat\b'
cat_matches_2 = re.findall(cat_expression_2, cat_text)
print('Here are all of the matches on the second updated expression:')
print(cat_matches_2)

Here are all of the matches on the second updated expression:
['cat']


## What would happen without the "raw" string r''?

In [15]:
cat_expression_3 = '\bcat\b'
cat_matches_3 = re.findall(cat_expression_3, cat_text)
print('Here are all of the matches on the second updated expression:')
print(cat_matches_3)

Here are all of the matches on the second updated expression:
[]


# "OR" matching
Regular expresions are powerful since there are special characters that allow for matching multiple sequences or regular expressions at once:

* | = match either regular expression on either side of the | symbol  

From the Python documentation linked above : 'A|B, where A and B can be arbitrary REs, creates a regular expression that will match either A or B. An arbitrary number of REs can be separated by the '|' in this way'.

In [16]:
color_text = 'i love the colors blue, green, red and purple but I do not like orange.'

color_expression = '(blue|green|red|purple)'
color_matches = re.findall(color_expression, color_text)

print(color_matches)

if 'yellow' in color_matches:
    print('yellow was matched in this case')
else:
    print('yellow was not matched in this case')

['blue', 'green', 'red', 'purple']
yellow was not matched in this case


# Quiz : "OR" matching

Change the regular expression below so that 'hair' is also part of the match

In [17]:
anatomy_text = 'Patient physical showed that they had heart, head, nose, mouth and hair'

# update the regular expression below so that 'hair' is matched by the code below
anatomy_expression = '(heart|head|nose|mouth|UPDATE_ME)'

anatomy_matches = re.findall(anatomy_expression, anatomy_text)

print('Anatomy matches : ' + str(anatomy_matches))

if len(set(['heart', 'head', 'nose', 'mouth']).intersection(set(anatomy_matches))) != 4:
    print('INCORRECT! Some of the strings previously matched are no longer matching.  Make sure that heart, head, nose and mouth can all still match!')
elif 'hair' not in anatomy_matches:
    print('INCORRECT! hair was not matched by your regular expression.  Please try again.')
else:
    print('CORRECT! hair was matched.  Great job!')
    

Anatomy matches : ['heart', 'head', 'nose', 'mouth']
INCORRECT! hair was not matched by your regular expression.  Please try again.


## Sequencing range and quantifier patterns 

A *quanitifier* indicates how many of these characters within the range. 

*Quantifier* simple examples include:
- \*	 = 0 or more
- \+	 = 1 or more
- ?	 = 0 or 1

Let's create a regular expression using a sequencing range and a quantifier above.

In [18]:
txt="Patient states fevers she felt feverish before developing many fevers, but no longer has a fever."

# this will return a list of matches
fever_matches = re.findall('fever\w*', txt)

# let's print out all the matches
print(fever_matches)

['fevers', 'feverish', 'fevers', 'fever']


In [19]:
re.findall('\w+', "patient is a 9-year old male with 2 broken fingers.")

['patient',
 'is',
 'a',
 '9',
 'year',
 'old',
 'male',
 'with',
 '2',
 'broken',
 'fingers']

## Quiz : Quantifiers

## Update the function below to pass in the string which would NOT be matched by this regular expression:
## 'ah+'

* a
* ah
* ahh
* ahhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh

In [20]:
which_a_string_not_matched('PUT_YOUR_ANSWER_HERE')

'There should be at least one character "a" in your answer since the regular expression expects one "a"'

## Quiz : Quantifiers and Plurals

## Update the function below to pass in a regular expression that will match the following strings:

* infiltrate
* infiltrates

## But not match:
* infiltratess

## HINT : You may want to use the "?" special character and also consider the "\b" character class to enforce word boundaries on a match

In [21]:
test_infiltrates_expression(r'infiltrate[s]?')

'INCORRECT.  Your expression matched an unexpected string : [infiltratess].  Please try again.'

## Back to our initial motivating example...  Above we listed this example which is relevant to our pneumonia work:

$${\bf infiltrat}e$$
$${\bf infiltrat}ed$$
$${\bf infiltrat}ion$$

## So how can we match this?  There are a number of ways:

In [22]:
infiltrate_text = 'The infiltrate then infiltrated the thing resulting infiltration'

In [23]:
infiltrate_expression_1 = r'\b(infiltrate|infiltrated|infiltration)\b'

In [24]:
infiltrate_expression_2 = r'\binfiltrat(e|ed|ion)\b'

In [25]:
re.findall(infiltrate_expression_1, infiltrate_text)

['infiltrate', 'infiltrated', 'infiltration']

## Since the second regular expression has Regular Expression "groups" with parentheses, a call to findall() would show us each captured group. Instead, let's iterate and show the spans.  Thisis actually the regular expression interface that pyConText uses:

In [26]:
infiltrate_compiled = re.compile(infiltrate_expression_2)
infiltrate_iter = infiltrate_compiled.finditer(infiltrate_text)

for i in infiltrate_iter:
    print(i.group())

infiltrate
infiltrated
infiltration


## (OPTIONAL) What are the pros and cons to the two regular expressions above?  Which one do you find to be more readable?  Which would be easier to maintain?

<img src="images/stopsign.png">

# STOP!  Regular expressions are a complex and deep topic and we don't have time to cover everything.  

## For now we've covered most of what will be needed for this course and most cases in pyConText

## Let's stop here for now but please come back to this notebook later in the course or sometime after to learn more about more advanced topics in regular expressions

## Additional ways to apply Regular Expressions in Python

- search: checks for a match anywhere in the string

In [None]:
re.search("and", "head and eyes and ears and nose and throat")

- match: checks for a match only at the beginning of the string

In [None]:
re.match('cardiovascular', "patient has cardiovascular disease.")

- finditer: returns all matches of the expression from the string in an iterative fashion.

In [None]:
for extracted in re.finditer('pain(?= in)', "pain on the neck; pain in the jaw"):
    print(extracted)

## Advanced Quantifiers
*Quantifier* examples also include:
- {5} = 5 exactly
- {5,7} = between 5 and 7
- {1, } = 1 or more
- {, 4} = up to 4

## Positional patterns

For some expressions the position of the string is an important aspect to capture. 

*Positional* examples:
- ^ Match the start of the string
- $ Match the end of the string

For example, in the sentence below, we may only be interested in identifying the headers from the text strings. How can we apply the *positional expression* to capture the headers?

In [3]:
txt="cardiovascular: patient has cardiovascular"
re.search('cardiovascular$', txt)

<re.Match object; span=(28, 42), match='cardiovascular'>

## Obtaining span offsets

The expression found the string 'cardiovascular' at the start of the sentence rather than near the end. We can not only find strings, but we can verify this claim and obtain information about the string identified using match, which will give us the span and the matched string below.

In [4]:
import re
matched=re.match('^cardiovascular', txt)
print(matched)
print("match:", matched.group())
print("span:", matched.start(),matched.end())

<re.Match object; span=(0, 14), match='cardiovascular'>
match: cardiovascular
span: 0 14


Identifying the start and end of a string can be important for many reasons:
* marking up text with denoting information of interest using html or xml
* redacting sensitive information from text

Below we demonstrate how to apply the matched string from the pattern back onto the original text by splicing the string according to the start and end of the matched pattern.

In [None]:
txt[matched.start():matched.end()]

## Extracting a set of strings

In some cases, we may want to extract a group of strings that match a more complex pattern.
To do this we will use a pattern with the following syntax: $$(?P<name>...)$$ 

Note you can retrieve the individual span parts using the 'name' or position in the retrieved tuple. 

In [5]:
txt="admission date:6/15-2015."
fullSpan=re.match(r"admission date:\s*(?P<month>\d{1,2})[-|\/](?P<day>\d{1,2})-(?P<year>\d{2,4})", txt)
print(fullSpan, "\n")

print("retrieved in as a tuple/ full span")
print("all parts:", fullSpan.groups(), "=", fullSpan.group(0), "\n")

print("retrieved as a dictionary")
print(fullSpan.groupdict(),"\n")

print("retrieved in parts")
print("month:",fullSpan.group('month'), "=", fullSpan.group(1) )
print("day:",fullSpan.group('day'), "=", fullSpan.group(2))
print("year:",fullSpan.group('year'), "=", fullSpan.group(3))
print(fullSpan.groupdict())


<re.Match object; span=(0, 24), match='admission date:6/15-2015'> 

retrieved in as a tuple/ full span
all parts: ('6', '15', '2015') = admission date:6/15-2015 

retrieved as a dictionary
{'month': '6', 'day': '15', 'year': '2015'} 

retrieved in parts
month: 6 = 6
day: 15 = 15
year: 2015 = 2015
{'month': '6', 'day': '15', 'year': '2015'}


## Coding exercise

Create an expression using the group syntax above to identify the patient name (Jane Doe) of the following text. Then replace it with the new name (Mary Lamb).
    

In [None]:
txt="Patient Name: Jane Doe"
re.subn("(?<=Patient Name: )\w+ \w+", "Mary Lamb", txt)


## For more detailed information about regexs

Please view the following video by Professors Dan Jurafsky & Chris Manning


In [None]:
YouTubeVideo("hwDhO1GLb_4?list=PL6397E4B26D00A269", width = 800, height=600)

## Want to try some expressions without coding?
Try [pythex](http://pythex.org/) 

<br/><br/>This material presented as part of the DeCART Data Science for the Health Science Summer Program at the University of Utah in 2019.<br/>
Presenters : Dr. Wendy Chapman, Kelly Peterson, Alec Chapman, Jianlin Shi <br> Acknowledgement: Many thanks to Olga Patterson because part of the materials are adopted from his previous work.