# Regular Expression for Concept Extraction

## Language

With language, we use a large, but finite vocabulary to describe the world. These descriptions are composed of words and phrases that can be predicted with some regularity. For instance, when describing 'infiltration', we can use phrases like  'infiltration', 'infiltrate', 'infiltrated', etc. written using letters that are placed together with some regularity. We can readily identify the infiltration written with these three expressions by identifying the pattern for constructing these lexical variants. What do these three words have in common? 


$${\bf infiltrat}e$$
$${\bf infiltrat}ed$$
$${\bf infiltrat}ion$$

## Representing patterns

Regular expressions (aka *regex*) are string statements that represent the patterns you would like to **match** or **extract**. Regex is a powerful and commonly used tool for natural language processing and has been used for several tasks:

* identify and classify a patient's smoking status
* extract family history information for genetic studies
* redacting patient protected health information from text for research
* parsing an excel spreadsheet to generate population statistics about childhood asthma

For this lesson, we will complete a high-level overview of regexs to see how we can match and extract strings of text. Next week we'll apply these methods to identify mentions of pneumonia in clinical text.

For more detailed documentation and explanations, please see the <a href="https://docs.python.org/3/library/re.html">Python regex documentation</a>

Let's get started - we will start by importing the re module from the python package.

In [None]:
import re
from IPython.display import display, Math, Latex

## Applying regexs

There are several ways a regex pattern. We'll mostly stick with the function `findall`:

$$matches = re.findall(pattern, string)$$

This function takes two arguments:
- `pattern`: The regular expression pattern
- `string`: The string to search through

This function will search through the entire text and return any matches in that string.

For example, let's look for **"cardiovascular"** in this string:

In [None]:
text = "cardiovascular: patient has cardiovascular disease. Will work with his cardiologist to establish a regiment."
re.findall('cardiovascular', text)

Let's look for the shorter string **"cardio"**:

In [None]:
re.findall('cardio', text)

## Simple expressions
The simplest regular expressions are very easy to understand.  They will just match any sequence of characters that match the pattern that you provide.  Let's do a few examples of that:

In [None]:
re.findall('and', "We walked and walked and walked.")

In [None]:
re.findall('walked', "We walked and walked and walked.")

## Case sensititivity
Note that regular expressions are very explicit in what they match.  This includes character case sensivity (i.e. lowercase and upper case characters.

Given this, will the expression below find a match?

In [None]:
re.findall('pneumonia', "Pneumonia reports were inconclusive")

One way to get around this by using the IGNORECASE flag, which won't distinguish between lower- and upper-case letters:

In [None]:
re.findall('pneumonia', "Pneumonia reports were inconclusive", flags=re.IGNORECASE)

## Character Classes
Often, we won't necessarily want to match one exact character, but instead match any of a set of characters. We can do that in **character classes**, where we enclose a number of characters in square brackets. Python will then match any character within those brackets. 

* [pP] = Lowercase p or Uppercase P
* [abcde]  = Lowercase a, b, c, d, or e

In [None]:
re.findall('[pP]neumonia', "Pneumonia reports were inconclusive")

In [None]:
re.findall('[abcde]', 'abcdefghijklmnop')

Alternatively a range of characters be specified like this : 

* [a-e] = Any character between lowercase a and lowercase e (this is equivalent to the example above

In [None]:
text = "Which of these 46 characters will this return?"

In [None]:
re.findall("[a-l]", text)

In [None]:
re.findall("[m-z]", text)

*Sequencing range* additional examples include:
- [0-9] = any number between 0 and 9 e.g, 0,1,2,3,4,5..9
- [A-Z] - any upper-case letters
- [a-zA-Z0-0] - all of the above

In [None]:
re.findall("[0-9]", text)

In [None]:
re.findall("[0-9a-zA-Z]", text)

# "OR" matching and groups
Regular expresions are powerful since there are special characters that allow for matching multiple sequences or regular expressions at once:

* | = match either regular expression on either side of the | symbol  

From the Python documentation linked above : 'A|B, where A and B can be arbitrary REs, creates a regular expression that will match either A or B. An arbitrary number of REs can be separated by the '|' in this way'.

If we put a list of different options in parentheses **()** and separate the list by "|", then we will match any of the strings within the parentheses.

In [None]:
color_text = 'i love the colors blue, green, red and purple but I do not like orange.'

color_expression = '(blue|green|red|purple)'
color_matches = re.findall(color_expression, color_text)

print(color_matches)

# Quiz : "OR" matching

Change the regular expression below so that 'hair' is also part of the match

In [None]:
anatomy_text = 'Patient physical showed that they had heart, head, nose, mouth and hair'

# update the regular expression below so that 'hair' is matched by the code below
anatomy_expression = '(heart|head|nose|mouth|hair)'

anatomy_matches = re.findall(anatomy_expression, anatomy_text)

print('Anatomy matches : ' + str(anatomy_matches))

# Quantifiers
It isn't very useful to get all of those matches as individual matches. For example, if we're searching for numbers in a text and see **"46"**, we probably don't want to get **"4"** and **"6"** separately: we want **"46"**. To do this, we can provide an quantifer to specify how many of the characters we want to include in a span:

- **\*** : match 0 or more
- **?** : match 0 or 1
- **+** : match 1 or more

In [None]:
text = "patient is a 91-year old male with 2 broken fingers."

In [None]:
re.findall('[0-9]', text)

In [None]:
re.findall('[0-9]+', text)

In [None]:
re.findall('[a-z]+', text)

In [None]:
re.findall('[a-z0-9]+', text)

To practice this, let's look at these three texts:

In [None]:
texts = [
    "sheep say 'b'",
    "sheep say 'ba'",
    "sheep say 'baaaaaaaaaa'",
]

First, write a pattern to match "b" and "ba", but not "baaaaaaaaaa"

In [None]:
pattern = ___
for text in texts:
    print(re.findall(pattern, text))

Now, write a pattern which will match "ba" and "baaaaaaaaaa", but not "b"

In [None]:
pattern = ___
for text in texts:
    print(re.findall(pattern, text))

Finally, write a pattern which will match all 3:

In [None]:
pattern = ___
for text in texts:
    print(re.findall(pattern, text))

## Back to our initial motivating example... 
At the beginning of this notebook we listed this example which is relevant to our pneumonia work:

$${\bf infiltrat}e$$
$${\bf infiltrat}ed$$
$${\bf infiltrat}ion$$

In [None]:
infiltrate_text = 'The infiltrate then infiltrated the thing resulting infiltration'

So how can we match these 3 words?  There are a number of ways:

*Option 1:* Include all 3 words in a single pattern separated by "|"

In [None]:
infiltrate_expression_1 = r'infiltrate|infiltrated|infiltration'

In [None]:
re.findall(infiltrate_expression_1, infiltrate_text)

In [None]:
infiltrate_expression_2 = r'infiltrat(e|ed|ion)'

In [None]:
re.findall(infiltrate_expression_2, infiltrate_text)

**Wait** - that doesn't look right.

Since the second regular expression has Regular Expression "groups" with parentheses, a call to findall() would show us each captured group. Instead, we have to iterate through and show the spans:

In [None]:
infiltrate_iter = re.finditer(infiltrate_expression_2, infiltrate_text)

for i in infiltrate_iter:
    print(i.group())

<br/><br/>This material was originally presented as part of the DeCART Data Science for the Health Science Summer Program at the University of Utah in 2017.<br/>
Presenters : Dr. Wendy Chapman, Jianlin Shi and Kelly Peterson