# Regular Expression for Concept Extraction

## 1. Language

With language, we use a large, but finite vocabulary to describe the world. These descriptions are composed of words and phrases that can be predicted with some regularity. For instance, when describing 'infiltration', we can use phrases like  'infiltration', 'infiltrate', 'infiltrated', etc. written using letters that are placed together with some regularity. We can readily identify the infiltration written with these three expressions by identifying the pattern for constructing these lexical variants. What do these three words have in common? 


$${\bf infiltrat}e$$
$${\bf infiltrat}ed$$
$${\bf infiltrat}ion$$

## 2. Representing patterns

Regular expressions aka regex are string statements that represent the regularity in the text that you would like to leverage to extract a concept. Regex is a powerful and commonly used tool for natural language processing and has been used for several tasks:

* redacting patient protected health information from text for research
* extract family history information for genetic studies
* identify and classify a patient's smoking status
* parsing an excel spreadsheet to generate population statistics about childhood asthma

For this lesson, we will complete a high-level overview of regexs. For more detailed documentation and explanations, please see the <a href="https://docs.python.org/3/library/re.html">Python regex documentation</a>

Let's get started - we will start by importing the re module from the python package.

In [None]:
import re
from IPython.display import display, Math, Latex
from IPython.display import YouTubeVideo

Now, before we begin, we will review the various ways regex expressions can be applied. The general approach to using regexs is 1) to compile the search pattern and then 2) to apply the compiled regex to a segment of text:

$$complied\_expression = re.compile(pattern)$$
$$extracted\_text = complied\_expression.match(string)$$

However, we can simplify this two-step process into one-step as follows:

$$result = re.match(pattern, string)$$


## 3. Applying regexs

There are several ways a regex pattern can be applied to text (below are just a few):
- search: checks for a match anywhere in the string

In [None]:
re.search("and", "head and eyes and ears and nose and throat")

- match: checks for a match only at the beginning of the string

In [None]:
re.match('cardiovascular', "patient has cardiovascular disease.")

- findall: returns all matches of the expression from the string

In [None]:
re.findall('cardiovascular', "cardiovascular: patient has cardiovascular disease.")

- finditer: returns all matches of the expression from the string in an iterative fashion.

In [None]:
for extracted in re.finditer('pain(?= in)', "pain on the neck; pain in the jaw"):
    print(extracted)


## 4. Acting like string functions we already know

Regexs also share some common functions with strings (below are just a few):
- split: splits the text string into smaller parts based on the pattern, returning a list of the remaining text pieces
        

In [None]:
output=re.split("and", "pain  and extreme discomfort") 
for item in output:
  print (item.strip()+"<")


- sub: substitutes the string extracted by the pattern with a new string, returning a string

In [None]:
re.sub("and", "or", "head and eyes and ears and nose and throat") 

- subn: substitutes the string extracted by the pattern with a new string, returning a tuple with the replaced string and the number of times the substitution was made


In [None]:
re.subn("and", "or", "head and eyes and ears and nose and throat") 

## 5. Character classes

Regular expressions represent the strings text that you would like to included and excluded when extracting a string representing a concept. For instance, in the sentence below, we can 1) make a regular expression that only identifies numbers using the *regular expression character class* string "\d", which means one digit 2) compile the expression, and 3) apply the expression to the sentence text.

In [None]:
re.findall('\d', "patient is a 91-year old male with 2 broken fingers.")

Conversely we can generate and apply a *regular expression character class* string that identifies everything, but the numbers by using the string "\D".

## 6. Coding exercise

Create and apply the \w+ and \W+ to the text above. What do you see?

In [None]:
re.findall('\w+', "patient is a 9-year old male with 2 broken fingers.")

There are a handful of *regular expression character class* including:

* \d   =  One digit 
* \D   =  One non-digit
* \s   =  One whitespace
* \S   =  One non-white space
* \w   =  One word character
* \W   =  One non-word character


## 7. Sequencing range and quantifier patterns 

For some string extractions, we may be interested in placing restrictions to the sequence of characters that can and can not occur before or after a set of characters. For instance, in our first example, we are interested in including the characters f-e-v-e-r followed by zero or more alphabetical letters. We can write a regular expression to represent this constraint using two *sequencing range* and *quantifier*. A *sequencing range* indicates the units in a range, denoted with [], of characters e.g., alphabetical letters or numbers, that can occur in a statement. A *quanitifier* indicates how many of these characters within the range. 

*Sequencing range* examples include:
- [0-9] = any number between 0 and 9 e.g, 0,1,2,3,4,5..9
- [a-z] = any lower case letter between a and z e.g., a,b,c,d...z
- [A-Z] = any capitial case letter between A and Z e.g., A, B, ..C.
- [A-Za-z0-9] = any alphanumeric regardless of case e.g., A, a, B, b,.. 0, 1..


*Quantifier* examples include:
- \*	 = 0 or more
- \+	 = 1 or more
- ?	 = 0 or 1
- {5} = 5 exactly
- {5,7} = between 5 and 7
- {1, } = 1 or more
- {, 4} = up to 4

Let's create a regular expression using a sequencing range and a quantifier above.

In [None]:
txt="Patient states fevers she felt feverish before developing many fevers, but no longer has a fever."
fevers={}
for fever in re.findall('fever\c*', txt):
    #print(fever)
    if fever in fevers:
        fevers[fever]+=1
    else:
        fevers[fever]=1
print(fevers)

In [None]:
re.findall('\w+', "patient is a 9-year old male with 2 broken fingers.")

TODO ADD quiz

## 8. Positional patterns

For some expressions the position of the string is an important aspect to capture. 

*Positional* examples:
- ^ Match the start of the string
- $ Match the end of the string

For example, in the sentence below, we may only be interested in identifying the headers from the text strings. How can we apply the *positional expression* to capture the headers?

In [None]:
txt="cardiovascular: patient has cardiovascular"
re.search('cardiovascular$', txt)

## 9.Obtaining span offsets

The expression found the string 'cardiovascular' at the start of the sentence rather than near the end. We can not only find strings, but we can verify this claim and obtain information about the string identified using match, which will give us the span and the matched string below.

In [None]:
matched=re.match('^cardiovascular', txt)
print(matched)
print("match:", matched.group())
print("span:", matched.start(),matched.end())

Identifying the start and end of a string can be important for many reasons:
* marking up text with denoting information of interest using html or xml
* redacting sensitive information from text

Below we demonstrate how to apply the matched string from the pattern back onto the original text by splicing the string according to the start and end of the matched pattern.

In [None]:
txt[matched.start():matched.end()]

## 10. Extracting a set of strings

In some cases, we may want to extract a group of strings that match a more complex pattern.
To do this we will use a pattern with the following syntax: $$(?P<name>...)$$ 

Note you can retrieve the individual span parts using the 'name' or position in the retrieved tuple. 

In [None]:
txt="admission date:6/15-2015."
fullSpan=re.match(r"admission date:\s*(?P<month>\d{1,2})[-|\/](?P<day>\d{1,2})-(?P<year>\d{2,4})", txt)
print(fullSpan, "\n")

print("retrieved in as a tuple/ full span")
print("all parts:", fullSpan.groups(), "=", fullSpan.group(0), "\n")

print("retrieved as a dictionary")
print(fullSpan.groupdict(),"\n")

print("retrieved in parts")
print("month:",fullSpan.group('month'), "=", fullSpan.group(1) )
print("day:",fullSpan.group('day'), "=", fullSpan.group(2))
print("year:",fullSpan.group('year'), "=", fullSpan.group(3))
print(fullSpan.groupdict())


## 11. Coding exercise

Create an expression using the group syntax above to identify the patient name (Jane Doe) of the following text. Then replace it with the new name (Mary Lamb).
    

In [None]:
txt="Patient Name: Jane Doe"
re.subn("(?<=Patient Name: )\w+ \w+", "Mary Lamb", txt)


## 12. For more detailed information about regexs

Please view the following video by Professors Dan Jurafsky & Chris Manning


In [None]:
YouTubeVideo("hwDhO1GLb_4?list=PL6397E4B26D00A269", width = 800, height=600)

## 13. Want to try some expressions without coding?
Try [pythex](http://pythex.org/) 