# Regular Expressions

* So far we have been reading through files, looking for patterns and extracting various bits of lines that we find interesting. We have been using string methods like `split` and `find` and using lists and string slicing to extract portions of the lines.
* This task of searching and extracting is so common that Python has a very powerful library called *regular expressions* that handles many of these tasks quite elegantly.
* we will only cover the basics of regular expressions. For more detail on regular expressions, see:
 * http://en.wikipedia.org/wiki/Regular_expression
 * https://docs.python.org/3.6/library/re.html
* The regular expression library `re` must be imported into your program before you can use it. 
* The simplest use of the regular expression library is the `search()` function.

In [2]:
import re

In [3]:
# Search for lines that contain 'From'
with open('data/mbox-short.txt') as hand:
    for line in hand:
        line = line.rstrip()
        if re.search('From:', line):
            print(line)

From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: gsilver@umich.edu
From: gsilver@umich.edu
From: zqian@umich.edu
From: gsilver@umich.edu
From: wagnermr@iupui.edu
From: zqian@umich.edu
From: antranig@caret.cam.ac.uk
From: gopal.ramasammycook@gmail.com
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: louis@media.berkeley.edu
From: ray@media.berkeley.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu


* We open the file, loop through each line, and use the regular expression `search()` to only print out lines that contain the string "From:". 
* This program does not use the real power of regular expressions, since we could have just as easily used `line.find()` to accomplish the same result.
* The power of the regular expressions comes when we add special characters to the search string that allow us to more precisely control which lines match the string.
* Adding these special characters to our regular expression allow us to do sophisticated matching and extraction while writing very little code.
* For example, the caret character is used in regular expressions to match "the beginning" of a line. We could change our program to only match lines where "From:" was at the beginning of the line as follows:

In [4]:
# Search for lines that start with 'From'
with open('data/mbox-short.txt') as hand:
    for line in hand:
        line = line.rstrip()
        if re.search('^From:', line):
            print(line)

From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: gsilver@umich.edu
From: gsilver@umich.edu
From: zqian@umich.edu
From: gsilver@umich.edu
From: wagnermr@iupui.edu
From: zqian@umich.edu
From: antranig@caret.cam.ac.uk
From: gopal.ramasammycook@gmail.com
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: louis@media.berkeley.edu
From: ray@media.berkeley.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu


* This is still a very simple example that we could have done equivalently with the `startswith()` method from the string library. 
* But it serves to introduce the notion that regular expressions contain special action characters that give us more control as to what will match the regular expression.

## Character Matching in Regular Expressions

* `.` is used to match any character
  * Eg. the regular expression `"F..m:"` would match any of the strings `"From:"`, `"Fxxm:"`, `"F12m:"`, or `"F!@m:"`

In [5]:
fh = open('data/mbox-short.txt')
for line in fh:
    line = line.rstrip()
    if re.search('^F..m', line):
        print(line)

From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008
From: stephen.marquard@uct.ac.za
From louis@media.berkeley.edu Fri Jan  4 18:10:48 2008
From: louis@media.berkeley.edu
From zqian@umich.edu Fri Jan  4 16:10:39 2008
From: zqian@umich.edu
From rjlowe@iupui.edu Fri Jan  4 15:46:24 2008
From: rjlowe@iupui.edu
From zqian@umich.edu Fri Jan  4 15:03:18 2008
From: zqian@umich.edu
From rjlowe@iupui.edu Fri Jan  4 14:50:18 2008
From: rjlowe@iupui.edu
From cwen@iupui.edu Fri Jan  4 11:37:30 2008
From: cwen@iupui.edu
From cwen@iupui.edu Fri Jan  4 11:35:08 2008
From: cwen@iupui.edu
From gsilver@umich.edu Fri Jan  4 11:12:37 2008
From: gsilver@umich.edu
From gsilver@umich.edu Fri Jan  4 11:11:52 2008
From: gsilver@umich.edu
From zqian@umich.edu Fri Jan  4 11:11:03 2008
From: zqian@umich.edu
From gsilver@umich.edu Fri Jan  4 11:10:22 2008
From: gsilver@umich.edu
From wagnermr@iupui.edu Fri Jan  4 10:38:42 2008
From: wagnermr@iupui.edu
From zqian@umich.edu Fri Jan  4 10:17:43 2008
From: zqian@

* `*` matches the previous character zero or more times
* `+` matches the previous character one or more times

In [6]:
fh = open('data/mbox-short.txt')
for line in fh:
    line = line.rstrip()
    if re.search('^From:.+@', line):
        print(line)

From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: gsilver@umich.edu
From: gsilver@umich.edu
From: zqian@umich.edu
From: gsilver@umich.edu
From: wagnermr@iupui.edu
From: zqian@umich.edu
From: antranig@caret.cam.ac.uk
From: gopal.ramasammycook@gmail.com
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: louis@media.berkeley.edu
From: ray@media.berkeley.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu


## Extracting data using regular expressions

* `findall()` - method to extract all of the substrings which match a regular expression.

In [8]:
fh = open('data/mbox-short.txt')
lines = fh.read()
lst = re.findall('\S+@\S+', lines)
print(lst)

['stephen.marquard@uct.ac.za', '<postmaster@collab.sakaiproject.org>', '<200801051412.m05ECIaH010327@nakamura.uits.iupui.edu>', '<source@collab.sakaiproject.org>;', '<source@collab.sakaiproject.org>;', '<source@collab.sakaiproject.org>;', 'apache@localhost)', 'source@collab.sakaiproject.org;', 'stephen.marquard@uct.ac.za', 'source@collab.sakaiproject.org', 'stephen.marquard@uct.ac.za', 'stephen.marquard@uct.ac.za', 'louis@media.berkeley.edu', '<postmaster@collab.sakaiproject.org>', '<200801042308.m04N8v6O008125@nakamura.uits.iupui.edu>', '<source@collab.sakaiproject.org>;', '<source@collab.sakaiproject.org>;', '<source@collab.sakaiproject.org>;', 'apache@localhost)', 'source@collab.sakaiproject.org;', 'louis@media.berkeley.edu', 'source@collab.sakaiproject.org', 'louis@media.berkeley.edu', 'louis@media.berkeley.edu', 'zqian@umich.edu', '<postmaster@collab.sakaiproject.org>', '<200801042109.m04L92hb007923@nakamura.uits.iupui.edu>', '<source@collab.sakaiproject.org>;', '<source@collab.sa

In [9]:
fh = open('data/mbox-short.txt')
lines = fh.read()
lst = re.findall('[a-zA-Z0-9]\S+@\S+[a-zA-Z]', lines)
print(lst)

['stephen.marquard@uct.ac.za', 'postmaster@collab.sakaiproject.org', '200801051412.m05ECIaH010327@nakamura.uits.iupui.edu', 'source@collab.sakaiproject.org', 'source@collab.sakaiproject.org', 'source@collab.sakaiproject.org', 'apache@localhost', 'source@collab.sakaiproject.org', 'stephen.marquard@uct.ac.za', 'source@collab.sakaiproject.org', 'stephen.marquard@uct.ac.za', 'stephen.marquard@uct.ac.za', 'louis@media.berkeley.edu', 'postmaster@collab.sakaiproject.org', '200801042308.m04N8v6O008125@nakamura.uits.iupui.edu', 'source@collab.sakaiproject.org', 'source@collab.sakaiproject.org', 'source@collab.sakaiproject.org', 'apache@localhost', 'source@collab.sakaiproject.org', 'louis@media.berkeley.edu', 'source@collab.sakaiproject.org', 'louis@media.berkeley.edu', 'louis@media.berkeley.edu', 'zqian@umich.edu', 'postmaster@collab.sakaiproject.org', '200801042109.m04L92hb007923@nakamura.uits.iupui.edu', 'source@collab.sakaiproject.org', 'source@collab.sakaiproject.org', 'source@collab.sakaip

## Combining searching & extracting

If we want to find numbers on lines that start with the string `"X-"` such as:
> `X-DSPAM-Confidence: 0.8475` 

> `X-DSPAM-Probability: 0.0000`

We don’t just want any floating-point numbers from any lines. We only want to extract numbers from lines that have the above syntax.

In [11]:
fh = open('data/mbox-short.txt')
for line in fh:
    line = line.rstrip()
    if re.search('^X\S*: [0-9]+', line):
        print(line)

X-DSPAM-Confidence: 0.8475
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.6178
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.6961
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7565
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7626
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7556
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7002
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7615
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7601
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7605
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.6959
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7606
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7559
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7605
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.6932
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7558
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.6526
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.6948
X-DSPAM-Probability: 0.0000
X-DSPAM-Co

* Parentheses are another special character in regular expressions. 
 * When you add parentheses to a regular expression, they are ignored when matching the string. 
 * But when you are using `findall()`, parentheses indicate that while you want the whole expression to match, you only are interested in extracting a portion of the substring that matches the regular expression.

In [13]:
fh = open('data/mbox-short.txt')
for line in fh:
    line = line.rstrip()
    x = re.findall('^X\S*: ([0-9.]+)', line)
    if len(x) > 0:
        print(x)

['0.8475']
['0.0000']
['0.6178']
['0.0000']
['0.6961']
['0.0000']
['0.7565']
['0.0000']
['0.7626']
['0.0000']
['0.7556']
['0.0000']
['0.7002']
['0.0000']
['0.7615']
['0.0000']
['0.7601']
['0.0000']
['0.7605']
['0.0000']
['0.6959']
['0.0000']
['0.7606']
['0.0000']
['0.7559']
['0.0000']
['0.7605']
['0.0000']
['0.6932']
['0.0000']
['0.7558']
['0.0000']
['0.6526']
['0.0000']
['0.6948']
['0.0000']
['0.6528']
['0.0000']
['0.7002']
['0.0000']
['0.7554']
['0.0000']
['0.6956']
['0.0000']
['0.6959']
['0.0000']
['0.7556']
['0.0000']
['0.9846']
['0.0000']
['0.8509']
['0.0000']
['0.9907']
['0.0000']


Search for lines that start with `'Details: rev='` followed by numbers and `'.'`. Then print the number if it is greater than zero

In [14]:
fh = open('data/mbox-short.txt')
for line in fh:
    line = line.rstrip()
    x = re.findall('^Details:.*rev=([0-9.]+)', line)
    if len(x) > 0:
        print(x)

['39772']
['39771']
['39770']
['39769']
['39766']
['39765']
['39764']
['39763']
['39762']
['39761']
['39760']
['39759']
['39758']
['39757']
['39756']
['39755']
['39754']
['39753']
['39752']
['39751']
['39750']
['39749']
['39746']
['39745']
['39744']
['39743']
['39742']


Search for lines that start with From and a character followed by a two digit number between 00 and 99 followed by ':' Then print the number if it is greater than zero

In [15]:
fh = open('data/mbox-short.txt')
for line in fh:
    line = line.rstrip()
    x = re.findall('^From .*([0-9.]{2}):', line)
    if len(x) > 0:
        print(x)

['14']
['10']
['10']
['46']
['03']
['50']
['37']
['35']
['12']
['11']
['11']
['10']
['38']
['17']
['04']
['05']
['02']
['08']
['49']
['33']
['07']
['51']
['18']
['07']
['34']
['29']
['23']


## Summary

`ˆ` - Matches the beginning of the line.

`$`- Matches the end of the line.

`.` - Matches any character (a wildcard).

`\s` - Matches a whitespace character.

`\S` - Matches a non-whitespace character (opposite of \\s).

`*` - Applies to the immediately preceding character and indicates to match zero or more of the preceding character(s).

`*?` - Applies to the immediately preceding character and indicates to match zero or more of the preceding character(s) in "non-greedy mode".

`+` - Applies to the immediately preceding character and indicates to match one or more of the preceding character(s).

`+?` - Applies to the immediately preceding character and indicates to match one or more of the preceding character(s) in "non-greedy mode".

`[aeiou]` - Matches a single character as long as that character is in the specified set. In this example, it would match "a", "e", "i", "o", or "u", but no other characters.

`[a-z0-9]` -  You can specify ranges of characters using the minus sign. This example is a single character that must be a lowercase letter or a digit.

`[ˆA-Za-z]` When the first character in the set notation is a caret, it inverts the logic. This example matches a single character that is anything other than an uppercase or lowercase letter.

`( )` - When parentheses are added to a regular expression, they are ignored for the purpose of matching, but allow you to extract a particular subset of the matched string rather than the whole string when using findall().

`\b` - Matches the empty string, but only at the start or end of a word.


`\B` - Matches the empty string, but not at the start or end of a word.

`\d` - Matches any decimal digit; equivalent to the set `[0-9]`.


`\D` - Matches any non-digit character; equivalent to the set `[ˆ0-9]`.