# Regular Expressions

* So far we have been reading through files, looking for patterns and extracting various bits of lines that we find interesting. We have been using string methods like `split` and `find` and using lists and string slicing to extract portions of the lines.
* This task of searching and extracting is so common that Python has a very powerful library called *regular expressions* that handles many of these tasks quite elegantly.
* we will only cover the basics of regular expressions. For more detail on regular expressions, see:
 * http://en.wikipedia.org/wiki/Regular_expression
 * https://docs.python.org/3.6/library/re.html
* The regular expression library `re` must be imported into your program before you can use it. 
* The simplest use of the regular expression library is the `search()` function.

In [1]:
# Search for lines that contain 'From'
import re
with open('data/mbox-short.txt') as hand:
    for line in hand:
        line = line.rstrip()
        if re.search('From:', line):
            print(line)

From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: gsilver@umich.edu
From: gsilver@umich.edu
From: zqian@umich.edu
From: gsilver@umich.edu
From: wagnermr@iupui.edu
From: zqian@umich.edu
From: antranig@caret.cam.ac.uk
From: gopal.ramasammycook@gmail.com
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: louis@media.berkeley.edu
From: ray@media.berkeley.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu


* We open the file, loop through each line, and use the regular expression `search()` to only print out lines that contain the string "From:". 
* This program does not use the real power of regular expressions, since we could have just as easily used `line.find()` to accomplish the same result.
* The power of the regular expressions comes when we add special characters to the search string that allow us to more precisely control which lines match the string.
* Adding these special characters to our regular expression allow us to do sophisticated matching and extraction while writing very little code.
* For example, the caret character is used in regular expressions to match "the beginning" of a line. We could change our program to only match lines where "From:" was at the beginning of the line as follows:

In [2]:
# Search for lines that start with 'From'
import re
with open('data/mbox-short.txt') as hand:
    for line in hand:
        line = line.rstrip()
        if re.search('^From:', line):
            print(line)

From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: gsilver@umich.edu
From: gsilver@umich.edu
From: zqian@umich.edu
From: gsilver@umich.edu
From: wagnermr@iupui.edu
From: zqian@umich.edu
From: antranig@caret.cam.ac.uk
From: gopal.ramasammycook@gmail.com
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: louis@media.berkeley.edu
From: ray@media.berkeley.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu


* This is still a very simple example that we could have done equivalently with the `startswith()` method from the string library. 
* But it serves to introduce the notion that regular expressions contain special action characters that give us more control as to what will match the regular expression.