In [1]:
from IPython.core.display import HTML
def css_styling():
    styles = open("../Data/www/styles/custom.css", "r").read()
    return HTML(styles)
css_styling()

# Synopsis

Frequently we are simply looking for specific words or phrases in a block of text and do not care about the rest of the text. However, sometimes we are intersted in a pattern of text (such as a phone number), where the format is consistent but the actual text itself changes. In this unit, we will learn:

1. What a regular expression is
2. Available functions in the `re` package
3. How to identify and extract a text pattern in a large block of text.
4. How to develop and test regular expressions

# Regular Expressions

[Regular expressions](https://en.wikipedia.org/wiki/Regular_expression) (or regexes in shorthand) are essentially a language of their own and are not unique to Python. What they do is allow for complicated searches through text according to various criteria. If you're looking at a large document of text it's easy enough to search for the word "Northwestern". But what if I want to search for a pattern rather than a particualr word such as (xxx)xxx-xxxx where I want all the x's to be numbers? This would be a great way to find a phone number but I'd have to do a lotttt of Cmd+F (Ctrl+F) searches if I searched for all possibilities of 10 digit phone numbers. 

Enter regular expressions. Regexes allow us to construct a generic text pattern that will then be matched through the entire body of text. There is a specific language that is used to build a regex and this language is both extremely powerful and complicated. 

You have the ability to construct very complicated and detailed regular expressions. However, as with any tool that is extremely powerful, obtuse, and difficult to debug, it is easy to construct a regular expression that does far more (or less) than you expect and have it generate incorrect answers. Constructing regular expressions at a master-level is an entire course in its own right, so keep that in mind and remember that the best way to build complex regular expressions is to **test, test, and test some more**.


# Regular expressions in Python

Regular expressions in Python are implmented in the `re` package. There are a few basic functions in the package that we will use:


* `re.match()` : Determine if the RE matches at the beginning of the string.
* `re.search()` : Scan through a string, looking for any location where this RE matches.
* `re.findall()` : Find all substrings where the RE matches, and returns them as a list.
* `re.finditer()` : Find all substrings where the RE matches, and returns them as an iterator object.


Now, let's go over an example so this is less abstract. We'll start with something easy - making a direct match to an explicit string (my name) using all of the different `re` methods

In [1]:
import re

In [2]:
text_sample  = "Hi my name is Adam!"
print(re.match('Adam', text_sample))
print(re.search('Adam', text_sample))
print(re.findall('Adam', text_sample))
print(re.finditer('Adam', text_sample))

None
<_sre.SRE_Match object; span=(14, 18), match='Adam'>
['Adam']
<callable_iterator object at 0x106372a58>


`re.match` did what we would expect, returning none since the string didn't start with 'Adam'.

`re.findall` gave us an answer that we would mostly expect (i.e. a list of all occurrences it could find). 

`re.search` and `re.finditer` did something weird so let's investigate that a little bit further:

In [51]:
search_results = re.search('Adam', text_sample)

In [52]:
help(search_results)

Help on SRE_Match object:

class SRE_Match(builtins.object)
 |  The result of re.match() and re.search().
 |  Match objects always have a boolean value of True.
 |  
 |  Methods defined here:
 |  
 |  __copy__(...)
 |  
 |  __deepcopy__(...)
 |  
 |  __repr__(self, /)
 |      Return repr(self).
 |  
 |  end(...)
 |      end([group=0]) -> int.
 |      Return index of the end of the substring matched by group.
 |  
 |  expand(...)
 |      expand(template) -> str.
 |      Return the string obtained by doing backslash substitution
 |      on the string template, as done by the sub() method.
 |  
 |  group(...)
 |      group([group1, ...]) -> str or tuple.
 |      Return subgroup(s) of the match by indices or names.
 |      For 0 returns the entire match.
 |  
 |  groupdict(...)
 |      groupdict([default=None]) -> dict.
 |      Return a dictionary containing all the named subgroups of the match,
 |      keyed by the subgroup name. The default argument is used for groups
 |      that did no

In [53]:
print(search_results.start())
print(search_results.end())
print(search_results.span())
print(search_results.group())

14
18
(14, 18)
Adam


and what about `re.finditer()`?

In [9]:
for found_item in re.finditer('Adam', text_sample):
    print(found_item)

<_sre.SRE_Match object; span=(14, 18), match='Adam'>
<_sre.SRE_Match object; span=(48, 52), match='Adam'>


`re.finditer` is doing essentially the same thing as `re.search` but it's wrapping the results in an iterator. (Reminder question: why would an iterator be used?)

What if the pattern occurs twice?

In [11]:
text_sample  = "Hi my name is Adam Pah! Not to be confused with Adam Hockenberry"
print(re.match('Adam', text_sample))
print(re.search('Adam', text_sample))
print(re.findall('Adam', text_sample))
print(re.finditer('Adam', text_sample))

None
<_sre.SRE_Match object; span=(14, 18), match='Adam'>
['Adam', 'Adam']
<callable_iterator object at 0x1065190f0>


`re.findall` looks like it found both occurrences, but `re.search` only found the first item. And `re.search` had all that cool stuff in it which might be useful like where our substring occured. Now let's look inside `re.finditer` again.

In [12]:
finditer_result = re.finditer('Adam', text_sample)
for i in finditer_result:
    print(i, i.span(), i.group())

<_sre.SRE_Match object; span=(14, 18), match='Adam'> (14, 18) Adam
<_sre.SRE_Match object; span=(48, 52), match='Adam'> (48, 52) Adam


Excellent! `re.finditer` does multiple occurences (as you might expect since it is an iterator) and it includes all of the annotation data about where the substring occured.

# Creating regular expressions

So far we have only covered the basic methods to use in the `re` package to find strings. However, this could be accomplished with just regular string matching in Python.

Now let's move towards creating a regular expression. To do that we'll work with time. 

Time is an excellent example of a very regularly formatted object that couldn't be matched easily with regular string matching. Just to give you a reminder a time looks like:

HH:MM

Hours can only be in the set [01, 02, 03, 04, 05, 06, 07, 08, 09, 10, 11, 12] while minutes can range from [01 .. 59]

So 3:26 is a time, while 3:62 is not a time.

So to start off creating a regular expression to pull out times, we're going to create two lists. One list is times (all positive results) and the other is not_times (all negative results). So whatever we create should find every value in the `times` variable and nothing in `not_times`. The approach of creating one list of all positives and another of all negatives is one of the best ways to create and test a regular expression to make **sure** that it is doing what you want it to do.

In [13]:
times = ['03:43', '01:00', '12:59']
not_times = ['orange', '03:60', '26:14', '0155', '13:00']

So let's think about approaching how to write this, by tackling the hours first. 

If an hour starts with `0` then the second digit can be any number from 1 to 9. If the hour starts with 1 though, it can only be 0, 1, or 2. We need to write a regular expression that treats those conditions separately, because one condition has a different range of available second digits than the other.

To write one condition with multiple values, we put the multiple values inside brackets `[]`. So to match 01 - 09, we would write:

`0[1-9]`

Which means that the first digit is always 0 and then second digit is any number from 1 to 9. We can test that first.

In [15]:
early_hours_expression = '0[1-9]'
for dataset in [times, not_times]:
    for i in dataset:
        print(i)
        print(re.match(early_hours_expression, i))
        print('---')

03:43
<_sre.SRE_Match object; span=(0, 2), match='03'>
---
01:00
<_sre.SRE_Match object; span=(0, 2), match='01'>
---
12:59
None
---
orange
None
---
03:60
<_sre.SRE_Match object; span=(0, 2), match='03'>
---
26:14
None
---
0155
<_sre.SRE_Match object; span=(0, 2), match='01'>
---
13:00
None
---


Great! We can see that we matched all of the values that start with an early hour.

Now let's add in the double-digit hours. We would construct a regular expression for that similarly:

`1[0-2]`

and we can see that it would work similarly.

In [17]:
late_hours_expression = '1[0-2]'
for dataset in [times, not_times]:
    for i in dataset:
        print(i)
        print(re.match(late_hours_expression, i))
        print('---')

03:43
None
---
01:00
None
---
12:59
<_sre.SRE_Match object; span=(0, 2), match='12'>
---
orange
None
---
03:60
None
---
26:14
None
---
0155
None
---
13:00
None
---


Now we need to put the two together. Since there isn't any overlap between the two conditions, we are really just looking to combine them with an `OR` statement. So we want `re` to match one or the other conditions. 

We write the logic of `OR` using the `|` symbol. The two regexes should be put into parentheticals and combined with the `|` symbol so the code knows that either regex match is acceptable (but not both).

In [19]:
hours_expression = '(0[1-9]|1[0-2])'
for dataset in [times, not_times]:
    for i in dataset:
        print(i)
        print(re.match(hours_expression, i))

03:43
<_sre.SRE_Match object; span=(0, 2), match='03'>
01:00
<_sre.SRE_Match object; span=(0, 2), match='01'>
12:59
<_sre.SRE_Match object; span=(0, 2), match='12'>
orange
None
03:60
<_sre.SRE_Match object; span=(0, 2), match='03'>
26:14
None
0155
<_sre.SRE_Match object; span=(0, 2), match='01'>
13:00
None


Excellent! Now we just need to add in the minutes. Minutes are relatively simple, they can be from 00 to 59 and it doesn't matter if it's 1:00 or 12:00, the same range of minutes is possible. 

In [20]:
time_expression = '(0[1-9]|1[0-2]):[0-5][0-9]'
for i in times:
    print(re.match(time_expression, i))

<_sre.SRE_Match object; span=(0, 5), match='03:43'>
<_sre.SRE_Match object; span=(0, 5), match='01:00'>
<_sre.SRE_Match object; span=(0, 5), match='12:59'>


Awesome! We found matches to all three times. What about the things that we don't want to match?

In [177]:
for i in not_times:
    print(re.match(time_expression, i))

None
None
None
None
None


Perfect! You can see how we've slowly built up a regular expression and tested against a positive and negative dataset to make sure that it works properly.

However, we've coded our regular expression pretty narrowly. Typically only computers (or humans writing for computers) put a `0` before a single digit time (i.e. 1-9). Can our regular expression handle that?

In [21]:
print(re.match(time_expression, '3:07'))

None


Nope! The problem is that our test cases weren't expansive enough to include how humans naturally write time. 

If time is between 1-9 we usually would just write 9:32 or 4:54 rather than 09:32 or 04:54. We can change our expression to accept this pretty easily by adding another '|' case in the first parentheses. Now we can either have 1-9, 0 followed by 1-9, or 1 followed by 0-2. Do we feel pretty good about this?

In [22]:
times = ['03:43', '01:00', '12:59', '1:06', '10:43:16']
not_times = ['orange', '03:60', '26:14', '0155', '13:00']
time_expression = '([1-9]|0[1-9]|1[0-2]):[0-5][0-9]'
print( "Valid times:")
for i in times:
    print(re.match(time_expression, i))
print( "Invalid times:")
for i in not_times:
    print(re.match(time_expression, i))

Valid times:
<_sre.SRE_Match object; span=(0, 5), match='03:43'>
<_sre.SRE_Match object; span=(0, 5), match='01:00'>
<_sre.SRE_Match object; span=(0, 5), match='12:59'>
<_sre.SRE_Match object; span=(0, 4), match='1:06'>
<_sre.SRE_Match object; span=(0, 5), match='10:43'>
Invalid times:
None
None
None
None
None


Excellent! We're finding times without a leading `0` now, while still passing all of the test cases.

Now if we had a whole sentence and we thought there might be a time in it, _any_ time, we could check quite easily!

In [23]:
sentence = "I was born at 1:42 PM all the way back in February 25, 1984"
print(re.search(time_expression, sentence))

<_sre.SRE_Match object; span=(14, 18), match='1:42'>


**Exercise:** Edit the `time_expression` above to (only) find millitary times!

In [24]:
military_times = ['00:45', '1:43', '23:59', '10:00', '0:00']
not_military_times = ['24:00', '0:-1', 'Northwestern', '06;17', '9:60']
####Edit this expression
military_expression = ''
print( "Valid times:")
for i in military_times:
    print(re.match(military_expression, i))
print( "Invalid times:")
for i in not_military_times:
    print(re.match(military_expression, i))

Valid times:
<_sre.SRE_Match object; span=(0, 0), match=''>
<_sre.SRE_Match object; span=(0, 0), match=''>
<_sre.SRE_Match object; span=(0, 0), match=''>
<_sre.SRE_Match object; span=(0, 0), match=''>
<_sre.SRE_Match object; span=(0, 0), match=''>
Invalid times:
<_sre.SRE_Match object; span=(0, 0), match=''>
<_sre.SRE_Match object; span=(0, 0), match=''>
<_sre.SRE_Match object; span=(0, 0), match=''>
<_sre.SRE_Match object; span=(0, 0), match=''>
<_sre.SRE_Match object; span=(0, 0), match=''>


Email addresses are another common format that you can think about. So what constitutes an email address? Well we have some letters or numbers, followed by an @ followed by more letters or numbers, a period, and yet more letters/numbers.  

So let's try something that I've just found on-line (which is a great way to find regexes btw):

`^[a-zA-Z0-9.]+@[a-zA-Z0-9.]+.[a-zA-Z0-9]+$`

Whoa! that's a lot of new symbols. Let's see if it even works.

In [25]:
email = ['a@b.c', 'something@somethingelse.org', '89@42.info', 'something@something.else.com']
not_email = ['@b.c', 'a@b.', 'something@somethingelse.']

email_expression = '^[a-zA-Z0-9.]+@[a-zA-Z0-9.]+.[a-zA-Z0-9]+$'

print( "Valid emails:")
for i in email:
    print(re.match(email_expression, i))
print( "Invalid emails:")
for i in not_email:
    print(re.match(email_expression, i))



Valid emails:
<_sre.SRE_Match object; span=(0, 5), match='a@b.c'>
<_sre.SRE_Match object; span=(0, 27), match='something@somethingelse.org'>
<_sre.SRE_Match object; span=(0, 10), match='89@42.info'>
<_sre.SRE_Match object; span=(0, 28), match='something@something.else.com'>
Invalid emails:
None
None
None


Well it works! Or at least it fits all are test cases...who knows if that really covers all possible emails. 

So let's pick apart what is going on in this regular expression.

* The '^' symbol at the beginning means the string must start with the first expression. This is a very handy character when you care about words that are at the start of a line only. 

* The '$' means the string has to end with last expression.

* ['a-zA-Z0-9'] means any letter and any number. 

* The '+' symbol that follows means that there can be any number of things before the @ symbol. Literally, thousands (so maybe that's not too realistic). 

* Then we have the @ symbol which we need, followed by again any letter or number of any length. 

* Then a '.' and again any letter/number of any length.

Seems reasonable to me? It looks confusing, but regular expressions always do, so don't worry. 

### Making our regular expression more reasonable

Maybe we wanted to say that there is a limit to how many characters can be before the '@' symbol. Let's say between 1-256. We'll just replace the '+' with {1,256}:
    
    ^[a-zA-Z0-9.]{1,256}@[a-zA-Z0-9.]+.[a-zA-Z0-9]+$
    
If we thought that we should allow %, _, + and -  before the @ symbol we'd just add them into the brackets:

    ^[a-zA-Z0-9.%_+-]{1,256}@[a-zA-Z0-9.]+.[a-zA-Z0-9]+$

We can keep going, matching more and more cases and making longer and more hideous expressions. 

In [26]:
email = ['a@b.c', 'something@somethingelse.org', '89@42.info', 'something@something.else.com']
not_email = ['@b.c', 'a@b.', 'something@somethingelse.']

email_expression = '^[a-zA-Z0-9.%_+-]{1,256}@[a-zA-Z0-9.]+.[a-zA-Z0-9]+$'

print( "Valid emails:")
for i in email:
    print(re.match(email_expression, i))
print( "Invalid emails:")
for i in not_email:
    print(re.match(email_expression, i))

Valid emails:
<_sre.SRE_Match object; span=(0, 5), match='a@b.c'>
<_sre.SRE_Match object; span=(0, 27), match='something@somethingelse.org'>
<_sre.SRE_Match object; span=(0, 10), match='89@42.info'>
<_sre.SRE_Match object; span=(0, 28), match='something@something.else.com'>
Invalid emails:
None
None
None


Excellent! It still works in all of our test cases!

We could continue to improve this regex (say by limiting the ending domain to only known domains), but that should be left as an exercise for you. What you will notice, is that as you match for more and more cases the regular expression has a tendency to keep getting uglier and more difficult to write and read. But difficult though they may be to look at, there is no better way to identify patterns in text. 

This regex is actually extremely useful too. In 2015, before his ill-fated primary run for the Republican Party Presidential Nomination, Jeb Bush released a number of his e-mails in a bid for transparency. Unfortunately, this release wasn't vetted very well and some constituents Social Security numbers were exposed. 

Use the regex we've created to try and pull out all of the e-mails in the e-mail data dump (the e-mail files are in `../Data/Emails/`

Hint: these files were encoded in the ISO-8859-1 standard

In [None]:
###Place your code here


Looks like it failed quite miserably! Wonder why that is?

Let's work with a smaller chunk so that it's more manageable.

In [3]:
email_chunk = open('../Data/Day5-Text-Analysis/Emails/2001-01Jan.txt', encoding = 'ISO-8859-1').read()[:300]
print(email_chunk)

From:	Bill and Carol Steele <scl@uslink.net>
Sent:	Wednesday, January 31, 2001 11:19 PM
To:	Governor Bush
Subject:	Homestead AFB

31 January 2000

Dear Governor Bush:

I am writing to urge you to support the Air Force in its decision to give 
Miami-Dade County 700 acres of surplus property at Homest


Now the first thing we should remember is that the `^` symbol means at the start of the line. But in our example the e-mail could occur anywhere, so we should remove that. Let's see if that makes a difference.

In [48]:
email_expression = '[a-zA-Z0-9.%_+-]{1,256}@[a-zA-Z0-9.]+.[a-zA-Z0-9]+$'

print( re.search(email_expression, email_chunk) )

None


Huh? It's still not working! This is infuriating!!!!!!!!


Welcome, to the real world of using regular expressions! Now... why isn't it working????

In [49]:
#Place your code here


Excellent! Now we've got it working! Now go back and try on the entire corpus. How many unique email addresses are there?

In [55]:
###Place your code here


Wow! That's a lot of potentially compromised e-mail addresses. 

How many unique ending domains are there though? Maybe it'll be easier to just identify blocks of users that we don't need to worry about contacting (like say government employees who already had their personal data compromised not through the Jeb Bush e-mails). 

In [57]:
###Place your code here


That's a pretty nice reduction! Now how could we profile these e-mails to improve our regular expression?

In [59]:
###Improve our e-mail regular expression


# Fore!

There are so many more complicated things you can do with regex, and there is even a game called [regex golf](http://regex.alf.nu) that the nerdiest of all nerds play from time to time where the object is to come up with the shortest way to match certain patterns while [avoiding others](http://nbviewer.ipython.org/url/norvig.com/ipython/xkcd1313.ipynb). This game can serve as good practice to improve your regular expression skills.

As a test, let's play a game of regex golf. Let's try to match Star Wars movie titles, but not Star Trek movie titles.

In [183]:
#Match the Star wars movie titles but not the Star Trek titles
starwars = ['The Phantom Menace', 'Attack of the Clones', 'Revenge of the Sith',\
            'A New Hope', 'The Empire Strikes Back', 'Return of the Jedi']

startrek = ['The Wrath of Khan', 'The Search for Spock', 'The Voyage Home',\
            'The Final Frontier', 'The Undiscovered Country', 'Generations',\
            'First Contact', 'Insurrection', 'Nemesis']
###Your code here


Star Wars titles:
<_sre.SRE_Match object; span=(10, 12), match='m '>
<_sre.SRE_Match object; span=(9, 11), match=' t'>
<_sre.SRE_Match object; span=(10, 12), match=' t'>
<_sre.SRE_Match object; span=(1, 3), match=' N'>
<_sre.SRE_Match object; span=(19, 20), match='B'>
<_sre.SRE_Match object; span=(9, 11), match=' t'>
Star Trek titles:
None
None
None
None
None
None
None
None
None


And one more example. Let's say that you are walking a filesystem looking for an image (that one photo of your vacation where you were looking totally dead-on at the camera and not blinking or making a weird face). Let's write a regular expression to identify images and not other file types.

In [185]:
images = ['test.gif', 
            'image.jpeg', 
            'image.jpg',
            'image.TIF'
            ]

non_images = ['test.pdf',
             'test.gif.pdf'
             ]

###Place your code here
image_expression = ''

# Additional Resources

If you're interest in learning more about using and writing regular expression, you can continue with this documentation.

* [More Python documentation](https://docs.python.org/3/howto/regex.html#regex-howto)
* [A great little notebook](http://nbviewer.ipython.org/github/sampathweb/python_reference/blob/master/tutorials/useful_regex.ipynb)

