In [186]:
from IPython.core.display import HTML

def css_styling():
    styles = open("../styles/custom.css", "r").read()
    return HTML(styles)
css_styling();

# `re` - Regular expressions
[Package documentation](https://docs.python.org/3/library/re.html)

Since we're working with lot of text today, we'd be remiss not to bring up the topic of ["regular expressions"](https://en.wikipedia.org/wiki/Regular_expression) (or regex for short).

Regular expressions are essentially a language of their own and they aren't unique to Python. What they do is allow for complicated searches through text according to various criteria. If you're looking at a large document of text it's easy enough to search for the word "Northwestern". But what if I want to search for a pattern rather than a particualr word such as (xxx)xxx-xxxx where I want all the x's to be numbers? This would be a great way to find a phone number but I'd have to do a lotttt of Cmd+F (Ctrl+F) searches if I searched for all possibilities of 10 digit phone numbers. 

Enter regex. 

There are a few basic functions that we'll use:


* `re.match()` : Determine if the RE matches at the beginning of the string.
* `re.search()` : Scan through a string, looking for any location where this RE matches.
* `re.findall()` : Find all substrings where the RE matches, and returns them as a list.
* `re.finditer()` : Find all substrings where the RE matches, and returns them as an iterator object.



Let's see an example:

In [13]:
import re

In [49]:
text_sample  = "Hi my name is Adam!"
print(re.match('Adam', text_sample))
print(re.search('Adam', text_sample))
print(re.findall('Adam', text_sample))
print(re.finditer('Adam', text_sample))

None
<_sre.SRE_Match object; span=(14, 18), match='Adam'>
['Adam']
<callable_iterator object at 0x104e36358>


`re.search` and `re.finditer` did something weird so let's investigate that a little bit further:

In [51]:
search_results = re.search('Adam', text_sample)

In [52]:
help(search_results)

Help on SRE_Match object:

class SRE_Match(builtins.object)
 |  The result of re.match() and re.search().
 |  Match objects always have a boolean value of True.
 |  
 |  Methods defined here:
 |  
 |  __copy__(...)
 |  
 |  __deepcopy__(...)
 |  
 |  __repr__(self, /)
 |      Return repr(self).
 |  
 |  end(...)
 |      end([group=0]) -> int.
 |      Return index of the end of the substring matched by group.
 |  
 |  expand(...)
 |      expand(template) -> str.
 |      Return the string obtained by doing backslash substitution
 |      on the string template, as done by the sub() method.
 |  
 |  group(...)
 |      group([group1, ...]) -> str or tuple.
 |      Return subgroup(s) of the match by indices or names.
 |      For 0 returns the entire match.
 |  
 |  groupdict(...)
 |      groupdict([default=None]) -> dict.
 |      Return a dictionary containing all the named subgroups of the match,
 |      keyed by the subgroup name. The default argument is used for groups
 |      that did no

In [53]:
print(search_results.start())
print(search_results.end())
print(search_results.span())
print(search_results.group())

14
18
(14, 18)
Adam


What if the pattern occurs twice?

In [173]:
text_sample  = "Hi my name is Adam Hockenberry! Not to be confused with Adam Pah"
print(re.match('Adam', text_sample))
print(re.search('Adam', text_sample))
print(re.findall('Adam', text_sample))
print(re.finditer('Adam', text_sample))

None
<_sre.SRE_Match object; span=(14, 18), match='Adam'>
['Adam', 'Adam']
<callable_iterator object at 0x104ea6eb8>


`re.findall` looks like it found both occurrences, but `re.search` only found the first item. And `re.search` had all that cool stuff in it which might be useful like where our substring occured. Now we can finally look at `re.finditer`. We haven't been exposed to an `iterator` object, but don't get too confused. For now we'll treat it kind of like a list:

In [174]:
finditer_result = re.finditer('Adam', text_sample)
for i in finditer_result:
    print(i, i.span(), i.group())

<_sre.SRE_Match object; span=(14, 18), match='Adam'> (14, 18) Adam
<_sre.SRE_Match object; span=(56, 60), match='Adam'> (56, 60) Adam


Okay but this is all pretty useless still. We've learned some of the functions in the `re` library but haven't actually used a regular expresion yet so we'll have to introduce something more complicated. 

In [175]:
times = ['03:43', '01:00', '12:59']
not_times = ['orange', '03:60', '26:14', '0155', '13:00']

Okay we have some examples of times and some examples of non-times, how could we tell the difference?

In [176]:
time_expression = '(0[1-9]|1[0-2]):[0-5][0-9]'
for i in times:
    print(re.match(time_expression, i))

<_sre.SRE_Match object; span=(0, 5), match='03:43'>
<_sre.SRE_Match object; span=(0, 5), match='01:00'>
<_sre.SRE_Match object; span=(0, 5), match='12:59'>


Awesome! We found matchces to all three times. What about the things that we don't want to match?

In [177]:
for i in not_times:
    print(re.match(time_expression, i))

None
None
None
None
None


Perfect. So what on earth is in `time_expression`. Let's look at it piece by piece:

* First, we have an argument in parentheses (0[0-9]|1[0-2]). To the left of the '|' what we're saying is that we want a zero followed by the bracketed argument which is the regex way of saying any number between 1 and 9. The '|', signifieds OR. So we can _either_ have a 0 followed by 1-9 _or_ we can have a 1 follwed by a number between 0-2. 
* If that all works out, now we'll check for a ':'
* And finally we want a number between 0 and 5, followed by another number between 0 and 9.

If all goes according to plan, we would have matched all of our times perfectly! You'll see that at east some of the `not_times` test cases I came up with were really close to fitting these criteria. But they didn't match all three of the above things that we were looking for so `re.match` returned `None`. 

And that's it, you've now seen a regular expression!

Let's check something though:

In [178]:
print(re.match(time_expression, '3:07'))

None


Our time expression worked perfect for our initial test cases, but this looks like a pretty valid time and it doesn't match, why not?

Well, in all of our initial test cases we had a leading zero. And that's not how we always write time, if it's between 1-9 we usually would just write 9:32 or 4:54 rather than 09:32 or 04:54. We can change our expression to accept this pretty easily by adding another '|' case in the first parentheses. Now we can either have 1-9, 0 followed by 1-9, or 1 followed by 0-2. Do we feel pretty good about this?

In [179]:
times = ['03:43', '01:00', '12:59', '1:06', '10:43:16']
not_times = ['orange', '03:60', '26:14', '0155', '13:00']
time_expression = '([1-9]|0[1-9]|1[0-2]):[0-5][0-9]'
print( "Valid times:")
for i in times:
    print(re.match(time_expression, i))
print( "Invalid times:")
for i in not_times:
    print(re.match(time_expression, i))

Valid times:
<_sre.SRE_Match object; span=(0, 5), match='03:43'>
<_sre.SRE_Match object; span=(0, 5), match='01:00'>
<_sre.SRE_Match object; span=(0, 5), match='12:59'>
<_sre.SRE_Match object; span=(0, 4), match='1:06'>
<_sre.SRE_Match object; span=(0, 5), match='10:43'>
Invalid times:
None
None
None
None
None


Now if we had a whole sentence and we thought there might be a time in it, _any_ time, we could check quite easily!

In [180]:
sentence = "I was born at 1:42 PM all the way back in February 25, 1984"
print(re.search(time_expression, sentence))

<_sre.SRE_Match object; span=(14, 18), match='1:42'>


**Exercise:** Edit the `time_expression` above to (only) find millitary times!

In [181]:
military_times = ['00:45', '1:43', '23:59', '10:00', '0:00']
not_military_times = ['24:00', '0:-1', 'Northwestern', '06;17', '9:60']
####Edit this expression
military_expression = ''
print( "Valid times:")
for i in military_times:
    print(re.match(military_expression, i))
print( "Invalid times:")
for i in not_military_times:
    print(re.match(military_expression, i))

Valid times:
<_sre.SRE_Match object; span=(0, 0), match=''>
<_sre.SRE_Match object; span=(0, 0), match=''>
<_sre.SRE_Match object; span=(0, 0), match=''>
<_sre.SRE_Match object; span=(0, 0), match=''>
<_sre.SRE_Match object; span=(0, 0), match=''>
Invalid times:
<_sre.SRE_Match object; span=(0, 0), match=''>
<_sre.SRE_Match object; span=(0, 0), match=''>
<_sre.SRE_Match object; span=(0, 0), match=''>
<_sre.SRE_Match object; span=(0, 0), match=''>
<_sre.SRE_Match object; span=(0, 0), match=''>


Email addresses are another common format that you can think about. So what constitutes an email address? Well we have some letters or numbers, followed by an @ followed by more letters or numbers, a period, and yet more letters/numbers.  

In [182]:
email = ['a@b.c', 'something@somethingelse.org', '89@42.info', 'something@something.else.com']
not_email = ['@b.c', 'a@b.', 'something@somethingelse.']

email_expression = '^[a-zA-Z0-9.]+@[a-zA-Z0-9.]+.[a-zA-Z0-9]+$'

print( "Valid emails:")
for i in email:
    print(re.match(email_expression, i))
print( "Invalid emails:")
for i in not_email:
    print(re.match(email_expression, i))



Valid emails:
<_sre.SRE_Match object; span=(0, 5), match='a@b.c'>
<_sre.SRE_Match object; span=(0, 27), match='something@somethingelse.org'>
<_sre.SRE_Match object; span=(0, 10), match='89@42.info'>
<_sre.SRE_Match object; span=(0, 28), match='something@something.else.com'>
Invalid emails:
None
None
None


Okay so I don't actually know what characters are allowed in email. But this fits all of our test cases and none of our failues at least. What's going on?

* First off, the '^' symbol at the beginning means the string must start with the first expression, and the '$' means the string has to end with last expression.

* ['a-zA-Z0-9'] means any letter and any number. 

* The '+' symbol that follows means that there can be any number of things before the @ symbol. Literally, thousands (so maybe that's not too realistic). 

* Then we have the @ symbol which we need, followed by again any letter or number of any length. 

* Then a '.' and again any letter/number of any length.

Seems reasonable to me? It looks confusing, and it should so don't worry. 

Maybe we wanted to say that there is a limit to how many characters can be before the '@' symbol. Let's say between 1-256. We'll just replace the '+' with {1,256}:
    
    ^[a-zA-Z0-9.]{1,256}@[a-zA-Z0-9.]+.[a-zA-Z0-9]+$
    
If we thought that we should allow %, _, + and -  before the @ symbol we'd just add them into the brackets:

    ^[a-zA-Z0-9.%_+-]{1,256}@[a-zA-Z0-9.]+.[a-zA-Z0-9]+$

We can keep going, matching more and more cases and making longer and more hideous expressions. But difficult though they may be to look at, there is no better way to identify patterns in text. There are so many more complicated things you can do with regex, and there is even a game called [regex golf](http://regex.alf.nu) that the nerdiest of all nerds play from time to time where the object is to come up with the shortest way to match certain patterns while [avoiding others](http://nbviewer.ipython.org/url/norvig.com/ipython/xkcd1313.ipynb):

In [183]:
starwars = ['The Phantom Menace', 'Attack of the Clones', 'Revenge of the Sith',\
            'A New Hope', 'The Empire Strikes Back', 'Return of the Jedi']

startrek = ['The Wrath of Khan', 'The Search for Spock', 'The Voyage Home',\
            'The Final Frontier', 'The Undiscovered Country', 'Generations',\
            'First Contact', 'Insurrection', 'Nemesis']

print("Star Wars titles:")
for i in starwars:
    print(re.search('M | [TN]|B', i, re.IGNORECASE))
print("Star Trek titles:")
for i in startrek:
    print(re.search('M | [TN]|B', i, re.IGNORECASE))

Star Wars titles:
<_sre.SRE_Match object; span=(10, 12), match='m '>
<_sre.SRE_Match object; span=(9, 11), match=' t'>
<_sre.SRE_Match object; span=(10, 12), match=' t'>
<_sre.SRE_Match object; span=(1, 3), match=' N'>
<_sre.SRE_Match object; span=(19, 20), match='B'>
<_sre.SRE_Match object; span=(9, 11), match=' t'>
Star Trek titles:
None
None
None
None
None
None
None
None
None


There are lots of great resources, so if you find yourself in need of more understanding google will help you out but here are a few:
* [More Python documentation](https://docs.python.org/3/howto/regex.html#regex-howto)
* [A great little notebook](http://nbviewer.ipython.org/github/sampathweb/python_reference/blob/master/tutorials/useful_regex.ipynb)


#Advanced Exercises

Write a regular expression to find image files while ignoring non-image files

In [185]:
images = ['test.gif', 
            'image.jpeg', 
            'image.jpg',
            'image.TIF'
            ]

non_images = ['test.pdf',
             'test.gif.pdf'
             ]

###Place your code here
image_expression = ''

What are other interesting patterns that you can think of? Dates? Phone numbers? Websites? Give them a try!