# Synopsis

Frequently we are simply looking for specific words or phrases in a block of text and do not care about the rest of the text. However, sometimes we are interested in a pattern of text (such as a phone number), where the format is consistent but the actual text itself changes. In this unit, we will learn:

1. What a regular expression is
2. Available functions in the `re` package
3. How to identify and extract a text pattern in a large block of text.
4. How to develop and test regular expressions

# Read libraries

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

from colorama import Back, Fore, Style
from copy import copy, deepcopy
from pathlib import Path
from sys import path

path.append( str(Path.cwd().parent) )

In [7]:
import re

import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np
import pandas as pd

from collections import Counter
from random import random
from string import punctuation, whitespace

from Amaral_libraries.my_nlp_library import read_complete_works
from Amaral_libraries.my_stats import place_commas

In [3]:
my_fontsize = 15

# Regular Expressions

[Regular expressions](https://en.wikipedia.org/wiki/Regular_expression) (or **regexes** in shorthand) are essentially a standalone scripting language.

This is good because it means you will be able to use what you learn in other contexts.


Regexes allow for searches of patterns that are not fixed but instead follow a particular set of rules.

Imagine that you are looking for Northwestern's Helpline in a document.  Then you would search for `847-491-4357` or `(847) 491-4357`, or maybe even for `1-4357`. Yes, it is getting complicated...

But what if you actually wanted to find **any** phone number in a document?

Let's say that we use `d` to represent any digit 0-9.  Then, we are looking for patterns of the form `ddd-ddd-dddd` or `(ddd) ddd-dddd`

Amazingly, regexes allow us to construct a generic compact text pattern that will then be matched through the entire text.  


## Regular expressions in Python

Methods for regular expressions in Python are implemented in the [`re` package](https://docs.python.org/3/library/re.html). There are a many functions, flags, and conventions.

A few basic functions that we will use are:

> `re.match()` : Determine if the RE matches at the beginning of the string.
>
>  `re.search()` : Scan through a string, looking for any location where this RE matches.
>
> `re.findall()` : Find all substrings where the RE matches, and returns them as a list.
>
> `re.finditer()` : Find all substrings where the RE matches, and returns them as an iterator object.


Some important conventions are:

> `|` stands for `or`
>
> `&` stands for `and`
>
> `.` stands for any character except a new line
>
> `^` stands for beginning of the string being searched
>
> `$` stands for end of the string or just before new line
>
> `\` allows for escaping special characters, i.e., search for character that are used in conventions 


This [site](https://www.rexegg.com/regex-quickstart.html#chars) is a great cheat sheet for when building regexes.

## Let's check back on Othello

Now, let's go over an example so this is less abstract. 

In [4]:
# Copied from previous notebook

complete_works, plays = read_complete_works()
    
print(len(complete_works))

plays

124787


{'THE SONNETS': {'year': 1609, 'first_line': 175, 'last_line': 2798},
 'ALLS WELL THAT ENDS WELL': {'year': 1603,
  'first_line': 2817,
  'last_line': 6045},
 'THE TRAGEDY OF ANTONY AND CLEOPATRA': {'year': 1607,
  'first_line': 6064,
  'last_line': 10236},
 'AS YOU LIKE IT': {'year': 1601, 'first_line': 10255, 'last_line': 13226},
 'THE TRAGEDY OF CORIOLANUS': {'year': 1608,
  'first_line': 15370,
  'last_line': 19644},
 'CYMBELINE': {'year': 1609, 'first_line': 19663, 'last_line': 23824},
 'THE TRAGEDY OF HAMLET, PRINCE OF DENMARK': {'year': 1604,
  'first_line': 23844,
  'last_line': 28393},
 'THE FIRST PART OF KING HENRY THE FOURTH': {'year': 1598,
  'first_line': 28412,
  'last_line': 31764},
 'SECOND PART OF KING HENRY IV': {'year': 1598,
  'first_line': 31784,
  'last_line': 35367},
 'THE LIFE OF KING HENRY THE FIFTH': {'year': 1599,
  'first_line': 35386,
  'last_line': 39026},
 'THE FIRST PART OF HENRY THE SIXTH': {'year': 1592,
  'first_line': 39045,
  'last_line': 42451},
 '

In [9]:
title = 'THE TRAGEDY OF OTHELLO, MOOR OF VENICE'
start_line = plays[title]['first_line']
end_line = plays[title]['last_line']

# Slice the play out of complete works
#
the_play = complete_works[start_line: end_line]

print(f"The play Othello has {place_commas(len(the_play))} lines.\n")

# Put it all into a single string
#
the_play = ' '.join(the_play)

print(f"The play Othello has {place_commas(len(the_play))} characters.\n")

The play Othello has 3,893 lines.

The play Othello has 179,024 characters.



In [10]:
print(the_play[:500])

THE TRAGEDY OF OTHELLO, MOOR OF VENICE
 
 by William Shakespeare
 
 
 
 Dramatis Personae
 
   OTHELLO, the Moor, general of the Venetian forces
   DESDEMONA, his wife
   IAGO, ensign to Othello
   EMILIA, his wife, lady-in-waiting to Desdemona
   CASSIO, lieutenant to Othello
   THE DUKE OF VENICE
   BRABANTIO, Venetian Senator, father of Desdemona
   GRATIANO, nobleman of Venice, brother of Brabantio
   LODOVICO, nobleman of Venice, kinsman of Brabantio
   RODERIGO, rejected suitor of Desdemon


Ok, let's search for Othello

In [11]:
print(re.match('Othello', the_play))
print()

print(re.match('OTHELLO', the_play))
print()


None

None



`re.match` finds no matches.  This is not surprising if you recall that it only attempts to match at the start of the string.



In [21]:
matches = re.findall('Othello', the_play)
print(f"There are {len(matches)} occurrents of 'Othello'.\n")

print(matches)

matches = re.findall('othello', the_play.lower())
print(f"\nThere are {len(matches)} occurrents of othello (any case).\n")


There are 59 occurrents of 'Othello'.

['Othello', 'Othello', 'Othello', 'Othello', 'Othello', 'Othello', 'Othello', 'Othello', 'Othello', 'Othello', 'Othello', 'Othello', 'Othello', 'Othello', 'Othello', 'Othello', 'Othello', 'Othello', 'Othello', 'Othello', 'Othello', 'Othello', 'Othello', 'Othello', 'Othello', 'Othello', 'Othello', 'Othello', 'Othello', 'Othello', 'Othello', 'Othello', 'Othello', 'Othello', 'Othello', 'Othello', 'Othello', 'Othello', 'Othello', 'Othello', 'Othello', 'Othello', 'Othello', 'Othello', 'Othello', 'Othello', 'Othello', 'Othello', 'Othello', 'Othello', 'Othello', 'Othello', 'Othello', 'Othello', 'Othello', 'Othello', 'Othello', 'Othello', 'Othello']

There are 335 occurrents of othello (any case).



`re.findall` finds numerous matches.  This is not surprising since we expect the string Othello to appear frequently in the play.


In [22]:
print(re.search('Othello', the_play))
print()

print(re.search('OTHELLO', the_play))
print()

<re.Match object; span=(187, 194), match='Othello'>

<re.Match object; span=(15, 22), match='OTHELLO'>



`re.search` finds matches and the capitalized version finds a different match from the non-capitalized.  

Each returned a single match as a `re.Match` object.  This is actually quite cool because a match object has all sorts of attributes!

Let's look at them in detail.


In [23]:
othello_match = re.search('Othello', the_play)
help(othello_match)


Help on Match in module re object:

class Match(builtins.object)
 |  The result of re.match() and re.search().
 |  Match objects always have a boolean value of True.
 |  
 |  Methods defined here:
 |  
 |  __copy__(self, /)
 |  
 |  __deepcopy__(self, memo, /)
 |  
 |  __getitem__(self, key, /)
 |      Return self[key].
 |  
 |  __repr__(self, /)
 |      Return repr(self).
 |  
 |  end(self, group=0, /)
 |      Return index of the end of the substring matched by group.
 |  
 |  expand(self, /, template)
 |      Return the string obtained by doing backslash substitution on the string template, as done by the sub() method.
 |  
 |  group(...)
 |      group([group1, ...]) -> str or tuple.
 |      Return subgroup(s) of the match by indices or names.
 |      For 0 returns the entire match.
 |  
 |  groupdict(self, /, default=None)
 |      Return a dictionary containing all the named subgroups of the match, keyed by the subgroup name.
 |      
 |      default
 |        Is used for groups tha

Great! `re.Match` objects always have a Boolean value of `True`.

They also have methods such as `.start()`, `.end()`, or `.span()`. 

Let's see what they are

In [27]:
# Notice the print of a re.Match object already contains 
# important information
print(f"Match object is: {othello_match}")
print()

print(f"Span is: {othello_match.span()}")
print()

print(f"Match groups are: {othello_match.group()}")
print()


Match object is: <re.Match object; span=(187, 194), match='Othello'>

Span is: (187, 194)

Match groups are: Othello



The meaning of `.group()` will become clear later.

In [28]:
print(othello_match.start())
print()

print(othello_match.end())
print()

187

194



Those numbers are locators in the string, which we can use to look at the surrounding text

In [29]:
the_play[187: 194]

'Othello'

In [30]:
the_play[othello_match.start()-20: othello_match.end()+20]

'\n   IAGO, ensign to Othello\n   EMILIA, his wife'

<br>

<br>





What about `re.finditer()`?

In [31]:
print(re.finditer('Othello', the_play))
print()

<callable_iterator object at 0x151495d80>



Interesting! Notice the word `iterator` in there.  This suggests that `Python` is offering us a way to iterate through the results.

As you will recall, we can do this using a `for` loop.


In [32]:
for i, match_item in enumerate(re.finditer('Othello', the_play)):
    print(f"{i:>3} -- {match_item}")

  0 -- <re.Match object; span=(187, 194), match='Othello'>
  1 -- <re.Match object; span=(270, 277), match='Othello'>
  2 -- <re.Match object; span=(588, 595), match='Othello'>
  3 -- <re.Match object; span=(10949, 10956), match='Othello'>
  4 -- <re.Match object; span=(13977, 13984), match='Othello'>
  5 -- <re.Match object; span=(19484, 19491), match='Othello'>
  6 -- <re.Match object; span=(19542, 19549), match='Othello'>
  7 -- <re.Match object; span=(21028, 21035), match='Othello'>
  8 -- <re.Match object; span=(22902, 22909), match='Othello'>
  9 -- <re.Match object; span=(23805, 23812), match='Othello'>
 10 -- <re.Match object; span=(28540, 28547), match='Othello'>
 11 -- <re.Match object; span=(30025, 30032), match='Othello'>
 12 -- <re.Match object; span=(31436, 31443), match='Othello'>
 13 -- <re.Match object; span=(32613, 32620), match='Othello'>
 14 -- <re.Match object; span=(37241, 37248), match='Othello'>
 15 -- <re.Match object; span=(39715, 39722), match='Othello'>
 16 

<br>

<br>

<br>

<br>


Nice! 

**We have a way to get the fullness of output of `re.search` but for all the matches**.

# Creating regular expressions

Now that we know what some `re` functions do, we are ready to start exploring the real power of the package.

Above, we where interested in finding **rigid patterns**. Let us know search for **flexible patterns**.



## Testing, testing, testing

Even though we have brought the matter up, in reality, we have not emphasized enough the need to create test for your code.

We were trying to cover the basics of the language and did not want to add another moving part to the learning process.

The analogy I may use is that it is much easier to learn to drive with an automatic transmission than with a manual.

Now we are switching to *manual transmission* because it is so crucial for writing robust code.



Above, we discussed phone numbers. However, those are too complicated as a starting point. Instead, we will start with searching for times.

Imagine you are fed-up with G..gle and A..le and what to write code to search your emails for times of appointments to add to your calendar. What would you do?

Times come in two major formats **good** (`hh:mm`, also called military) and **bad** (`+h:mm*`).

Consider the good system: 

> * The first `h` can take the values 0, 1, or 2 
> * The second `h` can take values 0-9
> * The first `m` can take values 0-5
> * The second `m` can take values 0-9

However, not all combinations are possible. For example, 27 is not acceptable for `hh`.

Consider the bad system:

> * The first `+` can be absent or take the values 0 or 1 
> * The second `h` can take values 0-9
> * The first `m` can take values 0-5
> * The second `m` can take values 0-9
> * The * can take the values zero or more spaces followed by pm/PM or am/AM

In order to insure the correctness of our code, we should start by creating a list of examples that even though they are not matches look superficially correct and another list of examples that are correct.

Both lists should cover a broad range of possibilities.


In [34]:
positive_good_times = [' 03:43 ', ' 01:00 ', ' 12:59 ', 
                       ' 13:00 ', ' 21:35 ']
negative_good_times = [' orange', ' 03:60 ', ' 26:14 ', ' 0155 ', 
                       ' 21:355 ']

And it is helpful to have a testing function...

**Please read the function, then write a `doc_string` and add any comments necessary to explain what is going on.**

In [38]:
def test_pattern(pattern, text, positive_matches = True):
    """
    
    """
    count = 0
    for item in text:
        match = re.search(pattern, item)
        if match:
            print(f"{match} -- {item}")
            count += 1
        else:
            print(item)
            
    if positive_matches:
        print( f"\n----Correctly matched {count} out of {len(text)}"
               f" positives matches.\n" )
    else:
        print(f"\n----Correctly failed to match {len(text) - count} out of {len(text)}"
              f" negative matches.\n")

    return

We will address this problem in a modular manner, so that we can clearly see what the granular operations are.

First, I define `hours_re` and `minutes_re` to store the expressions for matching hours and minutes respectively.

The simplest case for both is 00-09. Which means that the first digit is always 0 and then second digit is any number from 0 to 9. 

This is easily defined as `0[0-9]`


In [39]:
hours_re = '0[0-9]'

re_times = hours_re + ':' 
print(f"Current re_string is:\n\t\t'{re_times}'\n")

test_pattern(re_times, positive_good_times)            
test_pattern(re_times, negative_good_times, False)

Current re_string is:
		'0[0-9]:'

<re.Match object; span=(1, 4), match='03:'> --  03:43 
<re.Match object; span=(1, 4), match='01:'> --  01:00 
 12:59 
 13:00 
 21:35 

----Correctly matched 2 out of 5 positives matches.

 orange
<re.Match object; span=(1, 4), match='03:'> --  03:60 
 26:14 
 0155 
 21:355 

----Correctly failed to match 4 out of 5 negative matches.



Not impressive, ah!

Let's include other possibilities such as 10 to 19


In [41]:
hours_re = '(0[0-9]|1[0-9])'

re_times = hours_re + ':'
print(f"Current re_string is:\n\t\t'{re_times}'\n")

test_pattern(re_times, positive_good_times)            
test_pattern(re_times, negative_good_times, False)

Current re_string is:
		'(0[0-9]|1[0-9]):'

<re.Match object; span=(1, 4), match='03:'> --  03:43 
<re.Match object; span=(1, 4), match='01:'> --  01:00 
<re.Match object; span=(1, 4), match='12:'> --  12:59 
<re.Match object; span=(1, 4), match='13:'> --  13:00 
 21:35 

----Correctly matched 4 out of 5 positives matches.

 orange
<re.Match object; span=(1, 4), match='03:'> --  03:60 
 26:14 
 0155 
 21:355 

----Correctly failed to match 4 out of 5 negative matches.



Almost there for hours. Let's add 20 to 23...

In [42]:
hours_re = '(0[0-9]|1[0-9]|2[0-3])'

re_times = hours_re + ':'
print(f"Current re_string is:\n\t\t'{re_times}'\n")

test_pattern(re_times, positive_good_times)            
test_pattern(re_times, negative_good_times, False)

Current re_string is:
		'(0[0-9]|1[0-9]|2[0-3]):'

<re.Match object; span=(1, 4), match='03:'> --  03:43 
<re.Match object; span=(1, 4), match='01:'> --  01:00 
<re.Match object; span=(1, 4), match='12:'> --  12:59 
<re.Match object; span=(1, 4), match='13:'> --  13:00 
<re.Match object; span=(1, 4), match='21:'> --  21:35 

----Correctly matched 5 out of 5 positives matches.

 orange
<re.Match object; span=(1, 4), match='03:'> --  03:60 
 26:14 
 0155 
<re.Match object; span=(1, 4), match='21:'> --  21:355 

----Correctly failed to match 3 out of 5 negative matches.



The hours are looking pretty good. 

Let's add the minutes term. Any value from 00 to 59 works so `[0-5][0-9]`

In [43]:
hours_re = '(0[0-9]|1[0-9]|2[0-3])'
minutes_re = '[0-5][0-9]'

re_times = hours_re + ':' + minutes_re
print(f"Current re_string is:\n\t\t'{re_times}'\n")

test_pattern(re_times, positive_good_times)            
test_pattern(re_times, negative_good_times, False)

Current re_string is:
		'(0[0-9]|1[0-9]|2[0-3]):[0-5][0-9]'

<re.Match object; span=(1, 6), match='03:43'> --  03:43 
<re.Match object; span=(1, 6), match='01:00'> --  01:00 
<re.Match object; span=(1, 6), match='12:59'> --  12:59 
<re.Match object; span=(1, 6), match='13:00'> --  13:00 
<re.Match object; span=(1, 6), match='21:35'> --  21:35 

----Correctly matched 5 out of 5 positives matches.

 orange
 03:60 
 26:14 
 0155 
<re.Match object; span=(1, 6), match='21:35'> --  21:355 

----Correctly failed to match 4 out of 5 negative matches.



We are almost there. The problem is that while **21:355** is not a time, it contains **21:35** which is a time.

How to solve this? For it to be a time, it needs to have a something that is not a digit after the first **5**...

**We will take this opportunity to make use of the useful codes (accessible with `\`) that can be found at 
[site](https://www.rexegg.com/regex-quickstart.html#chars).**

If you search the cases shown there, you find the `\d` can be used to indicate any single digit -- thus equivalent to [0-9] -- and that `\D` means any single character that is not a digit.

Using these conventions, we can write

In [44]:
hours_re = '(0\d|1\d|2[0-3])'
minutes_re = '[0-5]\d\D'

re_times = hours_re + ':' + minutes_re
print(f"Current re_string is:\n\t\t'{re_times}'\n")

test_pattern(re_times, positive_good_times)            
test_pattern(re_times, negative_good_times, False)

Current re_string is:
		'(0\d|1\d|2[0-3]):[0-5]\d\D'

<re.Match object; span=(1, 7), match='03:43 '> --  03:43 
<re.Match object; span=(1, 7), match='01:00 '> --  01:00 
<re.Match object; span=(1, 7), match='12:59 '> --  12:59 
<re.Match object; span=(1, 7), match='13:00 '> --  13:00 
<re.Match object; span=(1, 7), match='21:35 '> --  21:35 

----Correctly matched 5 out of 5 positives matches.

 orange
 03:60 
 26:14 
 0155 
 21:355 

----Correctly failed to match 5 out of 5 negative matches.



This looks great. However, humans do not always follow rules precisely.  

For example, **Yoweri Nseko** pointed out that `e3:37` would be incorrectly identified as a match... and one probably still consider a case without the initial 0 as correct.

**Let's modify our test lists!**

In [45]:
negative_good_times.extend([' e3:26 ', ' f23:18 '])
positive_good_times.extend([' 3:25 ', ' 13:45 '])

When modifying the pattern string below, you might find it useful to know that `\s` means any whitespace character.

In [46]:
hours_re = '(0\d|1\d|2[0-3])'
minutes_re = '[0-5]\d\D'

re_times = hours_re + ':' + minutes_re
print(f"Current re_string is:\n\t\t{re_times}\n")

test_pattern(re_times, positive_good_times)            
test_pattern(re_times, negative_good_times, False)

Current re_string is:
		(0\d|1\d|2[0-3]):[0-5]\d\D

<re.Match object; span=(1, 7), match='03:43 '> --  03:43 
<re.Match object; span=(1, 7), match='01:00 '> --  01:00 
<re.Match object; span=(1, 7), match='12:59 '> --  12:59 
<re.Match object; span=(1, 7), match='13:00 '> --  13:00 
<re.Match object; span=(1, 7), match='21:35 '> --  21:35 
 3:25 
<re.Match object; span=(1, 7), match='13:45 '> --  13:45 

----Correctly matched 6 out of 7 positives matches.

 orange
 03:60 
 26:14 
 0155 
 21:355 
 e3:26 
<re.Match object; span=(2, 8), match='23:18 '> --  f23:18 

----Correctly failed to match 6 out of 7 negative matches.



Awesome! You are now an expert and ready to move on to the case of bad_times.

I will start your list of examples, but you do the rest.

In [47]:
positive_bad_times = [' 03:43pm ', ' 1:00 AM ', ]
negative_bad_times = ['orange', ' 03:50 XM ', ]


And you build the `re_string`

## Getting help, but testing the help you get

Times, dates and email addresses are common types of information that one wants to extract from documents.  Above we saw how to handle times.  You can look into dates as an exercise.

Let us consider email addresses now.  **[Wikipedia](https://en.wikipedia.org/wiki/Email_address) has a detailed explanation of the rules governing the construction of valid emails addresses that we transcribe here.**

### The `local-part`

How are email addresses constructed? The standard formulation is `local-part@domain`.  The `local-part`, if unquoted, may use any of these ASCII characters:

* uppercase and lowercase Latin letters A to Z and a to z
* digits 0 to 9
* printable characters !#$%&'*+-/=?^_`{|}~
* dot ., provided that it is not the first or last character and provided also that it does not appear consecutively (e.g., John..Doe@example.com is not allowed)

The maximum total length of the local-part of an email address is 64 octets, so in principle it can amount to 64 characters.


### The `domain`

The `domain` part of an email address has to conform to strict guidelines: it must match the requirements for a `hostname`, a list of dot-separated `DNS` labels, each label being limited to a length of 63 characters and consisting of:

* uppercase and lowercase Latin letters A to Z and a to z
* digits 0 to 9, provided that top-level domain names are not all-numeric
* hyphen -, provided that it is not the first or last character

Certain domains, for example those intended for documentation and testing, should not be resolvable and that as a result mail addressed to mailboxes in them and their sub-domains should be non-deliverable. Of note for e-mail are `example`, `invalid`, `example.com`, `example.net`, and `example.org`. 


Seems complicated. So, why don't we check whether someone already built a regex for email addresses.

If you search online, you may be able to find the following solution

> `^[a-zA-Z0-9.]+@[a-zA-Z0-9.]+.[a-zA-Z0-9]+$`

It appears appropriately complicated!  But how do we know whether it works?

Let's test it!

In [52]:
emails = ['a@b.co', 
          'something@somethingelse.org', 
          '89@42.info', 
          'something@something.else.com', ]
not_emails = ['@b.c', 
              'a@b.', 
              'something@somethingelse.', ]

In [53]:
re_emails = '^[a-zA-Z0-9.]+@[a-zA-Z0-9.]+.[a-zA-Z0-9]+$'
print(f"Current re_string is:\n\t\t{re_emails}\n")

test_pattern(re_emails, emails)            
test_pattern(re_emails, not_emails, False)

Current re_string is:
		^[a-zA-Z0-9.]+@[a-zA-Z0-9.]+.[a-zA-Z0-9]+$

<re.Match object; span=(0, 6), match='a@b.co'> -- a@b.co
<re.Match object; span=(0, 27), match='something@somethingelse.org'> -- something@somethingelse.org
<re.Match object; span=(0, 10), match='89@42.info'> -- 89@42.info
<re.Match object; span=(0, 28), match='something@something.else.com'> -- something@something.else.com

----Correctly matched 4 out of 4 positives matches.

@b.c
a@b.
something@somethingelse.

----Correctly failed to match 3 out of 3 negative matches.



Well it appears to work! 

At least it fits all are test cases... Should we create more examples to be sure?


<br>

<br>

<br>

<br>

<br>

<br>



Good decision.

For now, let's see if we can make sense of `re_email` by breaking it down in pieces as we did with re_times



In [78]:
username_re = '[a-zA-Z0-9.]+'
domain_re = '[a-zA-Z0-9.]+.[a-zA-Z0-9]+'

re_emails = username_re + '@' + domain_re
print(f"Current re_string is:\n\t\t{re_emails}\n")

Current re_string is:
		[a-zA-Z0-9.]+@[a-zA-Z0-9.]+.[a-zA-Z0-9]+



`username_re` has three parts:

> * `^` means what comes next has to be a the start of the string symbol. It only makes sense to add this if the string contains only the email and nothing else.
> * `[a-zA-Z0-9.]` means that the characters allowed include lower case letters, upper case letters, digits, and periods.
> * `+` means that the previous element must appear at least once
at the beginning means the string must start with the first expression. This is a very handy character when you care about words that are at the start of a line only. 

`domain_re` also has three parts:

> * `[a-zA-Z0-9.]` means that the characters allowed include lower case letters, upper case letters, digits, and periods.
> * `+` means that the previous element must appear at least once
> * `.[a-zA-Z0-9]+` means that after some characters that may include periods, there must come a period followed by at least one character that is a lower case letter, an upper case letters, or a digits.
> '$' means that the string ends here or at most has a new line. This also only makes sense if the string contains only the email and nothing else.

Breaking it down like this makes some issues with `re_emails` apparent


In [80]:
print(f"Current re_string is:\n\t\t'{re_emails}'\n")

test_pattern(re_emails, emails)            
test_pattern(re_emails, not_emails, False)

Current re_string is:
		'[a-zA-Z0-9.]+@[a-zA-Z0-9.]+.[a-zA-Z0-9]+'

<re.Match object; span=(1, 7), match='a@b.co'> --  a@b.co 
<re.Match object; span=(1, 28), match='something@somethingelse.org'> --  something@somethingelse.org 
<re.Match object; span=(1, 11), match='89@42.info'> --  89@42.info 
<re.Match object; span=(1, 29), match='something@something.else.com'> --  something@something.else.com 
<re.Match object; span=(1, 14), match='g.o@gmail.com'> --  g.o@gmail.com 
<re.Match object; span=(1, 7), match='a@b.co'> -- <a@b.co>

----Correctly matched 6 out of 6 positives matches.

 @b.c 
 a@b. 
<re.Match object; span=(1, 24), match='something@somethingelse'> --  something@somethingelse. 
<re.Match object; span=(1, 75), match='aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa> --  aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa@b.c 
<re.Match object; span=(1, 6), match='a@b.c'> --  a@b.c 
<re.Match object; span=(1, 40), match='a@b.ccccccccccccccccccccccccccccccccc

Our string is likely to contain more than just an email address.

The username cannot be of arbitrary length. 

There likely cannot be more than 4 or 5 intermediate levels before the final period.

The final string of the server cannot be a single character and cannot be longer than a few characters.

In [99]:
emails = [' a@b.co ', 
          ' something@somethingelse.org ', 
          ' 89@42.info ', 
          ' something@something.else.com ', 
          ' g.o@gmail.com ',
          '<a@b.com>',
         ]
not_emails = [' @b.c ', 
              ' a@b. ', 
              ' something@somethingelse. ', 
              ' '+'a'*70+'@b.c ', ' a@b.c ', 
              ' a@b.ccccccccccccccccccccccccccccccccccc ',
              ' a@' + 'b.'*10 + 'com ', 
              ' .a@gmail.com ',
              ' aa.@gmail.com ',
              ' a @gmail.com '
             ]

In [103]:
# username is no longer than 64 characters
# \w covers digits, upper and lower case letters and _
# first and last character cannot be periods
#
username_re = '([<\s][^. ]{1}[\w.]{0,62}[^. ]{0,1})'
# username_re = '([<\s][^. ]{1}[\w.]{0,62}[^. ]{1})'

# domain server (xxx.xx) last portion must be 2-8 letters
# after which there must be some space
#
domain_re = '(\w{1,64}\.[a-zA-Z]{2,8}[>\s])'

# inner domain pieces (up to 4), each must end with .
inner_domain_re = '((\w{1,64}\.){0,4})'

re_emails = ( username_re + '@' + inner_domain_re + domain_re )

print(f"Current re_string is:\n\t\t'{re_emails}'\n")

test_pattern(re_emails, emails)            
test_pattern(re_emails, not_emails, False)

Current re_string is:
		'([<\s][^. ]{1}[\w.]{0,62}[^. ]{0,1})@((\w{1,64}\.){0,4})(\w{1,64}\.[a-zA-Z]{2,8}[>\s])'

<re.Match object; span=(0, 8), match=' a@b.co '> --  a@b.co 
<re.Match object; span=(0, 29), match=' something@somethingelse.org '> --  something@somethingelse.org 
<re.Match object; span=(0, 12), match=' 89@42.info '> --  89@42.info 
<re.Match object; span=(0, 30), match=' something@something.else.com '> --  something@something.else.com 
<re.Match object; span=(0, 15), match=' g.o@gmail.com '> --  g.o@gmail.com 
<re.Match object; span=(0, 9), match='<a@b.com>'> -- <a@b.com>

----Correctly matched 6 out of 6 positives matches.

 @b.c 
 a@b. 
 something@somethingelse. 
 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa@b.c 
 a@b.c 
 a@b.ccccccccccccccccccccccccccccccccccc 
 a@b.b.b.b.b.b.b.b.b.b.com 
 .a@gmail.com 
<re.Match object; span=(0, 15), match=' aa.@gmail.com '> --  aa.@gmail.com 
 a @gmail.com 

----Correctly failed to match 9 out of 10 negativ

Excellent! It works in **almost** all of our new test cases!

We could continue to improve this `regex` (say by limiting the ending domain to only known domains).

**In order to keep readability, it is important that you try to break the `re string` into pieces that are individually easier to validate**.


# Playing with Jeb Bush's emails 

For those of you too young to know anything, this was a thing before the 2016 elections.

In 2015, before his ill-fated primary run for the Republican Party Presidential Nomination, Jeb Bush released a number of his e-mails in a bid for transparency. 

As usually happens, the release wasn't vetted very well and some constituents Social Security numbers were exposed. 

But let's ignore that and instead focus on finding who then FL Governor Jed Bush corresponded with. The data is located in the folder `Data/Emails/`.

Let's see what data we have there...

In [104]:
emails_folder = Path.cwd() / 'Data' / 'Emails'

filenames = list( emails_folder.glob('*.txt') )
for i, filename in enumerate( filenames ):
    print(f"{i:>2} -- {filename}")


 0 -- /Users/amaral/Dropbox/Code_Development/COURSES/Amaral_Lab_Intro_to_Data_Science/Module_Natural_Language_Processing/Data/Emails/2001-06Jun.txt
 1 -- /Users/amaral/Dropbox/Code_Development/COURSES/Amaral_Lab_Intro_to_Data_Science/Module_Natural_Language_Processing/Data/Emails/2001-10Oct.txt
 2 -- /Users/amaral/Dropbox/Code_Development/COURSES/Amaral_Lab_Intro_to_Data_Science/Module_Natural_Language_Processing/Data/Emails/2001-03Mar.txt
 3 -- /Users/amaral/Dropbox/Code_Development/COURSES/Amaral_Lab_Intro_to_Data_Science/Module_Natural_Language_Processing/Data/Emails/2001-08Aug.txt
 4 -- /Users/amaral/Dropbox/Code_Development/COURSES/Amaral_Lab_Intro_to_Data_Science/Module_Natural_Language_Processing/Data/Emails/2001-07Jul.txt
 5 -- /Users/amaral/Dropbox/Code_Development/COURSES/Amaral_Lab_Intro_to_Data_Science/Module_Natural_Language_Processing/Data/Emails/2001-12Dec.txt
 6 -- /Users/amaral/Dropbox/Code_Development/COURSES/Amaral_Lab_Intro_to_Data_Science/Module_Natural_Language_Pr

Let's pick the 8th file, from `01 Jan`.

Note that these files were encoded in the `ISO-8859-1` standard.


In [109]:
with open( filenames[7], 'r', encoding = 'ISO-8859-1') as file_in:
    emails_string = file_in.read()
    
print(f"This email file has {place_commas(len(emails_string))} "
      f"characters.\n\n-----------------------")

print(emails_string[:300])


This email file has 4,608,586 characters.

-----------------------
From:	Bill and Carol Steele <scl@uslink.net>
Sent:	Wednesday, January 31, 2001 11:19 PM
To:	Governor Bush
Subject:	Homestead AFB

31 January 2000

Dear Governor Bush:

I am writing to urge you to support the Air Force in its decision to give 
Miami-Dade County 700 acres of surplus property at Homest


.


.


Clearly there are some differences to what we were looking at.

In particular, it seems that email may be enclosed by `<...>`.

That is handy and easy enough to incorporate into our `re_string`


In [110]:
username_re = '([<\s][^. ]{1}[\w.]{0,62}[^. ]{0,1})'
domain_re = '(\w{1,64}\.[a-zA-Z]{2,8}[>\s])'
inner_domain_re = '((\w{1,64}\.){0,4})'

re_emails = ( username_re + '@' + inner_domain_re + domain_re )

print(f"Current re_string is:\n\t\t'{re_emails}'\n")

for match in re.finditer(re_emails, emails_string[:5000]):
    print(match)

Current re_string is:
		'([<\s][^. ]{1}[\w.]{0,62}[^. ]{0,1})@((\w{1,64}\.){0,4})(\w{1,64}\.[a-zA-Z]{2,8}[>\s])'

<re.Match object; span=(27, 44), match=' <scl@uslink.net>'>
<re.Match object; span=(882, 899), match=' <scl@uslink.net>'>
<re.Match object; span=(1715, 1732), match='\tROMARC7@aol.com\n'>
<re.Match object; span=(1778, 1791), match='\tjeb@jeb.org\n'>
<re.Match object; span=(1794, 1818), match='\tungerk@eog.state.fl.us\n'>
<re.Match object; span=(3188, 3211), match=' <ellenwhitmer@msn.com>'>


Note bad. All the matches are good email addresses.  

Notice that some are enclosed inside **<...>** while others are not, but instead have whitespace around it.


Do you need to buy an email list to sell your miracle COVID cure? I can offer you a very good deal!

Let's check how many there are in the entire file.


In [114]:
username_re = '([<\s][^. ]{1}[\w.]{0,62}[^. ]{0,1})'
domain_re = '(\w{1,64}\.[a-zA-Z]{2,8}[>\s])'
inner_domain_re = '((\w{1,64}\.){0,4})'

re_emails = ( username_re + '@' + inner_domain_re + domain_re )

print(f"Current re_string is:\n\t\t'{re_emails}'\n")

for i, match in enumerate( re.finditer(re_emails, emails_string[:]) ):
    if i % 50 == 0:
        print(f"{i:>5}--{match.group().strip(whitespace+'<>'):>50} -- {match.start()}")

Current re_string is:
		'([<\s][^. ]{1}[\w.]{0,62}[^. ]{0,1})@((\w{1,64}\.){0,4})(\w{1,64}\.[a-zA-Z]{2,8}[>\s])'

    0--                                    scl@uslink.net -- 27
   50--                                       jeb@jeb.org -- 45946
  100--                                       jeb@jeb.org -- 78716
  150--                           yourvoice@myflorida.com -- 143564
  200--                      jimmy.watson@fl.ngb.army.mil -- 197964
  250--                     secretary@mail.dc.state.fl.us -- 234197
  300--                                       jeb@jeb.org -- 273683
  350--                                       jeb@jeb.org -- 317773
  400--                                       jeb@jeb.org -- 359912
  450--                                       jeb@jeb.org -- 400180
  500--                            wgilbert@earthlink.net -- 454037
  550--                       BBevis@mail.dos.state.fl.us -- 486109
  600--                              holman@blankrome.com -- 514989
  650-- 

That's really a lot of compromised e-mail addresses. 

There are so many that we actually can reproduce some of the analysis we did with text but using emails as tokens instead of words.

Change the code so you store all the emails you found and repeat some of the analyses that you conducted earlier.



# Additional Resources

If you're interest in learning more about using and writing regular expression, you can continue with this documentation.

* [More Python documentation](https://docs.python.org/3/howto/regex.html#regex-howto)
* [A great little notebook](http://nbviewer.ipython.org/github/sampathweb/python_reference/blob/master/tutorials/useful_regex.ipynb)



# Exercises


If you look carefully, you will see that emails enclosed in `<...>` contain the name of the owner of the email account.

Can you associate names to emails?

How would you do it? One name to one email? One name to many emails (Jeb clearly used a bunch of different email addresses)?

If you use a `dictionary` to store the data, what would be the key? 

Besides names that appear close to the email address, you can also find names in the signature of the email. You can even find other information such as addresses and such.

Can you scrape that information?

People are clearly an important type of entity in a corpus of emails.  However, the emails themselves are important entities.

Can you isolate each email?

What **metadata** can you extract concerning a given email?

## Searching filesystems

Let's say that you want to find all `PDF` files in your home account in your computer.

How do you obtain the path of every single file? Glob-glob

How do you search each file path for whether the file is a `PDF` or not?

How can you confirm that the file is indeed a `PDF` file?



## Regex Golf

There are so many more complicated things you can do with regex, and there is even a game called [regex golf](http://regex.alf.nu) that the nerdiest of all nerds play from time to time where the object is to come up with the shortest way to match certain patterns while [avoiding others](http://nbviewer.ipython.org/url/norvig.com/ipython/xkcd1313.ipynb). This game can serve as good practice to improve your regular expression skills.

As a test, let's play a game of regex golf. Let's try to match Star Wars movie titles, but not Star Trek movie titles.

In [None]:
# This was MUCH simpler then
#
starwars = [ 'The Phantom Menace', 'Attack of the Clones', 
             'Revenge of the Sith', 'A New Hope', 
             'The Empire Strikes Back', 'Return of the Jedi' ]

startrek = [ 'The Wrath of Khan', 'The Search for Spock', 
             'The Voyage Home', 'The Final Frontier', 
             'The Undiscovered Country', 'Generations',
             'First Contact', 'Insurrection', 'Nemesis']