# Regular Expressions

Regular Expressions are sort of like a mini language in and of themselves. Regular Expressions are a sequence of characters used to match patterns in text. Regular Expressions are language agnostic and can be used in most (all?) programming languages including Python, R, Javascript, Java, etc. Regular Expressions can look very complicated and that is because they are so versatile. You can match basically any pattern of text. 

## re Module
Today we will be using python's re module to find patterns in text. This module comes with base python, meaning it is part of the normal installation of python. When you install python, you have access to re.

<b>Step 1</b>

In [1]:
# even though re is a part of base python, you still need to import it
import re

In [2]:
text_to_search = '''
abcdefghijklmnopqurtuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
1234567890

Ha HaHa

MetaCharacters (Need to be escaped):
. ^ $ * + ? { } [ ] \ | ( )

coreyms.com

321-555-4321
123.555.1234
123*555*1234
800-555-1234
900-555-1234

Mr. Schafer
Mr Smith
Ms Davis
Mrs. Robinson
Mr. T
'''

sentence = 'Start a sentence and then bring it to an end'

<b>Step 2</b>

### Raw Strings (python)
In python, when you prefix a string with the letter 'r', it becomes a raw string. Unlike a regular string, a raw string treats a backslash (\) as a literal character

In [3]:
print('\tTab')

	Tab


In [4]:
#raw string
print(r'\tTab')

\tTab


### Let's start writing regular expressions

We will start with some patterns that match literal characters

<b>Step 3</b>

In [5]:
pattern = re.compile(r'abc')

#we are searching our 'text_to_search' variable for literally the string 'abc'
matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

<re.Match object; span=(1, 4), match='abc'>


The .finditer method returns an iterator object (special data type) with some information associated with it. 

<b>span:</b> This returns the beginning and end index of the match. This is useful because it also allows us to use string indexing to access our match.

In [6]:
#the same result using string indexing
print(text_to_search[1:4])

abc


Notice that our match found one match of 'abc'.  It did not match 'ABC', 'bca', or any other mix of those characters. This regular expression is looking for LITERALLY the string 'abc' (lower case). 

## Meta Characters
Meta characters in regular expressions are characters that, if not interpreted literally, find other characters or patterns.

These include ., /, ?, $, etc... 

Let's start with a period (.).

<b>Step 4</b>

In [7]:
# What will happen in this regular expression?
pattern = re.compile(r'.')

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

<re.Match object; span=(1, 2), match='a'>
<re.Match object; span=(2, 3), match='b'>
<re.Match object; span=(3, 4), match='c'>
<re.Match object; span=(4, 5), match='d'>
<re.Match object; span=(5, 6), match='e'>
<re.Match object; span=(6, 7), match='f'>
<re.Match object; span=(7, 8), match='g'>
<re.Match object; span=(8, 9), match='h'>
<re.Match object; span=(9, 10), match='i'>
<re.Match object; span=(10, 11), match='j'>
<re.Match object; span=(11, 12), match='k'>
<re.Match object; span=(12, 13), match='l'>
<re.Match object; span=(13, 14), match='m'>
<re.Match object; span=(14, 15), match='n'>
<re.Match object; span=(15, 16), match='o'>
<re.Match object; span=(16, 17), match='p'>
<re.Match object; span=(17, 18), match='q'>
<re.Match object; span=(18, 19), match='u'>
<re.Match object; span=(19, 20), match='r'>
<re.Match object; span=(20, 21), match='t'>
<re.Match object; span=(21, 22), match='u'>
<re.Match object; span=(22, 23), match='v'>
<re.Match object; span=(23, 24), match='w'>
<re.M

## Escape Character
We need to use the escape character (\\, backslash) to literally interpret a period

<b>Step 5</b>

In [8]:
pattern = re.compile(r'\.')

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

<re.Match object; span=(113, 114), match='.'>
<re.Match object; span=(149, 150), match='.'>
<re.Match object; span=(171, 172), match='.'>
<re.Match object; span=(175, 176), match='.'>
<re.Match object; span=(223, 224), match='.'>
<re.Match object; span=(254, 255), match='.'>
<re.Match object; span=(267, 268), match='.'>


A practical example of this escape character is a URL.

In [9]:
pattern = re.compile(r'coreyms\.com')

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

<re.Match object; span=(142, 153), match='coreyms.com'>


Literal searching aren't too exciting because we are used to doing this. Now we can start using <b>Meta Characters</b> the way they are intended to be used. 


<b>Step 6</b>

In [10]:
# \d matches any digit 0-9
pattern = re.compile(r'\d')

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

<re.Match object; span=(55, 56), match='1'>
<re.Match object; span=(56, 57), match='2'>
<re.Match object; span=(57, 58), match='3'>
<re.Match object; span=(58, 59), match='4'>
<re.Match object; span=(59, 60), match='5'>
<re.Match object; span=(60, 61), match='6'>
<re.Match object; span=(61, 62), match='7'>
<re.Match object; span=(62, 63), match='8'>
<re.Match object; span=(63, 64), match='9'>
<re.Match object; span=(64, 65), match='0'>
<re.Match object; span=(155, 156), match='3'>
<re.Match object; span=(156, 157), match='2'>
<re.Match object; span=(157, 158), match='1'>
<re.Match object; span=(159, 160), match='5'>
<re.Match object; span=(160, 161), match='5'>
<re.Match object; span=(161, 162), match='5'>
<re.Match object; span=(163, 164), match='4'>
<re.Match object; span=(164, 165), match='3'>
<re.Match object; span=(165, 166), match='2'>
<re.Match object; span=(166, 167), match='1'>
<re.Match object; span=(168, 169), match='1'>
<re.Match object; span=(169, 170), match='2'>
<re.Matc

In [None]:
# \w matches any word character (a-z, A-Z, 0-9, _)
pattern = re.compile(r'\w')

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

In [None]:
# \s matches white space (space, tab, newline)
pattern = re.compile(r'\s')

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

## Anchors
Anchors do not match characters specifically, but match invisible positions before or after characters. These are typically used in conjunction with other patterns. 

Anchor characters include \b, \B, ^, $

<b>Step 7</b>

In [11]:
# \b matches a word boundary such as the beginning of a line or a space
pattern = re.compile(r'\bHa')

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

<re.Match object; span=(67, 69), match='Ha'>
<re.Match object; span=(70, 72), match='Ha'>


In [12]:
# \B matches NOT a word boundary. This is the opposite of the above example
pattern = re.compile(r'\BHa')

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

<re.Match object; span=(72, 74), match='Ha'>


<b>Step 8</b>

In [16]:
sentence = 'Start a sentence and then bring it to an end'

# ^ matches something at the beginning of a string.
pattern = re.compile(r'^Start')

matches = pattern.finditer(sentence)

for match in matches:
    print(match)

<re.Match object; span=(0, 5), match='Start'>


In [14]:
# This matches something at the end of a string.
pattern = re.compile(r'end$')

matches = pattern.finditer(sentence)

for match in matches:
    print(match)

<re.Match object; span=(41, 44), match='end'>


## Practical Examples

Up to this point, our examples are not "real". Let's do some more practical exercises. 

### Matching Phone Numbers

<b>Step 9</b>

In [17]:
# look for any three digits in a row
pattern = re.compile(r'\d\d\d')

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

<re.Match object; span=(55, 58), match='123'>
<re.Match object; span=(58, 61), match='456'>
<re.Match object; span=(61, 64), match='789'>
<re.Match object; span=(155, 158), match='321'>
<re.Match object; span=(159, 162), match='555'>
<re.Match object; span=(163, 166), match='432'>
<re.Match object; span=(168, 171), match='123'>
<re.Match object; span=(172, 175), match='555'>
<re.Match object; span=(176, 179), match='123'>
<re.Match object; span=(181, 184), match='123'>
<re.Match object; span=(185, 188), match='555'>
<re.Match object; span=(189, 192), match='123'>
<re.Match object; span=(194, 197), match='800'>
<re.Match object; span=(198, 201), match='555'>
<re.Match object; span=(202, 205), match='123'>
<re.Match object; span=(207, 210), match='900'>
<re.Match object; span=(211, 214), match='555'>
<re.Match object; span=(215, 218), match='123'>


In [18]:
#matches any 3 digits, followed by any character, another 3 digits, any character, followed by 4 digits
pattern = re.compile(r'\d\d\d.\d\d\d.\d\d\d\d')

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

<re.Match object; span=(155, 167), match='321-555-4321'>
<re.Match object; span=(168, 180), match='123.555.1234'>
<re.Match object; span=(181, 193), match='123*555*1234'>
<re.Match object; span=(194, 206), match='800-555-1234'>
<re.Match object; span=(207, 219), match='900-555-1234'>


Now we will use our data.txt file which contains a bunch of fake names, addresses, phone numbers, and email addresses. Let's read that file into python and parse out some information from it.

<b>Step 10</b>

In [19]:
pattern = re.compile(r'\d\d\d.\d\d\d.\d\d\d\d')

#open this file as a file object
#if you are unfamiliar with this, don't think too hard. Just remember, we are opening the file. 
with open('data.txt', 'r') as f:
          contents = f.read()
          
          matches = pattern.finditer(contents)
              
          for match in matches:
              print(match)

<re.Match object; span=(12, 24), match='615-555-7164'>
<re.Match object; span=(102, 114), match='800-555-5669'>
<re.Match object; span=(191, 203), match='560-555-5153'>
<re.Match object; span=(281, 293), match='900-555-9340'>
<re.Match object; span=(378, 390), match='714-555-7405'>
<re.Match object; span=(467, 479), match='800-555-6771'>
<re.Match object; span=(557, 569), match='783-555-4799'>
<re.Match object; span=(647, 659), match='516-555-4615'>
<re.Match object; span=(740, 752), match='127-555-1867'>
<re.Match object; span=(829, 841), match='608-555-4938'>
<re.Match object; span=(915, 927), match='568-555-6051'>
<re.Match object; span=(1003, 1015), match='292-555-1875'>
<re.Match object; span=(1091, 1103), match='900-555-3205'>
<re.Match object; span=(1180, 1192), match='614-555-1166'>
<re.Match object; span=(1269, 1281), match='530-555-2676'>
<re.Match object; span=(1355, 1367), match='470-555-2750'>
<re.Match object; span=(1439, 1451), match='800-555-6089'>
<re.Match object; spa

### Character Sets

Instead of matching any phone number, let's be more specific. We will now match any only any phone number that uses dashes or dots. 

<b>Character Sets</b> use [ ] characters to define specific characters. Even though a character set can have multiple characters in it, it is still only matching one character in the text

<b>Step 11</b>

In [22]:
# look for any phone number using just a - or . to separate numbers
pattern = re.compile(r'\d\d\d[-.]\d\d\d[-.]\d\d\d\d')

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

<re.Match object; span=(155, 167), match='321-555-4321'>
<re.Match object; span=(168, 180), match='123.555.1234'>
<re.Match object; span=(194, 206), match='800-555-1234'>
<re.Match object; span=(207, 219), match='900-555-1234'>


Another example, what if we only want to match '800' or '900' phone numbers?

In [23]:
# each character set only matches ONE number
pattern = re.compile(r'[89]00[-.]\d\d\d[-.]\d\d\d\d')

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

<re.Match object; span=(194, 206), match='800-555-1234'>
<re.Match object; span=(207, 219), match='900-555-1234'>


And now I will do this in our data.txt file (which is much bigger)

<b>Step 12</b>

In [None]:
pattern = re.compile(r'[89]00[-.]\d\d\d[-.]\d\d\d\d')

with open('data.txt', 'r') as f:
          contents = f.read()
          
          matches = pattern.finditer(contents)
              
          for match in matches:
              print(match)

### Dash (-) in a Character Set

The dash character in a character set is a special character itself. At the beginning or end of a character set, it looks for literally a dash (-) character. Placed between characters, it specifies a range of values

<b>Step 13</b>

In [24]:
#look for digits 1-5
pattern = re.compile(r'[1-5]')

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

<re.Match object; span=(55, 56), match='1'>
<re.Match object; span=(56, 57), match='2'>
<re.Match object; span=(57, 58), match='3'>
<re.Match object; span=(58, 59), match='4'>
<re.Match object; span=(59, 60), match='5'>
<re.Match object; span=(155, 156), match='3'>
<re.Match object; span=(156, 157), match='2'>
<re.Match object; span=(157, 158), match='1'>
<re.Match object; span=(159, 160), match='5'>
<re.Match object; span=(160, 161), match='5'>
<re.Match object; span=(161, 162), match='5'>
<re.Match object; span=(163, 164), match='4'>
<re.Match object; span=(164, 165), match='3'>
<re.Match object; span=(165, 166), match='2'>
<re.Match object; span=(166, 167), match='1'>
<re.Match object; span=(168, 169), match='1'>
<re.Match object; span=(169, 170), match='2'>
<re.Match object; span=(170, 171), match='3'>
<re.Match object; span=(172, 173), match='5'>
<re.Match object; span=(173, 174), match='5'>
<re.Match object; span=(174, 175), match='5'>
<re.Match object; span=(176, 177), match='1'

In [25]:
#match lower case letters a-e
pattern = re.compile(r'[a-e]')

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

<re.Match object; span=(1, 2), match='a'>
<re.Match object; span=(2, 3), match='b'>
<re.Match object; span=(3, 4), match='c'>
<re.Match object; span=(4, 5), match='d'>
<re.Match object; span=(5, 6), match='e'>
<re.Match object; span=(68, 69), match='a'>
<re.Match object; span=(71, 72), match='a'>
<re.Match object; span=(73, 74), match='a'>
<re.Match object; span=(77, 78), match='e'>
<re.Match object; span=(79, 80), match='a'>
<re.Match object; span=(82, 83), match='a'>
<re.Match object; span=(84, 85), match='a'>
<re.Match object; span=(85, 86), match='c'>
<re.Match object; span=(87, 88), match='e'>
<re.Match object; span=(93, 94), match='e'>
<re.Match object; span=(94, 95), match='e'>
<re.Match object; span=(95, 96), match='d'>
<re.Match object; span=(100, 101), match='b'>
<re.Match object; span=(101, 102), match='e'>
<re.Match object; span=(103, 104), match='e'>
<re.Match object; span=(105, 106), match='c'>
<re.Match object; span=(106, 107), match='a'>
<re.Match object; span=(108, 109

In [26]:
#match lower case OR upper case letters a - e. Just put them back to back.
pattern = re.compile(r'[a-eA-E]')

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

<re.Match object; span=(1, 2), match='a'>
<re.Match object; span=(2, 3), match='b'>
<re.Match object; span=(3, 4), match='c'>
<re.Match object; span=(4, 5), match='d'>
<re.Match object; span=(5, 6), match='e'>
<re.Match object; span=(28, 29), match='A'>
<re.Match object; span=(29, 30), match='B'>
<re.Match object; span=(30, 31), match='C'>
<re.Match object; span=(31, 32), match='D'>
<re.Match object; span=(32, 33), match='E'>
<re.Match object; span=(68, 69), match='a'>
<re.Match object; span=(71, 72), match='a'>
<re.Match object; span=(73, 74), match='a'>
<re.Match object; span=(77, 78), match='e'>
<re.Match object; span=(79, 80), match='a'>
<re.Match object; span=(80, 81), match='C'>
<re.Match object; span=(82, 83), match='a'>
<re.Match object; span=(84, 85), match='a'>
<re.Match object; span=(85, 86), match='c'>
<re.Match object; span=(87, 88), match='e'>
<re.Match object; span=(93, 94), match='e'>
<re.Match object; span=(94, 95), match='e'>
<re.Match object; span=(95, 96), match='d'

### Carrot (^) in character set

The carrot character negates whatever is in the character set

<b>Step 14</b>

In [27]:
#find any character that is NOT a-z or A-Z
pattern = re.compile(r'[^a-zA-Z]')

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

<re.Match object; span=(0, 1), match='\n'>
<re.Match object; span=(27, 28), match='\n'>
<re.Match object; span=(54, 55), match='\n'>
<re.Match object; span=(55, 56), match='1'>
<re.Match object; span=(56, 57), match='2'>
<re.Match object; span=(57, 58), match='3'>
<re.Match object; span=(58, 59), match='4'>
<re.Match object; span=(59, 60), match='5'>
<re.Match object; span=(60, 61), match='6'>
<re.Match object; span=(61, 62), match='7'>
<re.Match object; span=(62, 63), match='8'>
<re.Match object; span=(63, 64), match='9'>
<re.Match object; span=(64, 65), match='0'>
<re.Match object; span=(65, 66), match='\n'>
<re.Match object; span=(66, 67), match='\n'>
<re.Match object; span=(69, 70), match=' '>
<re.Match object; span=(74, 75), match='\n'>
<re.Match object; span=(75, 76), match='\n'>
<re.Match object; span=(90, 91), match=' '>
<re.Match object; span=(91, 92), match='('>
<re.Match object; span=(96, 97), match=' '>
<re.Match object; span=(99, 100), match=' '>
<re.Match object; span=(10

## Quantifiers

Typing individual characters is prone to errors (ie: \d\d\d). We can use <b>quantifiers</b> to match a range of characters or specific number of characters.

<b>Step 15</b>

In [29]:
#finding a phone number just like above using quantifiers. Specify the amount of digits you are looking for.
pattern = re.compile(r'\d{3}.\d{3}.\d{4}')

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

<re.Match object; span=(155, 167), match='321-555-4321'>
<re.Match object; span=(168, 180), match='123.555.1234'>
<re.Match object; span=(181, 193), match='123*555*1234'>
<re.Match object; span=(194, 206), match='800-555-1234'>
<re.Match object; span=(207, 219), match='900-555-1234'>


### Matching Names

Sometimes we don't know the exact number of characters we are looking for. Let's use names from our text_to_search as an example

<b>Step 16</b>

In [28]:
#we will start with the variations of Mr
pattern = re.compile(r'Mr\.')

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

<re.Match object; span=(221, 224), match='Mr.'>
<re.Match object; span=(265, 268), match='Mr.'>


Some of the 'misters' have a period after MR and some don't. So we need an optional character

In [30]:
#using the optional character '?'
pattern = re.compile(r'Mr\.?')

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

<re.Match object; span=(221, 224), match='Mr.'>
<re.Match object; span=(233, 235), match='Mr'>
<re.Match object; span=(251, 253), match='Mr'>
<re.Match object; span=(265, 268), match='Mr.'>


<b>Step 17</b>

Now we will match the whole name for all the Misters

In [32]:
#\s is whitespace.  \w is word character.  * is quantifier for 0 or more characters
pattern = re.compile(r'Mr\.?\s[A-Z]\w*')

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

<re.Match object; span=(221, 232), match='Mr. Schafer'>
<re.Match object; span=(233, 241), match='Mr Smith'>
<re.Match object; span=(265, 270), match='Mr. T'>


Now let's include all misters, miss, and missus

## Groups

Groups allow us to match several different patterns. Groups are created using ( )

<b>Step 18</b>

In [34]:
# | (vertical bar or pipe) is an 'or' character
pattern = re.compile(r'(Mr|Ms|Mrs)\.?\s')

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

<re.Match object; span=(221, 225), match='Mr. '>
<re.Match object; span=(233, 236), match='Mr '>
<re.Match object; span=(242, 245), match='Ms '>
<re.Match object; span=(251, 256), match='Mrs. '>
<re.Match object; span=(265, 269), match='Mr. '>


Now put it all together to get all the names

In [35]:
pattern = re.compile(r'(Mr|Ms|Mrs)\.?\s[A-Z]\w*')

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

<re.Match object; span=(221, 232), match='Mr. Schafer'>
<re.Match object; span=(233, 241), match='Mr Smith'>
<re.Match object; span=(242, 250), match='Ms Davis'>
<re.Match object; span=(251, 264), match='Mrs. Robinson'>
<re.Match object; span=(265, 270), match='Mr. T'>


### Matching Emails


<b>Step 19</b>

In [36]:
emails = '''
CoreyMSchafer@gmail.com
corey.schafer@university.edu
corey-321-schafer@my-work.net
'''

#we will start by matching just the first address
#before the @ it is just uppercase or lowercase letters
#after the @ it is any number of characters and .com
pattern = re.compile(r'[a-zA-Z]+@[a-zA-Z]+\.com')

matches = pattern.finditer(emails)

for match in matches:
    print(match)

<re.Match object; span=(1, 24), match='CoreyMSchafer@gmail.com'>


In [37]:
#now we can get the other emails
#the second email includes a '.' and ends with '.edu'
pattern = re.compile(r'[a-zA-Z.]+@[a-zA-Z]+\.(com|edu)')

matches = pattern.finditer(emails)

for match in matches:
    print(match)

<re.Match object; span=(1, 24), match='CoreyMSchafer@gmail.com'>
<re.Match object; span=(25, 53), match='corey.schafer@university.edu'>


In [None]:
#now for the final address which has numbers and '-' and ends with .net

pattern = re.compile(r'[a-zA-Z0-9.-]+@[a-zA-Z-]+\.(com|edu|net)')

matches = pattern.finditer(emails)

for match in matches:
    print(match)

## Reading Regular Expressions

I find that it is easier to write my own regular expressions than read other people's regular expressions. But it is not impossible and you don't always want to write a complicated regular expression if you can avoid it. Let's practice with a regular expression meant to match email addresses

<b>Step 20</b>

In [38]:
#first, make sure to run it and see that it works
pattern = re.compile(r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+')

matches = pattern.finditer(emails)

for match in matches:
    print(match)

<re.Match object; span=(1, 24), match='CoreyMSchafer@gmail.com'>
<re.Match object; span=(25, 53), match='corey.schafer@university.edu'>
<re.Match object; span=(54, 83), match='corey-321-schafer@my-work.net'>
