# Regular Expressions

Regular Expressions are sort of like a mini language in and of themselves. Regular Expressions are a sequence of characters used to match patterns in text. Regular Expressions are language agnostic and can be used in most (all?) programming languages including Python, R, Javascript, Java, etc. Regular Expressions can look very complicated and that is because they are so versatile. You can match basically any pattern of text. 

## re Module
Today we will be using python's re module to find patterns in text. This module comes with base python, meaning it is part of the normal installation of python. When you install python, you have access to re.

<b>Step 1</b>

In [2]:
# even though re is a part of base python, you still need to import it
import re

In [None]:
text_to_search = '''
abcdefghijklmnopqurtuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
1234567890

Ha HaHa

MetaCharacters (Need to be escaped):
. ^ $ * + ? { } [ ] \ | ( )

coreyms.com

321-555-4321
123.555.1234
123*555*1234
800-555-1234
900-555-1234

Mr. Schafer
Mr Smith
Ms Davis
Mrs. Robinson
Mr. T
'''

sentence = 'Start a sentence and then bring it to an end'

<b>Step 2</b>

### Raw Strings (python)
In python, when you prefix a string with the letter 'r', it becomes a raw string. Unlike a regular string, a raw string treats a backslash (\) as a literal character

In [None]:
print('\tTab')

In [None]:
#raw string
print(r'\tTab')

### Let's start writing regular expressions

We will start with some patterns that match literal characters

<b>Step 3</b>

In [None]:
pattern = re.compile(r'abc')

#we are searching our 'text_to_search' variable for literally the string 'abc'
matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

The .finditer method returns an iterator object (special data type) with some information associated with it. 

<b>span:</b> This returns the beginning and end index of the match. This is useful because it also allows us to use string indexing to access our match.

In [None]:
#the same result using string indexing
print(text_to_search[1:4])

Notice that our match found one match of 'abc'.  It did not match 'ABC', 'bca', or any other mix of those characters. This regular expression is looking for LITERALLY the string 'abc' (lower case). 

## Meta Characters
Meta characters in regular expressions are characters that, if not interpreted literally, find other characters or patterns.

These include ., /, ?, $, etc... 

Let's start with a period (.).

<b>Step 4</b>

In [None]:
# What will happen in this regular expression?
pattern = re.compile(r'.')

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

## Escape Character
We need to use the escape character (\\, backslash) to literally interpret a period

<b>Step 5</b>

In [None]:
pattern = re.compile(r'\.')

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

A practical example of this escape character is a URL.

In [None]:
pattern = re.compile(r'coreyms\.com')

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

Literal searching aren't too exciting because we are used to doing this. Now we can start using <b>Meta Characters</b> the way they are intended to be used. 


<b>Step 6</b>

In [None]:
# \d matches any digit 0-9
pattern = re.compile(r'\d')

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

In [None]:
# \w matches any word character (a-z, A-Z, 0-9, _)
pattern = re.compile(r'\w')

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

In [None]:
# \s matches white space (space, tab, newline)
pattern = re.compile(r'\s')

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

## Anchors
Anchors do not match characters specifically, but match invisible positions before or after characters. These are typically used in conjunction with other patterns. 

Anchor characters include \b, \B, ^, $

<b>Step 7</b>

In [None]:
# \b matches a word boundary such as the beginning of a line or a space
pattern = re.compile(r'\bHa')

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

In [None]:
# \B matches NOT a word boundary. This is the opposite of the above example
pattern = re.compile(r'\BHa')

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

<b>Step 8</b>

In [None]:
sentence = 'Start a sentence and then bring it to an end'

# ^ matches something at the beginning of a string.
pattern = re.compile(r'^Start')

matches = pattern.finditer(sentence)

for match in matches:
    print(match)

In [None]:
# This matches something at the end of a string.
pattern = re.compile(r'end$')

matches = pattern.finditer(sentence)

for match in matches:
    print(match)

## Practical Examples

Up to this point, our examples are not "real". Let's do some more practical exercises. 

### Matching Phone Numbers

<b>Step 9</b>

In [None]:
# look for any three digits in a row
pattern = re.compile(r'\d\d\d')

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

In [None]:
#matches any 3 digits, followed by any character, another 3 digits, any character, followed by 4 digits
pattern = re.compile(r'\d\d\d.\d\d\d.\d\d\d\d')

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

Now we will use our data.txt file which contains a bunch of fake names, addresses, phone numbers, and email addresses. Let's read that file into python and parse out some information from it.

<b>Step 10</b>

In [None]:
pattern = re.compile(r'\d\d\d.\d\d\d.\d\d\d\d')

#open this file as a file object
#if you are unfamiliar with this, don't think too hard. Just remember, we are opening the file. 
with open('data.txt', 'r') as f:
          contents = f.read()
          
          matches = pattern.finditer(contents)
              
          for match in matches:
              print(match)

### Character Sets

Instead of matching any phone number, let's be more specific. We will now match any only any phone number that uses dashes or dots. 

<b>Character Sets</b> use [ ] characters to define specific characters. Even though a character set can have multiple characters in it, it is still only matching one character in the text

<b>Step 11</b>

In [None]:
# look for any phone number using just a - or . to separate numbers
pattern = re.compile(r'\d\d\d[-.]\d\d\d[-.]\d\d\d\d')

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

Another example, what if we only want to match '800' or '900' phone numbers?

In [None]:
# each character set only matches ONE number
pattern = re.compile(r'[89]00[-.]\d\d\d[-.]\d\d\d\d')

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

And now I will do this in our data.txt file (which is much bigger)

<b>Step 12</b>

In [None]:
pattern = re.compile(r'[89]00[-.]\d\d\d[-.]\d\d\d\d')

with open('data.txt', 'r') as f:
          contents = f.read()
          
          matches = pattern.finditer(contents)
              
          for match in matches:
              print(match)

### Dash (-) in a Character Set

The dash character in a character set is a special character itself. At the beginning or end of a character set, it looks for literally a dash (-) character. Placed between characters, it specifies a range of values

<b>Step 13</b>

In [None]:
#look for digits 1-5
pattern = re.compile(r'[1-5]')

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

In [None]:
#match lower case letters a-e
pattern = re.compile(r'[a-e]')

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

In [None]:
#match lower case OR upper case letters a - e. Just put them back to back.
pattern = re.compile(r'[a-eA-E]')

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

### Carrot (^) in character set

The carrot character negates whatever is in the character set

<b>Step 14</b>

In [None]:
#find any character that is NOT a-z or A-Z
pattern = re.compile(r'[^a-zA-Z]')

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

## Quantifiers

Typing individual characters is prone to errors (ie: \d\d\d). We can use <b>quantifiers</b> to match a range of characters or specific number of characters.

<b>Step 15</b>

In [None]:
#finding a phone number just like above using quantifiers. Specify the amount of digits you are looking for.
pattern = re.compile(r'\d{3}.\d{3}.\d{4}')

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

### Matching Names

Sometimes we don't know the exact number of characters we are looking for. Let's use names from our text_to_search as an example

<b>Step 16</b>

In [None]:
#we will start with the variations of Mr
pattern = re.compile(r'Mr\.')

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

Some of the 'misters' have a period after MR and some don't. So we need an optional character

In [None]:
#using the optional character '?'
pattern = re.compile(r'Mr\.?')

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

<b>Step 17</b>

Now we will match the whole name for all the Misters

In [None]:
#\s is whitespace.  \w is word character.  * is quantifier for 0 or more characters
pattern = re.compile(r'Mr\.?\s[A-Z]\w*')

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

Now let's include all misters, miss, and missus

## Groups

Groups allow us to match several different patterns. Groups are created using ( )

<b>Step 18</b>

In [None]:
# | (vertical bar or pipe) is an 'or' character
pattern = re.compile(r'(Mr|Ms|Mrs)\.?\s')

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

Now put it all together to get all the names

In [None]:
pattern = re.compile(r'(Mr|Ms|Mrs)\.?\s[A-Z]\w*')

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

### Matching Emails


<b>Step 19</b>

In [None]:
emails = '''
CoreyMSchafer@gmail.com
corey.schafer@university.edu
corey-321-schafer@my-work.net
'''

#we will start by matching just the first address
#before the @ it is just uppercase or lowercase letters
#after the @ it is any number of characters and .com
pattern = re.compile(r'[a-zA-Z]+@[a-zA-Z]+\.com')

matches = pattern.finditer(emails)

for match in matches:
    print(match)

In [None]:
#now we can get the other emails
#the second email includes a '.' and ends with '.edu'
pattern = re.compile(r'[a-zA-Z.]+@[a-zA-Z]+\.(com|edu)')

matches = pattern.finditer(emails)

for match in matches:
    print(match)

In [None]:
#now for the final address which has numbers and '-' and ends with .net

pattern = re.compile(r'[a-zA-Z0-9.-]+@[a-zA-Z-]+\.(com|edu|net)')

matches = pattern.finditer(emails)

for match in matches:
    print(match)

## Reading Regular Expressions

I find that it is easier to write my own regular expressions than read other people's regular expressions. There are techniques to decipher them but to be honest, it is far easier to use an AI tool to explain it like [**ChatGPT**](https://chat.openai.com/). Of ChatGPT's many uses, I find it find particularly useful for picking apart regular expressions. It is less useful writing regular expressions in my experience. To conclude the workshop, I will use ChatGPT to explain the regular expression below.

<b>Step 20</b>

In [45]:

# what do you think this regular expression matches from the data.txt file?
pattern = re.compile(r'\d{3} \w+ \w{2}., \w+ \w{2} \d+')

with open('data.txt', 'r') as f:
          contents = f.read()
          
          matches = pattern.finditer(contents)
              
          for match in matches:
              print(match)


<re.Match object; span=(25, 59), match='173 Main St., Springfield RI 55924'>
<re.Match object; span=(115, 146), match='969 High St., Atlantis VA 34075'>
<re.Match object; span=(204, 234), match='806 1st St., Faketown AK 86847'>
<re.Match object; span=(294, 324), match='826 Elm St., Epicburg NE 10671'>
<re.Match object; span=(391, 424), match='212 Cedar St., Sunnydale CT 74983'>
<re.Match object; span=(480, 516), match='519 Washington St., Olympus TN 32425'>
<re.Match object; span=(570, 600), match='625 Oak St., Dawnstar IL 61914'>
<re.Match object; span=(660, 694), match='890 Main St., Pythonville LA 29947'>
<re.Match object; span=(842, 870), match='249 Elm St., Quahog OR 90938'>
<re.Match object; span=(928, 961), match='619 Park St., Winterfell VA 99000'>
<re.Match object; span=(1016, 1048), match='220 Cedar St., Lakeview NY 87282'>
<re.Match object; span=(1104, 1136), match='391 High St., Smalltown WY 28362'>
<re.Match object; span=(1282, 1313), match='433 Elm St., Westworld TX 61967