# Welcome to the Dark Art of Coding:
## Introduction to Python
regex - Regular Expressions

<img src='../universal_images/dark_art_logo.600px.png' width='300' style="float:right">

In this session, students should expect to:

* Understand the types of problems regular expressions are meant to solve
* ...and not solve
* Understand some of the most common regular expressions

# What are *Regular Expressions*
A regular expression is a special string used for pattern recognition and matching

**PLEASE NOTE:** 

Regular expressions have their own set of syntax as part of a language separate from Python. 

Some of the syntax used may look similar to Python code but BE CAREFUL. 

Similarities match the **FORMAT**, but NOT the **FUNCTIONALITY** of Python syntax. 

`Regex` is different from Python and is used by multitudes of languages and programs.

*"Some people, when confronted with a problem, think 'I know, I'll use regular expressions.' Now they have two problems."*

\- [Jamie Zawinski](http://regex.info/blog/2006-09-15/247)

Imagine a typical phone number...

**808-123-9876**

I could make a function to look at a piece of text and tell me if it matches a phone number pattern.

```python
def isPhoneNumber(text):
    if len(text) != 12:
        return False
```

```python
    for i in range(0, 3):
        if not text[i].isdecimal():
            return False
```

```python
    if text[3] != '-':
        return False
```

```python
    for i in range(4, 7):
        if not text[i].isdecimal():
            return False
```

```python
    if text[7] != '-':
        return False
```

```python
    for i in range(8, 12):
        if not text[i].isdecimal():
            return False
```

```python
    return True
```

In [None]:
# If we pull this all together, the function looks 
#     something like this...
#     fairly unwieldy

def isPhoneNumber(text):
    if len(text) != 12:
        return False
    for i in range(0, 3):
        if not text[i].isdecimal():
            return False
    if text[3] != '-':
        return False
    for i in range(4, 7):
        if not text[i].isdecimal():
            return False
    if text[7] != '-':
        return False
    for i in range(8, 12):
        if not text[i].isdecimal():
            return False
    return True

In [None]:
# and when we test it, we find that it is...
#     also not robust... it fails at 
#     detecting alternate versions
#     without significant modifications

print('Checking against: 465-814-0978')
print(isPhoneNumber('465-814-0978'))

print('-' * 40)

print('Checking against: (808)814-0978')
print(isPhoneNumber('(808)814-0978'))

print('-' * 40)

print('Checking against: not_a_number')
print(isPhoneNumber('not_a_number'))


In [None]:
# In addition, if we want to find variants within a larger string
#     we have to jump through some hoops to parse the
#     string manually in sequential fashion

message = 'text me at 123-456-7890. call me at 098-765-4321 OR (808)814-0978'
for i in range(len(message)):
    # This for loop takes slices of the longer string
    #     that start at index 'i' and end 12 characters
    #     later and compares the pattern to those chunks
    #     one at a time.
    
    chunk = message[i:i+12]
    
    if isPhoneNumber(chunk):
        print('Found number: ' + chunk)
    else:
        print('No number found: ' + chunk)

But wait, what about all these other formats?

```
456.789.0123

(443)554-6655

(443) 554-6655

098 629 7452

1-432-629-7451
```

In [None]:
# Let's start by importing the regular expression module: 're'

import re

# Next, let's compile a pattern that we can use to find
#     very generic phone numbers

phonePattern = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
             # We use a raw string because typing: r'\d\d\d'
             # is easier than typing: '\\d\\d\\d'

In [None]:
# Let's test phonePattern to see what we got:

print(type(phonePattern), phonePattern)

In [None]:
# Using that pattern, we can compare that pattern
#     against a string to see if the pattern appears 
#     anywhere in the string.

matchObj = phonePattern.search('My number is 786-234-6273')

# If a match is found, Python creates a Match Object
#     which we will label matchObj.
#     The Match Object stores attributes about the match
#     that re found.
#     The easiest way to see what was found is to 
#     call the .group() function.

In [None]:
print(type(matchObj), matchObj)

In [None]:
print('found numbers:', matchObj.group())

# What is a Match Object?

**Match Objects** maintain a record of any matches that were made, as well as some of the details associated with the match. We will explores these in more depth as we go.

Match Objects are only returned, if a match is found.

If no match is found, then the value `None` is returned. 

For simple matches, a Match Object has a `.group()` method associated with the object.

In [None]:
# Calling the .group() method wil return the entire match

matchObj.group()

# Capture Groups
---

In [None]:
# Sometimes it is not enough to find a match
#     we may want to break up or segregate matches
#     into submatches
#     This is done using parenthesis to create 'capture groups'

#                            |<cg 1>| |<capture grp 2>|

phonePattern2 = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')

matchObj = phonePattern2.search('My number is 786-234-6273')

In [None]:
# We can access both the entire match and/or any of the
#     submatches / capture groups
# To see the entire match, we use .group()

print(matchObj.group())

In [None]:
# By default, the .group() method stores the match as element 
#     zero, so .group(0) returns the same response.

print(matchObj.group(0))

In [None]:
# .group(1), on the other hand, returns the first capture group
#     in this case, the area code.

print(matchObj.group(1))

In [None]:
# .group(2), returns the second capture group
#     in this case, the subscriber number.

print(matchObj.group(2))

The matchObj also retains the index positions (start and stop)
    of the match in the original string:

```
'My number is 786-234-6273'
 |    |       |          |
 0    5      13         24
```

These index positions are available via the `.span()` method.

Remember, Python slices go up to but **do not include** the 
last index.

In [None]:
matchObj.span()

# Blame it all on Dijkstra.

In [None]:
# If you want to get back all the capture groups
#     you can call the .groups() method 

print(matchObj.groups())

In [None]:
# You can use tuple unpacking to assign
#     labels to each capture group, provided you create a
#     label for each capture group.


areaCode, subscriberNumber = matchObj.groups()

print(areaCode)
print(subscriberNumber)

# Experience Points!
---

**Part 0**

In **Jupyter** do each of the following:

Task | Sample Object(s)
:---|:---
`import` the regex module `re`|
`re.compile()` a regex pattern that matches:|
.|`12-34-56` OR any similar
.|`nn-nn-nn` pattern
`pattern.search()` the following string against your pattern|`'Can you find the number 99-66-33?'`
Label the result as `matchObj`|
`print()` the content of `.group()`|

When you complete this exercise, please put your green post-it on your monitor. 

If you want to continue on at your own-pace, please feel free to do so.

<img src='../universal_images/green_sticky.300px.png' width='200' style='float:left'>

**Part 1**

In **Jupyter** do each of the following:

Task | Sample Object(s)
:---|:---
`re.compile()` a regex pattern that matches:|
.|`123-34-5678` OR any similar
.|social security number
.|AND
.|captures the first three digits
.|AND
.|the last three digits
`pattern.search()` the following string against your pattern|`'which number 23-222-3333 OR 234-33-4455 is an SSN?`
Label the result as `matchObj`|
`print()` the content of `.group()`|
`print()` the content of `.group(0)`|
`print()` the content of `.group(1)`|
`print()` the content of `.groups()`|
`print()` the content of `.span()`|

When you complete this exercise, please put your green post-it on your monitor. 

If you want to continue on at your own-pace, please feel free to do so.

<img src='../universal_images/green_sticky.300px.png' width='200' style='float:left'>

In [None]:
# What if I want to match literal parenthesis?
#     escape them using the '\' character in a raw string
#     the compile module will note them as literal 
#     parens, NOT capture group parens

phoneRegex = re.compile(r'(\(\d\d\d\)) (\d\d\d-\d\d\d\d)')

# There is nothing magical about the label matchObj:
#     it is often used, but you can use any term you want
#     for the Match Object

phoneNum = phoneRegex.search('(808) 872-8204')
print(phoneNum.groups())

In [None]:
# It is possible to match multiple patterns:
#     Here, the pipe (|) character means OR.

multiRegex = re.compile(r'hat|cat')

# Note... search looks for only the FIRST match in a string...

mObj1 = multiRegex.search('cat in the hat')
print(mObj1.group())

In [None]:
# Let's do the same search on a different string to match
#     hat instead. 

mObj2 = multiRegex.search('hat on a cat')
print(mObj2.group())

In [None]:
# parenthesis are often used to group patterns you would like to 
#     OR together where they are just a part of a longer pattern
#     here, we are looking for steel sword OR steel armor OR steel shield
#     As before, those parens produce a capture group

endRegex = re.compile(r'steel (shield|sword|armor)')
mo = endRegex.search('grab the steel armor from the altar')
print(mo.group())

In [None]:
# Let's look at the capture group produced by the parens

print(mo.group(1))

# `.findall()`
---

In [None]:
# For those times when you want to find all instances of 
#     a pattern, the findall() method is used
#     NOTE: findall returns a list NOT a matchObj
#     so your options are potentially more limited

phoneRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
phoneRegex.findall('Home: 726-282-0186, Cell: 873-193-8264')

In [None]:
# In this case, capture groups still have some functionality...
#    They break the match into submatches that are parsed out
#    into tuples within the list
#    One tuple per match

phoneRegex = re.compile(r'((\d\d\d)-(\d\d\d)-(\d\d\d\d))')
phoneRegex.findall('Home: 726-282-0186, Cell: 873-193-8264')

# method chaining
---

In [None]:
# If you need to repeat a character OR string of characters
#     OR a character class, you can tell the re module
#     how many times to repeat using a number in {}

haRegex = re.compile(r'(ha){3}')

In [None]:
# Also, if you simply need to get the string back, you can
#     chain methods:

result = haRegex.search('hehahahahahehe').group()
print(result)

# Instead of:

result = haRegex.search('hehahahahahehe')
print(result.group(), result.span())

In [None]:
# If we want to find any sequence between 3 and 5 units long:
#     we can use a range-style notation
# WARNING: regex rules ARE NOT Python rules
#          within this regex pattern, this will look for
#          3 up to AND including 5 repetitions

haRegex = re.compile(r'(ha){3,5}')

In [None]:
# By default, regex is greedy, meaning if it can find
# 3, 4 OR 5 sequence repetitions, it will default to matching
# the most it can...

haRegex.search('hahahahaha').group()

In [None]:
# To alter this behavior, you can tell the regex module
#     to be lazy, using the ?
#     which will default to the shortest
#     string that matches the pattern

haRegex = re.compile(r'(ha){3,5}?')
haRegex.search('hahahahaha').group()

# Special sequences and symbols
---

|Special Sequences |Represents                                      |
|:--------------:|:-----------------------------------------------|
|\d              |numeric digits 0-9                              |
|\D              |everything BUT digits 0-9                       |
|\w              |any letter, numeric or underscore character     |
|\W              |everything BUT letters, numerics, or underscores|
|\s              |spaces, tabs, and newline characters            |
|\S              |everything BUT spaces, tabs, and newlines       |

|Regex symbols    |Their function                                         |
|:---------------:|:------------------------------------------------------|
|?                |matches zero or one (also drives lazy matching, see below|
|\*               |matches zero or more                                   |
|+                |matches one or more                                    |
|{n}              |matches exactly n                                      |
|{n,}             |matches n or more                                      |
|{,m}             |matches 0 to m                                         |
|{n,m}            |matches at least n and at most m                       |
|{n,m}?, \*?, +?  |performs a non-greedy(lazy) match                      |
|^spam            |the string must begin with spam                        |
|spam$            |the string must end with spam                          |
|.                |matches any character except newlines                  |
|[abc]            |matches any character between the brackets             |
|[^abc]           |matches any character but the ones between the brackets|

In [None]:
mo = re.search(r'\d', 'A1B2C3')
mo.group()

In [None]:
re.findall(r'\d', 'A1B2C3')

# Character classes
---

`\d` and `\w` are short cuts for character classes

Character classes are patterns composed of specific characters or ranges of characters


Task | Sample Object(s)
:---|:---
`[0-9]` |matches any **single** numeric character
`[0-9][0-9]` |matches any **two** numeric characters
`[0-9][0-9][a-z]` |matches any **two** numeric characters followed by one lowercase alpha character
`[0-9]{3}` |matches any **three** numeric character
`[0-9]{2,8}` |matches any **two** to **eight** numeric characters
`[a-zA-Z]` |matches any lowercase OR uppercase alpha character
`[a-zA-Z_.#]` |matches any lowercase OR uppercase alpha character OR underscore OR literal period or hashtag



# Experience Points!
---

In **Jupyter** do each of the following:

Task | Sample Object(s)
:---|:---
`import` the regex module `re`|
`re.compile()` a Character Class regex pattern that matches:|
.|uppercase letters AND these symbols: @, &, \*
.|repeats the Character Class three to five times|
`.findall()` the following string against your pattern|`'Can you find &ME* or @YOU but not )WE'`
Label the result as `matchObj`|


When you complete this exercise, please put your green post-it on your monitor. 

If you want to continue on at your own-pace, please feel free to do so.

<img src='../universal_images/green_sticky.300px.png' width='200' style='float:left'>

In [None]:
import re
patt = re.compile(r'[A-Z*&@]{3,5}')

matchObj = patt.findall('Can you find &ME* or @YOU but not )WE')

print(matchObj)

# Flags
---

In [None]:
# Flags

re.IGNORECASE
re.DOTALL
re.VERBOSE

In [None]:
helloRegex = re.compile(r'hello', re.IGNORECASE)
print(helloRegex.findall('I said "HELLO!" to the man after he said hello to me'))

In [None]:
# re.I is the equivalent of re.IGNORECASE

helloRegex = re.compile(r'hello', re.I)

In [None]:
# dot (.) will match all characters except a newline
# * will repeat the previous pattern multiple times

dotallRegex = re.compile(r'.*')
print(dotallRegex.search('Batman is love\nBatman is life').group())

In [None]:
# re.DOTALL overrides the except newline clause and 
#     allows dot (.)  to match even newlines

dotallRegex = re.compile(r'.*', re.DOTALL)
print(dotallRegex.search('Batman is love\nBatman is life').group())

# Verbose mode
---

In [None]:
phoneRegex = re.compile(r'((\d{3}|\(\d{3}\))?(\s|-|\.)?\d{3}(\s|-|\.)\d{4}(\s*(ext|x|ext.)\s*\d{2,5})?)')

In [None]:
phoneRegex = re.compile(r'''(
                            (\d{3}|\(\d{3}\))?
                            (\s|-|\.)?          # space OR hyphen OR literal dot (lazy)
                            \d{3}
                            (\s|-|\.)           # space OR hyphen OR literal dot 
                            \d{4}
                            (\s*(ext|x|ext.)\s*\d{2,5})?     # variations on extensions
                            )''', re.VERBOSE)

In [None]:
phoneRegex.findall('(808) 234-1234 OR 808.234.1234 OR 808-234-1234 ext 42')

In [None]:
# It is possible to include multiple flags:
# Also... be aware that by default, the regex module
#     includes the re.UNICODE flag

re.compile(r'example string', re.VERBOSE | re.IGNORECASE)

# Experience Points!
---

In your **text editor** create a simple script called:

```bash
my_regex_01.py```

Execute your script in **Jupyter** using the command:

```bash
run my_regex_01.py```

I suggest that as you add each feature to your script that you run it right away to test it incrementally. 

1. Assign the label `test_one` to this string: `this_address@email.net`
1. Assign the label `test_two` to this string: `USERname@aol.org`
1. Assign the label `test_three` to this string: `my_thing@com`
1. Assign the label `test_four` to this string: `domain.org`
1. Assign the label `test_five` to this string: `words.42@website.ly`     
1. `.compile()` a regex pattern that will match typical emails:
    * text possibly containing letters, numbers, underscores and periods
    * followed by an 'at' sign (@)
    * followed by generic text 
    * followed by a period (.)
    * followed by two or more letters

1. Compare your pattern against each of the test phrases, one by one.
1. `print()` the match group()

1. Next, go back and edit your pattern to capture:
    * the username (username)
    * the domain (website.com)
    * the top level domain (com)

1. `print()` each member of the capture groups (`.group(1)`, `.group(2)`, etc)
1. Lastly, create a long test string with multiple emails of your choice and use `.findall()` to find all the example emails.

When you complete this exercise, please put your green post-it on your monitor. 

If you want to continue on at your own-pace, please feel free to do so.

<img src='../universal_images/green_sticky.300px.png' width='200' style='float:left'>