# Regular Expressions

I think regular expressions are perfectly suited for people who like puzzles like _crosswords_ and _Sudoku_. Naturally, people created games out of them (more info on that later). They can be long, mindboggling and indecipherable at times. But we'll start at the very beginning to keep it simple.

There are multiple versions of Regex Engines out there for different programming languages and software (analogous to SQL flavors for databases). The differences usually lie in the more advanced features. The basic nomenclature should hold across most Regex Engines.

A regular expression is a sequence of characters that can be used as a search pattern for a string. They are commonly used to do __find__ and __find and replace__ string operations and for __validation of user input__ (we'll see examples of this later).

Regular expressions can be used to evaulate Unicode strings or 8-bit strings (e.g. '00001110'). We'll just worry about the former in this workshop. Python has adopted the Perl syntax vs. the POSIX syntax.

Let's get started and import the regular expression module in Python.

In [3]:
import re

To set up a regular expression (regex), we use the `compile` method.

In [4]:
regex = re.compile('regular expression')
regex

re.compile(r'regular expression', re.UNICODE)

To look for a match, we use the `search` method.

In [5]:
re.search(regex, 'This string should match the above regular expression')

<_sre.SRE_Match object; span=(35, 53), match='regular expression'>

The `search` method returns a match object that contains the span of characters that matched along with the characters that matched. If an object exists in Python, it usually is equivalent to `True`. We'll use that fact to check if we have a match or not.

Here is a custom function we'll be using to return `True`/`False` about whether our regex matches a list of strings.

In [6]:
def Go_Fish(strings, regex, flag=0):
    TF = []
    for string in strings:
        if re.search(regex, string):
            if flag:
                print(re.search(regex,string))
            TF.append(True)
        else:
            TF.append(False)
    return TF

**Note**: None of the regex solutions will require knowledge of anything not covered before it.

## Literal String Matching

The simplest way to match is to provide the actual text that you want to literally (exactly) match.  
For example, if you want to find all strings that contain the characters "dog", your regex would just be `dog`.

In [None]:
strings = ['Cat', 'Cats', 'Cathy', 'cathy', 'Caller id']
# [True, True, True, False, False]

**Note**: The list of True and False elements below the `strings` variable is the answer I want you to match using a regex.

In [None]:
regex = re.compile('')
Go_Fish(strings, regex)

The same holds true for numbers represented as a string.

In [None]:
strings = ['734-764-7828','734 764 4615', '007', 'seven']
# [True, True, True, False]

In [None]:
regex = re.compile('')
Go_Fish(strings, regex)

**Regular Expressions can only work with strings**. That is, numbers (e.g. integers and floats) have to be converted to a string type. This holds for both the regular expression and the item you are searching or matching against. Here is a counter example with two issues that will produce an error.

In [None]:
strings = [1,2,3] # not allowed

In [None]:
regex = re.compile(3) # not allowed
Go_Fish(strings, regex)

## Leading Characters `^` and Trailing Characters `$`

`^` indicates the start of string. An alternative notation to indicate the start of the string is to use `\A`. 

For example, `^al` would match a string like `aluminum` but not `tally`.  

In [None]:
strings = ['Beaver', 'Beard', 'Beat It', 'Betty', 'Big Beaver Rd']
# [True, True, True, False, False]

In [None]:
regex = re.compile('')
Go_Fish(strings, regex)

`$` indicates the end of string. Another nomenclature to indicate the end of the string is to use `\Z`. 

For example, `al$` would match a string like `get real` but not `tally`.  

In [None]:
strings = ['kissing', 'loving', 'song', 'kings']
# [True, True, False, False]

In [None]:
regex = re.compile('')
Go_Fish(strings, regex)

## Matching `[]` and Omitting `^` Specific Characters

You can match more than one character at a time for a specific character in a regex. Just specify the characters you want to exactly match in a square bracket `[]`. 

For example, `[abc]$` will match any word ending with a, b, or c.

In [None]:
strings = ['fast', 'mast', 'last', 'cast', 'vast', 'flash']
# [True, True, True, False, False, False]

In [None]:
regex = re.compile('')
Go_Fish(strings, regex)

You can also choose NOT to match certain characters in a regex. Just specify the characters you want to NOT match in square brackets with the caret symbol `[^]`. 

**Note:** As you might have notice, `^` already has two different meanings and can cause some confusion when reading regular expressions.

For example, `^[^XYZ]` will match any word not starting with X, Y, or Z.

In [None]:
strings = ['fast', 'mast', 'last', 'cast', 'vast', 'flash']
# [True, True, True, False, False, False]

In [None]:
regex = re.compile('')
Go_Fish(strings, regex)

## Range of Characters

Sometimes, you just want to match a range of characters instead of specifying specific ones. 

For example, if you want to match all alphanumeric characters, your regex pattern would be `[A-Za-z0-9]`.  
Order doesn't matter. This is the same as `[a-z0-9A-Z]`.

In [125]:
strings = ['World War Z','X-Ray Glasses','Alf','Foxtrot','Money','Yo-Yo']
# [True, True, True, True, False, False]

In [126]:
regex = re.compile('')
Go_Fish(strings, regex)

[True, True, True, True, False, False]

In [7]:
strings = ['123', '456', '789', '88', '420', '007']
# [True, True, True, False, False, False]

In [8]:
regex = re.compile('')
Go_Fish(strings, regex)

[True, True, True, False, False, False]

## Logical OR `|` Operator

The logical OR operator allows us to match one regex pattern or another. For example, to match "cat" or "dog", use the regex `cat|dog`. The OR operator is also commonly used with the `()` for grouping, clarity, and some other purpose that is explained later on. 

For example, `I love my (cat|dog).` Without the parentheses, the regex will try to match the entire phrase `I love my cat` or word `dog`.

In [18]:
strings = ['123', '456', '789', '88', '420', '007']
# [True, False, True, False, True, False]

In [19]:
regex = re.compile('')
Go_Fish(strings, regex,1)

<_sre.SRE_Match object; span=(1, 2), match='2'>
<_sre.SRE_Match object; span=(0, 2), match='78'>
<_sre.SRE_Match object; span=(1, 2), match='2'>


[True, False, True, False, True, False]

In [None]:
strings = ['keep calm and cook bacon', 
           'keep calm and cook hash browns',
           'keep calm and cook quinoa',
           'bacon and hash browns are yummy']
# [True, True, False, False]

In [None]:
regex = re.compile('')
Go_Fish(strings, regex)

**Note**: There is no logical AND operator as it does not make sense to be able to match two characters simultaneously on a single character.

## Characters, Numbers, Word Characters and Boundaries, and Whitespace

### <font color=blue>any character `.`</font>

The dot matches any character. The only exceptions are empty strings and new line characters. This metacharacter is pretty powerful and will almost always return `True` usually resulting in false positives. Use sparingly. If you want to literally match a period, use `\.` 

**Tip:** The backslash character is used to escape special characters like `^, $, *, +, ?, {, }, [, ], \, |, (, )`.

For example, `c.t` will match `cat` or `cot` but not `ca`. 

In [None]:
strings = ['Will.I.Am', 'The End.', 'etc...', 'LEGO', '???', '', '\n']
# [True, True, True, True, False, False, False]

In [None]:
regex = re.compile('')
Go_Fish(strings, regex)

Now try to match a literal period or two or three.

In [None]:
strings = ['Will.I.Am', 'The End.', 'etc...', 'LEGO', '???', '', '\n']
# [True, False, True, False, False, False, False]

In [None]:
regex = re.compile('')
Go_Fish(strings, regex)

### <font color=blue>numbers `\d \D`</font>

- `\d` matches any Unicode decimal digit. This includes [0-9] and also many other digit characters. 
- `\D` matches non-decimal digit characters. 

This url has a good list of other decimal digit characters (which I never knew about until now). http://www.fileformat.info/info/unicode/category/Nd/list.htm  

For example, `\d` will match `0` but not `?`, while `\D` will not match `0` but will match `?`.

In [85]:
strings = ['48109', '2001: A Space Odyssey', 'Beverly Hills 90210', 'BB-8', '734-764-7828']
# [True, True, True, False, False]

In [89]:
regex = re.compile('')
Go_Fish(strings, regex)

[True, True, True, False, False]

The first string has the digits 6 and 7 in Tamil and is equivalent to 67==67.

In [95]:
strings = ['௬௭==67','8eight','9?','2 - 2 = 0','2+2=4']
# [False, False, False, True, True]

In [96]:
regex = re.compile('')
Go_Fish(strings, regex, 1)

<_sre.SRE_Match object; span=(3, 6), match=' 2 '>
<_sre.SRE_Match object; span=(1, 4), match='+2='>


[False, False, False, True, True]

### <font color=blue>whitespace `\s \S`</font>

- `\s` matches Unicode whitespace characters (which includes a space, horizontal tab, vertical tab, line break, carriage return, form feed `[ \t\v\n\r\f]` and other characters). 
- `\S` matches non-whitespace characters. 

This url has a list of non-whitespace characters. http://www.fileformat.info/info/unicode/category/Zs/list.htm  

For example, `\s` will match `\t` while `\S` will match `t`.

In [None]:
strings = ['To Whom It May Concern','N I-75','915\tE\tWashington\tSt','13\n30','The\n\nEnd']
# [True, False, True, True, False]

In [None]:
regex = re.compile('')
Go_Fish(strings, regex, 1)

Try to match using the non-whitespace character.

In [None]:
strings = ['To Whom It May Concern','N I-75','915\tE\tWashington\tSt','13\n30','The\n\nEnd']
# [True, False, False, False, False]

In [None]:
regex = re.compile('')
Go_Fish(strings, regex, 1)

### <font color=blue>word characters `\w \W`</font>

- `\w` matches Unicode word characters. This includes most characters that can be part of a word in any language, as well as numbers and the underscore. The word character matches ASCII characters `[A-Za-z0-9_]`. 
- `\W` matches non-word characters. 

For example, `\w` will match `a`, `A`, or `0` but not `!`, but `\W` will not match `a`, `A`, or `0` but will match `!`.

In [None]:
strings = ['Häagen-Dazs','El Niño', 'Peanut Butter & Jelly', 'I-94', 'eh', '!!!']
# [True, True, True, False, False, False]

In [None]:
regex = re.compile('')
Go_Fish(strings, regex, 1)

Try to match using the non-word character.

In [None]:
strings = ['Häagen-Dazs','El Niño', 'Peanut Butter & Jelly', 'I-94', 'eh', '!!!']
# [False, False, True, False, False, False]

In [None]:
regex = re.compile('')
Go_Fish(strings, regex, 1)

### <font color=blue>word boundaries `\b \B`</font>

- `\b` matches the boundary between a `\w` and `\W` character (or vice versa) or between `\w` and the beginning/end of a string.
- `\B` matches non-word boundaries. 

Word boundaries is the first metacharacter we've introduced that does not consume a character.

For example, `regex\b` will match `regex` but not `regexpressions`, while `regex\B` will not match `regex` and will match `regexpressions`.

In [37]:
strings = ['Python is a snake',
           'Learn Python and make $$$', 
           'IPython Notebook is a good tool to know',
           'You should be using Python3 and not Python2',
           "That's not very Pythonic"]
# [True, True, False, False, False]

In [38]:
regex = re.compile(r'')
Go_Fish(strings, regex, 1)

<_sre.SRE_Match object; span=(0, 6), match='Python'>
<_sre.SRE_Match object; span=(6, 12), match='Python'>
<_sre.SRE_Match object; span=(20, 26), match='Python'>
<_sre.SRE_Match object; span=(16, 22), match='Python'>


[True, True, False, True, True]

**Tip**: In Python, you have use the raw string notation `r` when using word boundaries. This is because `\b` in python means backspace. The `r` tells Python to interpret the string literally and do not escape the character. Some people always include it at the start of a Python regex

Try to match using a non-word boundary character.

In [28]:
strings = ['Python is a snake',
           'Learn Python and make $$$', 
           'IPython Notebook is good to know',
           'You should be using Python3 and not Python2',
           "That's not very Pythonic"]
# [False, False, True, False, False]

In [29]:
regex = re.compile(r'')
Go_Fish(strings, regex, 1)

<_sre.SRE_Match object; span=(1, 7), match='Python'>


[False, False, True, False, False]

## Repetition `{} + * ?`

### Character{min, max}

Use the `{}` quantifier to specify how many times a character can be repeated.
1. `{N}` Exactly N times where N >= 0 
2. `{N,}` N or more times where N >= 0
3. `{,N}` N or less times where N >= 0
4. `{N,P}` Between N and P times where N < P  

For example, `[a-z]{3,5}` will match `bird` but not `id`. What about `donuts`?

In [None]:
strings = ['Doh!','Dooh!','Doooh!','Dooooh!','Doooooh!']
# [False, False, True, True, False]

In [None]:
regex = re.compile('')
Go_Fish(strings, regex, 1)

### `+` 1 or more repetitions

The `+` quantifier is equivalent to `{1,}`.

For example `e+` will match `red` or `queen` but not `crown`.

In [None]:
strings = ['IMG01.png','IMG0011.png','IMG0111.png','IMG010.png','IMG2.png']
# [True, True, True, False, False]

In [None]:
regex = re.compile('')
Go_Fish(strings, regex, 1)

### `*` 0 or more repetitions

The `*` quantifier is equivalent to `{0,}`.

For example `Ji*f` will match `Jf` or `Jif` but not `Jef`.

In [None]:
strings = ['IMG01.png','IMG001.png','IMG010.png','IMG100.png','IMG2.png']
# [True, True, True, True, False]

In [None]:
regex = re.compile('')
Go_Fish(strings, regex, 1)

### `?` 0 or 1 repetition

The `?` quantifier is equivalent to `{0,1}`.

For example `10?1` will match `11`, or `101` but not `1001`.

In [None]:
strings = ['dean', 'dead', 'den', 'deen', 'deaan']
# [True, False, True, False, False]

In [None]:
regex = re.compile('')
Go_Fish(strings, regex, 1)

### Greediness and `?` Laziness

All of the above quantifiers are greedy. Meaning they want to match as many characters as possible. This is not always desirable. For example, we are doing some web scraping and are interested in determining what tags are in the web page.

Below are some html tags. I want to extract the info in the opening html tag.

In [None]:
strings = ['<font color="blue">Eastern Conference</font>',
           '<a href="http://www.espn.com">ESPN</a>']

This regex looks for the opening and closing brackets <> with any character(s) inbetween.

In [None]:
regex = re.compile('<.+>')
Go_Fish(strings, regex, 1)

The regex ended up matching the opening HTML tag, but also the closing HTML tag because the `+` quantifier was greedy.

A solution around this issue is to use the the `?` modifier which opts for laziness instead of greediness. Laziness (a.k.a. "ungreedy" or "reluctant") is defined as matching as few characters as possible.

In [None]:
regex = re.compile('<.+?>')
Go_Fish(strings, regex, 1)

Notice above how the `?` modifier makes the `+` lazy instead of greedy. It matches on the fewest characters possible.

**Note:** Once again, another character has dual meaning in regular expressions. Part of the reason why reading them can be so difficult for people with limited knowledge of regex.

Try to extract the first sentence only.

In [116]:
strings = ['Olaf is a snowman. Who likes warm hugs.',
          'Elsa and Anna are BFF. My daughter likes Elsa. I am Anna.']

In [117]:
regex = re.compile('')
Go_Fish(strings, regex, 1)

<_sre.SRE_Match object; span=(0, 0), match=''>
<_sre.SRE_Match object; span=(0, 0), match=''>


[True, True]

Sometimes, there exists an alternative to making the `+` lazy using the `?` modifier. You can use use the `[^]` notation as we learned earlier on.

In [106]:
strings = ['<font color="blue">Eastern Conference</font>',
           '<a href="http://www.espn.com">ESPN</a>']

In [107]:
regex = re.compile('<[^>]+>')
Go_Fish(strings, regex, 1)

<_sre.SRE_Match object; span=(0, 19), match='<font color="blue">'>
<_sre.SRE_Match object; span=(0, 30), match='<a href="http://www.espn.com">'>


[True, True]

Try the same problem using the `[^]` notation. 

In [113]:
strings = ['Olaf is a snowman. Who likes warm hugs.',
          'Elsa and Anna are BFF. My daughter likes Elsa. I am Anna.']

In [115]:
regex = re.compile('')
Go_Fish(strings, regex, 1)

<_sre.SRE_Match object; span=(0, 18), match='Olaf is a snowman.'>
<_sre.SRE_Match object; span=(0, 22), match='Elsa and Anna are BFF.'>


[True, True]

And that's it for Part I of this Two-Part workshop series for learning regular expressions. 

I would like to conclude the teaching section with the most famous quote I could find regarding regular expressions.

> ## Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. - Jamie Zawinski

# <font color='darkgreen'>Practical Applications</font>

This section puts what you have learned to use in some practical applications where there is no context or constraint on which tools you can use.

## <font color="#FFCB0B">Umich Uniqname Compliance</font>

Regular expressions can be used to check if usernames and password meet the (sometime annoying) requirements set by the developer.

*Your umich uniqname must be three to eight characters (lowercase alphabetical characters only) with no spaces or punctuation*.

Match the valid uniqnames. You should also add your uniqname to the list to make sure your regex is valid.

In [None]:
strings = ['caoa', 'kaitcorn', 'ac', 'alexander', 'alex2017', 'Harbaugh', 'go blue']
# [True, True, False, False, False, False, False]

In [None]:
regex = re.compile('')
Go_Fish(strings, regex, 1)

## <font color='royalblue'>Standardization of Phone Numbers using String Substitution</font>

String substitution is an alternative to using the literal `str.replace` method in Python.

Standardize the CSCAR front desk phone number (call to register for a paid workshop or 1 hr free consultation) as best as you can using a single regex and a single substitution. The correct solution should print the same string on each line.

In [12]:
strings = ['734-764-7828',
           '734 764 7828',
           '734.764.7828',
           '(734)764 7828',
           '(734) 764-STAT']

In [13]:
regex = re.compile('') # put regex here
replacement_string = '' # put substitution string here
for string in strings:
    print(re.sub(regex, replacement_string, string))

734-764-7828
734 764 7828
734.764.7828
(734)764 7828
(734) 764-STAT


## <font color='darkblue'>Extracting a City/Township Name from an Address</font>

Extract the city name from these geocoded addresses. Run the existing code below once to load the dataset into Jupyter Notebook. You might want to open the text file in notepad to see some more data.

In [20]:
import pandas as pd
df = pd.read_csv('Detroit.txt', header=None, sep='|')
df.columns = ['address']

def get_city(address, regex):
    match = re.search(regex, address)
    return match.group(0) if match else "no match"
    
df.head(10)

Unnamed: 0,address
0,"17021 Schoolcraft Ave, Detroit, MI 48227, USA"
1,"Fisher Fwy, Detroit, MI 48216, USA"
2,"1611 Alter Rd, Detroit, MI 48215, USA"
3,"10899 Gratiot Ave, Detroit, MI 48213, USA"
4,"4099 Emery St, Detroit, MI 48234, USA"
5,"2509 Ferris St, Detroit, MI 48209, USA"
6,"19616-19624 Hoover St, Detroit, MI 48205, USA"
7,"Chrysler Fwy, Detroit, MI 48211, USA"
8,"536-598 W Hancock St, Detroit, MI 48201, USA"
9,"Chrysler Fwy, Detroit, MI, USA"


Write a regex to capture the city name as best as you can. The last line will provide your solution.

In [21]:
regex = re.compile('')
cities = df['address'].apply(get_city, args=(regex,))
cities.value_counts()

    2467
Name: address, dtype: int64

ANSWER: Here is a table with city and frequency you should match. The entire city/township name should be somewhere in your answer but it does not have to be an exact match. It will probably have extra characters in the match. I have not shown you how to extract just the city name for this scenario. That's in Part II of this workshop.

city|count
---|---
Detroit|                2292
Hamtramck|                98
Highland Park|            46
Redford Charter Twp|       5
Warren|                    4
Dearborn|                  4
Hazel Park|                4
Eastpointe|                4
Lincoln Park|              3
Harper Woods|              3
Grosse Pointe|             2
Dearborn Heights|          1
Melvindale|                1

# <font color='orange'>Regex Games</font>

Regex Crossword  
https://regexcrossword.com/

Regex Golf (I don't know why it refers to Golf)  
https://alf.nu/RegexGolf

# References

Online regex tester including real-time matching and explanations. Can only test one string at a time.  
https://regex101.com

Online tutorial with real-time matching  
https://regexone.com/

Regex Cheat Sheet  
https://www.cheatography.com/davechild/cheat-sheets/regular-expressions/

Language-agnostic website about regular expressions  
http://www.regular-expressions.info/

Official Python3 `re` documentation  
https://docs.python.org/3/library/re.html

Python's alternative regular expression module `regex` to replace `re`   
https://pypi.python.org/pypi/regex  
This module allows you to do more fancy things like nested sets and set operations.  
For example, the regex `[[a-z]--[aeiou]]` specifies all lowercase non-vowels.