# Regular Expressions and Patterns

* Regular expressions (regex) are short statements which describe patterns of text for searching within text
* A regular expression is interpreted by a regex processor, which can be used to search in or split up text into "chunks"
* A regex follows a sort of "mini-language" of programming to define patterns of interest

* In my not so humble view, regexs are:
  * Pretty awesome
  * Pretty efficient
  * Compact
  * Painful to read if complex
  * Hard as hell to debug
  * Are super fragile and should be rarely used in production programming

* However, in data science, regexs should be used, and fairly often. This is especially true because:
  * A lot of data cleaning code is "one off" or "few off" -- it just needs to work on a limited set of data, not be robust across all kinds of input data
  * The data science process involves a lot of "understanding of your data", where the data is poorly described, and thus
    * You will write a lot of throw away code
    * You will want to rapidly investigate your code

* Good uses of regex:
  * Validating input data ("Hey, make sure all phone numbers are in the format (###) ###-####")
  * Quick and dirty cleaning of data when you can verify the results easily
* Questionable uses of regex:
  * If someone needs to be able to understand what you wrote
  * If there are a lot of edge cases (in which case you might want regex + more error handling

* One more reason to learn regex: they're nearly ubiquitously supported in tools and languages (Java, python, C#, as well as grep, text editors etc)!

* Regex's in python are done through the `re` module (and the docs are your friend!):

In [None]:
from IPython.display import IFrame    
display(IFrame("https://docs.python.org/3/library/re.html", width="100%", height=700))

* The most important operations are:
  * `re.search()` which returns a `Match` object for the first item which can be found
  * `re.finditer()` which returns an iterator over `Match` objects for items found
  * `re.findall()` which returns a bunch of `string` objects, `re.finditer()` is generally prefered
  * `re.split()` which uses a pattern to break up a string
  * `re.sub()` which replaces substrings through substitution
* But! Lots of other modules will take in a regex as well, and we'll touch on them in pandas

* The Match object is key to understand.

In [1]:
import re
print(re.Match.__doc__)

AttributeError: module 're' has no attribute 'Match'

* If nothing is found the `Match` object doesn't exist - it's `None`.
* There is some important subtlety here!

In [2]:
# Quick example
strng = "I absolutely love Christopher Brooks and everything \
 he does in class, he is amazing."
pattern = "Chris"
result = re.search(pattern, strng)

In [3]:
result

<_sre.SRE_Match object; span=(18, 23), match='Chris'>

In [4]:
if result:
    print("I knew it was about Chris!")

I knew it was about Chris!


In [None]:
if result == True:
    print("I knew it was about Chris!")

* Wait, wtf? How does work? Why does `result == True` not evaluate to `True`, but `result is True` does, when we have a `Match` object?

* This is important python object understanding:
  * `==` checks for equality between objects, that the left hand side and right hand side point to the **same** object. In this case, `True` is not the same as a given `Match` object
  * `is` is a function which is called on the `Match` object and returns a value of `True` or `False`. The authors of the `re` module have implemented this function to make life easy for us
* Don't use `==` with `Match` objects. In truth, never use `==` when checking a `bool`

* Ok, that was a bit of a digression, let's get back on track
* Match objects also have some helpful information inside of them, such as what was matched (`match`) and where it was matched in the string (`span`)
* This can be helpful when your patterns can match many different substrings

## Patterns
* We've already seen the most basic pattern, just a list of ordinary alphanumeric characters
* But there are a lot of special characters. Let's start with `.`
* `.` will match any single character except for newline characters (which we represent with the escaped`\n`

In [None]:
pattern='G..d'
re.search(pattern, 'Good')

In [None]:
re.search(pattern, 'Gawd')

In [None]:
pattern='G..d'
re.search(pattern, 'Goodness!')

In [None]:
re.search(pattern, 'God')

* The next patterns to be aware of are
  * `\s` which matches whitespace, this will match odd unicode whitespaces, tabs, spaces, etc.
  * `\S` which matches non-whitespace
  * `\d` which matches digits
  * `\D` which matches non-digits

In [None]:
pattern="\D\d\d\d\D\s\d\d\d-\d\d\d\d"
re.search(pattern,"(306) 262-2905")

In [None]:
re.search(pattern,"306-262-2905")

In [None]:
pattern="\D\d\d\d\D\s\d\d\d-\d\d\d\d"
# But we see it's not an ideal pattern...
re.search(pattern,":306p 262-2905")

* In addition to characters to match, we can match next to positions (boundaries)
  * `^` match to the beginning of a line
  * `$` match to the end of a line
  * `\b` which matches to the beginning or end of a **word**
  * `\B` which matches not to the beginning or end of a work
  * `\w` matches to a word character (defined as letter, number or... underscore?)
  * `\W` matches to a non-word character'

In [None]:
strng="My goodness, have you heard that Chris \
Brooks teach? He's amazing!"
re.search('^Brooks', strng)

In [None]:
re.search('^My', strng)

In [None]:
# words that start with good (but not good itself)
re.search('\bgood\B', strng)

* Wait, WTF? Isn't that supposed to work? What is happening here?
  * Gah. Strings are a mess in Python. There are 3 ways of representing strings:
    * Just as per normal: `strng="No thank you"`, in Python 3 this is unicode data
    * As an array of bytes: `strng=b"No thank you"`, this is not done in Python 3 and reminds Chris of the simplier days. Who needs more than 255 characters?
    * As a raw string. In this case, the backslash characters are left in and not escaped by the string processing

In [None]:
print('No thank you chris brooks')
print('No thank you chris \brooks')
print(r'No thank you chris \brooks')

* Goodness! The `\b` that we were putting in the string was being mistaken for a backspace character!
* Wait, why didn't this happen with the \d before?
* Because \d isn't a special character...

* Morale of the story: Always prepend your regex strings with r
* Seriously. Always. Make your life easier.

In [None]:
# words that start with good (but not good itself)
re.search(r'\bgood\B', strng)

In [None]:
# what's going to happen!?
print(len('a'))
print(len('a\b'))
print('a\b')
print(len(r'a\b'))
print(r'a\b')

In [None]:
strng="Dang I love this class! It was worth every $"
re.search(r'worth every $', strng)

In [None]:
re.search('worth every \$', strng)

## Quantifiers
* A few different kinds of special sequence characters we can use
  * `*` zero or more of the previous character
  * `+` one or more of the previous character
  * `?` zerp or one of the previous character
  * `{m,n}` between `m` and `n` of the previous character, where `n` is optional and if left out it means either exactly `m` (`{m}`) or `m` or more (`{m,`)

In [5]:
strng='`My phone number is (306) 373-2905'
re.search(r'\d*', strng)

<_sre.SRE_Match object; span=(0, 0), match=''>

In [8]:
# ok, seems like that wasn't the aim
strng='`My phone number is (306) 373-2905'
string = re.search(r'\d+', strng)
print(string)

<_sre.SRE_Match object; span=(21, 24), match='306'>


In [None]:
# can we find all number fragments in the string?
re.findall(r'\d+', strng)

In [None]:
# what do you think this will do?
re.findall(r'\d{1,3}', strng)

In [None]:
# imagine these are your grades over time
grades='"ACCAAAABCBCBAA'
# how do you find your longest A streak?
re.findall(r'A+', grades)
# it's in there somewhere....

In [None]:
# What do you figure this does?
grades='"ACCAAAABCBCBAA'
re.findall(r'CA+', grades)

## Sets of Characters
* We can wrap a set of characters we want to match inside of `[]`
* `[aeiou]` means match any vowel

In [10]:
re.findall(r'\d+','The qui23456789ck brown fox jumped over the...')

['23456789']

In [None]:
# we can negate THE WHOLE SET with a caret `^`
re.findall(r'[^aeiou]{1}','The quick brown fox jumped over the...')

In [None]:
re.findall(r'dog[s]{1}','The dogs ran after the big dog')

* We can also define a range inside of a character set. This is still used, but meta characters are often more appropriate.
  * `[A-Z]` all upper case roman characters
  * `[a-zA-Z]` all upper case or lower roman characters
  * `[a-zA-Z0-9_]` the same as `\w`
  *

In [None]:
# unicode ranges work too
re.findall(r'[α-ω]+','Someone once said, "I am the α". Does this mean there is a γ?')

* And logic is implicit, but if you want to specify an or you use a pipe `|`

In [None]:
line="POST /incentivize HTTP/1.1"
re.findall(r'HTTP/1.[1|2]',line)

# Play time
* Let's play! Someone throw out a Canadian hockey player name?

In [None]:
# Insert fun 🇨🇦🏒 activity here.
# save data as datasets/wiki.txt

## Capture Groups
* Up until this point it probably seems really laborious. It is.
* Capture groups let us match and/or extract subpatterns so we can build many regexes up together
* To indicate a capture group we use parentheses `()`
* The cannonical example? An email address

In [None]:
strng="Wow, the course taught by brooksch@umich.edu is fire!"
re.search(r'[\w.-]+@[\w.-]+',strng)

* But, there are actually a few different parts of an email address, including the username and the hostname

In [None]:
strng="Wow, the course taught by brooksch@umich.edu is fire!"
match=re.search(r'([\w.-]+)@([\w.-]+)',strng)
if match:
    print(match.group()) # the whole match
    print(match.group(1))# the first group
    print(match.group(2))# the second group

* Capture groups get even cooler though: you can label them like a variable
* Uses the syntax `(?P<name>)`, where 
  * the `()` denotes a capture group 
  * the `?P` indicates this is an extension to standard regex
  * the `<name>` means that matches for that group are labeled with the dictionary key `name`

In [None]:
# read in the wiki text
with open("datasets/wiki.txt","r") as file:
    wiki=file.read()

# can you write a better regex to pull out titles from that datafile?
for item in re.finditer("???",wiki):
    print(item.groupdict())

* Last topic I'll touch on in capture groups: thus far the focus has been on returning and labeling the capture groups
* What if we want to match on the group, but don't want to see it come back?
* (like \[edit\])
* We can use non capturing groups
  * `(?:...)` Match but don't return the group

* Here's an example from the New York Times which covers health tweets on news items

In [227]:
# get a list of all of the hashtags that are included in this data
with open("datasets/nytimeshealth.txt","r") as file:
    health=file.read()


* Lets see an example using data from wikipedia on US universities which are buddhist-based

In [228]:
# Get a list of dicts where each university 'name', 'city', and 'state' are labeled as such
with open("datasets/buddhist.txt","r") as file:
    wiki=file.read()


![](https://imgs.xkcd.com/comics/regular_expressions.png)