# Regular Expressions and Patterns

* Regular expressions (regex) are short statements which describe patterns of text for searching within text
* A regular expression is interpreted by a regex processor, which can be used to search in or split up text into "chunks"
* A regex follows a sort of "mini-language" of programming to define patterns of interest

* Patterns for searching in text once you find a match, then split into text = like a mini paguage
* pros - ubiquitous, efficient, compact
* cons - difficult to understand, hard to debug

* Good uses of regex:
  * Validating input data ("Hey, make sure all phone numbers are in the format (###) ###-####")
  * Quick and dirty cleaning of data when you can verify the results easily
* Questionable uses of regex:
  * If someone needs to be able to understand what you wrote
  * If there are a lot of edge cases (in which case you might want regex + more error handling

* One more reason to learn regex: they're nearly ubiquitously supported in tools and languages (Java, python, C#, as well as grep, text editors etc)!

* Regex's in python are done through the `re` module (and the docs are your friend!):

In [2]:
from IPython.display import IFrame    
display(IFrame("https://docs.python.org/3/library/re.html", width="100%", height=700))

* The most important operations are:
  * `re.search()` which returns a `Match` object for the first item which can be found
  * `re.finditer()` which returns an iterator over `Match` objects for items found
  * `re.findall()` which returns a bunch of `string` objects, `re.finditer()` is generally prefered
  * `re.split()` which uses a pattern to break up a string
  * `re.sub()` which replaces substrings through substitution
* But! Lots of other modules will take in a regex as well, and we'll touch on them in pandas

* The Match object is key to understand.

In [3]:
import re
print(re.Match.__doc__)

The result of re.match() and re.search().
Match objects always have a boolean value of True.


* If nothing is found the `Match` object doesn't exist - it's `None`.
* There is some important subtlety here!

In [4]:
# Quick example
strng = "I absolutely love SI330 and everything \
 we do in class is amazing."
pattern = "SI330"
result = re.search(pattern, strng)

In [5]:
result

<re.Match object; span=(18, 23), match='SI330'>

In [6]:
if result: #works
    print("I knew it was about SI330!")

I knew it was about SI330!


In [7]:
if result == True: #will not work because result is a match and True is a booleen. DO NOT USE ==
    print("I knew it was about SI330!")

In [8]:
start = result.start(0)
end = result.end(0)
strng[start:end] #returns 'SI 330'

'SI330'

* Wait, wtf? Why does `result == True` not evaluate to `True`, but `result is True` does, when we have a `Match` object?

* This is important python object understanding:
  * `==` checks for equality between objects, that the left hand side and right hand side point to the **same** object. In this case, `True` is not the same as a given `Match` object

* Don't use `==` with `Match` objects. In truth, never use `==` when checking a `bool`

* Match objects also have some helpful information inside of them, such as what was matched (`match`) and where it was matched in the string (`span`)
* This can be helpful when your patterns can match many different substrings

## Patterns
* We've already seen the most basic pattern, just a list of ordinary alphanumeric characters
* But there are a lot of special characters. Let's start with `.`
* `.` will match any single character except for newline characters (which we represent with the escaped`\n`)

In [9]:
pattern='G..d'
re.search(pattern, 'Good') #checking to see if starts with capital G and ends with lower case d. 'Good' works

<re.Match object; span=(0, 4), match='Good'>

In [10]:
re.search(pattern, 'Grid') #workds

<re.Match object; span=(0, 4), match='Grid'>

In [11]:
pattern='G..d'
re.search(pattern, 'Graduation!') #works because finds a match in the first 4 characters

<re.Match object; span=(0, 4), match='Grad'>

In [12]:
re.search(pattern, 'God') #will not work because there needs to be two letters between the G and the d

* The next patterns to be aware of are
  * `\s` which matches whitespace, this will match odd unicode whitespaces, tabs, spaces, etc.
  * `\S` which matches non-whitespace
  * `\d` which matches digits
  * `\D` which matches non-digits

In [13]:
pattern="\D\d\d\d\D\s\d\d\d-\d\d\d\d"
re.search(pattern,"(306) 262-2905") #works, ( = \D, 123 = \d\d\d, ) = \D, " " = \s, 456 = \d\d\d, - = -, 7890 = \d\d\d\d

<re.Match object; span=(0, 14), match='(306) 262-2905'>

In [14]:
re.search(pattern,"306-262-2905") #won't work, expecting a non digit first

In [13]:
pattern="\D\d\d\d\D\s\d\d\d-\d\d\d\d"
# But we see it's not an ideal pattern...
re.search(pattern, "-123- 456-7890") #works, - = \D, 123 = \d\d\d, - = \D, " " = \s, 456 = \d\d\d, - = -, 7890 = \d\d\d\d

<re.Match object; span=(1, 15), match='x306p 262-2905'>

In [14]:
re.search(pattern, "p123x 456-7890") #works because p and x are non digits

<re.Match object; span=(0, 14), match='p123x 456-7890'>

* In addition to characters to match, we can match next to positions (boundaries)
  * `^` match to the beginning of a line
  * `$` match to the end of a line
  * `\b` which matches to the beginning or end of a **word**
  * `\B` which matches not to the beginning or end of a word
  * `\w` matches to a word character (defined as letter, number or... underscore?)
  * `\W` matches to a non-word character

In [15]:
strng="My goodness, have you heard that Li person \
is teaching? He's not even a Chris!"
re.search('^Li', strng) #won't work, Li is not at the beginning of the line

In [17]:
re.search('^My', strng) #works, My

<re.Match object; span=(0, 2), match='My'>

In [18]:
# words that start with good (but not good itself)
re.search('\bgood\B', strng) #will not work, trying to find word that starts with good(but not good itself), \b does not exist, needs r before quotation
# r represents a rawstring

* Wait, WTF? Isn't that supposed to work? What is happening here?
  * There are different ways of representing strings:
    * Just as per normal: `strng="No thank you"`, in Python 3 this is unicode data
    * As a raw string. In this case, the backslash characters are left in and not escaped by the string processing

In [19]:
print('No thank you Chris Teplovs') #prints out sentence
print('No thank you Chris \brooks') #treats as \b instead of brooks and gets rid of \b
print(r'No thank you chris \brooks') #r keeps \b in print
print('No thank you Chris \quarles') #doesn't change anything because \q isn't a special character

No thank you Chris Teplovs
No thank you Chrisrooks
No thank you chris \brooks
No thank you Chris \quarles


* Goodness! The `\b` that we were putting in the string was being mistaken for a backspace character!
* Wait, why didn't this happen with the \d before?
* Because \d isn't a special character...

* Morale of the story: Always prepend your regex strings with r
* Seriously. Always. Make your life easier.

In [20]:
# words that start with good (but not good itself)
re.search(r'\bgood\B', strng) #now works

<re.Match object; span=(3, 7), match='good'>

In [23]:
strng="Dang I love this class! It was worth every $"
re.search(r'worth every $', strng) #won't work because $ is a special character

In [22]:
re.search('worth every \$', strng) # now will work

<re.Match object; span=(31, 44), match='worth every $'>

## Quantifiers
* A few different kinds of special sequence characters we can use
  * `*` zero or more of the previous character
  * `+` one or more of the previous character
  * `?` zerp or one of the previous character
  * `{m,n}` between `m` and `n` of the previous character, where `n` is optional and if left out it means either exactly `m` (`{m}`) or `m` or more (`{m,`)

In [24]:
strng='`My phone number is (306) 373-2905'
re.search(r'\d*', strng) #won't work

<re.Match object; span=(0, 0), match=''>

In [25]:
# ok, seems like that wasn't the aim
strng='`My phone number is (306) 373-2905'
re.search(r'\d+', strng) #works, gets first 3 numbers, d+ wants one or more digits and ends once the digits are gone hence only retrieving the first 3

<re.Match object; span=(21, 24), match='306'>

In [28]:
# can we find all number fragments in the string?
list(re.finditer(r'\d+', strng)) # works, iterates through all the groups with \d+ in it

['306', '373', '2905']

In [27]:
# what do you think this will do?
re.findall(r'\d{1,3}', strng) # works, segmented based on limiting chuncks based on 1 and 3 characters

['306', '373', '290', '5']

## Sets of Characters
* We can wrap a set of characters we want to match inside of `[]`
* `[aeiou]` means match any vowel

In [29]:
re.findall(r'[aeiou]+','The quick brown fox jumped over the...') #works, pulls single or groups like ui

['e', 'ui', 'o', 'o', 'u', 'e', 'o', 'e', 'e']

In [30]:
# we can negate THE WHOLE SET with a caret `^`
re.findall(r'[^aeiou]{1}','The quick brown fox jumped over the...') #works, negates everything that is a vowel

['T',
 'h',
 ' ',
 'q',
 'c',
 'k',
 ' ',
 'b',
 'r',
 'w',
 'n',
 ' ',
 'f',
 'x',
 ' ',
 'j',
 'm',
 'p',
 'd',
 ' ',
 'v',
 'r',
 ' ',
 't',
 'h',
 '.',
 '.',
 '.']

In [31]:
re.findall(r'dog[s]{1}','The dogs ran after the big dog')

['dogs']

* We can also define a range inside of a character set. This is still used, but meta characters are often more appropriate.
  * `[A-Z]` all upper case roman characters
  * `[a-zA-Z]` all upper case or lower roman characters
  * `[a-zA-Z0-9_]` the same as `\w`
  *

In [32]:
# unicode ranges work too
re.findall(r'[α-ω]+','Someone once said, "I am the α". Does this mean there is a γ?')

['α', 'γ']

* And logic is implicit, but if you want to specify an "OR" you use a pipe `|`

In [None]:
line="POST /incentivize HTTP/1.1"
re.findall(r'HTTP/1.[1|2]',line) #works

## Capture Groups
* Up until this point it probably seems really laborious. It is.
* Capture groups let us match and/or extract subpatterns so we can build many regexes up together
* To indicate a capture group we use parentheses `()`
* The cannonical example? An email address

In [None]:
strng="The instructor is liwarren@umich.edu" #[\w.-] = takes either a word character/a period/or a dash/, + = one or more of those characters, @ = @, [\w.-] = another chunk of either word character/a period/ or a dash
re.search(r'[\w.-]+@[\w.-]+',strng)

* But, there are actually a few different parts of an email address, including the username and the hostname

In [33]:
strng="The instructor is liwarren@umich.edu"
match=re.search(r'([\w.-]+)@([\w.-]+)',strng)
if match:
    print(match.group()) # the whole match
    print(match.group(1))# the first group, prints liwarren that was in the first paranthensis
    print(match.group(2))# the second group, prints umich.edu that was in the second paranthesis

liwarren@umich.edu
liwarren
umich.edu


* Capture groups get even cooler though: you can label them like a variable
* Uses the syntax `(?P<name>)`, where 
  * the `()` denotes a capture group 
  * the `?P` indicates this is an extension to standard regex
  * the `<name>` means that matches for that group are labeled with the dictionary key `name`

In [36]:
import re
result = re.search("(?P<month>\w*) (?P<day>\d{1,2}), (?P<year>\d\d\d\d)",
          "Gordie Howe Chex card.jpg Born	March 31, 1928 Floral, Saskatchewan, Canada")
result.groupdict() #puts month day year into dictionary

{'month': 'March', 'day': '31', 'year': '1928'}

* Last topic I'll touch on in capture groups: thus far the focus has been on returning and labeling the capture groups
* What if we want to match on the group, but don't want to see it come back?
* (like \[edit\])
* We can use non capturing groups
  * `(?:...)` Match but don't return the group

* Lets see an example using data from wikipedia on US universities which are buddhist-based

In [38]:
# Get a list of dicts where each university 'name', 'city', and 'state' are labeled as such
with open("datasets/buddhist.txt","r") as file: # gets a list of dicts where each university name city and state are labeled as such
    wiki=file.read()
pattern = r"(?P<name>.*)(?:[–])(?: located in )(?P<city>\w*)(?:, )(?P<state>\w*)" #?- removes the - from what is returned
re.findall(pattern, wiki) 

[('Dhammakaya Open University ', 'Azusa', 'California'),
 ('Dharmakirti College ', 'Tucson', 'Arizona'),
 ('Dharma Realm Buddhist University ', 'Ukiah', 'California'),
 ('Ewam Buddhist Institute ', 'Arlee', 'Montana'),
 ('Institute of Buddhist Studies ', 'Berkeley', 'California'),
 ('Maitripa College ', 'Portland', 'Oregon'),
 ('University of the West ', 'Rosemead', 'California'),
 ('Won Institute of Graduate Studies ', 'Glenside', 'Pennsylvania')]

![](https://imgs.xkcd.com/comics/regular_expressions.png)