# Lesson Regex

# Meet Regular Expressions

## What?
- Regular expressions (called regexes or regex patterns) are a tiny language for dealing with text and character patterns.
- With RegEx patterns we can:
    - Does this string match a pattern?
    - Is there a match for the pattern anywhere in the string?
    - Modify + split strings in various ways
    
re library functions
- `re.search` scans through a string, looking for any location where the RE matches.
- `re.findall` Finds all substrings where the RE matches; returns a list.
- `re.split` splits a string on a given regex pattern, removing that pattern. The result is a list of a strings.
- `re.sub` allows us to match a regex and substitute in a new substring for the match.


## So What?
- Power + precision
    - Cost is learning something new and potentially unfamiliar.
    - Payoff is a language that works with any other programming language to operate on text and character patterns.
- Regular Expressions are cross platform and available in many programming languages and environments:
    - Command line tools (Linux, Windows, Mac, etc...)
    - Python
    - SQL flavors offer RegEx
    - Java (Scala/Clojure)
    - Other languages like Julia, Ruby, PHP, C#, etc...
    - Like SQL, there are differences between some of the different RegEx implementations, but if you know your RegEx, you can bring value in many environments.

## When is RegEx the right tool or wrong tool?
- If you can solve the problem with built-in string methods in your language, do so.
- If you need more capability than built-in string methods
- If you're parsing HTML, JSON, or XML, use a tool built for those formats. Regex + html/json = don't

## Now What?
- We'll start simple by writing regex patterns to match literal characters.
- Then we will introduce metacharacters, that have special meaning and functionality.

## Key Concepts
- The RegEx metacharacters `. ^ $ * + ? { } [ ] \ | ( )` have special meanings. 
- Square brackets create a "character class". 
    - Character classes allow us to specify many OR operations
    - For example, `r"[aeiou]"` matches any lowercase vowel character. Identical to `r"a|e|i|o|u"`
    - `r"[a-z]"` matches lowercase a through z.
- Metacharacters are not active inside of the character class square brackets `[]`
- Outside of the character class `[]`, if you need to match a metacharacter character literally, you will need to put a `\` in front of that character. `r"\+"` will match the literal `+` character.
- RegEx has characters for special sequences:
    - `.` matches any character
    - `\d` matches any numeral. Is equivalent to `[0-9]`
    - `\D` matches any non-digit character and is equivalent to `[^0-9]`. 
    - `\s` matches any white space like ` `, tab, soft return, new line etc...
    - `\w` matches any alphanumeric character and underscore. Equivalent to `[0-9a-zA-Z_]`
    - `\W` matches any non-alphanumeric character. Equivalent to `[^a-zA-Z0-9_]`
- `.` Matches any character
- Repetition:
    - `*` matches zero or more of the previous pattern
    - `+` matches 1 or more of the previous pattern
- `?` after a pattern means that pattern is optional
- Not - `[^abc]` matches anything but "a" or "b" or "c"
- Anchors
    - `^` start
    - `$` end
    - `\b` word boundary
- Groups
    - `(a)`

## How Deep Does RegEx go?
![Rabbit Hole](http://www.quickmeme.com/img/cf/cf803d01bef8c3ed6ec0cd879d59ae25ea311d67e1794d11ecf2db41209095a1.jpg)
- For challenging strings to match, like email addresses, recommend using pre-built RegEx specifications like  the HTML specification at https://html.spec.whatwg.org/multipage/forms.html#valid-e-mail-address
- With known, good, and proven RegEx patterns like these, you don't need to reinvent things.
- ```r"^[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$"```


In [1]:
import re
import pandas as pd

### Patterns to Match Literals 
> Crawl before you walk

In [2]:
string = 'Two households, both alike in dignity, In fair Verona, where we lay our scene, From ancient grudge break to new mutiny, Where civil blood makes civil hands unclean.'
string

'Two households, both alike in dignity, In fair Verona, where we lay our scene, From ancient grudge break to new mutiny, Where civil blood makes civil hands unclean.'

In [3]:
# We can search for a literal match of the string Verona
# re.search(r"pattern", "our subject")
re.search(r"Verona", string)

<re.Match object; span=(47, 53), match='Verona'>

In [4]:
re.search(r"ancient",string)

<re.Match object; span=(84, 91), match='ancient'>

In [5]:
# the span returned is the index. 
# Consider if we were to splice the string using the span bounds
string[47:53]

'Verona'

In [8]:
string[84:91]

'ancient'

In [9]:
re.search(r"In fair Verona", string)

<re.Match object; span=(39, 53), match='In fair Verona'>

In [10]:
# The string "Leonardo DiCaprio" is not here, so re.search returns None
re.search(r"Leonardo DiCaprio", string)

In [11]:
re.search(r"me",string)

In [12]:
# re.search returns the first match
re.search(r"civil", string)

<re.Match object; span=(126, 131), match='civil'>

In [14]:
string[126:131]

'civil'

In [18]:
string[144:149]

'civil'

In [13]:
# .findall returns all matches
re.findall(r"civil", string)

['civil', 'civil']

In [26]:
re.search(r"in",string)

<re.Match object; span=(27, 29), match='in'>

In [20]:
re.findall(r"in",string)

['in', 'in']

In [28]:
string[20:29]

' alike in'

In [29]:
re.findall(r"in",string,re.IGNORECASE)

['in', 'In', 'in']

In [21]:
# empty set for no matches with .findall
re.findall(r'Leonardo Dicaprio', string)

[]

In [22]:
re.search(r'Two', string)

<re.Match object; span=(0, 3), match='Two'>

In [23]:
# Are computers particular on specifics?
re.search(r'two', string)

In [24]:
re.search(r'In', string)

<re.Match object; span=(39, 41), match='In'>

In [25]:
string[39:41]

'In'

In [14]:
# The re.IGNORECASE flag does exactly that
re.search(r'two', string, re.IGNORECASE)

<re.Match object; span=(0, 3), match='Two'>

In [15]:
re.search(r'A', "aaaaaaaa", re.IGNORECASE)

<re.Match object; span=(0, 1), match='a'>

In [16]:
re.search(r'Aaaaa', "aaaaaaaa", re.IGNORECASE)

<re.Match object; span=(0, 5), match='aaaaa'>

In [17]:
re.search(r'aa', 'a')

## Using `|` for a logical OR to open opportunities
- We can use `|` with literal characters or other regular expression patterns

In [18]:
# OR
# Findall returns all matches 
re.findall(r'gray|grey', "I can't remember if you spell grey like gray or like grey")

['grey', 'gray', 'grey']

In [19]:
# The .search method matches only the first match
re.search(r"orange|apple", "I like both apples and oranges")

<re.Match object; span=(12, 17), match='apple'>

In [20]:
re.findall(r'this|that', "this, that, this, that")

['this', 'that', 'this', 'that']

In [21]:
# has a vowel, anywhere
re.search(r'a|e|i|o|u', 'banana')

<re.Match object; span=(1, 2), match='a'>

In [22]:
re.search(r'[aeiou]', 'banana')

<re.Match object; span=(1, 2), match='a'>

In [23]:
re.findall(r'[aeiou]', 'banana')

['a', 'a', 'a']

# REGEX symbols ^ . *

In [24]:
# ^ carot is starts-with
# . is any character
# * is zero or more
re.search(r'^b.*', "bananarama")

<re.Match object; span=(0, 10), match='bananarama'>

In [25]:
re.search(r'[^b.*]', "bananarama")

<re.Match object; span=(1, 2), match='a'>

In [26]:
re.search(r'^b.*', "bananarama is my jam")

<re.Match object; span=(0, 20), match='bananarama is my jam'>

In [27]:
re.search(r'^b.*', "my jam is bananarama")

In [28]:
# + means 1 or more times
re.search(r'^b.+', 'b')

In [29]:
# + means 1 or more times
re.search(r'^b.*', 'b')

<re.Match object; span=(0, 1), match='b'>

In [30]:
# .* finds the largest possible match
# technical term is greedy
re.search(r'^b.*', 'bananarama pajama')

<re.Match object; span=(0, 17), match='bananarama pajama'>

In [40]:
string

'Two households, both alike in dignity, In fair Verona, where we lay our scene, From ancient grudge break to new mutiny, Where civil blood makes civil hands unclean.'

In [51]:
re.search(r'^Two.*',string)

<re.Match object; span=(0, 164), match='Two households, both alike in dignity, In fair Ve>

In [52]:
string[0:164]

'Two households, both alike in dignity, In fair Verona, where we lay our scene, From ancient grudge break to new mutiny, Where civil blood makes civil hands unclean.'

In [56]:
re.search(r'^Two.*',string)

<re.Match object; span=(0, 164), match='Two households, both alike in dignity, In fair Ve>

# Letters alphanumerics  \w +

In [71]:
# match b then 1 or more alphanumerics for a word
# \w means any a-zA-Z0-9_
# + means 1 or more letters
# when the pattern hits the " " before pajama, we're done
re.search(r'^b\w+', 'bananarama pajama')

<re.Match object; span=(0, 10), match='bananarama'>

In [70]:
# match b then 1 or more alphanumerics for a word
# \w means any a-zA-Z0-9_
# + means 1 or more letters
re.search(r'^b\w+', 'be cool')

<re.Match object; span=(0, 2), match='be'>

In [72]:
re.search(r'c\w+', 'be cool')

<re.Match object; span=(3, 7), match='cool'>

In [81]:
# match b then 1 or more alphanumerics for a word
# \w means any a-zA-Z0-9_
# + means 1 or more letters
re.search(r'^b\w+', 'b cool')

In [82]:
# match b then 1 or more alphanumerics for a word
# \w means any a-zA-Z0-9_
# * means 0 or more letters
re.search(r'^b\w*', 'b cool')

<re.Match object; span=(0, 1), match='b'>

In [83]:
re.search(r'^b.', 'b cool')

<re.Match object; span=(0, 2), match='b '>

In [36]:
# .* finds the largest possible match
# technical term is greedy
re.search(r'^b\w*\b', 'bananarama1-pajama')

<re.Match object; span=(0, 11), match='bananarama1'>

In [37]:
# match the character b then any other character
re.search(r'b.', 'bananarama pajama')

<re.Match object; span=(0, 2), match='ba'>

In [38]:
# match the character b then any alphanumeric character
re.search(r'b\w', 'bananarama pajama')

<re.Match object; span=(0, 2), match='ba'>

In [39]:
# match b followed by 3 of any character
re.search(r'b...', 'bananarama pajama')

<re.Match object; span=(0, 4), match='bana'>

In [40]:
# match b followed by 3 of any alphanumeric character
re.search(r'b\w\w\w', 'bananarama pajama')

<re.Match object; span=(0, 4), match='bana'>

In [41]:
# match b followed by 3 of any character
re.search(r'b.{3}', 'bananarama pajama')

<re.Match object; span=(0, 4), match='bana'>

In [42]:
# match b followed by 3 of any alphanumeric character
re.search(r'b\w{3}', 'banarama pajama')

<re.Match object; span=(0, 4), match='bana'>

In [43]:
# [^abc] as "anything that ain't a or b or c"
re.search(r'[^b]', 'bananarama pajama')

<re.Match object; span=(1, 2), match='a'>

In [44]:
# let's find something that starts with a then has any number of other characters
re.search(r'^a.*', 'apple bananarama pajama')

<re.Match object; span=(0, 23), match='apple bananarama pajama'>

In [45]:
# starts with b
# anything
# ends with a
re.search(r'^b.*$a', 'bananarama pajama')

In [46]:
re.search(r'^b.+a$', 'bananarama pajama')

<re.Match object; span=(0, 17), match='bananarama pajama'>

In [47]:
re.search(r'^b.*a$', 'bananarama pajama')

<re.Match object; span=(0, 17), match='bananarama pajama'>

In [48]:
# starts with
# anything
# ends with 


# Quantifiers { }

In [49]:
# \w matches [a-zA-Z0-9_]
re.search(r'\w{4}', 'abc123')

<re.Match object; span=(0, 4), match='abc1'>

In [50]:
# \w matches [a-zA-Z0-9_]
re.search(r'\w{1,4}', 'abc 123')

<re.Match object; span=(0, 3), match='abc'>

In [51]:
# + metacharacter matches 1 or more of the pattern to the left of that + character
# + is greedy like *
re.search(r'f|F+', 'Fred asked a good question')

<re.Match object; span=(0, 1), match='F'>

In [52]:
# + metacharacter matches 1 or more of the pattern to the left of that + character
# + is greedy like *
re.search(r'f|F\w*', 'Fred asked a good question')

<re.Match object; span=(0, 4), match='Fred'>

In [53]:
# what if we want only letters and not letters + numbers + _ character
re.search(r'[a-zA-Z]*', 'stuff and things and 123')

<re.Match object; span=(0, 5), match='stuff'>

In [54]:
# what if we want only letters and not letters + numbers + _ character
# the [a-z]+ is finding any and all sequences that are only [a-zA-Z]
re.findall(r'[a-zA-Z]+', '42 $stuff a****nd things and 123')

['stuff', 'a', 'nd', 'things', 'and']

In [84]:
# what if we want only letters and not letters + numbers + _ character
re.findall(r'[a-zA-Z]*', '42 $stuff a****nd things and 123')

['',
 '',
 '',
 '',
 'stuff',
 '',
 'a',
 '',
 '',
 '',
 '',
 'nd',
 '',
 'things',
 '',
 'and',
 '',
 '',
 '',
 '',
 '']

In [85]:
re.search('^.\w\D', ';ijlasdfj;j32rjup83jrpoiawhefposdih')

<re.Match object; span=(0, 3), match=';ij'>

## Difference between `*` and `+`
+ needs one more character to turn a match
* can match just one character or more

In [56]:
# Match F and one or more of a a-zA-Z0-9 character
# Does not match F on its own
re.search(r'F\w+', 'F money asked a great question. Great job Fred')

<re.Match object; span=(42, 46), match='Fred'>

In [87]:
# Match F then zero or more of a-zA-Z0-9
re.search(r'F\w*', 'F money asked a great question. Great job Fred')

<re.Match object; span=(0, 2), match='Fe'>

Range {}

In [58]:
# {n,} matches n or more times
re.findall(r'[a-zA-Z]{1,}', 'abc123 is the place to be')

['abc', 'is', 'the', 'place', 'to', 'be']

In [89]:
# {n,} matches n or more times
re.findall(r'[a-zA-Z]{2,}', 'abc123 is the place to be')

['abc', 'is', 'the', 'place', 'to', 'be']

In [90]:
# {n,} matches n or more times
re.findall(r'[a-zA-Z]{3,}', 'abc123 is the place to be')

['abc', 'the', 'place']

In [91]:
# {n,} matches n or more times
re.findall(r'[a-zA-Z]{0,2}', 'abc123 is the place to be')

['ab',
 'c',
 '',
 '',
 '',
 '',
 'is',
 '',
 'th',
 'e',
 '',
 'pl',
 'ac',
 'e',
 '',
 'to',
 '',
 'be',
 '']

In [59]:
# {n,} matches n or more times
# \w+ means one or more alphanumerics
# \w{1,} means the same thing
re.findall(r'[a-zA-Z]+', 'abc123 is the place to be')

['abc', 'is', 'the', 'place', 'to', 'be']

# . any character (must be there) and 
# .? optional character

In [60]:
# 3 digits then a single character of any then 4 digits
re.search(r'\d{3}.\d{4}', '574-5860')

<re.Match object; span=(0, 8), match='574-5860'>

In [61]:
# What if the delimiter is optional?
# question mark metacharacter means the thing to the left of the ? is optional
re.search(r'\d{3}.?\d{4}', '57405860')

<re.Match object; span=(0, 8), match='57405860'>

## Using a RegEx pattern to split a string
- The `re.split` method returns a list of strings
- The matching substring is removed
- We can split on any regex pattern, not only character literals

In [62]:
"555-555-5555".split('-')

['555', '555', '5555']

In [63]:
# Split the phone number on the
re.split(r'-', '555-555-5555')

['555', '555', '5555']

In [64]:
# Split the phone number on the space character
re.split(r'-| ', '555 555 5555')

['555', '555', '5555']

In [65]:
# Splits the string on the space character


In [66]:
# Parse these songs into a dataframe containing 2 columns: artist_name and song_name
# Hint: break the string into an array of strings that hold each song/artist record


In [67]:
# re.method(pattern, subject_string, re.IGNORECASE)

# re.split
# re.search
# re.findall

## [Character Classes]
- Square brackets make character classes 
- Character classes provide OR behavior
- In a character classe, `^` works as a "None of" operator
- Metacharacters match their literal character when inside of square brackets for a character class

In [68]:
# has a vowel, anywhere
re.search(r'[aeiou]', 'banana')

<re.Match object; span=(1, 2), match='a'>

In [69]:
# gray or grey
re.findall(r'gr[ae]y', 'some people spell it gray, some spell it grey')

['gray', 'grey']

In [70]:
# Is only a single vowel
re.search(r'[aeiou]{1}', 'a')

<re.Match object; span=(0, 1), match='a'>

In [71]:
# is only vowels
re.search(r'^[aeiou]*$', 'aeioai')

<re.Match object; span=(0, 6), match='aeioai'>

In [104]:
# has a p or q, anywhere
re.search(r'[pq]', 'psq6')

<re.Match object; span=(0, 1), match='p'>

In [73]:
# has a p or q, anywhere


In [74]:
# is p or q


In [75]:
# is only Ps and Qs


In [76]:
string = "Two households, both alike in dignity, In fair Verona, where we lay our scene, From ancient grudge break to new mutiny, Where civil blood makes civil hands unclean."
string

# find all the occurences of civil followed by the word immediately after "civil"



'Two households, both alike in dignity, In fair Verona, where we lay our scene, From ancient grudge break to new mutiny, Where civil blood makes civil hands unclean.'

In [115]:
re.findall(r'civil\s[a-z]*',string)

['civil blood', 'civil hands']

## Repetition characters and Special Sequences
> Walk before you run

- `.` means any single character
- `*` means zero or more characters
- `+` means one or more characters
- `.` matches any character
- `\b` is a word boundary anchor
- `\d` matches any decimal. Is equivalent to `[0-9]`
- `\D` matches any non-digit character and is equivalent to `[^0-9]`. 
- `\s` matches any white space like ` `, tab, soft return, new line etc...
- `\w` matches any alphanumeric character and underscore. Equivalent to `[0-9a-zA-Z_]`
- `\W` matches any non-alphanumeric character. Equivalent to `[^a-zA-Z0-9_]`
- `{n)` exactly n characters
- `{n,}` n or more characters
- `{n, m}` n to m times

In [77]:
# world without \b word boundary


In [78]:
# \b means word boundary
# any word that starts with o


In [79]:
# \b means word boundary
# any word that starts with o


In [80]:
# \b means word boundary
# any word that starts with o
# without the word boundary, we get the "ou" from "you"


In [81]:
# \b means word boundary
# any word that starts with o


## Groups

In [82]:
sentence = 'You can find us on the web at https://codeup.com. Our ip address is 123.123.123.123 (maybe).'

In [83]:
sentence

'You can find us on the web at https://codeup.com. Our ip address is 123.123.123.123 (maybe).'

In [84]:
url_re = r'(https?)://(\w+)\.(\w+)'

re.search(url_re, sentence).groups()

('https', 'codeup', 'com')

In [85]:
# simplified for demonstration, a real url to parse urls would be much more
# complex
url_re = r'(https?)://(\w+)\.(\w+)'

protocol, domain, tld = re.search(url_re, sentence).groups()

print(f'''
protocol: {protocol}
domain:   {domain}
tld:      {tld}
''')


protocol: https
domain:   codeup
tld:      com



## A Reflection on Captured Groups
- After matching the first group or two, the need to be _highly_ specific with subsequent groups (unless there is ambiguity in the forms in the source text) likely starts to decrease. 
- For example, matching an abitrary user agent string in on its own might prove challenging. Specific user agents even more so https://jonlabelle.com/snippets/view/yaml/browser-user-agent-regular-expressions. But if that arbitrary user agent string lives in a line inside a log where to the left we've already matched a group for the method type GET|POST, and the timestamp, then matching any specific user agent string with the cleanest regex ever isn't as necessary, probabilistically, as matching any string up until but not including the bytes transmitted group to its right.
- We can sometimes rely on the regularness of forms, especially with multiple captured groups in order to capture much more challenging patterns in the middle of other more easily discerned pattern groups.