# Meet Regular Expressions

## What?
- Regular expressions (called regexes or regex patterns) are a tiny language for dealing with text and character patterns.
- With RegEx patterns we can:
    - Does this string match a pattern?
    - Is there a match for the pattern anywhere in the string?
    - Modify + split strings in various ways
    
re library functions
- `re.search` scans through a string, looking for any location where the RE matches.
- `re.findall` Finds all substrings where the RE matches; returns a list.
- `re.split` splits a string on a given regex pattern, removing that pattern. The result is a list of a strings.
- `re.sub` allows us to match a regex and substitute in a new substring for the match.


## So What?
- Power + precision
    - Cost is learning something new and potentially unfamiliar.
    - Payoff is a language that works with any other programming language to operate on text and character patterns.
- Regular Expressions are cross platform and available in many programming languages and environments:
    - Command line tools (Linux, Windows, Mac, etc...)
    - Python
    - SQL flavors offer RegEx
    - Java (Scala/Clojure)
    - Other languages like Julia, Ruby, PHP, C#, etc...
    - Like SQL, there are differences between some of the different RegEx implementations, but if you know your RegEx, you can bring value in many environments.

## When is Regex the right tool?
- if you can solve the problem with built in string methods
- if you need more capability than the built in methods
- if you're parsing HTML, JSON, or XL, use a tool built for those formats

## Now What?
- We'll start simple by writing regex patterns to match literal characters.
- Then we will introduce metacharacters, that have special meaning and functionality.
- 

## Key Concepts
- The RegEx metacharacters `. ^ $ * + ? { } [ ] \ | ( )` have special meanings. 
- Square brackets create a "character class". 
    - Character classes allow us to specify many OR operations
    - For example, `r"[aeiou]"` matches any lowercase vowel character. Identical to `r"a|e|i|o|u"`
    - `r"[a-z]"` matches lowercase a through z.
- Metacharacters are not active inside of the character class square brackets `[]`
- Outside of the character class `[]`, if you need to match a metacharacter character literally, you will need to put a `\` in front of that character. `r"\+"` will match the literal `+` character.
- RegEx has characters for special sequences:
    - `\d` matches any decimal. Is equivalent to `[0-9]`
    - `\D` matches any non-digit character and is equivalent to `[^0-9]`. 
    - `\s` matches any white space like ` `, tab, soft return, new line etc...
    - `\w` matches any alphanumeric character and underscore. Equivalent to `[0-9a-zA-Z_]`
    - `\W` matches any non-alphanumeric character. Equivalent to `[^a-zA-Z0-9_]`
- `.` Matches any character
- Repetition:
    - `*` matches zero or more of the previous pattern
    - `+` matches 1 or more of the previous pattern
- `?` after a pattern means that pattern is optional
- Not - [^abc] matches anything but
- Anchors
    - `^` start
    - `$` end
    - `\b` word boundary
- Groups
    - `(a)`

## How Deep Does RegEx go?
- For challenging strings to match, like email addresses, recommend using pre-built RegEx specifications like  the HTML specification at https://html.spec.whatwg.org/multipage/forms.html#valid-e-mail-address
- With known, good, and proven RegEx patterns like these, you don't need to reinvent things.
- ```r"^[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$"```

In [1]:
import re

re.#tab here to see all your options

### Patterns to Match Literals 
> Crawl before you walk

In [3]:
string = "Two households, both alike in dignity, In fair Verona, where we lay our scene, From ancient grudge break to new mutiny, Where civil blood makes civil hands unclean."
string

'Two households, both alike in dignity, In fair Verona, where we lay our scene, From ancient grudge break to new mutiny, Where civil blood makes civil hands unclean.'

In [5]:
# We can search for a literal match of the string Verona
# re.search(r"pattern","our subject")
x = re.search(r"Verona", string)
x
# returns a 'match' object

<re.Match object; span=(47, 53), match='Verona'>

In [6]:
# the span returned is the index. 
# Consider if we were to splice the string using the span bounds
string[47:53]

'Verona'

In [7]:
re.search(r"In fair Verona", string)

<re.Match object; span=(39, 53), match='In fair Verona'>

In [8]:
# The string "Leonardo DiCaprio" is not here, so re.search returns None
re.search(r"Leonardo DiCaprio", string)

In [10]:
# re.search returns the first match and only the first match
re.search(r"civil", string)

<re.Match object; span=(126, 131), match='civil'>

In [11]:
# .findall returns all matches
re.findall(r"civil", string)

['civil', 'civil']

In [12]:
# empty set for no matches with .findall
re.findall(r"Claire Danes", string)

[]

In [13]:
re.search(r"Two", string)

<re.Match object; span=(0, 3), match='Two'>

In [15]:
# Are computers particular on specifics?
re.search(r"two", string)
# lower case 'two' isn't found

In [17]:
# The re.IGNORECASE flag does exactly that
re.search(r"two", string, re.IGNORECASE)
# plan on using IGNORECASE frequently

<re.Match object; span=(0, 3), match='Two'>

In [18]:
re.search(r"A","aaaaaaa", re.IGNORECASE)

<re.Match object; span=(0, 1), match='a'>

In [19]:
re.search(r"Aaaaa","aaaaaaa", re.IGNORECASE)

<re.Match object; span=(0, 5), match='aaaaa'>

## Using `|` for a logical OR to open opportunities
- We can use `|` with literal characters or other regular expression patterns

In [20]:
# OR
# Findall returns all matches 
re.findall(r"gray|grey", "I can't remember if you spell grey gray or gray like grey!")

['grey', 'gray', 'gray', 'grey']

In [21]:
# The .search method matches only the first match
re.search(r"orange|apple", "I like both apples and oranges")

<re.Match object; span=(12, 17), match='apple'>

In [22]:
re.findall(r"this|that", "this that and the other")

['this', 'that']

In [23]:
# has a vowel, anywhere
re.search(r"a|e|i|o|u", "banana", re.IGNORECASE)

<re.Match object; span=(1, 2), match='a'>

In [76]:
re.findall(r"[aeiou]", "banana", re.IGNORECASE)

['a', 'a', 'a']

In [24]:
# has a vowel, anywhere
re.findall(r"a|e|i|o|u", "banana", re.IGNORECASE)

['a', 'a', 'a']

In [29]:
# carot is 'starts-with'
# . is any character
# * is zero or more
re.search(r"^b.*", "bananarama")

<re.Match object; span=(0, 10), match='bananarama'>

In [30]:
# the * is "greedy"--finds the largest possible match
re.search(r"^b.*", "bananarama pajama")

<re.Match object; span=(0, 17), match='bananarama pajama'>

In [78]:
# Match 'b' then one or more alphanumerics for a word
# \w means alphanumeric
# + means 1 or more letters
# when the pattern hits the " " before pajama, we're done
re.search(r"^b\w+", "bananarama pajama")

<re.Match object; span=(0, 10), match='bananarama'>

In [80]:
# match b fallowed by 3 of any character
re.search(r"b.{3}", 'hello bananarama pajama')

<re.Match object; span=(6, 10), match='bana'>

In [31]:
# the * is "greedy"--finds the largest possible match
re.search(r"^b.", "bananarama pajama")
# here we only got 'ba'

<re.Match object; span=(0, 2), match='ba'>

In [87]:
# only letters not letters/numbers/_
re.search(r"[a-zA-Z]*", 'stuff and things and 123')

<re.Match object; span=(0, 5), match='stuff'>

In [84]:
re.findall(r"[a-zA-Z]*", '42 stuff and things and 123')

['', '', '', 'stuff', '', 'and', '', 'things', '', 'and', '', '', '', '', '']

In [86]:
# the [a-z]+ is finding any and all sequences that are only [a-zA-Z]
re.findall(r"[a-zA-Z]+", '42 stuff a*****nd things and 123')

['stuff', 'a', 'nd', 'things', 'and']

In [36]:
# the * is "greedy"--finds the largest possible match
re.search(r"b.", "hello bananarama pajama")
# with the '^' we got nothing
# without the '^', we got 'ba'

<re.Match object; span=(6, 8), match='ba'>

In [42]:
# add a space to NOT get pajama
# re.findall(r"[^b]", "hello bananarama pajama")
# to return everything except 'b'

In [44]:
# starts with
# anything
# ends with
re.search(r"^b.*rama", 'bananarama pajama')

<re.Match object; span=(0, 10), match='bananarama'>

In [47]:
# starts with
# anything
# ends with
re.search(r".*jama$", 'bananarama pajama')

<re.Match object; span=(0, 17), match='bananarama pajama'>

In [49]:
# starts with
# anything
# ends with
re.search(r".*rama", 'bananarama pajama')

<re.Match object; span=(0, 10), match='bananarama'>

In [50]:
# \w matches [a-zA-Z0-9]
re.search(r"\w", "abc123")

<re.Match object; span=(0, 1), match='a'>

In [51]:
# \w matches [a-zA-Z0-9]
re.search(r"\w\w\w", "abc123")

<re.Match object; span=(0, 3), match='abc'>

In [52]:
# \w matches [a-zA-Z0-9]
re.search(r"\w\w\w\w\w\w", "abc123")

<re.Match object; span=(0, 6), match='abc123'>

In [53]:
# seven \w will match seven of any [a-zA-Z0-9]
re.search(r"\w\w\w\w\w\w\w", "abc123")
#returns nothing

In [54]:
# \w matches [a-zA-Z0-9]
re.search(r"\w*", "abc123")

<re.Match object; span=(0, 6), match='abc123'>

In [55]:
# \w matches [a-zA-Z0-9]
re.search(r"\w{3}", "abc123")

<re.Match object; span=(0, 3), match='abc'>

In [90]:
# \w matches [a-zA-Z0-9]
re.search(r"\w{2,6}", "abc123def456")
# bc the {2,6} is in abc123...

<re.Match object; span=(0, 6), match='abc123'>

In [58]:
# {n,}
re.search(r"\w{1,}", "abc123 is thee place to be")

<re.Match object; span=(0, 6), match='abc123'>

In [64]:
# {n,}
re.findall(r"\w{1,6}", "abc123 is the place to be bananaramapajama")

['abc123', 'is', 'the', 'place', 'to', 'be', 'banana', 'ramapa', 'jama']

In [65]:
# {n,} matches n or more times
# space after the 1-6 alpha \w matches
re.findall(r"\w{1,6} ", "abc123 is the place to be bananaramapajama")

['abc123 ', 'is ', 'the ', 'place ', 'to ', 'be ']

In [62]:
# {n,}
re.findall(r"\w{1,}", "abc123 is the place to be")

['abc123', 'is', 'theeeeeeee', 'place', 'to', 'be']

In [61]:
# r"\w+" is the same as r"w{1,}
re.findall(r"\w+", "abc123 is thee place to be")

['abc123', 'is', 'thee', 'place', 'to', 'be']

In [69]:
# the '.' says to find one of anything
# 3 digits then a single any character and then 4 digits
re.search(r"[0-9]{3}.[0-9]{4}", "226-3232")

<re.Match object; span=(0, 8), match='226-3232'>

In [70]:
#
re.search(r"[0-9]{3}.[0-9]{4}", "226.3232")

<re.Match object; span=(0, 8), match='226.3232'>

In [93]:
# it fell apart here....
# but, what if the delimiter is optional?
# question mark metacharacter means the thing to the left of the ? is optional
# re.search(r"[0-9]{3}.[0-9]{4}", "2263232")
# versus:
re.search(r"[0-9]{3}.?[0-9]{4}", "2263232")

<re.Match object; span=(0, 7), match='2263232'>

In [75]:
# it fell apart here....
# but, what if the delimiter is optional? or multiples?
# question mark metacharacter means the thing to the left of the ? is optional
# the star is multiple repeats
re.search(r"[0-9]{3}.*[0-9]{4}", "226-----3232")

<re.Match object; span=(0, 12), match='226-----3232'>

## Using a RegEx pattern to split a string
- The `re.split` method returns a list of strings
- The matching substring is removed
- We can split on any regex pattern, not only character literals

In [94]:
# base python:
"210-226-3232".split('-')

['210', '226', '3232']

In [95]:
# Split the phone number on the
re.split(r"-", "210-226-3232")

['210', '226', '3232']

In [97]:
# More power/options with regex, when base python can't be used
re.split(r"-| ", "210 226-3232")

['210', '226', '3232']

In [98]:
# Splits the string on the space character
# The \ is necessary
re.split(r" ", "this that and the other")

['this', 'that', 'and', 'the', 'other']

In [99]:
# Parse these songs into a dataframe containing 2 columns: artist_name and song_name
# Hint: break the string into an array of strings that hold each song/artist record
songs = "Harry_Belafonte_-_Jump_In_the_Line.mp3,Willie_Mae_'Big_Mama'_Thornton_-_Hound_Dog.mp3,Tina_Turner_-_Proud_Mary.mp3,Prince_-_Purple_Rain.mp3"
songs
# probably split on a comma first, then on _-_ (hint for exercise)
# re.method(pattern, subject_string, re.IGNORECASE or other)

"Harry_Belafonte_-_Jump_In_the_Line.mp3,Willie_Mae_'Big_Mama'_Thornton_-_Hound_Dog.mp3,Tina_Turner_-_Proud_Mary.mp3,Prince_-_Purple_Rain.mp3"

## [Character Classes]
- Square brackets make character classes 
- Character classes provide OR behavior
- In a character classe, `^` works as a "None of" operator
- Metacharacters match their literal character when inside of square brackets for a character class

In [100]:
# has a vowel, anywhere

re.search(r"[aeiou]", "banana", re.IGNORECASE)

<re.Match object; span=(1, 2), match='a'>

In [101]:
# The parentheses around 
re.findall(r"gr[ae]y", "Some people spell gray like grey")

['gray', 'grey']

In [102]:
# has a vowel, anywhere

re.search(r"a|e|i|o|u", "banana", re.IGNORECASE)

<re.Match object; span=(1, 2), match='a'>

In [104]:
# search a single vowel
re.search(r"^[aeiou]{1}$", "a", re.IGNORECASE)

<re.Match object; span=(0, 1), match='a'>

In [105]:
# not multiple
re.search(r"^[aeiou]{1}$", "ae", re.IGNORECASE)

In [143]:
# Is a vowel

assert bool(re.search(r"^[aeiou]{1}$", "a", re.IGNORECASE)) == True
assert bool(re.search(r"^[aeiou]{1}$", "aaaa", re.IGNORECASE)) == False

In [106]:
# is only vowels

re.search(r"^[aeiou]*$", "aaeeeaa")

<re.Match object; span=(0, 7), match='aaeeeaa'>

In [107]:
# has a p or q, anywhere
re.search(r"p|q", "albuquerque", re.IGNORECASE)

<re.Match object; span=(4, 5), match='q'>

In [108]:
# has a p or q, anywhere
re.search(r"[pq]", "albuquerque", re.IGNORECASE)

<re.Match object; span=(4, 5), match='q'>

In [109]:
# is p or q
re.search(r"^[pq]{1}$", "q", re.IGNORECASE)

<re.Match object; span=(0, 1), match='q'>

In [113]:
# is only ps and qs
re.search(r"^[pqPQ]*$", "pqpqpqpPQQQQQQQQp")

<re.Match object; span=(0, 17), match='pqpqpqpPQQQQQQQQp'>

In [114]:
re.search(r"^[pq]*$", "b3qwpeop")

In [115]:
# is only Ps and Qs
assert bool(re.search(r"^[pqPQ]*$", "pqpqpqpPQQQQQQQQp")) == True
assert bool(re.search(r"^[pq]*$", "b3qwpeop")) == False

In [116]:
string = "Two households, both alike in dignity, In fair Verona, where we lay our scene, From ancient grudge break to new mutiny, Where civil blood makes civil hands unclean."
string

re.findall(r"civil\s.{5}", string)


['civil blood', 'civil hands']

In [117]:
string = "Two households, both alike in dignity, In fair Verona, where we lay our scene, From ancient grudge break to new mutiny, Where civil blood makes civil hands unclean."
string

re.findall(r"civil\s[a-z]+", string)


['civil blood', 'civil hands']

## Repetition characters and Special Sequences
> Walk before you run

- `.` means any single character
- `*` means zero or more characters
- `+` means one or more characters
- `.` matches any character
- `\d` matches any decimal. Is equivalent to `[0-9]`
- `\D` matches any non-digit character and is equivalent to `[^0-9]`. 
- `\s` matches any white space like ` `, tab, soft return, new line etc...
- `\w` matches any alphanumeric character and underscore. Equivalent to `[0-9a-zA-Z_]`
- `\W` matches any non-alphanumeric character. Equivalent to `[^a-zA-Z0-9_]`
- `{n)` exactly n characters
- `{n,}` n or more characters
- `{n, m}` n to m times

In [119]:
# world w/o \b word boundary
re.search(r"o\w+", "do you like apples or oranges?")

<re.Match object; span=(4, 6), match='ou'>

In [120]:
# \b means word boundary
# any word that starts with o
re.search(r"\bo\w+", "do you like apples or oranges?")

<re.Match object; span=(19, 21), match='or'>

In [121]:
# \b means word boundary
# any word that starts with o
re.findall(r"\bo\w+", "do you like apples or oranges?")

['or', 'oranges']

In [122]:
# \b means word boundary
# any word that starts with o
# without the word boundary, we get the "ou" from "you"
re.findall(r"o\w+", "do you like apples or oranges?")

['ou', 'or', 'oranges']

In [123]:
# \b means word boundary
# any word that starts with o
re.findall(r"\bo\w+", "do you like apples or oranges?")

['or', 'oranges']

## Groups

In [124]:
sentence = '''
You can find us on the web at https://codeup.com. Our ip address is 123.123.123.123 (maybe).
'''.strip()
sentence

'You can find us on the web at https://codeup.com. Our ip address is 123.123.123.123 (maybe).'

In [126]:
ip_re = r'\d+(\.\d+){3}'

match = re.search(ip_re, sentence)
match[0]

'123.123.123.123'

In [127]:
# simplified for demonstration, a real url to parse urls would be much more
# complex
url_re = r'(https?)://(\w+)\.(\w+)'

protocol, domain, tld = re.search(url_re, sentence).groups()

print(f'''
protocol: {protocol}
domain:   {domain}
tld:      {tld}
''')


protocol: https
domain:   codeup
tld:      com



In [129]:
# .groups() returns the groups
re.search(url_re, sentence).groups()

('https', 'codeup', 'com')

In [132]:
# here, we went out of the way to name the groups
# they can reference themselves this way
url_re = r'(?P<protocol>https?)://(?P<domain>\w+)\.(?P<tld>\w+)'

match = re.search(url_re, sentence)

In [133]:
match.groups()

('https', 'codeup', 'com')

In [135]:
# and then be called by the corresponding group name
match.group("domain")

'codeup'

In [136]:
match.group("tld")

'com'

In [138]:
# and also be returned as a dictionary
match.groupdict()

{'protocol': 'https', 'domain': 'codeup', 'tld': 'com'}

In [139]:
print(f'''
groups: {match.groups()}
referencing a group by name: {match.group('tld')}
group dictionary: {match.groupdict()}
''')


groups: ('https', 'codeup', 'com')
referencing a group by name: com
group dictionary: {'protocol': 'https', 'domain': 'codeup', 'tld': 'com'}

