# Guide to Regex

from Chapter 8: Strings and Text Data
"Pandas for Everyone" by Daniel Y. Chen (2018)

* [Data types](#7.1)
* [Converting data types](#7.2)

In [1]:
%%html 
<!-- HTML CODE BLOCK TO SHIFT JUPYTER TABLES TO THE LEFT -->
<style>
table {float:left}
</style>

In [23]:
import pandas as pd
from pathlib import Path
import re
import seaborn as sns

path_data = Path.cwd() / 'data'
tips = sns.load_dataset('tips')

## Strings
In Python, `string` is series of characters, created by set of opening single or double quotes

In [2]:
word = 'grail'
sent = 'a scratch'

### Subsetting and slicing strings

Strings like container of characters, and can subset like other Python containers (`list`, `Series`)

**Index positions for string 'grail'**

| | | | | | |
| --- | --- | --- | --- | --- | --- |
| index | 0 | 1 | 2 | 3 | 4 |
| string | g | r | a | i | l |
| neg index | -5 | -4 | -3 | -2 | -1 |

In [4]:
# get 1st character in string
word[0]

'g'

In [5]:
# slicing notation (get characters from 0 up to, but NOT including 3)
word[:3]

'gra'

In [6]:
# negative index starts count from end of container
word[-1]

'l'

In [7]:
# slicing notation with negative indices (get characters from -5 up to, but NOT including -4)
word[-5:-4]

'g'

In [11]:
# slicing notation to INCLUDE last character
word[2:]

'ail'

In [16]:
# specify slicing interval (start from 0, get every 2nd character)
word[::2]

'gal'

In [22]:
# join method takes container (e.g., list) and returns new string joining all elements using string as separator
print(' '.join(['40°',  '46\'', '52.837"', 'N', '73°', '58\'', '26.302"', 'W']))

40° 46' 52.837" N 73° 58' 26.302" W


## Regular expressions (regex)

Regular expressions allow a *pattern* of text to be searched. e.g., email addresses have @ symbols in the middle, US social security numbers have 9 digits and 2 hypens, etc.

Most regexes are passing a pattern to `re.compile()`, store it in `Regex` object. Call method on `Regex` object and pass string to check. Result gets stored.

This section from automate the boring stuff with Python: https://automatetheboringstuff.com/2e/chapter7/

In [117]:
# searches string for 3 consecutive digits followed by hyphen, 3 more consecutive digits, another hyphen, and 4 more consecutive digits
phone_regex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
mo = phone_regex.search('My number is 415-555-4242.')
print(f'Phone number found: {mo.group()}')

Phone number found: 415-555-4242


### Grouping with Parentheses

Groups used to separate parts of matched strings. Each set of parentheses in pattern is a group
* Passing 0 or no argument returns entire matched text
* Passing 1 returns the matched string in the 1st set of parentheses...etc.

In [63]:
# create Regex object and search string for pattern
phone_regex = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')
mo = phone_regex.search('My number is 415-555-4242.')

# .group() returns entire matched text
mo.group()

'415-555-4242'

In [64]:
# passing 1 to .group() returns string in 1st set of parentheses
mo.group(1)

'415'

In [65]:
# passing 2 to .group() returns string in 2nd set of parentheses
mo.group(2)

'555-4242'

In [66]:
# returns all groups as tuple
mo.groups()

('415', '555-4242')

### Special characters

The following characters have special meanings:  
. ^ $ * + ? { } [ ] \ | ( )  
*To use these characters as part of text pattern, escape with backslash, e.g., `\.`*

#### | character (pipe): matches if part is from multiple given patterns

| character used to separate patterns to match to, and the 1st pattern that is found is returned

In [164]:
# matches 'Batman' OR 'Tina Fey' (1st match returned)
hero_regex = re.compile(r'Batman|Tina Fey')
hero_regex.search('Batman and Tina Fey')

<re.Match object; span=(0, 6), match='Batman'>

In [165]:
# matches 'Batman' OR 'Tina Fey' (1st match returned)
hero_regex.search('Tina Fey and Batman')

<re.Match object; span=(0, 8), match='Tina Fey'>

In [167]:
# matches 'Batman' OR 'Batmobile' OR 'Batcopter' OR 'Batbat'
bat_regex = re.compile(r'Bat(man|mobile|copter|bat)')
bat_regex.search('Batmobile lost a wheel')

<re.Match object; span=(0, 9), match='Batmobile'>

In [168]:
# returns only matched text from 1st parentheses group
bat_regex.search('Batmobile lost a wheel').group(1)

'mobile'

#### ? character: matches if part is absent or present once

Whatever precedes the ? character is set as optional when matching. So to match, it will be present either 0 or exactly 1 times

In [169]:
# matches 'Batman' OR 'Batwoman' (longest regex returned)
bat_regex = re.compile(r'Bat(wo)?man')
bat_regex.search('The Adventures of Batman')

<re.Match object; span=(18, 24), match='Batman'>

In [171]:
# matches 'Batman' OR 'Batwoman' (longest regex returned)
bat_regex.search('The Adventures of Batwoman')

<re.Match object; span=(18, 26), match='Batwoman'>

In [172]:
BOOKMARK

NameError: name 'BOOKMARK' is not defined

#### * character: matches if part is absent or present 1 or more times

Similar to ? character, but the part preceding * character can be present multiple times

In [124]:
# matches 'Batman' OR 'Batwoman' OR 'Batwowoman', etc.
bat_regex = re.compile(r'Bat(wo)*man')
text = 'The Adventures of Batman'
mo = bat_regex.search(text)
mo.group()

'Batman'

In [126]:
# matches 'Batman' OR 'Batwoman' OR 'Batwowoman', etc.
text = 'The Adventures of Batwoman'
mo = bat_regex.search(text)
mo.group()

'Batwoman'

In [128]:
# matches 'Batman' OR 'Batwoman' OR 'Batwowoman', etc.
text = 'The Adventures of Batwowowowowowoman'
mo = bat_regex.search(text)
mo.group()

'Batwowowowowowoman'

#### + character: matches if part is present 1 or more times

Similar to * character, but the part preceding + character MUST be present at least once

In [129]:
# matches 'Batwoman' OR 'Batwowoman', etc.
bat_regex = re.compile(r'Bat(wo)+man')
text = 'The Adventures of Batman'
mo = bat_regex.search(text)
mo == None

True

In [130]:
# matches 'Batwoman' OR 'Batwowoman', etc.
bat_regex = re.compile(r'Bat(wo)+man')
text = 'The Adventures of Batwoman'
mo = bat_regex.search(text)
mo.group()

'Batwoman'

In [131]:
# matches 'Batwoman' OR 'Batwowoman', etc.
bat_regex = re.compile(r'Bat(wo)+man')
text = 'The Adventures of Batwowowowoman'
mo = bat_regex.search(text)
mo.group()

'Batwowowowoman'

#### ^ character (caret): matches ONLY IF match occurs at BEGINNING of searched text

Force the match to occur at the beginning of the searched string

In [155]:
# match occurs because searched string begins with 'Hello'
hello_regex = re.compile(r'^Hello')
text = 'Hello, world!'
mo = hello_regex.search(text)
mo.group()

'Hello'

In [157]:
# match DOES NOT occur because searched string DOES NOT begin with 'Hello'
hello_regex = re.compile(r'^Hello')
text = 'hello, world!'
mo = hello_regex.search(text)
mo == None

True

#### $ character: matches ONLY IF match occurs at END of searched text

Force the match to occur at the end of the searched string

In [158]:
# match occurs because searched string ends with digit
end_regex = re.compile(r'\d$')
text = 'Your number is 42'
mo = end_regex.search(text)
mo.group()

'2'

Combine ^ and $ to force match that starts and ends with given pattern

In [159]:
# match occurs
whole_regex = re.compile(r'^\d+$')
whole_regex.search('1234567890')

<re.Match object; span=(0, 10), match='1234567890'>

In [160]:
# match DOES NOT occur because letters in string
whole_regex.search('12345xyz67890') == None

True

In [161]:
# match DOES NOT occur because spacing in string
whole_regex.search('12  34567890') == None

True

### Matching specific repetitions with braces

In [136]:
# matches 'HaHaHa' only
ha_regex = re.compile(r'(Ha){3}')
mo = ha_regex.search('HaHaHa')
mo.group()

'HaHaHa'

In [137]:
# matches 'HaHaHa' only
mo = ha_regex.search('Ha')
mo == None

True

Can use a range with braces instead of single number:
* `(Ha){2,4}` will match `'HaHa'`, `'HaHaHa'`, or `'HaHaHaHa'`
* `(Ha){2,}` will match 2 or more instances of `(Ha)` group
* `(Ha){,5}` will match 0 to 5 instances of `(Ha)` group

By default, Python's regular expressions are *greedy*, meaning in ambiguous situations, will match longest string possible. 

In [141]:
# will match 'HaHaHaHa'
greedy_regex = re.compile(r'(Ha){2,4}')
mo = greedy_regex.search('HaHaHaHa')
mo.group()

'HaHaHaHa'

In [142]:
# will match 'HaHa'
nongreedy_regex = re.compile(r'(Ha){2,4}?')
mo = nongreedy_regex.search('HaHaHaHa')
mo.group()

'HaHa'

Question mark has 2 meanings in regular expressions: declaring non-greedy match or flagging a group as optional

### FINDALL() method


`Regex` objects also have `findall()` method. `search()` returns `Match` object of *first* matched text in searched string, while `findall()` method returns string of *every* match in searched string

In [143]:
# returns list of strings IF NO GROUPS in regular expression
phone_regex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
phone_regex.findall('Cell: 415-555-9999 Work: 212-555-0000')

['415-555-9999', '212-555-0000']

In [145]:
# returns list of tuples (each represents found match, and its items are matched strings for each group in regex)
phone_regex = re.compile(r'(\d\d\d)-(\d\d\d)-(\d\d\d\d)')
phone_regex.findall('Cell: 415-555-9999 Work: 212-555-0000')

[('415', '555', '9999'), ('212', '555', '0000')]

### Character classes

Useful to shorten regexes.

List of shorthand character classes:
* \d: matches any decimal digit (equivalent to [0-9])
* \D: matches any non-digit character (equivalent to [^0-9])
* \s: matches any whitespace character (equivalent to [ \t\n\r\f\v])
* \S: matches any non-whitespace character (equivalent to [^ \t\n\r\f\v])
* \w: matches any alphanumeric character (equivalent to [a-zA-Z0-9_])
* \W: matches any non-alphanumeric character (equivalent to [^a-zA-Z0-9_])

In [147]:
# matches text with 1 or more digits, followed by whitespace character, followed by 1 or more letter/digit/underscore characters
xmas_regex = re.compile(r'\d+\s\w+')
text = '12 drummers, 11 pipers, 10 lords, 9 ladies, 8 maids, 7 swans, 6 geese, 5 rings, 4 birds, 3 hens, 2 doves, 1 partridge'
xmas_regex.findall(text)

['12 drummers',
 '11 pipers',
 '10 lords',
 '9 ladies',
 '8 maids',
 '7 swans',
 '6 geese',
 '5 rings',
 '4 birds',
 '3 hens',
 '2 doves',
 '1 partridge']

#### Custom character classes

Create custom character classes, if shorthand character classes are too broad. 
* Define character class by using square brackets, e.g., `[aeiouAEIOU]` will match any vowel, both lowercase and uppercase
* Can also includes ranges of letters or numbers by using hyphen, e.g., `[a-d]` will match the letters a, b, c, or d
* Special characters DO NOT need to be escaped in brackets, e.g., [0-5.] will match digits 0 to 5 and a period
* Caret character (^) just after character class's opening bracket makes a *negative character* class

In [151]:
# list of string matches for any vowels in string
vowel_regex = re.compile(r'[aeiouAEIOU]')
text = 'Robocop eats baby food. BABY FOOD'
vowel_regex.findall(text)

['o', 'o', 'o', 'e', 'a', 'a', 'o', 'o', 'A', 'O', 'O']

In [153]:
# negative character class (^), matches every non-vowel character
vowel_regex = re.compile(r'[^aeiouAEIOU]')
text = 'Robocop eats baby food. BABY FOOD'
vowel_regex.findall(text)

['R',
 'b',
 'c',
 'p',
 ' ',
 't',
 's',
 ' ',
 'b',
 'b',
 'y',
 ' ',
 'f',
 'd',
 '.',
 ' ',
 'B',
 'B',
 'Y',
 ' ',
 'F',
 'D']