# Regular Expressions
- **Regular Expression (RegEx)**--sequence of characters that form a search pattern.  Used to check if specified characters in a specified pattern match a string.  Provide much more flexibility than string methods.  
- Regular expressions are used in many other programming languages and have been around for decades.  They are both powerful and not super intuitive.  
- There is additional grammar and functions not shown below that can be found online at websites like https://www.regular-expressions.info/index.html
- We can test expressions at https://pythex.org
- Regular expressions are processed in two steps:
    1. Search patterns are processed by the Python interpreter.  Any Python escape characters are created and sent to RegEx parser. 
    1. Search patterns are processed by the RegEx parser.  Any RegEx special sequences are recognized. 
- If there are any backslashes in the search pattern it is best practice to write the search pattern as a raw string.  This prevents the Python interpreter from potentially converting them into escape characters before the RegEx parser gets a chance to read the search pattern.

Code | Use
--- | --- |
`re` | Module
`re.findall()` | Returns a list containing all matching strings
`re.split()` | Same as built-in `.split()` method, but can use metacharacters, special sequences, and sets
`re.sub()` | Same as `.replace()` method, but can use metacharacters, special sequences, and sets
`re.search()` | Returns a Match object if there is match anywhere in the string.  If no match returns None.  The Match Object contains information on the match search pattern used, if there was a match or not, and the index location of matching characters (slice).  Only includes info on the slice of the first match found.  Can be used in conditionals to say there is at least one match.  There are a few methods that can be used with this match object.
`.group()` | Match object method.  Returns string that matched.
`.start()` or `.end()` | Match object method.  Returns match location.  Could find multiple matches by using indices to slice out match and before searching again.
`re.compile()` | Create RegEx object, which is our search pattern.  Other functions like `.findall()`, `.search()`, `.split().`, and `.sub()` are then called as methods on the RegEX object.  `re.compile()` can be passed arguments, which makes it more advanced than just saving our search pattern as a string.   However, we can either use a normal string or a RegEx object for a search pattern.
`re.IGNORECASE` | Argument for `compile()`.  Ignores case in the string we are searching.  If we want to use multiple parameters we can use `re.IGNORECASE | re.DOTALL | re.VERBOSE` with the bitwise or (pipe) operator.
`re.DOTALL` | Argument for `compile()`.  Makes the dot character match all characters *including* newlines.
`re.VERBOSE` | Argument for `compile()`.  Allows us to format our search pattern over multiple lines (like a multiline string) while using comments and ignoring the whitepace created by those newlines and comments.

Metacharacter | Use.  Reminder: within Sets Metacharacters have no special meaning. | Example
--- | -- | ---
`\` | Used in special sequences.  See table with special sequences.  A single backslash placed before a metacharacter suppresses the unique action of that character.  Similar to placing a backslash before Python escape characters.  This allows us to search for a metacharacter instead of just using its unique action. | "\d"
`[]` | A set of characters. Different from the Python data type. See table for working with sets | "[ain]"
`.` | Any character (except newline) | "a.n"
`^` | String starts with | "^ain"
`$` | String ends with  | "ain\$"
`*` | Zero or more occurrences | "ain*"
`*?` | Zero or more occurrences (non "greedy").  Normally, RegEx is **greedy**, meaning that it tries to find the longest string that matches our pattern.  **Non greedy (or lazy)** tells it to find the shortest string that matches the pattern |"ain*?"
`+` | One or more occurrences | "ain+"
`+?` | One or more occurrences (non greedy) | "ain+?"
`{}` | Exactly the specified number of occurrences or a range of occurrences | "ain{1}", "ain{2,3}"
`\|` | Either or.  Must use escape in Markdown. | "rain\|Spain"
`()` | Extract.  Specifies what to extract.  Use more characters in search pattern, but only extract a subset of characters from match. | "a(i)n"
`(?:)` |  Group but not extract.  Parentheses can be used to group characters together so we can apply the pipe or metacharacters to the entire group.  However, this also means extract.  `(?:)` means group, but do not extract. | "(?:ai)+n"
`?` | Optional pattern.  Can occur 0 or 1 time.  Used after characters or groups of characters.  Takes on a different meaning in this scenario then when used to indicate non-greedy. | "r?ain"

Special Sequences | Use.  The first half match portions of words/string.  The second half match individual characters.  | Example
--- | --- | ---
`\A` | Returns a match where specified characters at START of STRING.  Same as ^. | "\Aain"
`\Z` | Returns a match where specified characters at END of STRING. Same as $. | "ain\Z"
`\b` | Returns a match where specified characters at START or END of a WORD. `r` creates "raw string". | r"\bain", r"ain\b"
`\B` | Returns a match where the specified characters are present, but NOT at START or END of WORD. | "\Bain" "ain\B"
`\d` | Returns a match where the string  DOES contain digits (numbers from 0-9) | "\d"
`\D` | Returns a match where the string does NOT contain digits | "\D"
`\s` | Returns a match where the string DOES contain a white space character | "\s"
`\S` | Returns a match where the string does NOT contain a white space character | "\S"
`\w` | Returns a match where the string DOES contain any alphanumeric characters or the underscore | "\w"
`\W` | Returns a match where the string does NOT contain any alphanumeric characters or the underscore | "\W"

Set Syntax | Use.  These match specified individual characters.
--- | ---
[] | When used within sets, within [], the metacharacters above have no special meaning
[ain] | Returns a match where one of the specified characters are present
[^ain] | Returns a match for any character except those letters
[a-n] | Returns a match for any lower case character alphabetically between
[1-5] | Returns a match for numbers in range (both numbers inclusive)
[0-5][0-9] | Returns match for numbers.  Here this is 0-59
[a-zA-Z] | Returns a match for any character alphabetically between a and z, both lower or lower case

---

**EXAMPLES**

In [1]:
import re

## Findall Examples

Simple Example

In [2]:
text = 'The rain in Spain'

re.findall('ain', text)

['ain', 'ain']

### Metacharacters

- Remember that if we wanted to match a metacharacter, we must prefix that metacharacter with a backslash

`^` String Starts With, `$` String Ends With

In [3]:
text = 'The rain in Spain'

# String starts with
var_matches = re.findall('^ain', text)  # No match.  Empty list.
print(var_matches)

# String ends with
var_matches = re.findall('ain$', text)
print(var_matches)

[]
['ain']


`.` Any Character

In [4]:
text = 'The rain in Spain'

re.findall('.in', text)

['ain', ' in', 'ain']

`*`  and `*?`.  Zero or More Occurrences Greedy and Non-greedy.

In [5]:
text = 'The rain in Spain'

# Greedy.  'a' found, greedy then matches any character until strings ends.
# Matches do NOT overlap so only one match.
var_matches = re.findall('a.*', text)
print(var_matches)

# Non-Greedy. a found, non-greedy then stops as 0 matches for any character is okay.
var_matches = re.findall('a.*?', text)
print(var_matches)

['ain in Spain']
['a', 'a']


`+` and `+?`.  One or More Occurrences Greedy and Non-greedy.

In [6]:
text = 'The rain in Spain'

# Greedy.  'a' found, greedy then matches any character until strings ends.
# Matches do NOT overlap so only one match.var_matches = re.findall('a.+', text)
print(var_matches)

# Non-Greedy. a found, any character found once, non-greedy then stops as 1 match for any character is okay.
var_matches = re.findall('a.+?', text)
print(var_matches)

['a', 'a']
['ai', 'ai']


`{}` Specify Number of Occurrences

In [7]:
text = 'The rain in Spain'

# a then any character 0 times
var_matches = re.findall('a.{0}', text)
print(var_matches)

# a then any character 1 time
var_matches = re.findall('a.{1}', text)
print(var_matches)

# a then any character 1 or more times
var_matches = re.findall('a.{1,}', text)  # Same as a.+
print(var_matches)

# Greedy. a then any character 0 to 2 times.  Because greedy, goes for 2 characters after a
var_matches = re.findall('a.{0,2}', text)
print(var_matches)

# Non-Greedy.  a then any charcter 0 to 2 times.  Because non-greedy, goes for 0 characters after a
var_matches = re.findall('a.{0,2}?', text)
print(var_matches)

['a', 'a']
['ai', 'ai']
['ain in Spain']
['ain', 'ain']
['a', 'a']


`|` Either

In [8]:
text = 'The rain in Spain'

var_matches = re.findall('rain|Spain', text)
print(var_matches)

['rain', 'Spain']


`()` Extract
- If a `.findall()` search pattern has 0 or 1 groups, then the result is a list of matches like we'd expect
- If a `.findall()` search pattern has 2 or more groups, then the result is a list of tuples.  Each tuple is a single match, with 2 or more groups per tuple.

In [9]:
text = 'The rain in Spain'

# Single group
var_matches = re.findall('a(in)', text)  # Two matches.  Returns list.
print(var_matches)

# Two groups
var_matches = re.findall('(a)(in)', text)  # Two matches.  Returns list of tuples, with 2 items each.
print(var_matches)

# Three groups
var_matches = re.findall('((a)(in))', text)  # Two matches.  Returns list of tuples, with 3 items each.
print(var_matches)

['in', 'in']
[('a', 'in'), ('a', 'in')]
[('ain', 'a', 'in'), ('ain', 'a', 'in')]


`(?:)` Group
- Parentheses can be used to group characters together so we can apply the pipe or metacharacters to the entire group.  However, this also means extract.  `(?:)` means group, but do not extract.

In [10]:
text = 'The rain in Spain'

# Unwanted results
var_matches = re.findall('(r|Sp)ain', text)
print(var_matches)

# (?:) Wanted Results
var_matches = re.findall('(?:r|Sp)ain', text)
print(var_matches)

['r', 'Sp']
['rain', 'Spain']


`?` Optional Pattern

In [11]:
text = 'The rain in Spain'

var_matches = re.findall('a?in', text)
print(var_matches)

['ain', 'in', 'ain']


### Special Sequences

`\A` STRING Starts With.  `\Z` STRING Ends With.

In [12]:
text = 'The rain in Spain'

# STRING starts with
var_matches = re.findall('\Aain', text)
print(var_matches)

# STRING ends with
var_matches = re.findall('ain\Z', text)
print(var_matches)

[]
['ain']


`\b` WORD Starts With.  `\b` WORD Ends With.

In [13]:
text = 'The rain in Spain'

# WORD starts with
var_matches = re.findall(r'\bain', text)  # Raw string needed here (or two backslashes)
print(var_matches)

# WORD ends with
var_matches = re.findall(r'ain\b', text)  # Raw string needed here (or two backslashes)
print(var_matches)

[]
['ain', 'ain']


`\B` Present, but WORD does NOT Start With.  `\B` Present, but WORD does NOT End With.

In [14]:
text = 'The rain in Spain'

# Present, but WORD does NOT start with
var_matches = re.findall('\Bain', text)
print(var_matches)

# Present, but WORD NOT ends with
var_matches = re.findall('ain\B', text)
print(var_matches)

['ain', 'ain']
[]


`\d` Number.  `\D` NOT Number.

In [15]:
text = 'The rain in Spain'

# Number
var_matches = re.findall(r'\d', text)
print(var_matches)

# NOT number
var_matches = re.findall(r'\D', text)
print(var_matches)

[]
['T', 'h', 'e', ' ', 'r', 'a', 'i', 'n', ' ', 'i', 'n', ' ', 'S', 'p', 'a', 'i', 'n']


`\s` Whitespace.  `\S` NOT Whitespace.

In [16]:
text = 'The rain in Spain'

# Whitespace
var_matches = re.findall('\s', text)
print(var_matches)

# NOT whitespace
var_matches = re.findall('\S', text)
print(var_matches)

[' ', ' ', ' ']
['T', 'h', 'e', 'r', 'a', 'i', 'n', 'i', 'n', 'S', 'p', 'a', 'i', 'n']


`\w` Alphanumeric or Underscore.  `\W` NOT Alphanumeric or Underscore.

In [17]:
text = 'The rain in Spain'

# Alphanumeric/underscore
var_matches = re.findall('\w', text)
print(var_matches)

# NOT alphanumeric/undercore
var_matches = re.findall('\W', text)
print(var_matches)

['T', 'h', 'e', 'r', 'a', 'i', 'n', 'i', 'n', 'S', 'p', 'a', 'i', 'n']
[' ', ' ', ' ']


### Sets

`[]`

In [18]:
text = 'The rain in Spain'

re.findall('[ain]', text)

['a', 'i', 'n', 'i', 'n', 'a', 'i', 'n']

`[^]`

In [19]:
text = 'The rain in Spain'

re.findall('[^ain]', text)

['T', 'h', 'e', ' ', 'r', ' ', ' ', 'S', 'p']

`[-]`

In [20]:
text = 'The rain in Spain'

var_matches = re.findall('[a-n]', text)
print(var_matches)

var_matches = re.findall('[a-zA-Z]', text)
print(var_matches)

var_matches = re.findall('[0-9]', text)
print(var_matches)

['h', 'e', 'a', 'i', 'n', 'i', 'n', 'a', 'i', 'n']
['T', 'h', 'e', 'r', 'a', 'i', 'n', 'i', 'n', 'S', 'p', 'a', 'i', 'n']
[]


## Compile Examples

Simple Example

In [21]:
text = 'The rain in Spain'

search_pattern = re.compile(r'ain')
search_pattern.findall(text)

['ain', 'ain']

.IGNORECASE

In [22]:
text = 'The rain in Spain'

search_pattern = re.compile(r'spain\b', re.IGNORECASE)
search_pattern.findall(text)

['Spain']

DOTALL

In [23]:
text = 'The rain\n in Spain'  # Included newline character

search_pattern = re.compile(r'^The.+(Spain)')
print(search_pattern.findall(text))

search_pattern = re.compile(r'^The.+(Spain)', re.DOTALL)
print(search_pattern.findall(text))

[]
['Spain']


VERBOSE

In [24]:
text = 'The rain in Spain'

search_pattern = re.compile(r"""
^The    # Starts with
\s      # Whitespace
(rain)  # Extract rain
.+      # Any character except newline repeated 1 or more times
Spain$  # Ends with
""", re.VERBOSE)
search_pattern.findall(text)

['rain']

## Search Examples

**`search()`, `.group()`, `.start()`, `.end()`**

In [25]:
text = 'The rain in Spain'

match_object = re.search('ain', text)
print(match_object)
print(match_object.group())
print(match_object.start())
print(match_object.end())

<re.Match object; span=(5, 8), match='ain'>
ain
5
8


- `search()` can be used in conditional statements

In [26]:
text = 'The rain in Spain'

match_object = re.search('ain', text)

if match_object:  # if match object exists then print  
    print("There is a match.")

There is a match.


## Split Examples

Simple Example

In [27]:
text = 'The rain in Spain'

l_matches = re.split(' ', text)
print(l_matches)

['The', 'rain', 'in', 'Spain']


## Sub Examples

Simple Example

In [28]:
text = 'The rain in Spain'

sub_text = re.sub('ain','at', text)
print(sub_text)

The rat in Spat


Use Group from Match to Replace

In [29]:
text = 'Agent Kevin and Agent James.'

# 1st argument is search pattern
# 2nd argument is what to replace with
# The \1 means use the group from the match and substitute that back into the text at each match
# 3rd argument is text we are searching

sub_text = re.sub(r'Agent (\w)\w*', r'Agent \1', text)

print(sub_text)

Agent K and Agent J.


## Additional Examples

**Find Email Examples**

In [30]:
# Goals: return all emails

text = """
Reach Us by Email

    General inquiries: info@nostarch.com
    Media requests: media@nostarch.com
    Academic requests: academic@nostarch.com (Further information)
    Conference and Events: conferences@nostarch.com
    Help with your order: info@nostarch.com
"""
# Different characters are allowed in the username and domain name
# The dot something at the end can only be 2-4 leters
l_matches = re.findall(r'[a-zA-Z0-9.%+]+@[a-zA-Z0-9.-]+[.][a-zA-Z]{2,4}', text)
print(l_matches)

['info@nostarch.com', 'media@nostarch.com', 'academic@nostarch.com', 'conferences@nostarch.com', 'info@nostarch.com']


In [31]:
# Goals: return the email handle, without @

text = 'From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008'

# Double find and slice
i_indexat = text.find('@')  # Return index position
i_indexspace = text.find(' ', i_indexat)  # Start searching at index position
s_handle = text[(i_indexat + 1) : i_indexspace]  # Slice
print(s_handle)

# Double split and slice
l_words = text.split(' ')
for s_word in l_words:
    if '@' in s_word:
        s_handle = (s_word.split('@'))[1]
print(s_handle) 

# RegEx
l_matches = re.findall('@(\S+)', text)  # After @, NON-whitespace character 1 or more times greedy
print(l_matches[0])

uct.ac.za
uct.ac.za
uct.ac.za


- **Find Phone Numbers Example**
- We provide an example text to search, but the RegEx search pattern will find phone numbers in multiple formats
- Note that the below code uses some advanced features of RegEx:
    - `re.compile(<SEARCH_PATTERN>,re.VERBOSE)` allows us to space out and comment our search pattern
    - `r` creates a raw string
    - `"""` allows us to use multiple lines
    - `()` parentheses for grouping  
    - `|` allows us to search for either pattern
    - `?`  allows us to include optional portions such as the area code and extension.  If they are found they are included, but if they are not found it will not interfere with the match.
    - `\(\)` we want to search for parentheses in the area code so we need to use backslashes before them
- We show two slightly different RegEx search patterns

1. The first uses parentheses to group characters and then use `?` or `|`.  The downside to this is that each group also tells the parser to extract.  This is OKAY if we then enclose and extract the entire search pattern and use the first item (index 0) in each tuple in the list of results returned.  This is because the group with the first parentheses always comes first.  The RegEx pattern is simpler in this method.

In [32]:
# Goals: find all phone numbers in text

text = """
Contact Us

No Starch Press, Inc.
245 8th Street
San Francisco, CA 94103 USA
Phone: 800.420.7240 or +1 415.863.9900 (9 a.m. to 5 p.m., M-F, PST)
Fax: +1 415.863.9950
"""

search_phone = re.compile(r"""(
(\d{3}|\(\d{3}\))?            # Area code.  In form: ### or (###).  Optional.
(-|\s|.)?                     # Separator.  In form: space or dash or period
\d{3}                         # First 3 digits.  In form ###
(-|\s|.)                      # Separator.  In form: space or dash or period
\d{4}                         # Last 4 digits.  In form: ####
(\s*(ext|x|ext.)\s*\d{2,5})?  # Extension.  In form: space ext or x or ext. space 2-5 digits
)""", re.VERBOSE)

l_matches = search_phone.findall(text)
print(l_matches)
for t_matches in l_matches:
   print(t_matches[0])

[('800.420.7240', '800', '.', '.', '', ''), ('415.863.9900', '415', '.', '.', '', ''), ('415.863.9950', '415', '.', '.', '', '')]
800.420.7240
415.863.9900
415.863.9950


2. The second uses the `(?:)`.  `(?:)` tells the parser to use the parentheses to group, but not extract.  This makes the RegEx more complicated, but the results are simpler.

In [33]:
# Goals: find all phone numbers in text

text = """
Contact Us

No Starch Press, Inc.
245 8th Street
San Francisco, CA 94103 USA
Phone: 800.420.7240 or +1 415.863.9900 (9 a.m. to 5 p.m., M-F, PST)
Fax: +1 415.863.9950
"""

search_phone = re.compile(r"""
(?:\d{3}|\(\d{3}\))?              # Area code.  In form: ### or (###).  Optional.
(?:-|\s|.)?                       # Separator.  In form: space or dash or period
\d{3}                             # First 3 digits.  In form ###
(?:-|\s|.)                        # Separator.  In form: space or dash or period
\d{4}                             # Last 4 digits.  In form: ####
(?:\s*(?:ext|x|ext.)\s*\d{2,5})?  # Extension.  In form: space ext or x or ext. space 2-5 digits
""", re.VERBOSE)

l_matches = search_phone.findall(text)
print(l_matches)

['800.420.7240', '415.863.9900', '415.863.9950']


**Backslash Cluster F\w\w\w and Raw Strings**
- `\b` is both a Python escape character (backspace) and a RegEx special sequence (word begins with or word ends with).  When `\b` is in the search pattern it is first passed into the Python interpreter and turned into a backspace.  Next, this backspace is sent the RegEx parser, but the RegEx parser can't use it.  
- Therefore, we must either escape the escape character, or use a raw string.  Raw strings are easier to read and recommended.

In [34]:
text = 'The rain in Spain'

var_matches = re.findall('ain\\b', text)  # Escape the escape
print(var_matches)

var_matches = re.findall(r'ain\b', text)  # Raw string.  Easier to read.
print(var_matches)

['ain', 'ain']
['ain', 'ain']


- `\` is both a Python escape character and a RegEx special character.  A RegEx search pattern actually completely ignores a single backslash used on its own.  To get RegEX to search for a single backslash, we'd actually have to include two, `\\`.  But wait, the Python interpreter looks at this string first and converts `\\` to `\`.  If we want the RegEx parser to look for a single backslash, we must type `\\\\` so the Python interpreter sends the parser `\\`.  The parser converts this to `\` and searches.
- To avoid this, we use the raw string and two backslashes.  The results are a bit odd and appear to show the string search pattern instead of the match we found in the text.

In [35]:
text = 'The rain \ in Spain'

var_matches = re.findall(r'\\ in', text)
print(var_matches)

['\\ in']


---