# Regex Expressions in Python

This notebook contains a cheatsheet for using regex expressions in Python.

Source:
- https://realpython.com/regex-python/


First examples will be based on the search function:

*re.search(\<regex\>, \<string\>)*:
Scans a string for a regex match.

Let's import regex:

In [2]:
import re

## Metacharacters

Metacharacters are special characters that have a unique meaning to the regex machine engine. They are the following:

### The [] metacharacter
Specifies a character class. It matches any single character that is in the class.

In [8]:
# Matches any letter that is 'a' or 'r'
re.search("ba[ar]", "foobarqux")

<re.Match object; span=(3, 6), match='bar'>

In [4]:
# Matches any digit that is between 0 and 9 (0123456789)
re.search("[0-9]", "abc90")

<re.Match object; span=(3, 4), match='9'>

In [5]:
# Matches any minus letter between 'a' and 'z'.
re.search("[a-z]", "9abc")

<re.Match object; span=(1, 2), match='a'>

In [7]:
# Matches any mayus letter betwen 'A' and 'Z'
re.search("[A-Z]", "9aABC")

<re.Match object; span=(2, 3), match='A'>

### The ^ metacharacter
The ^ metacharacter can be used to match any character that **is not** in the set.

In [9]:
# Matches any character that is NOT between 0 and 9.
re.search("[^0-9]", "12345abc")

<re.Match object; span=(5, 6), match='a'>

*^* must appear as the first character in the class, otherwise it matches a literal '^'.

In [11]:
# Matches any character between 0 and 9 and a literal ^
re.search("[0-9^]", "Hello^^")

<re.Match object; span=(5, 6), match='^'>

### The . metacharacter
Matches any single character except newline.

In [13]:
# Matches a sequence of 'foo' + any character except newline + 'bar'
re.search("foo.bar", "fooxbar")

<re.Match object; span=(0, 7), match='fooxbar'>

### Shorthands

- **\\w**: Matches any alphanumeric char. Equals to: **[a-zA-Z0-9_]**.
- **\\W**: The opposite to \\w. Equals to: **[^a-zA-Z0-9_]**.

In [14]:
re.search("\w", "#(.a0_)")

<re.Match object; span=(3, 4), match='a'>

In [15]:
re.search("\W", "#(.a0_)")

<re.Match object; span=(0, 1), match='#'>

- **\\d**: Matches any decimal digit char. Equals to **[0-9]**.
- **\\D**: The opposite to \\d. Equals to **[^0-9]**.

In [16]:
re.search("\d", "abc012")

<re.Match object; span=(3, 4), match='0'>

In [17]:
re.search("\\D", "abc012")

<re.Match object; span=(0, 1), match='a'>

- **\\s**: Matches any whitespace char. Common whitespaces are: \\t\\n\\r\\f and ' ' (the space character).
- **\\S**: The opposite to \\s.

In [20]:
re.search("\s", "foo\nbar")

<re.Match object; span=(3, 4), match='\n'>

In [21]:
re.search("\S", "foo\nbar")

<re.Match object; span=(0, 1), match='f'>

### Escaping metacharacters using backslash (\\)

You can **espace characters** if you want to include a metacharacter in your regex but you don't want it to carry its special meaning. 

- **Backslash (\\)**: Removes the special meaning of a metacharacter.

In [23]:
# Matches a character that is a literal '[' or ']'
re.search("[\[\]]", "abc[abc]")

<re.Match object; span=(3, 4), match='['>

In [24]:
# Matches a literal dot (.)
re.search("\.", "foo.bar")

<re.Match object; span=(3, 4), match='.'>

The backslash is itself a special character in regex. To match it, one option is to escape it twice, once due to the Python interpreter and once due to the regex parser:

In [29]:
# Matches a literal '\'
s = r"abc\abc"
re.search("\\\\", s) 

<re.Match object; span=(3, 4), match='\\'>

Another cleaner option is to specify the \<regex\> as a raw-string to suppress the escaping at the interpreter level:

In [31]:
re.search(r"\\", s)

<re.Match object; span=(3, 4), match='\\'>

**It’s good practice to use a raw string to specify a regex in Python whenever it contains backslashes.**

### Anchors

An anchor dictates a particular location in the search string where a match must occur. They are zero-width matches, that is, they don't match any actual characters in the search string.

- **^** or **\\A**: Anchors a match to the **start** of the search string. ^ and \\A behave differently in MULTILINE mode.

In [32]:
# Matches 'foo' only if it's present at the beggining of the search string
re.search("^foo", "foobar")

<re.Match object; span=(0, 3), match='foo'>

In [33]:
re.search("^foo", "barfoo")

- **\\$** or **\\Z**: Anchors a match to the **end** of the search string. \\$ also matches if the expression if followed by a **newline character**. \\$ and \\Z behaves differently in MULTILINE mode.

In [36]:
# Matches 'bar' only if it's present at the end of the search string or followed by a newline
re.search("bar$", "foobar")

<re.Match object; span=(3, 6), match='bar'>

In [38]:
# Matches 'bar' only if it's present at the end of the search string or followed by a newline
re.search("bar$", "foobar\n")

<re.Match object; span=(3, 6), match='bar'>

In [35]:
re.search("bar$", "barfoo")

- **\\b**: Anchors a match to a word boundary i.e. at the beginning or end of a word. A word consists of a sequence of alphanumeric characters or underscores ([a-zA-Z0-9_]).
- **\\B**: The opposite of \\b.

In [44]:
# Matches 'bar' only if it's at the beggining or at the end of a word.
re.search(r"\bbar", "foo bar qux")

<re.Match object; span=(4, 7), match='bar'>

In [45]:
re.search(r"\bbar", "foobarqux")

In [47]:
# Matches 'bar' only if it is NOT at the beggining nor at the end of a word.
re.search(r"\Bbar", "foo bar qux")

In [48]:
re.search(r"\Bbar", "foobarqux")

<re.Match object; span=(3, 6), match='bar'>

### Quantifiers
A quantifier metacharacter immediately follows a portion of a \<regex\> and indicates how many times that portion must occur for the match to succeed.

#### \* quantifier
Matches **zero or more repetitions** of the preceding regex.

In [3]:
# Matches 'foo' followed by zero or more '-' and followed by 'bar'
re.search("foo-*bar", "foo---bar")

<re.Match object; span=(0, 9), match='foo---bar'>

In [6]:
re.search("foo-*bar", "foobar")

<re.Match object; span=(0, 6), match='foobar'>

#### + quantifier
Matches **one or more repetitions** of the preceding regex.

In [7]:
# Matches 'foo' followed by one or more '-' and followed by 'bar'
re.search("foo-+bar", "foo---bar")

<re.Match object; span=(0, 9), match='foo---bar'>

In [8]:
re.search("foo-+bar", "foobar")

#### ? quantifier
Matches **zero or one repetitions** of the preceding regex.

In [9]:
# Matches 'foo' followed by an optional '-' and followed by 'bar'
re.search("foo-?bar", "foo-bar")

<re.Match object; span=(0, 7), match='foo-bar'>

In [10]:
re.search("foo-?bar", "foo---bar")

In [11]:
re.search("foo-?bar", "foobar")

<re.Match object; span=(0, 6), match='foobar'>

#### Non-greendy quantifiers
The quantifier metacharacters \*, + and ? are all **greedy**, meaning they produce the **longest** possible match. If you want the **shortest** possible match instead, then use the **non-greedy** version of them:
- \*?: Non-greedy version of the metacharacter '*'.
- +?: Non-greedy version of the metacharacter '+'.
- ??: Non-greedy version of the metacharacter '?'.

In [13]:
# Non-greedy match
re.search("<.*>", "<foo><bar><qux>")

<re.Match object; span=(0, 15), match='<foo><bar><qux>'>

In [15]:
# Greedy match
re.search("<.*?>", "<foo><bar><qux>")

<re.Match object; span=(0, 5), match='<foo>'>

#### {m} quatifier
Matches exactly ***m* repetititons** of the preceding regex.

In [16]:
# Matches 'foo' followed by 3 '-' and followed by 'bar'
re.search("foo-{3}bar", "foo---bar")

<re.Match object; span=(0, 9), match='foo---bar'>

In [17]:
re.search("foo-{3}bar", "foo--bar")

#### {m,n} quantifier
Matches any number of repetitions of the preceding regex **from *m* to *n***, inclusive.

In [18]:
# Matches 'foo' followed by 1, 2 or 3 '-' and followed by 'bar'
re.search("foo-{1,3}bar", "foo---bar")

<re.Match object; span=(0, 9), match='foo---bar'>

In [19]:
re.search("foo-{1,3}bar", "foo--bar")

<re.Match object; span=(0, 8), match='foo--bar'>

In [21]:
re.search("foo-{1,3}bar", "foo-----bar")

- Omitting *m* implies m=0.
- Omitting *n* implies n=0.

For instance, to preserve its special meaning a sequence with curly braces ({}) must fit one of the following patterns: {m}, {m,n}, {m,}, {,n}, where m>=0 and n>=0.

#### Non-greedy {m,n}: {m,n}?
{m,n}? is the non-greedy version of {m,n}, meaning it will match as few characters as possible.

In [22]:
# Matches 'foo' followed by 1, 2 or 3 '-' and followed by 'bar' in a greedy manner
re.search("a{1,3}", "aaabbbccc")

<re.Match object; span=(0, 3), match='aaa'>

In [23]:
# Matches 'foo' followed by 1, 2 or 3 '-' and followed by 'bar' in a non-greedy manner
re.search("a{1,3}?", "aaabbbccc")

<re.Match object; span=(0, 1), match='a'>

### Grouping & backreferences
Grouping constructs break up a regex into subexpressions or groups. This serves 2 purposes:
- **Grouping**: A group is a single entity. Additional metacharacters apply to the entire group as a unit.
- **Capturing**: Some grouping constructs can also capture the portion of the search string that matches the group regex. The captured matches can be retrieved later. 

To define a regex group we use parenthesis: 
*(\<regex\>)*

#### 1st purpose: Treating the group as a unit

In [24]:
# Matches a group formed by the 'bar' sequence and repeated one or more times. 
re.search("(bar)+", "foo barbarbar baz")

<re.Match object; span=(4, 13), match='barbarbar'>

In [25]:
# Matches a group formed by 'foo' and an optional 'bar' repeated one or more times.
re.search("(foo(bar)?)+", "foofoobarfoobar")

<re.Match object; span=(0, 15), match='foofoobarfoobar'>

#### 2nd purpose: Capturing groups
To retrieve the captured part of the search string that matches the group we have two methods:
- **m.groups()**: returns a **tuple** containing all the captured groups from a regex match.
- **m.group(\<n\>)**: returns a **string** containing the *n* captured match. ***n* starts at 1**. If ***n=0* or no *n*** is provided, it returns **the entired match**. If **more than 1 *n*** is specified, it returns **the specified *n* captured matches** as a **tuple**.

where *m* is the match object that *re.search()* returns. 

In [30]:
# Each (\w+) matches a sequence of word characters, 
# and the full regex is formed by 3 of these sequences 
# separated by a comma
m = re.search("(\w+),(\w+),(\w+)", "foo,bar,qux")
m.groups()

('foo', 'bar', 'qux')

In [28]:
# Returns the 2nd captured group, which is 'bar'
m.group(2)

'bar'

In [29]:
# Returns the entire match
m.group()

'foo,bar,qux'

In [31]:
# Returns the 1st and 2nd matches
m.group(1,2)

('foo', 'bar')

#### Backreferences
You can match a previously captured group later within the same regex using a **backreference**:
- **\\\<n\>**: Matches the contents of the *n* captured group.

Since a backreference contains a **backslash**, it's a good practice to use **raw strings**.

In [34]:
# Matches a number followed by a comma and followed by the same number
# again
re.search(r"([0-9]),\1", "0,1,2,2,3,4")

<re.Match object; span=(4, 7), match='2,2'>

In [37]:
# A more complex example
re.search(r"([0-9]),([0-9]),\2,\1", "0,1,2,2,1,3")

<re.Match object; span=(2, 9), match='1,2,2,1'>

In [43]:
# Without using raw strings (not recommended)
re.search("([0-9]),\\1", "0,1,2,2,3,4")

<re.Match object; span=(4, 7), match='2,2'>

#### Naming groups
Until now, we have referred to the groups with an integer value *n*. You can name a group too:

- **(?P\<name\>\<regex\>)**: Creates a **named captured group**. Each \<name\> can only appear **once** per regex.

In [39]:
# Let's give a name to the groups of the previous example: w1, w2 and w3.
m = re.search("(?P<w1>\w+),(?P<w2>\w+),(?P<w3>\w+)", "foo,bar,qux")

# And let's access the 2nd group called 'w2'
print(m.group("w2"))

# You can still access the group by number
print(m.group(2))

bar
bar


- **(?P=\<name\>)**: Matches the contents of a **previously captured named group**.

In [41]:
# Again, let's give a name to the captured group of the previous example
re.search("(?P<number>[0-9]),(?P=number)", "0,1,2,2,3,4")

<re.Match object; span=(4, 7), match='2,2'>

In [42]:
# Again, you can still use the numbers as well
re.search(r"(?P<number>[0-9]),\1", "0,1,2,2,3,4")

<re.Match object; span=(4, 7), match='2,2'>

**The angle brackets (\< and \>) are required** around *name* **when creating a named group** but **not when referring to it later**, either by backreference or by *.group()*.

#### Non-capturing groups
Sometimes, there's no need to capture a group and do something with the value later. It takes some time and memory to capture a group, so if you don't capture unneccessary groups then you may see a slight performance advantage. 

- (?:\<regex\>): Creates a **non-capturing** group.

In [44]:
# Here the second group is not captured 
# (but it still has to appear in the search string)
m = re.search("(\w+),(?:\w+),(\w+)", "foo,bar,qux")
m.groups()

('foo', 'qux')

#### Conditional matches
A conditional match matches against one of two specified regexes depending on whether the given group exists:
- **(?(\<n\>)\<yes-regex\>|\<no-regex\>)**: Matches against **\<yes-regex\> if the group numbered \<n\> exists**. **Otherwise**, it matches against **\<no-regex\>**.
- **(?(\<name\>)\<yes-regex\>|\<no-regex\>)**: Same as above except the group is referred by its *name* instead of its number.

In [56]:
# Matches '###foobar' 'cause the '###' group exists.
re.search("^(###)?foo(?(1)bar|baz)", "###foobar")

<re.Match object; span=(0, 9), match='###foobar'>

In [59]:
# The '###' doesn't exist, so the regex to match is equal to 'foobaz'.
re.search("^(###)?foo(?(1)bar|baz)", "foobar")

In [60]:
# Same as above but using a named group 'ch'
re.search("^(?P<ch>###)?foo(?(ch)bar|baz)", "###foobar")

<re.Match object; span=(0, 9), match='###foobar'>

### Lookahead & lookbehind assertions

Lookahead and lookbehind assertions determine a match based on what is **just behind (to the left)** or **ahead (to the right)** of the parser's current position in the search string. 

Lookahead and lookbehind assertions are **zero-width**, meaning they don't consume any of the search string. Also, **they don't capture** what they match. 

#### Lookahead
- **(?=\<lookahead_regex\>)**: **Positive lookahead**. Asserts that **what follows** the regex parser's current position **must match *\<lookahead_regex\>***. 
- **(?!\<lookahead_regex\>)**: **Negative lookahead**. Asserts that **what follows** the regex parser's current position **must NOT match *\<lookahead_regex\>***. 

In [66]:
# Matches 'foo' if it's followed by a lowercase alphabetic character
# Note here that the 'b' that follows 'foo' is not a part of the match
# either it's captured.
re.search("foo(?=[a-z])", "foobar")

<re.Match object; span=(0, 3), match='foo'>

In [63]:
re.search("foo(?=[a-z])", "foo123")

In [62]:
# Matches 'foo' if it is not followed by a lowercase alphabetic character
re.search("foo(?![a-z])", "foobar")

In [64]:
re.search("foo(?![a-z])", "foo123")

<re.Match object; span=(0, 3), match='foo'>

#### Lookbehind

- **(?\<=\<lookbehind_regex\>)**: **Positive lookbehind**. Asserts that **what precedes** the regex parser's current position **must match *\<lookbehind_regex\>***. 
- **(?\<!\<lookbehind_regex\>)**: **Negative lookbehind**. Asserts that **what precedes** the regex parser's current position **must NOT match *\<lookbehind_regex\>***. 

In [67]:
# Matches 'bar' if it's preceded by 'foo'
re.search("(?<=foo)bar", "foobar")

<re.Match object; span=(3, 6), match='bar'>

In [68]:
re.search("(?<=foo)bar", "quxbar")

In [69]:
# Matches 'bar' if it is not preceded by 'foo'
re.search("(?<!foo)bar", "foobar")

In [70]:
re.search("(?<!foo)bar", "quxbar")

<re.Match object; span=(3, 6), match='bar'>

There’s a restriction on lookbehind assertions that doesn’t apply to lookahead assertions: **The \<lookbehind_regex\> in a lookbehind assertion must specify a match of fixed length.** For instance, quantifiers like '\*', '+' or '?' are not allowed in lookbehind assertions.