# Regex Expressions in Python

This notebook contains a cheatsheet for using regex expressions in Python.

Source:
- https://realpython.com/regex-python/


First examples will be based on the search function:

*re.search(\<regex\>, \<string\>)*:
Scans a string for a regex match.

Let's import regex:

In [2]:
import re

## Metacharacters

Metacharacters are special characters that have a unique meaning to the regex machine engine. They are the following:

### The [] metacharacter
Specifies a character class. It matches any single character that is in the class.

In [8]:
# Matches any letter that is 'a' or 'r'
re.search("ba[ar]", "foobarqux")

<re.Match object; span=(3, 6), match='bar'>

In [4]:
# Matches any digit that is between 0 and 9 (0123456789)
re.search("[0-9]", "abc90")

<re.Match object; span=(3, 4), match='9'>

In [5]:
# Matches any minus letter between 'a' and 'z'.
re.search("[a-z]", "9abc")

<re.Match object; span=(1, 2), match='a'>

In [7]:
# Matches any mayus letter betwen 'A' and 'Z'
re.search("[A-Z]", "9aABC")

<re.Match object; span=(2, 3), match='A'>

### The ^ metacharacter
The ^ metacharacter can be used to match any character that **is not** in the set.

In [9]:
# Matches any character that is NOT between 0 and 9.
re.search("[^0-9]", "12345abc")

<re.Match object; span=(5, 6), match='a'>

*^* must appear as the first character in the class, otherwise it matches a literal '^'.

In [11]:
# Matches any character between 0 and 9 and a literal ^
re.search("[0-9^]", "Hello^^")

<re.Match object; span=(5, 6), match='^'>

### The . metacharacter
Matches any single character except newline.

In [13]:
# Matches a sequence of 'foo' + any character except newline + 'bar'
re.search("foo.bar", "fooxbar")

<re.Match object; span=(0, 7), match='fooxbar'>

### Shorthands

- **\\w**: Matches any alphanumeric char. Equals to: **[a-zA-Z0-9_]**.
- **\\W**: The opposite to \\w. Equals to: **[^a-zA-Z0-9_]**.

In [14]:
re.search("\w", "#(.a0_)")

<re.Match object; span=(3, 4), match='a'>

In [15]:
re.search("\W", "#(.a0_)")

<re.Match object; span=(0, 1), match='#'>

- **\\d**: Matches any decimal digit char. Equals to **[0-9]**.
- **\\D**: The opposite to \\d. Equals to **[^0-9]**.

In [16]:
re.search("\d", "abc012")

<re.Match object; span=(3, 4), match='0'>

In [17]:
re.search("\\D", "abc012")

<re.Match object; span=(0, 1), match='a'>

- **\\s**: Matches any whitespace char. Common whitespaces are: \\t\\n\\r\\f and ' ' (the space character).
- **\\S**: The opposite to \\s.

In [20]:
re.search("\s", "foo\nbar")

<re.Match object; span=(3, 4), match='\n'>

In [21]:
re.search("\S", "foo\nbar")

<re.Match object; span=(0, 1), match='f'>

### Escaping metacharacters using backslash (\\)

You can **espace characters** if you want to include a metacharacter in your regex but you don't want it to carry its special meaning. 

- **Backslash (\\)**: Removes the special meaning of a metacharacter.

In [23]:
# Matches a character that is a literal '[' or ']'
re.search("[\[\]]", "abc[abc]")

<re.Match object; span=(3, 4), match='['>

In [24]:
# Matches a literal dot (.)
re.search("\.", "foo.bar")

<re.Match object; span=(3, 4), match='.'>

The backslash is itself a special character in regex. To match it, one option is to escape it twice, once due to the Python interpreter and once due to the regex parser:

In [29]:
# Matches a literal '\'
s = r"abc\abc"
re.search("\\\\", s) 

<re.Match object; span=(3, 4), match='\\'>

Another cleaner option is to specify the \<regex\> as a raw-string to suppress the escaping at the interpreter level:

In [31]:
re.search(r"\\", s)

<re.Match object; span=(3, 4), match='\\'>

**It’s good practice to use a raw string to specify a regex in Python whenever it contains backslashes.**

### Anchors

An anchor dictates a particular location in the search string where a match must occur. They are zero-width matches, that is, they don't match any actual characters in the search string.

- **^** or **\\A**: Anchors a match to the **start** of the search string. ^ and \\A behave differently in MULTILINE mode.

In [32]:
# Matches 'foo' only if it's present at the beggining of the search string
re.search("^foo", "foobar")

<re.Match object; span=(0, 3), match='foo'>

In [33]:
re.search("^foo", "barfoo")

- **\\$** or **\\Z**: Anchors a match to the **end** of the search string. \\$ also matches if the expression if followed by a **newline character**. \\$ and \\Z behaves differently in MULTILINE mode.

In [36]:
# Matches 'bar' only if it's present at the end of the search string or followed by a newline
re.search("bar$", "foobar")

<re.Match object; span=(3, 6), match='bar'>

In [38]:
# Matches 'bar' only if it's present at the end of the search string or followed by a newline
re.search("bar$", "foobar\n")

<re.Match object; span=(3, 6), match='bar'>

In [35]:
re.search("bar$", "barfoo")

- **\\b**: Anchors a match to a word boundary i.e. at the beginning or end of a word. A word consists of a sequence of alphanumeric characters or underscores ([a-zA-Z0-9_]).
- **\\B**: The opposite of \\b.

In [44]:
# Matches 'bar' only if it's at the beggining or at the end of a word.
re.search(r"\bbar", "foo bar qux")

<re.Match object; span=(4, 7), match='bar'>

In [45]:
re.search(r"\bbar", "foobarqux")

In [47]:
# Matches 'bar' only if it is NOT at the beggining nor at the end of a word.
re.search(r"\Bbar", "foo bar qux")

In [48]:
re.search(r"\Bbar", "foobarqux")

<re.Match object; span=(3, 6), match='bar'>

### Quantifiers
A quantifier metacharacter immediately follows a portion of a \<regex\> and indicates how many times that portion must occur for the match to succeed.

#### \* quantifier
Matches **zero or more repetitions** of the preceding regex.

In [3]:
# Matches 'foo' followed by zero or more '-' and followed by 'bar'
re.search("foo-*bar", "foo---bar")

<re.Match object; span=(0, 9), match='foo---bar'>

In [6]:
re.search("foo-*bar", "foobar")

<re.Match object; span=(0, 6), match='foobar'>

#### + quantifier
Matches **one or more repetitions** of the preceding regex.

In [7]:
# Matches 'foo' followed by one or more '-' and followed by 'bar'
re.search("foo-+bar", "foo---bar")

<re.Match object; span=(0, 9), match='foo---bar'>

In [8]:
re.search("foo-+bar", "foobar")

#### ? quantifier
Matches **zero or one repetitions** of the preceding regex.

In [9]:
# Matches 'foo' followed by an optional '-' and followed by 'bar'
re.search("foo-?bar", "foo-bar")

<re.Match object; span=(0, 7), match='foo-bar'>

In [10]:
re.search("foo-?bar", "foo---bar")

In [11]:
re.search("foo-?bar", "foobar")

<re.Match object; span=(0, 6), match='foobar'>

#### Non-greendy quantifiers
The quantifier metacharacters \*, + and ? are all **greedy**, meaning they produce the **longest** possible match. If you want the **shortest** possible match instead, then use the **non-greedy** version of them:
- \*?: Non-greedy version of the metacharacter '*'.
- +?: Non-greedy version of the metacharacter '+'.
- ??: Non-greedy version of the metacharacter '?'.

In [13]:
# Non-greedy match
re.search("<.*>", "<foo><bar><qux>")

<re.Match object; span=(0, 15), match='<foo><bar><qux>'>

In [15]:
# Greedy match
re.search("<.*?>", "<foo><bar><qux>")

<re.Match object; span=(0, 5), match='<foo>'>

#### {m} quatifier
Matches exactly ***m* repetititons** of the preceding regex.

In [16]:
# Matches 'foo' followed by 3 '-' and followed by 'bar'
re.search("foo-{3}bar", "foo---bar")

<re.Match object; span=(0, 9), match='foo---bar'>

In [17]:
re.search("foo-{3}bar", "foo--bar")

#### {m,n} quantifier
Matches any number of repetitions of the preceding regex **from *m* to *n***, inclusive.

In [18]:
# Matches 'foo' followed by 1, 2 or 3 '-' and followed by 'bar'
re.search("foo-{1,3}bar", "foo---bar")

<re.Match object; span=(0, 9), match='foo---bar'>

In [19]:
re.search("foo-{1,3}bar", "foo--bar")

<re.Match object; span=(0, 8), match='foo--bar'>

In [21]:
re.search("foo-{1,3}bar", "foo-----bar")

- Omitting *m* implies m=0.
- Omitting *n* implies n=0.

For instance, to preserve its special meaning a sequence with curly braces ({}) must fit one of the following patterns: {m}, {m,n}, {m,}, {,n}, where m>=0 and n>=0.

#### Non-greedy {m,n}: {m,n}?
{m,n}? is the non-greedy version of {m,n}, meaning it will match as few characters as possible.

In [22]:
# Matches 'foo' followed by 1, 2 or 3 '-' and followed by 'bar' in a greedy manner
re.search("a{1,3}", "aaabbbccc")

<re.Match object; span=(0, 3), match='aaa'>

In [23]:
# Matches 'foo' followed by 1, 2 or 3 '-' and followed by 'bar' in a non-greedy manner
re.search("a{1,3}?", "aaabbbccc")

<re.Match object; span=(0, 1), match='a'>