# Regex Expressions in Python

This notebook contains a cheatsheet for using regex expressions in Python.

Source:
- https://realpython.com/regex-python/

Regex interactive helper:
- https://regex101.com/

First examples will be based on the **search function**:

***re.search(\<regex\>, \<string\>, \<flags\>)***:
Scans a string for a regex match. \<flags\> argument is optional.

Let's import regex:

In [2]:
import re

## Metacharacters

Metacharacters are special characters that have a unique meaning to the regex machine engine. They are the following:

### The [] metacharacter
Specifies a character class. It matches any single character that is in the class.

In [8]:
# Matches any letter that is 'a' or 'r'
re.search("ba[ar]", "foobarqux")

<re.Match object; span=(3, 6), match='bar'>

In [4]:
# Matches any digit that is between 0 and 9 (0123456789)
re.search("[0-9]", "abc90")

<re.Match object; span=(3, 4), match='9'>

In [5]:
# Matches any minus letter between 'a' and 'z'.
re.search("[a-z]", "9abc")

<re.Match object; span=(1, 2), match='a'>

In [7]:
# Matches any mayus letter betwen 'A' and 'Z'
re.search("[A-Z]", "9aABC")

<re.Match object; span=(2, 3), match='A'>

### The ^ metacharacter
The ^ metacharacter can be used to match any character that **is not** in the set.

In [9]:
# Matches any character that is NOT between 0 and 9.
re.search("[^0-9]", "12345abc")

<re.Match object; span=(5, 6), match='a'>

*^* must appear as the first character in the class, otherwise it matches a literal '^'.

In [11]:
# Matches any character between 0 and 9 and a literal ^
re.search("[0-9^]", "Hello^^")

<re.Match object; span=(5, 6), match='^'>

Other regex metacharacters lose their special meaning inside a character class:

In [107]:
# + and * are metacharacters but they lose their special meaning
# because they are contained inside a character class.
re.search("[+*]", "2+1*1+3")

<re.Match object; span=(1, 2), match='+'>

### The . metacharacter
Matches any single character except newline.

In [13]:
# Matches a sequence of 'foo' + any character except newline + 'bar'
re.search("foo.bar", "fooxbar")

<re.Match object; span=(0, 7), match='fooxbar'>

### Shorthands

- **\\w**: Matches any alphanumeric char. Equals to: **[a-zA-Z0-9_]**.
- **\\W**: The opposite to \\w. Equals to: **[^a-zA-Z0-9_]**.

In [14]:
re.search("\w", "#(.a0_)")

<re.Match object; span=(3, 4), match='a'>

In [15]:
re.search("\W", "#(.a0_)")

<re.Match object; span=(0, 1), match='#'>

- **\\d**: Matches any decimal digit char. Equals to **[0-9]**.
- **\\D**: The opposite to \\d. Equals to **[^0-9]**.

In [16]:
re.search("\d", "abc012")

<re.Match object; span=(3, 4), match='0'>

In [17]:
re.search("\\D", "abc012")

<re.Match object; span=(0, 1), match='a'>

- **\\s**: Matches any whitespace char. Common whitespaces are: \\t\\n\\r\\f and ' ' (the space character).
- **\\S**: The opposite to \\s.

In [20]:
re.search("\s", "foo\nbar")

<re.Match object; span=(3, 4), match='\n'>

In [21]:
re.search("\S", "foo\nbar")

<re.Match object; span=(0, 1), match='f'>

### Escaping metacharacters using backslash (\\)

You can **espace characters** if you want to include a metacharacter in your regex but you don't want it to carry its special meaning. 

- **Backslash (\\)**: Removes the special meaning of a metacharacter.

In [23]:
# Matches a character that is a literal '[' or ']'
re.search("[\[\]]", "abc[abc]")

<re.Match object; span=(3, 4), match='['>

In [24]:
# Matches a literal dot (.)
re.search("\.", "foo.bar")

<re.Match object; span=(3, 4), match='.'>

The backslash is itself a special character in regex. To match it, one option is to escape it twice, once due to the Python interpreter and once due to the regex parser:

In [29]:
# Matches a literal '\'
s = r"abc\abc"
re.search("\\\\", s) 

<re.Match object; span=(3, 4), match='\\'>

Another cleaner option is to specify the \<regex\> as a raw-string to suppress the escaping at the interpreter level:

In [31]:
re.search(r"\\", s)

<re.Match object; span=(3, 4), match='\\'>

**It’s good practice to use a raw string to specify a regex in Python whenever it contains backslashes.**

### Anchors

An anchor dictates a particular location in the search string where a match must occur. They are zero-width matches, that is, they don't match any actual characters in the search string.

- **^** or **\\A**: Anchors a match to the **start** of the search string. ^ and \\A behave differently in MULTILINE mode.

In [32]:
# Matches 'foo' only if it's present at the beggining of the search string
re.search("^foo", "foobar")

<re.Match object; span=(0, 3), match='foo'>

In [33]:
re.search("^foo", "barfoo")

- **\\$** or **\\Z**: Anchors a match to the **end** of the search string. \\$ also matches if the expression if followed by a **newline character**. \\$ and \\Z behaves differently in MULTILINE mode.

In [36]:
# Matches 'bar' only if it's present at the end of the search string or followed by a newline
re.search("bar$", "foobar")

<re.Match object; span=(3, 6), match='bar'>

In [38]:
# Matches 'bar' only if it's present at the end of the search string or followed by a newline
re.search("bar$", "foobar\n")

<re.Match object; span=(3, 6), match='bar'>

In [35]:
re.search("bar$", "barfoo")

- **\\b**: Anchors a match to a word boundary i.e. at the beginning or end of a word. A word consists of a sequence of alphanumeric characters or underscores ([a-zA-Z0-9_]).
- **\\B**: The opposite of \\b.

In [44]:
# Matches 'bar' only if it's at the beggining or at the end of a word.
re.search(r"\bbar", "foo bar qux")

<re.Match object; span=(4, 7), match='bar'>

In [45]:
re.search(r"\bbar", "foobarqux")

In [47]:
# Matches 'bar' only if it is NOT at the beggining nor at the end of a word.
re.search(r"\Bbar", "foo bar qux")

In [48]:
re.search(r"\Bbar", "foobarqux")

<re.Match object; span=(3, 6), match='bar'>

### Quantifiers
A quantifier metacharacter immediately follows a portion of a \<regex\> and indicates how many times that portion must occur for the match to succeed.

#### \* quantifier
Matches **zero or more repetitions** of the preceding regex.

In [3]:
# Matches 'foo' followed by zero or more '-' and followed by 'bar'
re.search("foo-*bar", "foo---bar")

<re.Match object; span=(0, 9), match='foo---bar'>

In [6]:
re.search("foo-*bar", "foobar")

<re.Match object; span=(0, 6), match='foobar'>

#### + quantifier
Matches **one or more repetitions** of the preceding regex.

In [7]:
# Matches 'foo' followed by one or more '-' and followed by 'bar'
re.search("foo-+bar", "foo---bar")

<re.Match object; span=(0, 9), match='foo---bar'>

In [8]:
re.search("foo-+bar", "foobar")

#### ? quantifier
Matches **zero or one repetitions** of the preceding regex.

In [9]:
# Matches 'foo' followed by an optional '-' and followed by 'bar'
re.search("foo-?bar", "foo-bar")

<re.Match object; span=(0, 7), match='foo-bar'>

In [10]:
re.search("foo-?bar", "foo---bar")

In [11]:
re.search("foo-?bar", "foobar")

<re.Match object; span=(0, 6), match='foobar'>

#### Non-greendy quantifiers
The quantifier metacharacters \*, + and ? are all **greedy**, meaning they produce the **longest** possible match. If you want the **shortest** possible match instead, then use the **non-greedy** version of them:
- \*?: Non-greedy version of the metacharacter '*'.
- +?: Non-greedy version of the metacharacter '+'.
- ??: Non-greedy version of the metacharacter '?'.

In [13]:
# Non-greedy match
re.search("<.*>", "<foo><bar><qux>")

<re.Match object; span=(0, 15), match='<foo><bar><qux>'>

In [15]:
# Greedy match
re.search("<.*?>", "<foo><bar><qux>")

<re.Match object; span=(0, 5), match='<foo>'>

#### {m} quatifier
Matches exactly ***m* repetititons** of the preceding regex.

In [16]:
# Matches 'foo' followed by 3 '-' and followed by 'bar'
re.search("foo-{3}bar", "foo---bar")

<re.Match object; span=(0, 9), match='foo---bar'>

In [17]:
re.search("foo-{3}bar", "foo--bar")

#### {m,n} quantifier
Matches any number of repetitions of the preceding regex **from *m* to *n***, inclusive.

In [18]:
# Matches 'foo' followed by 1, 2 or 3 '-' and followed by 'bar'
re.search("foo-{1,3}bar", "foo---bar")

<re.Match object; span=(0, 9), match='foo---bar'>

In [19]:
re.search("foo-{1,3}bar", "foo--bar")

<re.Match object; span=(0, 8), match='foo--bar'>

In [21]:
re.search("foo-{1,3}bar", "foo-----bar")

- Omitting *m* implies m=0.
- Omitting *n* implies n=0.

For instance, to preserve its special meaning a sequence with curly braces ({}) must fit one of the following patterns: {m}, {m,n}, {m,}, {,n}, where m>=0 and n>=0.

#### Non-greedy {m,n}: {m,n}?
{m,n}? is the non-greedy version of {m,n}, meaning it will match as few characters as possible.

In [22]:
# Matches 'foo' followed by 1, 2 or 3 '-' and followed by 'bar' in a greedy manner
re.search("a{1,3}", "aaabbbccc")

<re.Match object; span=(0, 3), match='aaa'>

In [23]:
# Matches 'foo' followed by 1, 2 or 3 '-' and followed by 'bar' in a non-greedy manner
re.search("a{1,3}?", "aaabbbccc")

<re.Match object; span=(0, 1), match='a'>

### Grouping & backreferences
Grouping constructs break up a regex into subexpressions or groups. This serves 2 purposes:
- **Grouping**: A group is a single entity. Additional metacharacters apply to the entire group as a unit.
- **Capturing**: Some grouping constructs can also capture the portion of the search string that matches the group regex. The captured matches can be retrieved later. 

To define a regex group we use parenthesis: 
*(\<regex\>)*

#### 1st purpose: Treating the group as a unit

In [24]:
# Matches a group formed by the 'bar' sequence and repeated one or more times. 
re.search("(bar)+", "foo barbarbar baz")

<re.Match object; span=(4, 13), match='barbarbar'>

In [25]:
# Matches a group formed by 'foo' and an optional 'bar' repeated one or more times.
re.search("(foo(bar)?)+", "foofoobarfoobar")

<re.Match object; span=(0, 15), match='foofoobarfoobar'>

#### 2nd purpose: Capturing groups
To retrieve the captured part of the search string that matches the group we have two methods:
- **m.groups()**: returns a **tuple** containing all the captured groups from a regex match.
- **m.group(\<n\>)**: returns a **string** containing the *n* captured match. ***n* starts at 1**. If ***n=0* or no *n*** is provided, it returns **the entired match**. If **more than 1 *n*** is specified, it returns **the specified *n* captured matches** as a **tuple**.

where *m* is the match object that *re.search()* returns. 

In [30]:
# Each (\w+) matches a sequence of word characters, 
# and the full regex is formed by 3 of these sequences 
# separated by a comma
m = re.search("(\w+),(\w+),(\w+)", "foo,bar,qux")
m.groups()

('foo', 'bar', 'qux')

In [28]:
# Returns the 2nd captured group, which is 'bar'
m.group(2)

'bar'

In [29]:
# Returns the entire match
m.group()

'foo,bar,qux'

In [31]:
# Returns the 1st and 2nd matches
m.group(1,2)

('foo', 'bar')

#### Backreferences
You can match a previously captured group later within the same regex using a **backreference**:
- **\\\<n\>**: Matches the contents of the *n* captured group.

Since a backreference contains a **backslash**, it's a good practice to use **raw strings**.

In [34]:
# Matches a number followed by a comma and followed by the same number
# again
re.search(r"([0-9]),\1", "0,1,2,2,3,4")

<re.Match object; span=(4, 7), match='2,2'>

In [37]:
# A more complex example
re.search(r"([0-9]),([0-9]),\2,\1", "0,1,2,2,1,3")

<re.Match object; span=(2, 9), match='1,2,2,1'>

In [43]:
# Without using raw strings (not recommended)
re.search("([0-9]),\\1", "0,1,2,2,3,4")

<re.Match object; span=(4, 7), match='2,2'>

#### Naming groups
Until now, we have referred to the groups with an integer value *n*. You can name a group too:

- **(?P\<name\>\<regex\>)**: Creates a **named captured group**. Each \<name\> can only appear **once** per regex.

In [39]:
# Let's give a name to the groups of the previous example: w1, w2 and w3.
m = re.search("(?P<w1>\w+),(?P<w2>\w+),(?P<w3>\w+)", "foo,bar,qux")

# And let's access the 2nd group called 'w2'
print(m.group("w2"))

# You can still access the group by number
print(m.group(2))

bar
bar


- **(?P=\<name\>)**: Matches the contents of a **previously captured named group**.

In [41]:
# Again, let's give a name to the captured group of the previous example
re.search("(?P<number>[0-9]),(?P=number)", "0,1,2,2,3,4")

<re.Match object; span=(4, 7), match='2,2'>

In [42]:
# Again, you can still use the numbers as well
re.search(r"(?P<number>[0-9]),\1", "0,1,2,2,3,4")

<re.Match object; span=(4, 7), match='2,2'>

**The angle brackets (\< and \>) are required** around *name* **when creating a named group** but **not when referring to it later**, either by backreference or by *.group()*.

#### Non-capturing groups
Sometimes, there's no need to capture a group and do something with the value later. It takes some time and memory to capture a group, so if you don't capture unneccessary groups then you may see a slight performance advantage. 

- (?:\<regex\>): Creates a **non-capturing** group.

In [44]:
# Here the second group is not captured 
# (but it still has to appear in the search string)
m = re.search("(\w+),(?:\w+),(\w+)", "foo,bar,qux")
m.groups()

('foo', 'qux')

#### Conditional matches
A conditional match matches against one of two specified regexes depending on whether the given group exists:
- **(?(\<n\>)\<yes-regex\>|\<no-regex\>)**: Matches against **\<yes-regex\> if the group numbered \<n\> exists**. **Otherwise**, it matches against **\<no-regex\>**.
- **(?(\<name\>)\<yes-regex\>|\<no-regex\>)**: Same as above except the group is referred by its *name* instead of its number.

In [56]:
# Matches '###foobar' 'cause the '###' group exists.
re.search("^(###)?foo(?(1)bar|baz)", "###foobar")

<re.Match object; span=(0, 9), match='###foobar'>

In [59]:
# The '###' doesn't exist, so the regex to match is equal to 'foobaz'.
re.search("^(###)?foo(?(1)bar|baz)", "foobar")

In [60]:
# Same as above but using a named group 'ch'
re.search("^(?P<ch>###)?foo(?(ch)bar|baz)", "###foobar")

<re.Match object; span=(0, 9), match='###foobar'>

### Lookahead & lookbehind assertions

Lookahead and lookbehind assertions determine a match based on what is **just behind (to the left)** or **ahead (to the right)** of the parser's current position in the search string. 

Lookahead and lookbehind assertions are **zero-width**, meaning they don't consume any of the search string. Also, **they don't capture** what they match. 

#### Lookahead
- **(?=\<lookahead_regex\>)**: **Positive lookahead**. Asserts that **what follows** the regex parser's current position **must match *\<lookahead_regex\>***. 
- **(?!\<lookahead_regex\>)**: **Negative lookahead**. Asserts that **what follows** the regex parser's current position **must NOT match *\<lookahead_regex\>***. 

In [66]:
# Matches 'foo' if it's followed by a lowercase alphabetic character
# Note here that the 'b' that follows 'foo' is not a part of the match
# either it's captured.
re.search("foo(?=[a-z])", "foobar")

<re.Match object; span=(0, 3), match='foo'>

In [63]:
re.search("foo(?=[a-z])", "foo123")

In [62]:
# Matches 'foo' if it is not followed by a lowercase alphabetic character
re.search("foo(?![a-z])", "foobar")

In [64]:
re.search("foo(?![a-z])", "foo123")

<re.Match object; span=(0, 3), match='foo'>

#### Lookbehind

- **(?\<=\<lookbehind_regex\>)**: **Positive lookbehind**. Asserts that **what precedes** the regex parser's current position **must match *\<lookbehind_regex\>***. 
- **(?\<!\<lookbehind_regex\>)**: **Negative lookbehind**. Asserts that **what precedes** the regex parser's current position **must NOT match *\<lookbehind_regex\>***. 

In [67]:
# Matches 'bar' if it's preceded by 'foo'
re.search("(?<=foo)bar", "foobar")

<re.Match object; span=(3, 6), match='bar'>

In [68]:
re.search("(?<=foo)bar", "quxbar")

In [69]:
# Matches 'bar' if it is not preceded by 'foo'
re.search("(?<!foo)bar", "foobar")

In [70]:
re.search("(?<!foo)bar", "quxbar")

<re.Match object; span=(3, 6), match='bar'>

There’s a restriction on lookbehind assertions that doesn’t apply to lookahead assertions: **The \<lookbehind_regex\> in a lookbehind assertion must specify a match of fixed length.** For instance, quantifiers like '\*', '+' or '?' are not allowed in lookbehind assertions.

### Miscellaneous metacharacters

There are some metacharacters that don't fall into any of the previous categories.

#### Writing comments
**(?#...)**: Specifies a **comment** inside the regex ('...' must be replaced by the comment).

In [71]:
re.search("foo(?# This is a comment)bar", "foobar")

<re.Match object; span=(0, 6), match='foobar'>

#### Vertical bar or pipe (|)

**|** specifies a set of alternatives to match.

A expression like *\<regex1\>|\<regex2\>|...|\<regexn\>* matches **at most one** of the specified regex expressions. 

In [72]:
# Matches 'foo' or 'bar' or 'baz'
re.search("foo|bar|baz", "foo")

<re.Match object; span=(0, 3), match='foo'>

This operation is **non-greedy**, meaning the regex parser looks at the expression is **left-to-right** order and return the first match it finds:

In [74]:
# Matches 'foo' as it's the first match it finds
re.search("foo|grault", "foograult")

<re.Match object; span=(0, 3), match='foo'>

## Flags 

The *search* function as well as others support an optional \<flags\> argument. Flags modify regex parsing behavior.

### re.I or re.IGNORECASE
Makes matching **case insensitive**.

In [75]:
# Matches one or more 'a' in a case sensitive manner
re.search("a+", "aaaAAA")

<re.Match object; span=(0, 3), match='aaa'>

In [77]:
# Matches one or more 'a' in a case insensitive manner
re.search("a+", "aaaAAA", re.IGNORECASE)

<re.Match object; span=(0, 6), match='aaaAAA'>

### re.M or re.MULTILINE
Modifies the start-of-string (^) and end-of-string (\\$) anchors to **match embedded newlines too**.
- ^: matches at the beggining of the string or immediately following a newline. 
- \\$: matches at the end of the string or immediately preceding a newline.

MULTILINE **don't have any effect on the \\A and \\Z anchors.**

In [78]:
# Matches 'foo' only if it's at the beggining of the string
re.search("^foo", "foobar")

<re.Match object; span=(0, 3), match='foo'>

In [79]:
re.search("^foo", "\nfoo\bar")

In [80]:
# Matches 'foo' only if it's at the beggining of the string 
# or preceded by a newline
re.search("^foo", "\nfoo\bar", re.MULTILINE)

<re.Match object; span=(1, 4), match='foo'>

### re.S or re.DOTALL
Causes the dot (.) metacharacter to match a newline too.

In [81]:
# Without the DOTALL flag
re.search("foo.bar", "foo\nbar")

In [82]:
# With the DOTALL flag
re.search("foo.bar", "foo\nbar", re.DOTALL)

<re.Match object; span=(0, 7), match='foo\nbar'>

### re.X or re.VERBOSE
Allows inclusion of **whitespaces** and **comments** within a regex. 

Using VERBOSE:
- The regex parser **ignores all whitespaces** unless they're inside a character class or escaped with a backslash (\\).
- The regex parser **ignores the '#' character and all characters to the right of it** except it's contained in a character class or espaces with a backslash (\\).

The purpose of VERBOSE is to make regex expressions more readables.

In [83]:
# Let's try to find gmail and outlooks emails in a text
re.search("[a-zA-Z0-9]+@(?:gmail|outlook).com", "My email is example@gmail.com")

<re.Match object; span=(12, 29), match='example@gmail.com'>

In [91]:
# Let's use verbose to add comments
regex = """
          [a-zA-Z0-9]+       # username
          @                  # @ sign
          (?:gmail|outlook)  # supports 'gmail' or 'outlook' emails (could be extended in the future)
          .com               # .com extension (could be extended in the future)
          """
re.search(regex, "My email is example@gmail.com", re.VERBOSE)

<re.Match object; span=(12, 29), match='example@gmail.com'>

Note that **triple quoting** makes it particularly convenient to include embedded newlines, which qualify as ignored whitespace in VERBOSE mode.

### re.DEBUG
Displays debugging information.

In [92]:
# Interpreting '.' as "any"
re.search("foo.bar", "foo-bar", re.DEBUG)

LITERAL 102
LITERAL 111
LITERAL 111
ANY None
LITERAL 98
LITERAL 97
LITERAL 114

 0. INFO 12 0b1 7 7 (to 13)
      prefix_skip 3
      prefix [0x66, 0x6f, 0x6f] ('foo')
      overlap [0, 0, 0]
13: LITERAL 0x66 ('f')
15. LITERAL 0x6f ('o')
17. LITERAL 0x6f ('o')
19. ANY
20. LITERAL 0x62 ('b')
22. LITERAL 0x61 ('a')
24. LITERAL 0x72 ('r')
26. SUCCESS


<re.Match object; span=(0, 7), match='foo-bar'>

In [96]:
# Interpreting {2, 4} as a quantifier
re.search("x[123]{2,4}y", "x222y", re.DEBUG)

LITERAL 120
MAX_REPEAT 2 4
  IN
    LITERAL 49
    LITERAL 50
    LITERAL 51
LITERAL 121

 0. INFO 8 0b1 4 6 (to 9)
      prefix_skip 1
      prefix [0x78] ('x')
      overlap [0]
 9: LITERAL 0x78 ('x')
11. REPEAT_ONE 10 2 4 (to 22)
15.   IN 5 (to 21)
17.     RANGE 0x31 0x33 ('1'-'3')
20.     FAILURE
21:   SUCCESS
22: LITERAL 0x79 ('y')
24. SUCCESS


<re.Match object; span=(0, 5), match='x222y'>

In [94]:
# Interpreting '{foo}' as literal "{", "f", "o", "o", "}"
re.search("x[123]{foo}y", "x222y", re.DEBUG)

LITERAL 120
IN
  LITERAL 49
  LITERAL 50
  LITERAL 51
LITERAL 123
LITERAL 102
LITERAL 111
LITERAL 111
LITERAL 125
LITERAL 121

 0. INFO 8 0b1 8 8 (to 9)
      prefix_skip 1
      prefix [0x78] ('x')
      overlap [0]
 9: LITERAL 0x78 ('x')
11. IN 5 (to 17)
13.   RANGE 0x31 0x33 ('1'-'3')
16.   FAILURE
17: LITERAL 0x7b ('{')
19. LITERAL 0x66 ('f')
21. LITERAL 0x6f ('o')
23. LITERAL 0x6f ('o')
25. LITERAL 0x7d ('}')
27. LITERAL 0x79 ('y')
29. SUCCESS


### re.A or re.ASCII, re.U or re.UNICODE, re.L or re.LOCALE

Specify the character encoding used:
- **re.U and re.UNICODE**: specify Unicode encoding (the default).
- **re.A and re.ASCII**: force ASCII encoding.
- **re.L and re.LOCALE**: make the decision based on the current locale. This is not recommended since locale is an outdated concept.

The ASCII and LOCALE flags are available in case you need them for special circumstances. But in general, **the best strategy is to use the default Unicode encoding. This should handle any world language correctly.**

## Combining \<flags\>

You can combine multiple flags using the '|' operator:

In [97]:
# Combine IGNORECASE and MULTILINE flags
re.search("^bar", "foo\nBaR\nbaz", re.IGNORECASE|re.MULTILINE)

<re.Match object; span=(4, 7), match='BaR'>

### Specifing flags within a regex
You can modify flag values within a regex too.
- **(?\<flags\>)**: Sets the flag values for the duration of a regex. Should be placed at the begginig of the regex expression.

\<flags\> must be one of these letters:
- a: Equals to re.A or re.ASCII.
- i: Equals to re.I or re.IGNORECASE.
- L: Equals to re.L or re.LOCALE.
- m: Equals to re.M or re.MULTILINE.
- s: Equals to re.S or re.DOTALL.
- u: Equals to re.U or re.UNICODE.
- x: Equals to re.X or re.VERBOSE.

In [98]:
# Setting IGNORECASE and MULTILINE within a regex
re.search("(?im)bar", "foo\nbAr\nqux")  # same as re.I|re.M

<re.Match object; span=(4, 7), match='bAr'>

- **(?\<set_flags\>-\<remove_flags\>:\<regex\>)**: **Sets or removes flag values** for the duration of a group. It defines a **non-capturing group** that matches against \<regex\>. For that group, the parser sets any flags specified in \<set_flags\> and clears any flags specified in \<remove_flags\>. 

In [104]:
# Matches 'foo' case insensitive followed by 'bar'
m = re.search("(?i:foo)bar", "fOObar")

In [105]:
# The group 'foo' is not captured.
m.groups()

()

In [100]:
re.search("(?i:foo)bar", "fOObAr")

In [101]:
# Turn off the IGNORECASE flag for 'foo'
re.search("(?-i:foo)bar", "fooBaR", re.IGNORECASE)

<re.Match object; span=(0, 6), match='fooBaR'>

In [102]:
re.search("(?-i:foo)bar", "fOOBaR", re.IGNORECASE)

## Re functions

The available regex functions fall into these categories:
- Searching functions
- Substitution functions
- Utility functions

### Searching functions
Scan a search string for matches. 

- **re.search(\<regex>, \<string>, flags=0)**

Scans a string for **1** match. 

- **re.match(\<regex>, \<string>, flags=0)**

Looks for a regex match **at the beginning of a string.**

Some considerations:
1. The caret '^' metacharacter is not necessary here. 
2. The re.MULTILINE flags doesn't affect re.match().

In [2]:
# Matches only if 'foo' is at the beginning
re.match("foo", "foo\nbar\nbaz")

<re.Match object; span=(0, 3), match='foo'>

In [4]:
#  Since 'bar' is not at the beginning, it doesn't match
re.match("bar", "foo\nbar\nbaz")

In [5]:
# Multiline mode doesn't affect here
re.match("bar", "foo\nbar\nbaz", re.MULTILINE)

- **re.fullmatch(\<regex>, \<string>, flags=0)**

Looks for a regex match on an entire string.

This is equivalent to use *re.search* and anchor the regex with a '^' and a '$'.

In [6]:
# Matches the entire string
re.fullmatch("\d+", "123456789")

<re.Match object; span=(0, 9), match='123456789'>

In [7]:
# It doesn't match since the string is not entirely digits
re.fullmatch("\d+", "123456789abcdef")

- **re.findall(\<regex>, \<string>, flags=0)**

Returns a **list** of all matches of a regex in a string, from left to right.

In [3]:
re.findall("\w+", ",,foo,,bar:baz")

['foo', 'bar', 'baz']

If the \<regex\> contains a **capturing group**, the return list contains **only contents of the group**. If \<regex\> contains more than one capturing group, it returns **a list of tuples** with each captured group.

In [5]:
# Returns the contents of the capturing group
re.findall("#(\w+)#", "#foo##bar##baz#")

['foo', 'bar', 'baz']

In [7]:
# Returns a list of tuples with the contents of each captured groups
# The length of the tuple is the same as the number of capturing groups
re.findall("(\w+),(\w+),(\w+)", "foo,bar,baz,a,b,c")

[('foo', 'bar', 'baz'), ('a', 'b', 'c')]

- **re.finditer(\<regex>, \<string>, flags=0)**

Returns an **iterator** that yields the found regex **matches**, from left to right.

It's similar to *re.findall()* but:
1. *re.finditer()* returns an iterator.
2. Each items if a **match object**, not a str as with *re.findall()*.

In [13]:
it = re.finditer("[0-9]", "a1b2c3")

# You can access the results with 'next()'
print(next(it))
print(next(it))
print(next(it))
print(next(it))  # raises an Stop Iteration

<re.Match object; span=(1, 2), match='1'>
<re.Match object; span=(3, 4), match='2'>
<re.Match object; span=(5, 6), match='3'>


StopIteration: 

In [14]:
# You can also loop over the matches
it = re.finditer("[0-9]", "a1b2c3")

for match in it:
    print(match)

<re.Match object; span=(1, 2), match='1'>
<re.Match object; span=(3, 4), match='2'>
<re.Match object; span=(5, 6), match='3'>


### Substitution functions

Replaces portions of a search string that match a specified regex.

- **re.sub(\<regex>, \<repl>, \<string>, count=0, flags=0)**

Returns **a new string** that results from performing replacement on a search string. **The original string remains unchanged**.

The argument \<repl> can be:
- **A string**
- **A function**

#### Substitution by string


In [4]:
# Replaces any digit with a '#'
re.sub("\d", "#", "foo123bar456")

'foo###bar###'

If there are captured groups on the \<regex\> expression, you can **backreference them in \<repl>** with the **number** (using the **\\\<n>** or **\\g\<n>** notation) or the **name** (using the **\\g\<name>** notation) of the group.

In [17]:
# Captures the first and the last words separated by a comma
# and changes their order
re.sub("(\w+),bar,baz,(\w+)", r"\2,bar,baz,\1", "foo,bar,baz,qux")

'qux,bar,baz,foo'

In [18]:
# Same as above but using group naming
re.sub("(?P<w1>\w+),bar,baz,(?P<w2>\w+)", r"\g<w2>,bar,baz,\g<w1>", "foo,bar,baz,qux")

'qux,bar,baz,foo'

In [20]:
# '\g<name> can be used with numered groups as well.
# This can be used to avoid ambiguity when a numbered backrefence is
# followed by a literal digit.
re.sub("(\w+),bar,baz,(\w+)", r"\g<2>,bar,baz,\g<1>", "foo,bar,baz,qux")

'qux,bar,baz,foo'

As with groups, **\\g<0> refers to the text of the entire match**, even when there are no groups in the \<regex>.

In [21]:
re.sub("\d+", "#\g<0>#", "foo123bar")

'foo#123#bar'

If \<regex> specifies a **zero-length match**, the \<repl> will be replaced in every character position.

In [22]:
re.sub("x?", "-", "foo")

'-f-o-o-'

The argument **count** is used to specify **how many replacements** you want to perform. 

In [5]:
# Replaces any digit with a '#' 
# with a max of 3 replacements
re.sub("\d", "#", "foo123bar456", count=3)

'foo###bar456'

#### Substitution by function

The function specified as \<repl> is called for each match found and the return value becomes the replacement.

In [2]:
# Define some function
def f(match):
    s = match.group(0)
    return s.upper()

# And use it in an regex
re.sub("\w+", f, "foo123bar")

'FOO123BAR'

In [3]:
# With lambdas
re.sub("\w+", lambda x: x.group().upper(), "foo123bar")

'FOO123BAR'

- **re.subn(\<regex>, \<repl>, \<string>, count=0, flags=0)**

Returns a **tuple** with the **new string** that results from performing the replacements on a search string and with **the number of substitutions made.**

In [6]:
# Performs 6 substitutions on digits
re.subn("\d", "#", "foo123bar456")

('foo###bar###', 6)

### Utility functions

There are two more utility functions in regex.

- **re.split(\<regex>, \<string>, maxsplit=0, flags=0)**

Splits a string into a **list** of substrings **using \<regex> as the delimiter.**

In [7]:
# Splits words by ',', ';' or '/'
re.split("[,;/]", "foo,bar;qux/foo")

['foo', 'bar', 'qux', 'foo']

If \<regex> contains **capturing groups**, then the return list **includes the matching delimiters** as well:

In [8]:
re.split("([,;/])", "foo,bar;qux/foo")

['foo', ',', 'bar', ';', 'qux', '/', 'foo']

If you need groups but you don't want the delimiters to be return, you can use **noncapturing groups**:

In [9]:
re.split("(?:[,;/])", "foo,bar;qux/foo")

['foo', 'bar', 'qux', 'foo']

The **\<maxsplit>** argument can be used to specify the maximum number of splits to do. The last element is the remainder of string.

In [10]:
re.split("[,;/]", "foo,bar;qux/foo", maxsplit=2)

['foo', 'bar', 'qux/foo']

If \<regex> matches **the start of the string**, then **an empty string is return as the first element** of the list. The same happens when regex matches **the end of the string.**

In [13]:
re.split("/", "/foo/bar/qux/foo/")

['', 'foo', 'bar', 'qux', 'foo', '']

- **re.escape(\<regex>)**

Escapes characters in a regex. That means it returns a copy of \<regex> with each nonword character preced by a backslash (\\).

In [14]:
# Without escaping metacharacters
re.match("foo(baz)|qux", "foo(baz)|qux")

In [15]:
# Escaping metacharacters manually
re.match("foo\(baz\)\|qux", "foo(baz)|qux")

<re.Match object; span=(0, 12), match='foo(baz)|qux'>

In [17]:
# Escaping metacharacters using 're.escape'
re.match(re.escape("foo(baz)|qux"), "foo(baz)|qux")

<re.Match object; span=(0, 12), match='foo(baz)|qux'>

### Compiled Regex 

You can precompile a regex into a **regular expression object** that can be repeatedly used later.

**re.compile(\<regex>, flags=0)**

Compiles a regex into a regular expression object.

There are 2 ways of using compiled regex expressions:

In [18]:
# 1. Specify it as argument
re_obj = re.compile("\d+")
re.search(re_obj, "abc123")

<re.Match object; span=(3, 6), match='123'>

In [19]:
# 2. Invoke the method directly from it
re_obj = re.compile("\d+")
re_obj.search("abc123")

<re.Match object; span=(3, 6), match='123'>

What are the **advantages** of precompiling regexes?
- If you use a particular regex in your code **frequently**. This enhances **modularity** and **maintanibility**.

However:
- You might expect precompilation to result in faster execution time as well. In practice, though, that isn’t the case because re **caches** a regex and it isn't recompiled if it's used subsequently. 

#### Functions using a compiled regex
You can use the previous functions with a compiled regex (here referred as *re_obj*):

- **re_obj.search(\<string>[,\<pos>[,\<endpos>]])**
- **re_obj.match(\<string>[,\<pos>[,\<endpos>]])**
- **re_obj.fullmatch(\<string>[,\<pos>[,\<endpos>]])**
- **re_obj.findall(\<string>[,\<pos>[,\<endpos>]])**
- **re_obj.finditer(\<string>[,\<pos>[,\<endpos>]])**
- **re_obj.sub(\<repl>, \<string>, count=0)**
- **re_obj.subn(\<repl>, \<string>, count=0)**
- **re_obj.split(\<string>, maxsplit=0)**

They behave the same but some of them also support the optional **\<pos> and \<endpos> parameters.** If they are present, then **the search only applies to that portion of the string.** They are used as indices in slice notation:

In [22]:
re_obj = re.compile("\d+")

print(re_obj.search("abc123", pos=4, endpos=5))

# If 'endpos' is omitted, the search applies from
# 'pos' to the end of the string
print(re_obj.search("abc123", pos=4))

<re.Match object; span=(4, 5), match='2'>
<re.Match object; span=(4, 6), match='23'>


The **'^'** and the **$** metacharacters still refer to the start and the end of the string.

#### Attributes of compiled regex

The compiled regex objects have some useful attributes:

In [31]:
# Let's define a compiled regex
re_obj = re.compile("^(?P<zero_to_five>[0-5]+)(?P<six_to_nine>[6-9]+)", re.MULTILINE)

In [32]:
# 1. Accessing the flags
re_obj.flags  # means MULTILINE

40

In [33]:
# 2. Accessing the regex pattern
re_obj.pattern

'^(?P<zero_to_five>[0-5]+)(?P<six_to_nine>[6-9]+)'

In [34]:
# 3. Accesing the number of capturing groups
re_obj.groups

2

In [39]:
# 4. Get the mapping of each group name to its corresponding number
print(re_obj.groupindex)
print(type(re_obj.groupindex))

# 'mappingproxy' functions like a dictionary
print(re_obj.groupindex["zero_to_five"])

{'zero_to_five': 1, 'six_to_nine': 2}
<class 'mappingproxy'>
1


### The Match object

#### Match object methods

There are some methods you can use with a *match* object.

- **match.group([\<group1>,...])**

Returns the specified captured groups from a match. If no arguments are provided or the argument is equals to 0, then the entire match is returned.

In [42]:
# With numbered groups
m = re.search("(\w+),(\w+)", "foo,bar")
m.group(1)

'foo'

In [43]:
# With naming groups
m = re.search("(?P<w1>\w+),(?P<w2>\w+)", "foo,bar")
m.group("w2")

'bar'

If there's a **nonparticipating group**, *match.group()* returns **'None'**:

In [45]:
m = re.search("(\w+),(\w+)?", "foo,")
m.group(1, 2)

('foo', None)

If a group participates **multiple times**, then **only the last match** is return:

In [48]:
m = re.search("(\w{3},)+", "foo,bar,baz,")
m.group(1)

'baz,'

- **match.__getitem__(\<grp>)**

Returns a captured group from a match.

It's a magic method and behaves the same way **as *match.group()***. You can access groups **by indexes** too, which is more **syntactic sugar**.

In [51]:
m = re.search("(\w+),(\w+)", "foo,bar")
print(m.group(1))
print(m.__getitem__(1))
print(m[1])

foo
foo
foo


This works **with naming groups too**:

In [52]:
m = re.search("(?P<w1>\w+),(?P<w2>\w+)", "foo,bar")
print(m.group("w1"))
print(m.__getitem__("w1"))
print(m["w1"])

foo
foo
foo


- **match.groups(default=None)**

Returns **a tuple** with all captured groups from a match.

In [53]:
m = re.search("(\w+),(\w+)", "foo,bar")
m.groups()

('foo', 'bar')

Again, it returns **'None'** for **nonparticipating groups**. You can specify a **default value** in this situation using the ***default* argument.**

In [54]:
m = re.search("(\w+),(\w+)?", "foo,")
print(m.groups())
print(m.groups(default="---"))

('foo', None)
('foo', '---')


- **match.groupdict(default=None)**

Returns a dictionary of named captured groups.

In [55]:
m = re.search("(?P<w1>\w+),(?P<w2>\w+)", "foo,bar")
m.groupdict()

{'w1': 'foo', 'w2': 'bar'}

The default parameter works like the one seen previously for nonparticipating groups.

- **match.expand(\<template>)**

Performs backreference substitutions from a match.

In [3]:
m = re.search("(\w+),(\w+)", "foo,bar")
m.expand(r"\1////\2")

'foo////bar'

- **match.start([\<grp>])**
- **match.end([\<grp>])**

Return the starting and ending indices of the match.

In [8]:
search_string = "---foo,bar---"
m = re.search("(\w+),(\w+)", search_string)

print(m)
print(m.start())
print(m.end())
print(search_string[m.start():m.end()])

<re.Match object; span=(3, 10), match='foo,bar'>
3
10
foo,bar


When used with the optional argument **\<grp>**, they return the starting and ending indices of **the substring matched by the group**.

In [9]:
search_string = "---foo,bar---"
m = re.search("(\w+),(\w+)", search_string)

print(m)
print(m.start(1))
print(m.end(1))
print(search_string[m.start(1):m.end(1)])

<re.Match object; span=(3, 10), match='foo,bar'>
3
6
foo


Works with **naming groups** as well:

In [10]:
search_string = "---foo,bar---"
m = re.search("(?P<w1>\w+),(?P<w2>\w+)", search_string)

print(m)
print(m.start("w1"))
print(m.end("w1"))
print(search_string[m.start("w1"):m.end("w1")])

<re.Match object; span=(3, 10), match='foo,bar'>
3
6
foo


When the specified group matches a **null string**, *.start()* and *.end()* are **equal**.

In [13]:
m = re.search("foo(\d*)bar", "foobar")
print(m.group(1))
print(m.start(1))
print(m.end(1))


3
3


When the group **doesn't participate**, they return **-1**.

In [16]:
m = re.search("(\w+),(\w+),(\w+)?", "foo,bar,")

print(m.group(3))
print(m.start(3))
print(m.end(3))

None
-1
-1


- **match.span([\<grp>])**

Return both the starting and ending indices of the match as a **tuple**. It's equivalent to *(m.start([\<grp>]), m.end([\<grp>]))*.


In [17]:
search_string = "---foo,bar---"
m = re.search("(\w+),(\w+)", search_string)

print(m)
print(m.span(1))

<re.Match object; span=(3, 10), match='foo,bar'>
(3, 6)


#### Match object attributes

A match object also have some useful attributes.

- **match.pos**
- **match.endpos**

Contain the effective values of \<pos> and \<endpos> for the search.

In [18]:
re_obj = re.compile("\d+")
m = re_obj.search("foo123bar", pos=2, endpos=7)
print(m)

print(m.pos)
print(m.endpos)

<re.Match object; span=(3, 6), match='123'>
2
7


If omitted, they contain the start and the end of the string:

In [20]:
m = re.search("\d+", "foo123bar")
print(m)

print(m.pos)
print(m.endpos)

<re.Match object; span=(3, 6), match='123'>
0
9


 - **match.lastindex**
 
 Contains the index of the last captured group. 
 
This allows you to determinate **how many groups actually participated** in the match.

In [22]:
m = re.search("(\w+),(\w+),(\w+)", "foo,bar,baz")
m.lastindex

3

 It isn’t always the case that the last group to match is also the last group encountered syntactically. See: https://docs.python.org/3/library/re.html#re.Match.lastindex

 - **match.lastgroup**
 
 Contains the name of the last captured group. If the last captured group isn't a named group, it returns None. 

In [23]:
m = re.search("(?P<w1>\w+),(?P<w2>\w+)", "foo,bar")
m.lastgroup

'w2'

In [27]:
m = re.search("(\w+),(\w+)", "foo,bar")
print(m.lastgroup)

None


 - **match.re**

Contains the regular expression object that produced the match. 

In [28]:
m = re.search("(\w+),(\w+),(\w+)", "foo,bar,baz")
m.re

re.compile(r'(\w+),(\w+),(\w+)', re.UNICODE)

In [29]:
# It 's the same object as the compiled regex
# This is due to the regex cache mencioned above
re_obj = re.compile("(\w+),(\w+),(\w+)")
m.re is re_obj

True

In [31]:
# You can access to the regular expression attributes
# as well
print(m.re.pattern)
print(m.re.groups)

(\w+),(\w+),(\w+)
3


- **match.string**

Contains the search string for a match.

In [32]:
m = re.search("(\w+),(\w+),(\w+)", "foo,bar,baz")
m.string

'foo,bar,baz'