* References:
  * https://realpython.com/regex-python/#lookahead-and-lookbehind-assertions
  * https://realpython.com/regex-python-part-2/

# Table of contents
* [1. Regexes in Python and Their Uses](#1)
* [2. Metacharacters Supported by the `re` Module](#2)
  * [2.1 Metacharacters That Match a Single Character](#2.1)
  * [2.2 Escaping Metacharacters](#2.2)
  * [2.3 Anchors](#2.3)
  * [2.4 Quantifiers](#2.4)
  * [2.5 Grouping Constructs and Backreferences](#2.5)
  * [2.6 Lookahead and Lookbehind Assertions](#2.6)
  * [2.7 Miscellaneous Metacharacters](#2.7)
* [3. Modified Regular Expression Matching With Flags](#3)
  * [3.1 Supported Regular Expression Flags](#3.1)
  * [3.2 Combining `<flags>` Arguments in a Function Call](#3.2)
  * [3.3 Setting and Clearing Flags Within a Regular Expression](#3.3)
* [4. `re` Module Functions](#4)
  * [4.1 Searching Functions](#4.1)
  * [4.2 Substitution Functions](#4.2)
  * [4.3 Utility Functions](#4.3)
* [5. Compiled Regex Objects](#5)
  * [5.1 Why Bother Compiling a Regex?](#5.1)
  * [5.2 Regular Expression Object Methods](#5.2)
  * [5.3 Regular Expression Object Attributes](#5.3)
* [6. Match Object Methods and Attributes](#6)
  * [6.1 Match Object Methods](#6.1)
  * [6.2 Match Object Attributes](#6.2)


# 1. Regexes in Python and Their Uses <a name="1"></a>
* With regexes in Python, you can identify patterns in a string that you wouldn’t be able to find with the `in` operator or with `string` methods.

In [23]:
import re
# Check the exist of a substring in a given string
s = 'foo123bar_*_789'

print('123' in s)
print('456' in s)
print('\n')

print(s[3:6])
print('\n')

print(re.search('123', s))
# error: nothing to repeat at position
# regular expression normally uses * and + in theory of language, to solve it, it needs to include \ before * and +
print(re.search('\*_78', s))
# span()
# Find the position of a substring within a given string
print(s.find('_')) # return the index of the first matching element
print(s.index('bar')) #if the input is a string, return the index of the first character of the given string (just in case the whole substring is found)

# # print(s.index('fol')) #error: substring not found 

True
False


123


<re.Match object; span=(3, 6), match='123'>
<re.Match object; span=(10, 14), match='*_78'>
9
6


## `re.search()`
* `re.search(<regex>, <string>)` scans `<string>` looking for the first location where the pattern `<regex>` matches. 
  * If a match is found, then `re.search()` returns a **match object**. 
  * Otherwise, it returns `None`.

* `re.search()` has an optional third `<flags>` argument.


In [7]:
import re

# The return value is always the leftmost possible match. re.search() scans the search string from left to right, 
#   and as soon as it locates a match for <regex>, it stops scanning and returns the match.

print(re.search('123', s))
# error: nothing to repeat at position
# regular expression normally uses * and + in theory of language, to solve it, it needs to include \ before * and +
print(re.search('\*_78', s))
# span(m, n): index range of letters in given string [m; n -1]

<re.Match object; span=(3, 6), match='123'>
<re.Match object; span=(10, 14), match='*_78'>


## Python Regex Metacharacters
* A set of characters specified in square brackets `[ ]` makes up a character class.
  * This metacharacter sequence matches any single character that is in the class.

* The dot `.` metacharacter **matches any character** except a newline, so it functions like a wildcard.


In [18]:
# [0-9][0-9] matches any sequence of two consecutive decimal digit characters.
print(re.search('[0-9][0-9]', s))
print('\n')
print(re.search('[0-9][0-9][0-9]', '12foo34'))


<re.Match object; span=(3, 5), match='12'>


None


In [25]:
# the `.` matches the `2`
# s = 'a1 34'
print(re.search('1.3', s))
print('\n')
# `foo13bar` does not contain a `1`, then any character (except a newline), then a `3`
print(re.search('1.3', 'foo13bar'))


<re.Match object; span=(3, 6), match='123'>


None


# 2. Metacharacters Supported by the `re` Module <a name="2"></a>

### `[ ]` : Specifies a specific set of characters to match.
### ` ^ ` : Anchors a match at the start of a string / Complements a character class
### ` $ ` : Anchors a match at the end of a string
### ` * ` : Matches zero or more repetitions
### ` + ` : Matches one or more repetitions
### ` ? ` : * Matches zero or one repetition
####        * Specifies the non-greedy versions of *, +, and ?
####        * Introduces a lookahead or lookbehind assertion
####        * Creates a named group
### ` { } ` : Matches an explicitly specified number of repetitions
### ` \ ` : * Escapes a metacharacter of its special meaning
####        * Introduces a special character class
####        * Introduces a grouping backreference
### ` | ` : Designates alternation
### ` ( ) ` : Creates a group
### ` : # = ! ` : Designate a specialized group
### ` < > ` : Create a named group



## 2.1 Metacharacters That Match a Single Character <a name="2.1"></a>


### `[ ]` : Specifies a specific set of characters to match.


In [28]:
# The metacharacter sequence [artz] matches any single 'a', 'r', 't', or 'z' character. 
print(re.search('ba[ztra]', 'foobarqux'))
print(re.search('ba[artz]', 'foobazqux'))
print('\n')

# [a-z] matches any lowercase alphabetic character between 'a' and 'z'
print(re.search('[a-z]', 'FOObar'))
print('\n')

# [0-9] matches a sequence of two digits
print(re.search('[0-9][0-9]', 'foo563bar'))
print('\n')

# [0-9a-fA-F] matches any hexadecimal digit character
print(re.search('[0-9a-fA-f]', '--- a0 ---'))
print(re.search('[0-9a-f]', '--- a0 ---'))
print(re.search('[0-9A-F]', '--- a0 ---'))
print('\n')

<re.Match object; span=(3, 6), match='bar'>
<re.Match object; span=(3, 6), match='baz'>


<re.Match object; span=(3, 4), match='b'>


<re.Match object; span=(3, 5), match='56'>


<re.Match object; span=(4, 5), match='a'>
<re.Match object; span=(4, 5), match='a'>
<re.Match object; span=(5, 6), match='0'>




## ` ^ `: Anchors a match at the start of a string / Complements a character class 
* You can complement a character class by specifying ^ as the first character, in which case it matches any character that isn’t in the set.

In [36]:
# [^0-9] matches any character that isn’t a digit
print(re.search('[^0-9]', '12345foo'))
print('\n')

# If a ^ character appears in a character class but isn’t the first character, 
#   then it has no special meaning and matches a literal '^' character:
print(re.search('[#:^]', 'foo^\bar:baz#qux')) 
print('\n')

# There are some special character which is identical to some Python special characcters (*, +, ], -, some regex metacharacters)
# Solution:  Place it as the first or last character or escape it with a backslash (\) (include \ before them)
# Note: Other regex metacharacters lose their special meaning inside a character class

#case: hypen (-) 
print(re.search('[-abc]', '123-456'))
print(re.search('[abc-]', '123-456'))
print(re.search('[ab\-c]', '123-456'))
print('\n')

#case: ']'
print(re.search('[]]', 'foo[1]'))
print(re.search('[ab\]cd]', 'foo[1]'))
print('\n')

# case: '*', '+'
print(re.search('[)*+|]', '123*456'))
print(re.search('[)*+|]', '123+456'))


<re.Match object; span=(5, 6), match='f'>


<re.Match object; span=(3, 4), match='^'>


<re.Match object; span=(3, 4), match='-'>
<re.Match object; span=(3, 4), match='-'>
<re.Match object; span=(3, 4), match='-'>


<re.Match object; span=(5, 6), match=']'>
<re.Match object; span=(5, 6), match=']'>


<re.Match object; span=(3, 4), match='*'>
<re.Match object; span=(3, 4), match='+'>


## ` dot (.) ` : Specifies the wildcard
## ` \w, \W ` : Match based on whether a character is a word character.
## `\d, \D ` : Match based on whether a character is a decimal digit.
## `\s, \S `: Match based on whether a character represents whitespace.

In [38]:
# The . metacharacter matches any single character except a newline:

# As a regex, foo.bar essentially means the characters 'foo', then any character except newline, then the characters 'bar'.
print(re.search('foo.bar', 'foo bar'))

# Nothing lies between 'foo' and 'bar'
print(re.search('foo.bar', 'foobar'))

#Exception: newline character ('\n')
print(re.search('foo.bar', 'foo\nbar'))
print('\n')

# \w matches any alphanumeric word character. W
# Word characters are uppercase and lowercase letters, digits, and the underscore (_) character
#Conclusion:  '\w' is essentially shorthand for [a-zA-Z0-9_]
print(re.search('\w', '#(.a$@&'))

print(re.search('[a-zA-Z0-9_]', '#(.a$@&'))
print('\n')

# \W is the opposite. It matches any non-word character and is equivalent to [^a-zA-Z0-9_]
print(re.search('\W', 'a_1*3Qb'))
print(re.search('[^a-zA-Z0-9_]', 'a_1*3Qb'))
print('\n')

# \d matches any decimal digit character. \D is the opposite. It matches any character that isn’t a decimal digit
# \d is essentially equivalent to [0-9], and \D is equivalent to [^0-9].
print(re.search('\d', 'abc4def'))

print(re.search('\D', '234Q678'))
print('\n')

# \s matches any whitespace character
# \S is the opposite of \s. It matches any character that isn’t whitespace
# \s and \S consider a newline to be whitespace
print(re.search('\s', 'foo\nbar baz'))
print(re.search('\S', '  \n foo  \n  '))
print('\n')

# The character class sequences \w, \W, \d, \D, \s, and \S can appear inside a square bracket character class as well
# [\d\w\s] matches any digit, word, or whitespace character.
print(re.search('[\d\w\s]', '---3---'))
print(re.search('[\d\w\s]', '---a---'))
print(re.search('[\d\w\s]', '--- ---'))


<re.Match object; span=(0, 7), match='foo bar'>
None
None


<re.Match object; span=(3, 4), match='a'>
<re.Match object; span=(3, 4), match='a'>


<re.Match object; span=(3, 4), match='*'>
<re.Match object; span=(3, 4), match='*'>


<re.Match object; span=(3, 4), match='4'>
<re.Match object; span=(3, 4), match='Q'>


<re.Match object; span=(3, 4), match='\n'>
<re.Match object; span=(4, 5), match='f'>


<re.Match object; span=(3, 4), match='3'>
<re.Match object; span=(3, 4), match='a'>
<re.Match object; span=(3, 4), match=' '>


## 2.2 Escaping Metacharacters <a name="2.2"></a>
### Occasionally, you’ll want to include a metacharacter in your regex, except you won’t want it to carry its special meaning. Instead, you’ll want it to represent itself as a literal character.

### ` \ ` : Removes the special meaning of a metacharacter. 

In [None]:
import re

In [None]:
# The backslash escapes metacharacters. A metacharacter preceded by a backslash loses its special meaning and matches the literal character instead
# Without the backslash, it returns 'f'
print(re.search('.', 'foo.bar'))

# With backslash, it returns '.'
print(re.search('\.', 'foo.bar'))


In [None]:
# HOWEVER !!!
# print(re.search('\\', 'foo\bar')) - ERROR: bad escape (end of pattern) at position 0
  # 1. The Python interpreter is the first to process the string literal '\\'. It interprets that as an escaped backslash and passes only a single backslash to re.search().
  # 2. The regex parser receives just a single backslash, which isn’t a meaningful regex, so the messy error ensues.

# Solution:
  # 1. re.search('\\\\', s) - The regex parser then sees \\ as one escaped backslash. As a <regex>, that matches a single backslash character
  # 2. re.search(r'\\', s) - specify the <regex> using a raw string

s = r'foo\bar'
print(s)
print(re.search(r'\\', s))

foo\bar
<_sre.SRE_Match object; span=(3, 4), match='\\'>


## 2.3 Anchors <a name="2.3"></a>
### Anchors are zero-width matches. An anchor dictates a particular location in the search string where a match must occur.

## ` ^, \A `: Anchor a match to the start of <string>.

In [None]:
# When the regex parser encounters ^ or \A, the parser’s current position must be at the beginning of the search string for it to find a match.

#  Regex ^foo stipulates that 'foo' must be present not just any old place in the search string, but at the beginning
print(re.search('^foo', 'foobar'))
print(re.search('^foo', 'barfoo'))
print('\n')
# '\A' functions similarly
print(re.search('\Afoo', 'foobar'))
print(re.search('\Afoo', 'barfoo'))


<_sre.SRE_Match object; span=(0, 3), match='foo'>
None


<_sre.SRE_Match object; span=(0, 3), match='foo'>
None


## ` $, \Z ` : Anchor a match to the end of string.

In [None]:
# When the regex parser encounters $ or \Z, the parser’s current position must be at the end of the search string for it to find a match.
print(re.search('bar$', 'foobar'))

print(re.search('bar$', 'barfoo'))
# print('\n')

print(re.search('bar\Z', 'foobar'))
print(re.search('bar\Z', 'barfoo'))


# As a special case, $ (but not \Z) also matches just before a single newline at the end of the search string:
print(re.search('bar$', 'foobar\n'))

<_sre.SRE_Match object; span=(3, 6), match='bar'>
None
<_sre.SRE_Match object; span=(3, 6), match='bar'>
None
<_sre.SRE_Match object; span=(3, 6), match='bar'>


### ` \b ` : Anchors a match to a word boundary.

In [None]:
# \b asserts that the regex parser’s current position must be at the beginning or end of a word ('word''s definition ís the same as '\w' class)

#  A match happens on lines 9 and 10 because there’s a word boundary at the start of 'bar'. This isn’t the case on line 11, so the match fails there.
print(re.search(r'bar\b', 'foo bar'))
print(re.search(r'\bbar', 'foo.bar'))
print(re.search(r'\bbar', 'foobar'))
print('\n')

# there are matches on lines 14 and 15 because a word boundary exists at the end of 'foo', but not on line 16.
print(re.search(r'foo\b', 'foo bar'))
print(re.search(r'foo\b', 'foo.bar'))
print(re.search(r'foo\b', 'foobar'))
print('\n')

# Using the \b anchor on both ends of the <regex> will cause it to match when it’s present in the search string as a whole word
print(re.search(r'\bbar\b', 'foo bar baz'))
print(re.search(r'\bbar\b', 'foo(bar)baz'))
print(re.search(r'\bbar\b', 'foobarbaz'))

# This is another instance in which it pays to specify the <regex> as a raw string
# Because '\b' is an escape sequence for both string literals and regexes in Python, each use above would need to be double escaped as '\\b' if you didn’t use raw strings

<_sre.SRE_Match object; span=(4, 7), match='bar'>
<_sre.SRE_Match object; span=(4, 7), match='bar'>
None


<_sre.SRE_Match object; span=(0, 3), match='foo'>
<_sre.SRE_Match object; span=(0, 3), match='foo'>
None


<_sre.SRE_Match object; span=(4, 7), match='bar'>
<_sre.SRE_Match object; span=(4, 7), match='bar'>
None


### ` \B `: Anchors a match to a location that isn’t a word boundary.

In [None]:
# \B does the opposite of \b. It asserts that the regex parser’s current position must not be at the start or end of a word

print(re.search(r'\Bfoo\B', 'foo'))

print(re.search(r'\Bfoo\B', '.foo.'))

print(re.search(r'\Bfoo\B', 'barfoobaz'))


None
None
<_sre.SRE_Match object; span=(3, 6), match='foo'>


## 2.4 Quantifiers <a name="2.4"></a>
### A quantifier metacharacter immediately follows a portion of a <regex> and indicates how many times that portion must occur for the match to succeed.

## ` * ` : Matches **zero or more** repetitions of the preceding regex.

In [None]:
# For example, a* matches zero or more 'a' characters. That means it would match an empty string, 'a', 'aa', 'aaa', and so on.
print(re.search('foo-*bar', 'foobar'))                     # Zero dashes
print(re.search('foo-*bar', 'foo-bar'))                    # One dash
print(re.search('foo-*bar', 'foo--bar'))                   # Two dashes
print('\n')
# In other words, '*' essentially matches any character sequence up to a line break. (Remember that the . wildcard metacharacter doesn’t match a newline.)

#.* matches everything between 'foo' and 'bar'
print(re.search('foo.*bar', '# foo $qux@grault % bar #'))


<_sre.SRE_Match object; span=(0, 6), match='foobar'>
<_sre.SRE_Match object; span=(0, 7), match='foo-bar'>
<_sre.SRE_Match object; span=(0, 8), match='foo--bar'>


<_sre.SRE_Match object; span=(2, 23), match='foo $qux@grault % bar'>


## ` + ` : Matches **one or more** repetitions of the preceding regex.

In [None]:
# '+' is similar to *, but the quantified regex must occur at least once:

print(re.search('foo-+bar', 'foobar'))                     # Zero dashes

print(re.search('foo-+bar', 'foo-bar'))                    # One dash

print(re.search('foo-+bar', 'foo--bar'))                   # Two dashes

None
<_sre.SRE_Match object; span=(0, 7), match='foo-bar'>
<_sre.SRE_Match object; span=(0, 8), match='foo--bar'>


## ` ? ` : Matches __zero or one__ repetitions of the preceding regex.

In [None]:
import re
# Again, this is similar to * and +, but in this case there’s only a match if the preceding regex occurs once or not at all:

print(re.search('foo-?bar', 'foobar'))                     # Zero dashes
print(re.search('foo-?bar', 'foo-bar'))                    # One dash
print(re.search('foo-?bar', 'foo--bar'))                   # Two dashes



# This time, the quantified regex is the character class [1-9] instead of the simple character '-'.
print(re.match('foo[1-9]*bar', 'foobar'))
print(re.match('foo[1-9]*bar', 'foo42bar'))

print(re.match('foo[1-9]+bar', 'foobar'))

print(re.match('foo[1-9]+bar', 'foo42bar'))

print(re.match('foo[1-9]?bar', 'foobar'))
print(re.match('foo[1-9]?bar', 'foo42bar'))



<_sre.SRE_Match object; span=(0, 6), match='foobar'>
<_sre.SRE_Match object; span=(0, 7), match='foo-bar'>
None
<_sre.SRE_Match object; span=(0, 6), match='foobar'>
<_sre.SRE_Match object; span=(0, 8), match='foo42bar'>
None
<_sre.SRE_Match object; span=(0, 8), match='foo42bar'>
<_sre.SRE_Match object; span=(0, 6), match='foobar'>
None


## ` ?, +?, ?? ` :The non-greedy (or lazy) versions of the *, +, and ? quantifiers.

In [None]:
# When used alone, the quantifier metacharacters *, +, and ? are all greedy, meaning they produce the longest possible match

# Since the * metacharacter is greedy, it dictates the longest possible match, which includes everything up to and including the '>' character that follows 'baz
print(re.search('<.*>', '%<foo> <bar> <baz>%'))

# If you want the shortest possible match instead, then use the non-greedy metacharacter sequence *?:
print(re.search('<.*?>', '%<foo> <bar> <baz>%'))

print(re.search('<.+>', '%<foo> <bar> <baz>%'))
print(re.search('<.+?>', '%<foo> <bar> <baz>%'))

# The greedy version, ?, matches one occurrence, so ba? matches 'b' followed by a single 'a'. The non-greedy version, ??, matches zero occurrences, so ba?? matches just 'b'.
print(re.search('ba?', 'baaaa'))
print(re.search('ba??', 'baaaa'))


## ` {m} `: Matches exactly m repetitions of the preceding regex.

In [None]:
# This is similar to * or +, but it specifies exactly how many times the preceding regex must occur for a match to succeed:
print(re.search('x-{3}x', 'x--x'))                # Two dashes


print(re.search('x-{3}x', 'x---x')   )            # Three dashes

print(re.search('x-{3}x', 'x----x'))              # Four dashes


None
<_sre.SRE_Match object; span=(0, 5), match='x---x'>
None


## ` {m,n} ` : Matches any number of repetitions of the preceding regex from m to n, inclusive.
### ` <regex>{, n} = <regex>{0, n} ` :  Any number of repetitions of regex less than or equal to n
### ` <regex>{m,} ` : Any number of repetitions of regex greater than or equal to m
### ` <regex>{,} = <regex>*`  : Any number of repetitions of regex

In [None]:
# The match succeeds when there are two, three, or four dashes between the 'x' characters but fails otherwise
for i in range(1, 6):
  s = f"x{'-' * i}x"
  print(f'{i}  {s:10}', re.search('x-{2,4}x', s))

print('\n')
# If you omit all of m, n, and the comma, then the curly braces no longer function as metacharacters. {} matches just the literal string '{}'
print(re.search('x{foo}y', 'x{foo}y'))
print(re.search('x{a:b}y', 'x{a:b}y'))
print(re.search('x{1,3,5}y', 'x{1,3,5}y'))
print(re.search('x{foo,bar}y', 'x{foo,bar}y'))


1  x-x        None
2  x--x       <_sre.SRE_Match object; span=(0, 4), match='x--x'>
3  x---x      <_sre.SRE_Match object; span=(0, 5), match='x---x'>
4  x----x     <_sre.SRE_Match object; span=(0, 6), match='x----x'>
5  x-----x    None


<_sre.SRE_Match object; span=(0, 7), match='x{foo}y'>
<_sre.SRE_Match object; span=(0, 7), match='x{a:b}y'>
<_sre.SRE_Match object; span=(0, 9), match='x{1,3,5}y'>
<_sre.SRE_Match object; span=(0, 11), match='x{foo,bar}y'>


## ` {m,n}? ` : The non-greedy (lazy) version of {m,n}.



In [None]:
""" {m,n} will match as many characters as possible, and {m,n}? will match as few as possible """

# a{3,5} produces the longest possible match, so it matches five 'a' characters. a{3,5}? produces the shortest match, so it matches three.

print(re.search('a{3,5}', 'aaaaaaaa'))

print(re.search('a{3,5}?', 'aaaaaaaa'))


<_sre.SRE_Match object; span=(0, 5), match='aaaaa'>
<_sre.SRE_Match object; span=(0, 3), match='aaa'>


## 2.5 Grouping Constructs and Backreferences <a name="2.5"></a>

### **Grouping**: A group represents a single syntactic entity. Additional metacharacters apply to the entire group as a unit.

### __Capturing__: Some grouping constructs also capture the portion of the search string that matches the subexpression in the group. You can retrieve captured matches later through several different mechanisms.

## ` (<regex>) ` : Defines a subexpression or group.

In [None]:
import re
print(re.search('(bar)', 'foo bar baz'))

print(re.search('bar', 'foo bar baz'))


<_sre.SRE_Match object; span=(4, 7), match='bar'>
<_sre.SRE_Match object; span=(4, 7), match='bar'>


## Treating a Group as a Unit

In [None]:
# A quantifier metacharacter that follows a group operates on the entire subexpression specified in the group as a single unit.
print(re.search('(bar)+', 'foo bar baz'))
print(re.search('(bar)+', 'foo barbar baz'))
print(re.search('(bar)+', 'foo barbarbarbar baz'))

# Note:
  # bar+:	The + metacharacter applies only to the character 'r'.
  # (bar)+	The + metacharacter applies to the entire string 'bar'.
print('\n')

# The regex (ba[rz]){2,4}(qux)? matches 2 to 4 occurrences of either 'bar' or 'baz', optionally followed by 'qux'

print(re.search('(ba[rz]){2,4}(qux)?', 'bazbarbazqux'))
print(re.search('(ba[rz]){2,4}(qux)?', 'barbar'))
print('\n')

print(re.search('(qux)?', 'bazbarbazqux'))
print(re.search('(qux)*', 'bazbarbazqux')) #*: zero or more; ?: zero or one

print(re.search('(qux)+', 'bazbarbazqux'))
print(re.search('(qux)?', 'barbar'))
print(re.search('(qux)+', 'barbar'))

<_sre.SRE_Match object; span=(4, 7), match='bar'>
<_sre.SRE_Match object; span=(4, 10), match='barbar'>
<_sre.SRE_Match object; span=(4, 16), match='barbarbarbar'>


<_sre.SRE_Match object; span=(0, 12), match='bazbarbazqux'>
<_sre.SRE_Match object; span=(0, 6), match='barbar'>


<_sre.SRE_Match object; span=(0, 0), match=''>
<_sre.SRE_Match object; span=(0, 0), match=''>
<_sre.SRE_Match object; span=(9, 12), match='qux'>
<_sre.SRE_Match object; span=(0, 0), match=''>
None


In [None]:
print(re.search('(foo(bar)?)+(\d\d\d)?', 'foofoobar'))
print(re.search('(foo(bar)?)+(\d\d\d)?', 'foofoobar123'))

# Note:
  # foo(bar)?	'foo' optionally followed by 'bar'
  # (foo(bar)?)+	One or more occurrences of the above
  # \d\d\d	Three decimal digit characters
  # (\d\d\d)?	Zero or one occurrences of the above

<_sre.SRE_Match object; span=(0, 9), match='foofoobar'>
<_sre.SRE_Match object; span=(0, 12), match='foofoobar123'>


## Capturing Groups

## ` m.groups() ` : Returns a tuple containing all the captured groups from a regex match.



In [None]:
 import re
 m = re.search('(\w+),(\w+),(\w+)', 'foo,quux,baz')
 print(m)
 print(m.groups())

# Each of the three (\w+) expressions matches a sequence of word characters. The full regex (\w+),(\w+),(\w+) breaks the search string into three comma-separated tokens.
# Because the (\w+) expressions use grouping parentheses, the corresponding matching tokens are captured.

<_sre.SRE_Match object; span=(0, 12), match='foo,quux,baz'>
('foo', 'quux', 'baz')


##  ` m.group(<n>) ` : Returns a string containing the [n]th captured match.

In [None]:
# With one argument, .group() returns a single captured match. Note that the arguments are one-based, not zero-based. 
# So, m.group(1) refers to the first captured match, m.group(2) to the second

m = re.search('(\w+),(\w+),(\w+)', 'foo,quux,baz')
print(m.groups(), type(m.groups()))
print(m.group(0), type(m.group(0)))

print(m.group(1))
print(m.group(2))

# Since the numbering of captured matches is one-based, and there isn’t any group numbered zero, m.group(0) returns the entire match (string)

('foo', 'quux', 'baz') <class 'tuple'>
foo,quux,baz <class 'str'>
foo
quux


## ` m.group(<n1>, <n2>, ...) ` : Returns a tuple containing the specified captured matches.



In [None]:
# With multiple arguments, .group() returns a tuple containing the specified captured matches in the given order

print(m.groups())

print(m.group(2, 3))

print(m.group(3, 2, 1))

# m.group(3, 2, 1) = (m.group(3), m.group(2), m.group(1))

('foo', 'quux', 'baz')
('quux', 'baz')
('baz', 'quux', 'foo')


## Backreferences

## ` \<n> ` : Matches the contents of a previously captured group.



In [None]:
# the sequence \<n>, where <n> is an integer from 1 to 99, matches the contents of the <n>th captured group
regex = r'(\w+),\1'

# In the first example, on line 3, (\w+) matches the first instance of the string 'foo' and saves it as the first captured group. 
# The comma matches literally. Then \1 is a backreference to the first captured group and matches 'foo' again
m = re.search(regex, 'foo,foo')
print(m)
print(m.group(1))

m = re.search(regex, 'qux,qux')
print(m)
print(m.group(1))

# The last example doesn’t have a match because what comes before the comma isn’t the same as what comes after it, so the \1 backreference doesn’t match.
m = re.search(regex, 'foo,qux')
print(m)


<_sre.SRE_Match object; span=(0, 7), match='foo,foo'>
foo
<_sre.SRE_Match object; span=(0, 7), match='qux,qux'>
qux
None


In [None]:
# Python misinterprets the backreference \1 as the character whose octal value is one
print(re.search('([a-z])#\1', 'd#d'))

#  achieve the correct match if you specify the regex as a raw string
print(re.search(r'([a-z])#\1', 'd#d'))


""" Numbered backreferences are one-based like the arguments to .group(). 
Only the first ninety-nine captured groups are accessible by backreference. 
The interpreter will regard \100 as the '@' character, whose octal value is 100."""

None
<_sre.SRE_Match object; span=(0, 3), match='d#d'>


## Other Grouping Constructs

## ` (?P<name><regex>) ` : Creates a named captured group.



In [None]:
# reference the matched group by its given symbolic <name> instead of by its number.
m = re.search('(?P<w1>\w+),(?P<w2>\w+),(?P<w3>\w+)', 'foo,quux,baz')

print(m.group('w1', 'w2', 'w3'))
print(m.group(1, 2, 3))

# each <name> can only appear once per regex.

('foo', 'quux', 'baz')
('foo', 'quux', 'baz')


## ` (?P=<name>) ` : Matches the contents of a previously captured named group.



In [None]:
m = re.search(r'(?P<word>\w+),(?P=word)', 'foo,foo')
print(m)
print(m.group('word'))
# (?P=<word>\w+) matches 'foo' and saves it as a captured group named word. Again, the comma matches literally. 
# Then (?P=word) is a backreference to the named capture and matches 'foo' again



In [None]:
# Note: The angle brackets (< and >) are required around name when creating a named group but not when referring to it later, either by backreference or by .group()
m = re.match(r'(?P<num>\d+)\.(?P=num)', '135.135')
print(m)
print(m.group('num'))

<_sre.SRE_Match object; span=(0, 7), match='135.135'>
135


## ` (?:<regex>) ` : Creates a non-capturing group.



In [None]:
""" (?:<regex>) is just like (<regex>) in that it matches the specified <regex>. But (?:<regex>) doesn’t capture the match for later retrieval """

m = re.search('(\w+),(?:\w+),(\w+)', 'foo,quux,baz')
print(m.groups())
# The middle word 'quux' sits inside non-capturing parentheses, so it’s missing from the tuple of captured groups. 
# It isn’t retrievable from the match object, nor would it be referable by backreference.

# Use case: If you use non-capturing grouping, then the tuple of captured groups won’t be cluttered with values you don’t actually need to keep.

## ` (?(<n>)<yes-regex>|<no-regex>) `
## ` (?(<name>)<yes-regex>|<no-regex>) `

## Specifies a conditional match.

In [None]:
import re
""" (?(<n>)<yes-regex>|<no-regex>) matches against <yes-regex> if a group numbered <n> exists. Otherwise, it matches against <no-regex>.
(?(<name>)<yes-regex>|<no-regex>) matches against <yes-regex> if a group named <name> exists. Otherwise, it matches against <no-regex>. """

regex = r'^(###)?foo(?(1)bar|baz)'
# 1. ^(###)? indicates that the search string optionally begins with '###'. 
# If it does, then the grouping parentheses around ### will create a group numbered 1. Otherwise, no such group will exist.

# 2. The next portion, foo, literally matches the string 'foo'.

# 3. Lastly, (?(1)bar|baz) matches against 'bar' if group 1 exists and 'baz' if it doesn’t.

print(re.search(regex, '###foobar'))

print(re.search(regex, '###foobaz'))

print(re.search(regex, 'foobar'))

print(re.search(regex, 'foobaz'))

<_sre.SRE_Match object; span=(0, 9), match='###foobar'>
None
None
<_sre.SRE_Match object; span=(0, 6), match='foobaz'>


In [None]:
regex = r'^(?P<ch>\W)?foo(?(ch)(?P=ch)|)$'

"""
1. ^ :	The start of the string
2. (?P<ch>\W)	: A single non-word character, captured in a group named ch
3. (?P<ch>\W)? : Zero or one occurrences of the above 
4. foo :	The literal string 'foo'
5. (?(ch)(?P=ch)|)	The contents of the group named ch if it exists, or the empty string if it doesn’t
6. $	The end of the string
"""

print(re.search(regex, 'foo'))

print(re.search(regex, '#foo#'))

print(re.search(regex, '@foo@'))

print(re.search(regex, '#foo'))

print(re.search(regex, 'foo@'))

print(re.search(regex, '#foo@'))

print(re.search(regex, '@foo#'))

# If a non-word character precedes 'foo', then the parser creates a group named ch which contains that character. 
# The conditional match then matches against <yes-regex>, which is (?P=ch), the same character again. 
# That means the same character must also follow 'foo' for the entire match to succeed.

# If 'foo' isn’t preceded by a non-word character, then the parser doesn’t create group ch. 
# <no-regex> is the empty string, which means there must not be anything following 'foo' for the entire match to succeed. 
# Since ^ and $ anchor the whole regex, the string must equal 'foo' exactly.

<_sre.SRE_Match object; span=(0, 3), match='foo'>
<_sre.SRE_Match object; span=(0, 5), match='#foo#'>
<_sre.SRE_Match object; span=(0, 5), match='@foo@'>
None
None
None
None


## 2.6 Lookahead and Lookbehind Assertions <a name="2.6"></a>
### Lookahead and lookbehind assertions determine the success or failure of a regex match in Python based on what is just behind (to the left) or ahead (to the right) of the parser’s current position in the search string.

## ` (?=<lookahead_regex>) ` : Creates a positive lookahead assertion.



In [None]:
""" (?=<lookahead_regex>) asserts that what follows the regex parser’s current position must match <lookahead_regex> """

# The lookahead assertion (?=[a-z]) specifies that what follows 'foo' must be a lowercase alphabetic character. In this case, it’s the character 'b', so a match is found.

print(re.search('foo(?=[a-z])', 'foobar'))
# The regex parser looks ahead only to the 'b' that follows 'foo' but doesn’t pass over it yet. 
# You can tell that 'b' isn’t considered part of the match because the match object displays match='foo'

print(re.search('foo([a-z])', 'foobar'))
# This time, the regex consumes the 'b', and it becomes a part of the eventual match.

print(re.search('foo(?=[a-z])', 'foo123'))

In [None]:
m = re.search('foo(?=[a-z])(?P<ch>.)', 'foobar')
print(m.groups())
print(m.group('ch'))
# (?P<ch>.) matches the next single character available

m = re.search('foo([a-z])(?P<ch>.)', 'foobar')
print(m.groups())
print(m.group('ch'))

('b',)
b
('b', 'a')
a


## ` (?!<lookahead_regex>) ` : Creates a negative lookahead assertion.



In [None]:
""" (?!<lookahead_regex>) asserts that what follows the regex parser’s current position must not match <lookahead_regex> """

In [None]:
print(re.search('foo(?=[a-z])', 'foobar'))
print(re.search('foo(?![a-z])', 'foobar'))


print(re.search('foo(?=[a-z])', 'foo123'))

print(re.search('foo(?![a-z])', 'foo123'))

<_sre.SRE_Match object; span=(0, 3), match='foo'>
None
None
<_sre.SRE_Match object; span=(0, 3), match='foo'>


## ` (?<=<lookbehind_regex>) ` : Creates a positive lookbehind assertion.



In [None]:
""" (?<=<lookbehind_regex>) asserts that what precedes the regex parser’s current position must match <lookbehind_regex>. """
print(re.search('(?<=foo)bar', 'foobar'))
print(re.search('(?<=qux)bar', 'foobar'))
# Note: look-behind requires fixed-width pattern

print(re.search('(?<=a{3})def', 'aaadef'))

<_sre.SRE_Match object; span=(3, 6), match='bar'>
None
<_sre.SRE_Match object; span=(3, 6), match='def'>


## ` (?<!<lookbehind_regex>) ` : Creates a negative lookbehind assertion.



In [None]:
""" (?<!<lookbehind_regex>) asserts that what precedes the regex parser’s current position must not match <lookbehind_regex> """

In [None]:
print(re.search('(?<!foo)bar', 'foobar'))

print(re.search('(?<!qux)bar', 'foobar'))

None
<_sre.SRE_Match object; span=(3, 6), match='bar'>


## 2.7 Miscellaneous Metacharacters <a name = "2.7"></a>

## ` (?#...) ` : Specifies a comment.



In [None]:
print(re.search('bar(?#This is a comment) *baz', 'foo bar  baz qux'))

<_sre.SRE_Match object; span=(4, 12), match='bar  baz'>


## ` Vertical bar, or pipe (|) ` : Specifies a set of alternatives on which to match.



In [None]:
print(re.search('foo|bar|baz', 'bar'))

print(re.search('foo|bar|baz', 'baz'))

print(re.search('foo|bar|baz', 'quux'))
print('\n)')
# Note:  The regex parser looks at the expressions separated by | in left-to-right order and returns the first match that it finds. it functions like 'OR'
print(re.search('foo', 'foograult'))
print(re.search('grault', 'foograult'))
print(re.search('foo|grault', 'foograult'))

# (foo|bar|baz)+ means a sequence of one or more of the strings 'foo', 'bar', or 'baz'
print('\n)')
print(re.search('(foo|bar|baz)+', 'foofoofoo'))
print(re.search('(foo|bar|baz)+', 'bazbazbazbaz'))
print(re.search('(foo|bar|baz)+', 'barbazfoo'))

# ([0-9]+|[a-f]+) means a sequence of one or more decimal digit characters or a sequence of one or more of the characters 'a-f'
print('\n')
print(re.search('([0-9]+|[a-f]+)', '456'))
print(re.search('([0-9]+|[a-f]+)', 'ffda'))


<_sre.SRE_Match object; span=(0, 3), match='bar'>
<_sre.SRE_Match object; span=(0, 3), match='baz'>
None

)
<_sre.SRE_Match object; span=(0, 3), match='foo'>
<_sre.SRE_Match object; span=(3, 9), match='grault'>
<_sre.SRE_Match object; span=(0, 3), match='foo'>

)
<_sre.SRE_Match object; span=(0, 9), match='foofoofoo'>
<_sre.SRE_Match object; span=(0, 12), match='bazbazbazbaz'>
<_sre.SRE_Match object; span=(0, 9), match='barbazfoo'>


<_sre.SRE_Match object; span=(0, 3), match='456'>
<_sre.SRE_Match object; span=(0, 4), match='ffda'>


# 3. Modified Regular Expression Matching With Flags <a name="3"></a>
## Most of the functions in the re module take an optional (flags) argument

## 3.1 Supported Regular Expression Flags <a name = "3.1"></a>

## ` re.search(<regex>, <string>, <flags>) ` : Scans a string for a regex match, applying the specified modifier (flags).

### Flags modify regex parsing behavior, allowing you to refine your pattern matching even further.



## Supported Regular Expression Flags


In [None]:
""" 
ShortName	LongName	      Effect
re.I	    re.IGNORECASE	  Makes matching of alphabetic characters case-insensitive
re.M	    re.MULTILINE	  Causes start-of-string and end-of-string anchors to match embedded newlines
re.S	    re.DOTALL	      Causes the dot metacharacter to match a newline
re.X	    re.VERBOSE	    Allows inclusion of whitespace and comments within a regular expression
----	    re.DEBUG	      Causes the regex parser to display debugging information to the console
re.A	    re.ASCII	      Specifies ASCII encoding for character classification
re.U	    re.UNICODE	    Specifies Unicode encoding for character classification
re.L      re.LOCALE	      Specifies encoding for character classification based on the current locale
"""

### ` re.I / re.IGNORECASE ` : Makes matching case insensitive.



In [None]:
import re
print(re.search('a+', 'aaaAAA'))
print(re.search('A+', 'aaaAAA'))

# the parser ignores case, so both a+ and A+ match the entire string.
print(re.search('a+', 'aaaAAA', re.I))
print(re.search('A+', 'aaaAAA', re.IGNORECASE))


<_sre.SRE_Match object; span=(0, 3), match='aaa'>
<_sre.SRE_Match object; span=(3, 6), match='AAA'>
<_sre.SRE_Match object; span=(0, 6), match='aaaAAA'>
<_sre.SRE_Match object; span=(0, 6), match='aaaAAA'>


In [None]:
# Specifying re.I makes the search case insensitive, so [a-z]+ matches the entire string.
print(re.search('[a-z]+', 'aBcDeF'))
print(re.search('[a-z]+', 'aBcDeF', re.I))


<_sre.SRE_Match object; span=(0, 1), match='a'>
<_sre.SRE_Match object; span=(0, 6), match='aBcDeF'>


### ` re.M / re.MULTILINE ` : Causes start-of-string and end-of-string anchors to match at embedded newlines.



In [None]:
s = 'foo\nbar\nbaz'

#  the ^ (start-of-string) and $ (end-of-string) anchors match only at the beginning and end of the search string
print(re.search('^foo', s))
print(re.search('^bar', s))

print(re.search('^baz', s))

print(re.search('foo$', s))

print(re.search('bar$', s))

print(re.search('baz$', s))


<_sre.SRE_Match object; span=(0, 3), match='foo'>
None
None
None
None
<_sre.SRE_Match object; span=(8, 11), match='baz'>


In [None]:
"""
+ ^ matches at the beginning of the string or at the beginning of any line within the string (that is, immediately following a newline).
+ $ matches at the end of the string or at the end of any line within the string (immediately preceding a newline).
+ if the MULTILINE flag is set, the ^ and $ anchor metacharacters match internal lines as well
"""
s = 'foo\nbar\nbaz'
print(s)

print(re.search('^foo', s, re.MULTILINE))
print(re.search('^bar', s, re.MULTILINE))
print(re.search('^baz', s, re.MULTILINE))

print(re.search('foo$', s, re.M))
print(re.search('bar$', s, re.M))
print(re.search('baz$', s, re.M))

# The MULTILINE flag only modifies the ^ and $ anchors in this way. It doesn’t have any effect on the \A and \Z anchors
print(re.search('\Abar', s, re.MULTILINE))
print(re.search('bar\Z', s, re.MULTILINE))


foo
bar
baz
<_sre.SRE_Match object; span=(0, 3), match='foo'>
<_sre.SRE_Match object; span=(4, 7), match='bar'>
<_sre.SRE_Match object; span=(8, 11), match='baz'>
<_sre.SRE_Match object; span=(0, 3), match='foo'>
<_sre.SRE_Match object; span=(4, 7), match='bar'>
<_sre.SRE_Match object; span=(8, 11), match='baz'>
None
None


### ` re.S / re.DOTALL ` : Causes the dot (.) metacharacter to match a newline.



In [None]:
# the dot metacharacter matches any character except the newline character. The DOTALL flag lifts this restriction

print(re.search('foo.bar', 'foo\nbar'))

print(re.search('foo.bar', 'foo\nbar', re.DOTALL))

print(re.search('foo.bar', 'foo\nbar', re.S))


None
<_sre.SRE_Match object; span=(0, 7), match='foo\nbar'>
<_sre.SRE_Match object; span=(0, 7), match='foo\nbar'>


### ` re.X / re.VERBOSE ` : Allows inclusion of whitespace and comments within a regex.



In [None]:
import re
regex = r'^(\(\d{3}\))?\s*\d{3}[-.]\d{4}$'

print(re.search(regex, '414.9229'))
print(re.search(regex, '414-9229'))
print(re.search(regex, '(712)414-9229'))
print(re.search(regex, '(712) 414-9229'))

<_sre.SRE_Match object; span=(0, 8), match='414.9229'>
<_sre.SRE_Match object; span=(0, 8), match='414-9229'>
<_sre.SRE_Match object; span=(0, 13), match='(712)414-9229'>
<_sre.SRE_Match object; span=(0, 14), match='(712) 414-9229'>


In [None]:
# Using the VERBOSE flag, you can write the same regex in Python like this instead:
>>> regex = r'''^               # Start of string
...             (\(\d{3}\))?    # Optional area code
...             \s*             # Optional whitespace
...             \d{3}           # Three-digit prefix
...             [-.]            # Separator character
...             \d{4}           # Four-digit line number
...             $               # Anchor at end of line
...             '''

print(re.search(regex, '414.9229', re.VERBOSE))
print(re.search(regex, '414-9229', re.VERBOSE))
print(re.search(regex, '(712)414-9229', re.X))
print(re.search(regex, '(712) 414-9229', re.X))
# Note that triple quoting makes it particularly convenient to include embedded newlines, which qualify as ignored whitespace in VERBOSE mode.

<_sre.SRE_Match object; span=(0, 8), match='414.9229'>
<_sre.SRE_Match object; span=(0, 8), match='414-9229'>
<_sre.SRE_Match object; span=(0, 13), match='(712)414-9229'>
<_sre.SRE_Match object; span=(0, 14), match='(712) 414-9229'>


In [None]:
print(re.search('foo bar', 'foo bar'))

print(re.search('foo bar', 'foo bar', re.VERBOSE))


# escape the space character with a backslash or include it in a character class
print(re.search('foo\ bar', 'foo bar', re.VERBOSE))
print(re.search('foo[ ]bar', 'foo bar', re.VERBOSE))

# Note:
  # the VERBOSE flag causes the parser to ignore the space character.

<_sre.SRE_Match object; span=(0, 7), match='foo bar'>
None
<_sre.SRE_Match object; span=(0, 7), match='foo bar'>
<_sre.SRE_Match object; span=(0, 7), match='foo bar'>


### ` re.DEBUG ` : Displays debugging information.



In [None]:
# When the parser displays LITERAL nnn in the debugging output, it’s showing the ASCII code of a literal character in the regex.
print(re.search('foo.bar', 'fooxbar', re.DEBUG))

# MAX_REPEAT 2 4 confirms that the regex parser recognizes the metacharacter sequence {2,4} and interprets it as a range quantifier.
print(re.search('x[123]{2,4}y', 'x222y'))

LITERAL 102
LITERAL 111
LITERAL 111
ANY None
LITERAL 98
LITERAL 97
LITERAL 114
<_sre.SRE_Match object; span=(0, 7), match='fooxbar'>
<_sre.SRE_Match object; span=(0, 5), match='x222y'>


### ` re.A / re.ASCII `
### ` re.U / re.UNICODE `
### ` re.L / re.LOCALE `

### Specify the character encoding used for parsing of special regex character classes.

__re.U and re.UNICODE__: specify Unicode encoding. Unicode is the default, so these flags are superfluous. They’re mainly supported for backward compatibility.

__re.A and re.ASCII__: force a determination based on ASCII encoding. If you happen to be operating in English, then this is happening anyway, so the flag won’t affect whether or not a match is found.

__re.L and re.LOCALE__: make the determination based on the current locale. Locale is an outdated concept and isn’t considered reliable. Except in rare circumstances, you’re not likely to need it.

## 3.2 Combining **flags** Arguments in a Function Call <a name  = "3.2"></a>
### Flag values are defined so that you can combine them using the bitwise OR (|) operator. This allows you to specify several flags in a single function call

In [None]:
# This re.search() call uses bitwise OR to specify both the IGNORECASE and MULTILINE flags at once.
print(re.search('^bar', 'FOO\nBAR\nBAZ', re.I|re.M))

<_sre.SRE_Match object; span=(4, 7), match='BAR'>


## 3.3 Setting and Clearing Flags Within a Regular Expression <a name = "3.3"></a>

### ` (?<flags>) ` : Sets flag value(s) for the duration of a regex.

In [None]:
"""
Letter	   Flags
a	         re.A / re.ASCII
i	         re.I / re.IGNORECASE
L	         re. L / re.LOCALE
m	         re. M / re.MULTILINE
s	         re.S / re.DOTALL
u	         re.U / re.UNICODE
x	         re. X / re.VERBOSE
"""

In [None]:
# The (?<flags>) metacharacter sequence as a whole matches the empty string. It always matches successfully and doesn’t consume any of the search string.


print(re.search('^bar', 'FOO\nBAR\nBAZ\n', re.I|re.M))
print(re.search('(?im)^bar', 'FOO\nBAR\nBAZ\n'))


# In the below examples, both dot metacharacters match newlines because the DOTALL flag is in effect. This is true even when (?s) appears in the middle or at the end of the expression
print(re.search('foo.bar(?s).baz', 'foo\nbar\nbaz'))
print(re.search('foo.bar.baz(?s)', 'foo\nbar\nbaz'))


<_sre.SRE_Match object; span=(4, 7), match='BAR'>
<_sre.SRE_Match object; span=(4, 7), match='BAR'>
<_sre.SRE_Match object; span=(0, 11), match='foo\nbar\nbaz'>
<_sre.SRE_Match object; span=(0, 11), match='foo\nbar\nbaz'>


### ` (?<set_flags>-<remove_flags>:<regex>) ` : Sets or removes flag value(s) for the duration of a group.



In [None]:
# For the <regex> contained in the group, the regex parser sets any flags specified in <set_flags> and clears any flags specified in <remove_flags>

# (?i:foo) dictates that the match against 'FOO' is case insensitive.
print(re.search('(?i:foo)bar', 'FOObar'))

# the match against 'FOO' would succeed because it’s case insensitive. But once outside the group, IGNORECASE is no longer in effect, 
# so the match against 'BAR' is case sensitive and fails.
print(re.search('(?i:foo)bar', 'FOOBAR'))

# turning a flag off for a group
print(re.search('(?-i:foo)bar', 'FOOBAR', re.IGNORECASE))



# As of Python 3.7, you can specify u, a, or L as <set_flags> to override the default encoding for the specified group
s = 'sch\u00f6n'
print(s)

# Requires Python 3.7 or later
print(re.search('(?a:\w+)', s))
print(re.search('(?u:\w+)', s))


"""
bad inline flags: cannot turn off flags 'a', 'u' and 'L'
re.search('(?-a:\w+)', s)
"""

<_sre.SRE_Match object; span=(0, 6), match='FOObar'>
None
None
schön


error: ignored

# 4. re Module Functions <a name = "4"></a>

## 4.1 Searching Function <a name = '4.1'></a>

## ` re.search(<regex>, <string>, flags=0) ` : Scans a string for a regex match

In [None]:
import re
# The function returns a match object if it finds a match and None otherwise.
print(re.search(r'(\d+)', 'foo123bar'))
print(re.search(r'[a-z]+', '123FOO456', flags=re.IGNORECASE))

print(re.search(r'\d+', 'foo.bar'))

<_sre.SRE_Match object; span=(3, 6), match='123'>
<_sre.SRE_Match object; span=(3, 6), match='FOO'>
None


## ` re.match(<regex>, <string>, flags=0) ` : Looks for a regex match at the beginning of a string.

In [None]:
# returns a match only if <regex> matches at the beginning of <string>.
print(re.search(r'\d+', '123foobar'))
print(re.search(r'\d+', 'foo123bar'))

print(re.match(r'\d+', '123foobar'))
print(re.match(r'\d+', 'foo123bar'))


# MULTILINE flag does not cause re.match() to match the caret (^) anchor metacharacter either at the beginning of <string> or at the beginning of any line contained within <string>:
s = 'foo\nbar\nbaz'
print(re.match('^foo', s))
print(re.match('^bar', s, re.MULTILINE))


<_sre.SRE_Match object; span=(0, 3), match='123'>
<_sre.SRE_Match object; span=(3, 6), match='123'>
<_sre.SRE_Match object; span=(0, 3), match='123'>
None
<_sre.SRE_Match object; span=(0, 3), match='foo'>
None


## ` re.fullmatch(<regex>, <string>, flags=0) ` : Looks for a regex match on an entire string.

In [None]:
# re.fullmatch() returns a match only if <regex> matches <string> in its entirety
print(re.fullmatch(r'\d+', '123foo'))

print(re.fullmatch(r'\d+', 'foo123'))

print(re.fullmatch(r'\d+', 'foo123bar'))

print(re.fullmatch(r'\d+', '123'))

print(re.search(r'^\d+$', '123'))

None
None
None
<_sre.SRE_Match object; span=(0, 3), match='123'>
<_sre.SRE_Match object; span=(0, 3), match='123'>


## ` re.findall(<regex>, <string>, flags=0) ` : Returns a list of all matches of a regex in a string.



In [None]:
# returns a list of all non-overlapping matches of <regex> in <string>
# scans the search string from left to right and returns all matches in the order found
print(re.findall(r'\w+', '...foo,,,,bar:%$baz//|'))

print(re.findall(r'#(\w+)#', '#foo#.#bar#.#baz#'))
# the hash (#) characters don’t appear in the return list because they’re outside the grouping parentheses.

print(re.findall(r'(\w+),(\w+)', 'foo,bar,baz,qux,quux,corge'))
# the regex contains two capturing groups, so re.findall() returns a list of three two-tuples, each containing two captured matches

print(re.findall(r'(\w+),(\w+),(\w+)', 'foo,bar,baz,qux,quux,corge'))


['foo', 'bar', 'baz']
['foo', 'bar', 'baz']
[('foo', 'bar'), ('baz', 'qux'), ('quux', 'corge')]
[('foo', 'bar', 'baz'), ('qux', 'quux', 'corge')]


## ` re.finditer(<regex>, <string>, flags=0) ` : Returns an iterator that yields regex matches.

In [None]:
# scans <string> for non-overlapping matches of <regex> and returns an iterator that yields the match objects from any it finds
"""
re.findall() and re.finditer() are very similar, but they differ in two respects:
  + re.findall() returns a list, whereas re.finditer() returns an iterator.
  + The items in the list that re.findall() returns are the actual matching strings, whereas the items yielded by the iterator that re.finditer() returns are match objects.
"""
it = re.finditer(r'\w+', '...foo,,,,bar:%$baz//|')
print(next(it))
print(next(it))
print(next(it))
print(next(it))

<_sre.SRE_Match object; span=(3, 6), match='foo'>
<_sre.SRE_Match object; span=(10, 13), match='bar'>
<_sre.SRE_Match object; span=(16, 19), match='baz'>


StopIteration: ignored

In [None]:
for i in re.finditer(r'\w+', '...foo,,,,bar:%$baz//|'):
  print(i)

<_sre.SRE_Match object; span=(3, 6), match='foo'>
<_sre.SRE_Match object; span=(10, 13), match='bar'>
<_sre.SRE_Match object; span=(16, 19), match='baz'>


## 4.2 Substitution Functions <a name = '4.2'></a>

## ` re.sub(<regex>, <repl>, <string>, count=0, flags=0) ` 
## Returns a new string that results from performing replacements on a search string.

In [None]:
"""
re.sub(<regex>, <repl>, <string>) finds the leftmost non-overlapping occurrences of <regex> in <string>, replaces each match as indicated by <repl>, and returns the result. <string> remains unchanged.
<repl> can be either a string or a function
"""

## Substitution by String
### ` repl ` : is a string

In [None]:
# If <repl> is a string, then re.sub() inserts it into <string> in place of any sequences that match <regex>
s = 'foo.123.bar.789.baz'
print(re.sub(r'\d+', '#', s))
print(re.sub('[a-z]+', '(*)', s))

# captured groups 1 and 2 contain 'foo' and 'qux'. In the replacement string '\2,bar,baz,\1', 'foo' replaces \1 and 'qux' replaces \2.
print(re.sub(r'(\w+),bar,baz,(\w+)', r'\2,bar,baz,\1', 'foo,bar,baz,qux'))

# refer to named backreferences created with (?P<name><regex>) in the replacement string using the metacharacter sequence \g<name>
print(re.sub(r'foo,(?P<w1>\w+),(?P<w2>\w+),qux', r'foo,\g<w2>,\g<w1>,qux', 'foo,bar,baz,qux'))

# refer to numbered backreferences this way by specifying the group number inside the angled brackets
print(re.sub(r'foo,(\w+),(\w+),qux', r'foo,\g<2>,\g<1>,qux', 'foo,bar,baz,qux'))

# suppose you have a string like 'foo 123 bar' and want to add a '0' at the end of the digit sequence
print(re.sub(r'(\d+)', r'\g<1>0', 'foo 123 bar'))

# The backreference \g<0> refers to the text of the entire match. This is valid even when there are no grouping parentheses in <regex>
print(re.sub(r'\d+', '/\g<0>/', 'foo 123 bar'))

# If <regex> specifies a zero-length match, then re.sub() will substitute <repl> into every character position in the string
print(re.sub('x*', '-', 'foo'))

foo.#.bar.#.baz
(*).123.(*).789.(*)
qux,bar,baz,foo
foo,baz,bar,qux
foo,baz,bar,qux
foo 1230 bar
foo /123/ bar
-f-o-o-


## Substitution by Function
### If you specify <repl> as a function, then re.sub() calls that function for each match found.

In [None]:
"""
Passes each corresponding match object as an argument to the function to provide information about the match. 
The function return value then becomes the replacement string
"""

In [None]:
def f(match_obj):
  s = match_obj.group(0)  # The matching string
  # s.isdigit() returns True if all characters in s are digits
  if s.isdigit():
    return str(int(s) * 10)
  else:
    return s.upper()

# f() gets called for each match. As a result, re.sub() converts each alphanumeric portion of <string> to all uppercase and multiplies each numeric portion by 10.
re.sub(r'\w+', f, 'foo.10.bar.20.baz.30')

'FOO.100.BAR.200.BAZ.300'

## Limiting the Number of Replacements

In [None]:
# If you specify a positive integer for the optional count parameter, then re.sub() performs at most that many replacements

print(re.sub(r'\w+', 'xxx', 'foo.bar.baz.qux'))
print(re.sub(r'\w+', 'xxx', 'foo.bar.baz.qux', count=2))


xxx.xxx.xxx.xxx
xxx.xxx.baz.qux


## ` re.subn(<regex>, <repl>, <string>, count=0, flags=0) `
## Returns a new string that results from performing replacements on a search string and also returns the number of substitutions made.

In [None]:
# re.subn() is identical to re.sub(), except that re.subn() returns a two-tuple consisting of the modified string and the number of substitutions made

print(re.subn(r'\w+', 'xxx', 'foo.bar.baz.qux'))

print(re.subn(r'\w+', 'xxx', 'foo.bar.baz.qux', count=2))

def f(match_obj):
  m = match_obj.group(0)  
  if m.isdigit():
    return str(int(m) * 10)
  else:
    return m.upper()

re.subn(r'\w+', f, 'foo.10.bar.20.baz.30')

('xxx.xxx.xxx.xxx', 4)
('xxx.xxx.baz.qux', 2)


('FOO.100.BAR.200.BAZ.300', 6)

## 4.3 Utility Functions <a name = '4.3'></a>

## ` re.split(<regex>, <string>, maxsplit=0, flags=0) ` : Splits a string into substrings.



In [None]:
""" re.split(<regex>, <string>) splits <string> into substrings using <regex> as the delimiter and returns the substrings as a list. """

# splits the specified string into substrings delimited by a comma (,), semicolon (;), or slash (/) character, surrounded by any amount of whitespace
print(re.split('\s*[,;/]\s*', 'foo,bar  ;  baz / qux'))

# If <regex> contains capturing groups, then the return list includes the matching delimiter strings as well:
print(re.split('(\s*[,;/]\s*)', 'foo,bar  ;  baz / qux'))


['foo', 'bar', 'baz', 'qux']
['foo', ',', 'bar', '  ;  ', 'baz', ' / ', 'qux']


In [None]:
string = 'foo,bar  ;  baz / qux'
regex = r'(\s*[,;/]\s*)'
a = re.split(regex, string)

# List of tokens and delimiters
print('List of tokens and delimiters: ', a)

# Enclose each token in <>'s
for i, s in enumerate(a):
  # This will be True for the tokens but not the delimiters
  if not re.fullmatch(regex, s):
    a[i] = f'<{s}>'

# Put the tokens back together using the same delimiters
print(''.join(a))

List of tokens and delimiters:  ['foo', ',', 'bar', '  ;  ', 'baz', ' / ', 'qux']
<foo>,<bar>  ;  <baz> / <qux>


In [None]:
# If you need to use groups but don’t want the delimiters included in the return list, then you can use noncapturing groups
string = 'foo,bar  ;  baz / qux'
regex = r'(?:\s*[,;/]\s*)'
print(re.split(regex, string))

# If the optional maxsplit argument is present and greater than zero, then re.split() performs at most that many splits. 
# The final element in the return list is the remainder of <string> after all the splits have occurred
s = 'foo, bar, baz, qux, quux, corge'

print(re.split(r',\s*', s))

print(re.split(r',\s*', s, maxsplit=3))
# Explicitly specifying maxsplit=0 is equivalent to omitting it entirely. If maxsplit is negative, then re.split() returns <string> unchanged


# <regex> matches the start/end of <string>, then re.split() places an empty string as the first element in the return list.
print(re.split('(/)', '/foo/bar/baz/'))

['foo', 'bar', 'baz', 'qux']
['foo', 'bar', 'baz', 'qux', 'quux', 'corge']
['foo', 'bar', 'baz', 'qux, quux, corge']
['', '/', 'foo', '/', 'bar', '/', 'baz', '/', '']


 ## ` re.escape(<regex>) ` : Escapes characters in a regex.



In [None]:
# re.escape(<regex>) returns a copy of <regex> with each nonword character (anything other than a letter, digit, or underscore) preceded by a backslash.
"""
Objective: This is useful if you’re calling one of the re module functions, and the <regex> you’re passing in has a lot of special characters
that you want the parser to take literally instead of as metacharacters
"""
print(re.match('foo^bar(baz)|qux', 'foo^bar(baz)|qux'))

print(re.match('foo\^bar\(baz\)\|qux', 'foo^bar(baz)|qux'))

print(re.escape('foo^bar(baz)|qux') == 'foo\^bar\(baz\)\|qux')

print(re.match(re.escape('foo^bar(baz)|qux'), 'foo^bar(baz)|qux'))


None
<_sre.SRE_Match object; span=(0, 16), match='foo^bar(baz)|qux'>
True
<_sre.SRE_Match object; span=(0, 16), match='foo^bar(baz)|qux'>


# 5. Compiled Regex Objects in Python <a name = '5'></a>

## ` re.compile(<regex>, flags=0) ` : Compiles a regex into a regular expression object.

In [None]:
import re
"""
re.compile(<regex>) compiles <regex> and returns the corresponding regular expression object. 
If you include a <flags> value, then the corresponding flags apply to any searches performed with the object.
------------------------------------------
re_obj = re.compile(<regex>, <flags>)
result = re.search(re_obj, <string>)
result = re_obj.search(<string>)
result = re.search(<regex>, <string>, <flags>)
"""
print(re.search(r'(\d+)', 'foo123bar'))

re_obj = re.compile(r'(\d+)')
print(re.search(re_obj, 'foo123bar'))
print(re_obj.search('foo123bar'))

# Using flag re.IGNORECASE
r1 = re.search('ba[rz]', 'FOOBARBAZ', flags=re.I)
re_obj = re.compile('ba[rz]', flags=re.I)
r2 = re.search(re_obj, 'FOOBARBAZ')
r3 = re_obj.search('FOOBARBAZ')

print(r1)
print(r2)
print(r3)

<_sre.SRE_Match object; span=(3, 6), match='123'>
<_sre.SRE_Match object; span=(3, 6), match='123'>
<_sre.SRE_Match object; span=(3, 6), match='123'>
<_sre.SRE_Match object; span=(3, 6), match='BAR'>
<_sre.SRE_Match object; span=(3, 6), match='BAR'>
<_sre.SRE_Match object; span=(3, 6), match='BAR'>


## Why Bother Compiling a Regex? <a name = '5.1'></a>

In [None]:
""" 
If you use a particular regex in your Python code frequently, 
then precompiling allows you to separate out the regex definition from its uses. This enhances modularity. 
"""
#  more modular and more maintainable:
s1, s2, s3, s4 = 'foo.bar', 'foo123bar', 'baz99', 'qux & grault'

re_obj = re.compile('\d+')
regex = '\d+'

print(re_obj.search(s1))
print(re.search(regex, s1))

print(re_obj.search(s2))
print(re.search(regex, s2))

print(re_obj.search(s3))
print(re.search(regex, s3))

print(re_obj.search(s4))
print(re.search(regex, s4))

## 5.2 Regular Expression Object Methods <a name = '5.2'></a>

In [None]:
"""
A compiled regular expression object re_obj supports the following methods:
  + re_obj.search(<string>[, <pos>[, <endpos>]])
  + re_obj.match(<string>[, <pos>[, <endpos>]])
  + re_obj.fullmatch(<string>[, <pos>[, <endpos>]])
  + re_obj.findall(<string>[, <pos>[, <endpos>]])
  + re_obj.finditer(<string>[, <pos>[, <endpos>]])
  + re_obj.split(<string>, maxsplit=0)
  + re_obj.sub(<repl>, <string>, count=0)
  + re_obj.subn(<repl>, <string>, count=0)
"""
re_obj = re.compile(r'\d+')
s = 'foo123barbaz'

print(re_obj.search(s))
print(s[6:9])
print(re_obj.search(s, 6, 9))

# anchors such as caret (^) and dollar sign ($) still refer to the start and end of the entire string, not the substring determined by <pos> and <endpos>
re_obj = re.compile('^bar')
s = 'foobarbaz'

print(s[3:])
print(re_obj.search(s, 3))

<_sre.SRE_Match object; span=(3, 6), match='123'>
bar
None
barbaz
None


## 5.3 Regular Expression Object Attributes <a name = '5.3'></a>

In [None]:
"""
  + re_obj.flags:	Any <flags> that are in effect for the regex
  + re_obj.groups:	The number of capturing groups in the regex
  + re_obj.groupindex:	A dictionary mapping each symbolic group name defined by the (?P<name>) construct (if any) to the corresponding group number
  + re_obj.pattern:	The <regex> pattern that produced this object
"""
re_obj = re.compile(r'(?m)(\w+),(\w+)', re.I)
print(re_obj.flags)

# the value of re_obj.flags is the logical OR of these three values, which equals 42.
print(re.I|re.M|re.UNICODE)
# re.I: Specified as a <flags> value in the re.compile() call
# re.M: Specified as (?m) within the regex
# re.UNICODE: Enabled by default
print(re_obj.groups)
print(re_obj.pattern)

re_obj = re.compile(r'(?P<w1>),(?P<w2>)')
print(re_obj.groupindex)
print(re_obj.groupindex['w1'])

42
RegexFlag.UNICODE|MULTILINE|IGNORECASE
2
(?m)(\w+),(\w+)
{'w1': 1, 'w2': 2}
1


# 6. Match Object Methods and Attributes <a name = '6'></a>

In [None]:
"""
  + match.group(): The specified captured group or groups from match

  + match.__getitem__(): A captured group from match

  + match.groups(): All the captured groups from match

  + match.groupdict(): A dictionary of named captured groups from match

  + match.expand(): The result of performing backreference substitutions from match

  + match.start(): The starting index of match

  + match.end(): The ending index of match
  
  + match.span(): Both the starting and ending indices of match as a tuple
  """

## ` match.group([<group1>, ...]) ` : Returns the specified captured group(s) from a match.



In [None]:
# For numbered groups, match.group(n) returns the nth group
# Numbered captured groups are one-based, not zero-based.
m = re.search(r'(\w+),(\w+),(\w+)', 'foo,bar,baz')
print(m.group(1))

print(m.group(3))


foo
baz


In [None]:
# If you capture groups using (?P<name><regex>), then match.group(<name>) returns the corresponding named group
m = re.match(r'(?P<w1>\w+),(?P<w2>\w+),(?P<w3>\w+)', 'quux,corge,grault')
print(m.group('w1'))

m.group('w3')

quux


'grault'

In [None]:
# With more than one argument, .group() returns a tuple of all the groups specified. 
# A given group can appear multiple times, and you can specify any captured groups in any order
m = re.search(r'(\w+),(\w+),(\w+)', 'foo,bar,baz')
print(m.group(1, 3))

print(m.group(3, 3, 1, 1, 2, 2))

m = re.match(r'(?P<w1>\w+),(?P<w2>\w+),(?P<w3>\w+)', 'quux,corge,grault')
print(m.group('w3', 'w1', 'w1', 'w2'))


('foo', 'baz')
('baz', 'baz', 'foo', 'foo', 'bar', 'bar')
('grault', 'quux', 'quux', 'corge')


In [None]:
# If you specify a group that’s out of range or nonexistent, then .group() raises an IndexError exception
m = re.search(r'(\w+),(\w+),(\w+)', 'foo,bar,baz')
m.group(4)

IndexError: ignored

In [None]:
# It’s possible for a regex in Python to match as a whole but to contain a group that doesn’t participate in the match. In that case, .group() returns None 
# for the nonparticipating group
m = re.search(r'(\w+),(\w+),(\w+)?', 'foo,bar,')
print(m)
print(m.group(1, 2))
print(m.group(3))


<_sre.SRE_Match object; span=(0, 8), match='foo,bar,'>
('foo', 'bar')
None


In [None]:
# It can also happen that a group participates in the overall match multiple times. If you call .group() for that group number, 
# then it returns only the part of the search string that matched the last time
m = re.match(r'(\w{3},)+', 'foo,bar,baz,qux')
print(m)
m.group(1)

<_sre.SRE_Match object; span=(0, 12), match='foo,bar,baz,'>


'baz,'

In [None]:
# If you call .group() with an argument of 0 or no argument at all, then it returns the entire match
m = re.search(r'(\w+),(\w+),(\w+)', 'foo,bar,baz')
print(m)

print(m.group(0))

m.group()

<_sre.SRE_Match object; span=(0, 11), match='foo,bar,baz'>
foo,bar,baz


'foo,bar,baz'

## ` match.__getitem__(<grp>) ` : Returns a captured group from a match.



In [None]:
# match.__getitem__(<grp>) is identical to match.group(<grp>) and returns the single group specified by <grp>
m = re.search(r'(\w+),(\w+),(\w+)', 'foo,bar,baz')
print(m.group(2))

print(m.__getitem__(2))

"""
.__getitem__() is one of a collection of methods in Python called magic methods. 
These are special methods that the interpreter calls when a Python statement contains specific corresponding syntactical elements.

Note: Magic methods are also referred to as dunder methods because of the double underscore at the beginning and end of the method name.
whenever you use the expression obj[n], behind the scenes Python quietly translates it to a call to .__getitem__()
The syntax obj[n] is only meaningful if a .__getitem()__ method exists for the class or type to which obj belongs. 
Exactly how Python interprets obj[n] will then depend on the implementation of .__getitem__() for that class.
"""

bar
bar


'\n.__getitem__() is one of a collection of methods in Python called magic methods. \nThese are special methods that the interpreter calls when a Python statement contains specific corresponding syntactical elements.\n\nNote: Magic methods are also referred to as dunder methods because of the double underscore at the beginning and end of the method name.\nwhenever you use the expression obj[n], behind the scenes Python quietly translates it to a call to .__getitem__()\nThe syntax obj[n] is only meaningful if a .__getitem()__ method exists for the class or type to which obj belongs. \nExactly how Python interprets obj[n] will then depend on the implementation of .__getitem__() for that class.\n'

In [None]:
# The implementation is such that match.__getitem__(n) is the same as match.group(n).
m = re.search(r'(\w+),(\w+),(\w+)', 'foo,bar,baz')
print(m.group(2))

print(m.__getitem__(2))

print(m[2])

# named captured groups
m = re.match(r'foo,(?P<w1>\w+),(?P<w2>\w+),qux', 'foo,bar,baz,qux')
print(m.group('w2'))

print(m['w2'])


# Note: Many objects in Python have a .__getitem() method defined, allowing the use of square-bracket indexing syntax. 
# However, this feature is only available for regex match objects in Python version 3.6 or later.

bar
bar
bar
baz
baz


## ` match.groups(default=None) ` : Returns all captured groups from a match.

In [None]:
# match.groups() returns a tuple of all captured groups
m = re.search(r'(\w+),(\w+),(\w+)', 'foo,bar,baz')
print(m.groups())

# If you want .groups() to return something else in this situation, then you can use the default keyword argument
m = re.search(r'(\w+),(\w+),(\w+)?', 'foo,bar,')
print(m)

print(m.group(3))

print(m.groups())

m.groups(default='---')

('foo', 'bar', 'baz')
<_sre.SRE_Match object; span=(0, 8), match='foo,bar,'>
None
('foo', 'bar', None)


('foo', 'bar', '---')

## ` match.groupdict(default=None) ` : Returns a dictionary of named captured groups.

In [None]:
"""
match.groupdict() returns a dictionary of all named groups captured with the (?P<name><regex>) metacharacter sequence. 
The dictionary keys are the group names and the dictionary values are the corresponding group values
"""

m = re.match(r'foo,(?P<w1>\w+),(?P<w2>\w+),qux', 'foo,bar,baz,qux')
print(m.groupdict())

print(m.groupdict()['w2'])

# the default argument determines the return value for nonparticipating groups:
m = re.match(r'foo,(?P<w1>\w+),(?P<w2>\w+)?,qux', 'foo,bar,,qux')
print(m.groupdict())

print(m.groupdict(default='---'))

{'w1': 'bar', 'w2': 'baz'}
baz
{'w1': 'bar', 'w2': None}
{'w1': 'bar', 'w2': '---'}


## ` match.expand(<template>) ` : Performs backreference substitutions from a match.

In [None]:
""" match.expand(<template>) returns the string that results from performing backreference substitution on <template> exactly as re.sub() would do """
#  works for numeric backreferences and also for named backreference
m = re.search(r'(\w+),(\w+),(\w+)', 'foo,bar,baz')
print(m)
print(m.groups())

print(m.expand(r'\2'))

print(m.expand(r'[\3] -> [\1]'))

m = re.search(r'(?P<num>\d+)', 'foo123qux')
print(m)
print(m.group(1))
m.expand(r'--- \g<num> ---')

<_sre.SRE_Match object; span=(0, 11), match='foo,bar,baz'>
('foo', 'bar', 'baz')
bar
[baz] -> [foo]
<_sre.SRE_Match object; span=(3, 6), match='123'>
123


'--- 123 ---'

## ` match.start([<grp>]) ` 
## ` match.end([<grp>]) `
## Return the starting and ending indices of the match.

In [None]:
"""
match.start() returns the index in the search string where the match begins
match.end() returns the index immediately after where the match ends
"""

s = 'foo123bar456baz'
m = re.search('\d+', s)
print(m)

print(m.start())
print(m.end())

print(s[m.start():m.end()])


<_sre.SRE_Match object; span=(3, 6), match='123'>
3
6
123


In [None]:
# match.start(<grp>) and match.end(<grp>) return the starting and ending indices of the substring matched by <grp>, which may be a numbered or named group
s = 'foo123bar456baz'
m = re.search(r'(\d+)\D*(?P<num>\d+)', s)

print(m.group(1))

print(m.start(1), m.end(1))

print(s[m.start(1):m.end(1)])

print(m.group('num'))

print(m.start('num'), m.end('num'))

print(s[m.start('num'):m.end('num')])

123
3 6
123
456
9 12
456


In [None]:
# If the specified group matches a null string, then .start() and .end() are equal
m = re.search('foo(\d*)bar', 'foobar')
print(m[1])
print(m.start(1), m.end(1))


3 3


In [None]:
# A special case occurs when the regex contains a group that doesn’t participate in the match. They return -1
m = re.search(r'(\w+),(\w+),(\w+)?', 'foo,bar,')
print(m.group(3))
print(m.start(3), m.end(3))

None
-1 -1


## ` match.span([<grp>]) ` : Returns both the starting and ending indices of the match.

In [None]:
"""
match.span() returns both the starting and ending indices of the match as a tuple. If you specified <grp>, then the return tuple applies to the given group
"""

m = re.search(r'(\d+)\D*(?P<num>\d+)', s)
print(m)

print(m[0])

print(m.span())

print(m[1])

print(m.span(1))

print(m['num'])

print(m.span('num'))

# match.span(<grp>) = (match.start(<grp>), match.end(<grp>))
# match.span() just provides a convenient way to obtain both match.start() and match.end() in one method call

<_sre.SRE_Match object; span=(3, 12), match='123bar456'>
123bar456
(3, 12)
123
(3, 6)
456
(9, 12)


## 6.2 Match Object Attributes <a name = '6.2'></a>

In [None]:
"""
  + match.pos / match.endpos:	The effective values of the <pos> and <endpos> arguments for the match

  + match.lastindex: The index of the last captured group

  + match.lastgroup: The name of the last captured group

  + match.re: The compiled regular expression object for the match
  
  + match.string: The search string for the match
"""

## ` match.pos/match.endpos ` : Contain the effective values of <pos> and <endpos> for the search.

In [None]:
import re
"""
Remember that some methods, when invoked on a compiled regex, accept optional <pos> and <endpos> arguments that limit the search to a portion of the specified
search string. These values are accessible from the match object with the .pos and .endpos attributes
"""
re_obj = re.compile(r'\d+')
m = re_obj.search('foo123bar', 2, 7)
print(m)
m.pos, m.endpos

<_sre.SRE_Match object; span=(3, 6), match='123'>


(2, 7)

In [None]:
# If the <pos> and <endpos> arguments aren’t included in the call, either because they were omitted or because the function in question doesn’t accept them,
# then the .pos and .endpos attributes effectively indicate the start and end of the string
re_obj = re.compile(r'\d+')
m = re_obj.search('foo123bar')
print(m)
m.pos, m.endpos

<_sre.SRE_Match object; span=(3, 6), match='123'>


(0, 9)

# ` match.lastindex ` : Contains the index of the last captured group.

In [None]:
""" match.lastindex is equal to the integer index of the last captured group """

m = re.search(r'(\w+),(\w+),(\w+)', 'foo,bar,baz')
print(m.lastindex, m[m.lastindex])


3 baz


In [None]:
# In cases where the regex contains potentially nonparticipating groups, this allows you to determine how many groups actually participated in the match:
m = re.search(r'(\w+),(\w+),(\w+)?', 'foo,bar,')
print(m.groups())
print(m.lastindex, m[m.lastindex])

('foo', 'bar', None)
2 bar


In [None]:
# It isn’t always the case that the last group to match is also the last group encountered syntactically
m = re.match('((a)(b))', 'ab')
print(m.groups())
print(m.lastindex, m[m.lastindex])
# The outermost group is ((a)(b)), which matches 'ab'. This is the first group the parser encounters, so it becomes group 1. 
# But it’s also the last group to match, which is why m.lastindex is 1.

('ab', 'a', 'b')
1 ab


## ` match.lastgroup ` : Contains the name of the last captured group.

In [None]:
"""
If the last captured group originates from the (?P<name><regex>) metacharacter sequence, then match.lastgroup returns the name of that group
"""
s = 'foo123bar456baz'
m = re.search(r'(?P<n1>\d+)\D*(?P<n2>\d+)', s)
print(m.lastgroup)


n2


In [None]:
# match.lastgroup returns None if the last captured group isn’t a named group or no-captured group
s = 'foo123bar456baz'
m = re.search(r'(\d+)\D*(\d+)', s)
print(m.groups())
print(m.lastgroup)

# There is no captured group
m = re.search(r'\d+\D*\d+', s)
m.groups()
print(m.lastgroup)

('123', '456')
None
None


## ` match.re ` : Contains the regular expression object for the match.

In [None]:
"""
match.re contains the regular expression object that produced the match. This is the same object you’d get if you passed the regex to re.compile()
"""
regex = r'(\w+),(\w+),(\w+)'
# re.search
m1 = re.search(regex, 'foo,bar,baz')
print(m1)
print(m1.re)

# re.compile
re_obj = re.compile(regex)
print(re_obj)
print(re_obj is m1.re)

# Once you have access to the regular expression object for the match, all of that object’s attributes are available as well
print(m1.re.groups, m1.re.pattern)
print(m1.re.pattern == regex)
m1.re.flags


<_sre.SRE_Match object; span=(0, 11), match='foo,bar,baz'>
re.compile('(\\w+),(\\w+),(\\w+)')
re.compile('(\\w+),(\\w+),(\\w+)')
True
3 (\w+),(\w+),(\w+)
True


32

In [None]:
m = re.search(r'(\w+),(\w+),(\w+)', 'foo,bar,baz')
m.re

# Here, .match() is invoked on m.re to perform another search using the same regex but on a different search string.
print(m.re.match('quux,corge,grault'))

<_sre.SRE_Match object; span=(0, 17), match='quux,corge,grault'>


## ` match.string ` : Contains the search string for a match.

In [None]:
"""
match.string contains the search string that is the target of the match
"""

m = re.search(r'(\w+),(\w+),(\w+)', 'foo,bar,baz')
m.string

re_obj = re.compile(r'(\w+),(\w+),(\w+)')
m = re_obj.search('foo,bar,baz')
print(m.string)


foo,bar,baz
