# re — Regular Expressions

Purpose:	Searching within and changing text using formal patterns.

Regular expressions are text matching patterns described with a formal syntax. The patterns are interpreted as a set of instructions, which are then executed with a string as input to produce a matching subset or modified version of the original. The term “regular expressions” is frequently shortened to “regex” or “regexp” in conversation. Expressions can include literal text matching, repetition, pattern composition, branching, and other sophisticated rules. A large number of parsing problems are easier to solve with a regular expression than by creating a special-purpose lexer and parser.

Regular expressions are typically used in applications that involve a lot of text processing. For example, they are commonly used as search patterns in text editing programs used by developers, including vi, emacs, and modern IDEs. They are also an integral part of Unix command-line utilities such as sed, grep, and awk. Many programming languages include support for regular expressions in the language syntax (Perl, Ruby, Awk, and Tcl). Other languages, such as C, C++, and Python, support regular expressions through extension libraries.

Multiple open source implementations of regular expressions exist, each sharing a common core syntax but with different extensions or modifications to their advanced features. The syntax used in Python’s re module is based on the syntax used for regular expressions in Perl, with a few Python-specific enhancements.

#### Note
Although the formal definition of “regular expression” is limited to expressions that describe regular languages, some of the extensions supported by re go beyond describing regular languages. The term “regular expression” is used here in a more general sense to mean any expression that can be evaluated by Python’s re module.

## Finding Patterns in Text

The most common use for re is to search for patterns in text. The search() function takes the pattern and text to scan, and returns a Match object when the pattern is found. If the pattern is not found, search() returns None.

Each Match object holds information about the nature of the match, including the original input string, the regular expression used, and the location within the original string where the pattern occurs.

The start() and end() methods give the indexes into the string showing where the text matched by the pattern occurs.

In [1]:
# re_simple_match.py
import re

pattern = 'this'
text = 'Does this text match the pattern?'

match = re.search(pattern, text)

s = match.start()
e = match.end()

print('Found "{}"\nin "{}"\nfrom {} to {} ("{}")'.format(
    match.re.pattern, match.string, s, e, text[s:e]))

Found "this"
in "Does this text match the pattern?"
from 5 to 9 ("this")


## Compiling Expressions

Although re includes module-level functions for working with regular expressions as text strings, it is more efficient to compile the expressions a program uses frequently. The compile() function converts an expression string into a RegexObject.

The module-level functions maintain a cache of compiled expressions, but the size of the cache is limited and using compiled expressions directly avoids the overhead associated with cache lookup. Another advantage of using compiled expressions is that by precompiling all of the expressions when the module is loaded, the compilation work is shifted to application start time, instead of occurring at a point where the program may be responding to a user action.

In [2]:
# re_simple_compilted.py

import re

# precompile the patterns

regexes = [
    re.compile(p)
    for p in ['this', 'that']
]

text = 'Does this text match the patter?'

print('Text: {!r}\n'.format(text))

for regex in regexes:
    print('Seeking "{}" -> '.format(regex.pattern), end=' ') 
    if regex.search(text):
        print('match!')
    else:
        print('no match!')

Text: 'Does this text match the patter?'

Seeking "this" ->  match!
Seeking "that" ->  no match!


## Multiple Matches

So far, the example patterns have all used search() to look for single instances of literal text strings. The findall() function returns all of the substrings of the input that match the pattern without overlapping.

This example input string includes two instances of ab.

In [3]:
# re_findall.py

import re

text = 'abbaaabbbaaaa'

pattern = 'ab'

for match in re.findall(pattern, text):
    print('Found {!r}'.format(match))

Found 'ab'
Found 'ab'


The finditer() function returns an iterator that produces Match instances instead of the strings returned by findall().

This example finds the same two occurrences of ab, and the Match instance shows where they are found in the original input.

In [4]:
# re_finditer.py

import re

text = 'abbaaabbbaaaaa'

pattern = 'ab'

for match in re.finditer(pattern, text):
    s = match.start()
    e = match.end()
    print('Found {!r} at {:d}:{:d}'.format(text[s:e], s, e))

Found 'ab' at 0:2
Found 'ab' at 5:7


## Pattern Syntax

Regular expressions support more powerful patterns than simple literal text strings. Patterns can repeat, can be anchored to different logical locations within the input, and can be expressed in compact forms that do not require every literal character to be present in the pattern. All of these features are used by combining literal text values with meta-characters that are part of the regular expression pattern syntax implemented by re.

The following examples will use test_patterns() to explore how variations in patterns change the way they match the same input text. The output shows the input text and the substring range from each portion of the input that matches the pattern.

In [5]:
# re_test_patterns.py

import re

def test_patterns(text, patterns):
    """Given source text and a list of patterns, look for
    matches for each pattern within the text and print
    them to stdout.
    """
    # Look for each pattern in the text and print the results
    for pattern, desc in patterns:
        print("'{} ({})'".format(pattern, desc))
        print("  '{}'".format(text))
        for match in re.finditer(pattern, text):
            s = match.start()
            e = match.end()
            substr = text[s:e]
            n_backslashes = text[:s].count('\\')
            prefix = '.' * (s + n_backslashes)
            print("  {}'{}'".format(prefix, substr))
        print()
    return

if __name__ == '__main__':
    test_patterns('abbaaabbbbaaaaa',
                 [('ab',"'a' followed by 'b'"),
                 ('baa',"'b' followed by double 'a'")])

'ab ('a' followed by 'b')'
  'abbaaabbbbaaaaa'
  'ab'
  .....'ab'

'baa ('b' followed by double 'a')'
  'abbaaabbbbaaaaa'
  ..'baa'
  .........'baa'



## Repetition

There are five ways to express repetition in a pattern. 

* A pattern followed by the meta-character \* is repeated zero or more times (allowing a pattern to repeat zero times means it does not need to appear at all to match). 
* If the * is replaced with +, the pattern must appear at least once. 
* Using ? means the pattern appears zero or one time. 
* For a specific number of occurrences, use {m} after the pattern, where m is the number of times the pattern should repeat. 
* Finally, to allow a variable but limited number of repetitions, use {m,n}, where m is the minimum number of repetitions and n is the maximum. Leaving out n ({m,}) means the value must appear at least m times, with no maximum.

There are more matches for ab* and ab? than ab+.

In [6]:
# re_repetition.py

from re_test_patterns import test_patterns

test_patterns(
    'abbaabbba',
    [('ab*', 'a followed by zero or more b'),
     ('ab+', 'a followed by one or more b'),
     ('ab?', 'a followed by zero or one b'),
     ('ab{3}', 'a followed by three b'),
     ('ab{2,3}', 'a followed by two to three b')],
)

'ab* (a followed by zero or more b)'
  'abbaabbba'
  'abb'
  ...'a'
  ....'abbb'
  ........'a'

'ab+ (a followed by one or more b)'
  'abbaabbba'
  'abb'
  ....'abbb'

'ab? (a followed by zero or one b)'
  'abbaabbba'
  'ab'
  ...'a'
  ....'ab'
  ........'a'

'ab{3} (a followed by three b)'
  'abbaabbba'
  ....'abbb'

'ab{2,3} (a followed by two to three b)'
  'abbaabbba'
  'abb'
  ....'abbb'



When processing a repetition instruction, re will usually consume as much of the input as possible while matching the pattern. This so-called greedy behavior may result in fewer individual matches, or the matches may include more of the input text than intended. Greediness can be turned off by following the repetition instruction with ?.

Disabling greedy consumption of the input for any of the patterns where zero occurrences of b are allowed means the matched substring does not include any b characters.

In [7]:
# re_repetition_non_greedy.py
from re_test_patterns import test_patterns

test_patterns(
    'abbaabbba',
    [('ab*?', 'a followed by zero or more b'),
     ('ab+?', 'a followed by one or more b'),
     ('ab??', 'a followed by zero or one b'),
     ('ab{3}?', 'a followed by three b'),
     ('ab{2,3}?', 'a followed by two to three b')],
)

'ab*? (a followed by zero or more b)'
  'abbaabbba'
  'a'
  ...'a'
  ....'a'
  ........'a'

'ab+? (a followed by one or more b)'
  'abbaabbba'
  'ab'
  ....'ab'

'ab?? (a followed by zero or one b)'
  'abbaabbba'
  'a'
  ...'a'
  ....'a'
  ........'a'

'ab{3}? (a followed by three b)'
  'abbaabbba'
  ....'abbb'

'ab{2,3}? (a followed by two to three b)'
  'abbaabbba'
  'abb'
  ....'abb'



## Character Sets

A character set is a group of characters, any one of which can match at that point in the pattern. For example, [ab] would match either a or b.

The greedy form of the expression (a[ab]+) consumes the entire string because the first letter is a and every subsequent character is either a or b.

In [8]:
# re_charset.py
from re_test_patterns import test_patterns

test_patterns(
    'abbaabbba',
    [('[ab]', 'either a or b'),
     ('a[ab]+', 'a followed by 1 or more a or b'),
     ('a[ab]+?', 'a followed by 1 or more a or b, not greedy')],
)

'[ab] (either a or b)'
  'abbaabbba'
  'a'
  .'b'
  ..'b'
  ...'a'
  ....'a'
  .....'b'
  ......'b'
  .......'b'
  ........'a'

'a[ab]+ (a followed by 1 or more a or b)'
  'abbaabbba'
  'abbaabbba'

'a[ab]+? (a followed by 1 or more a or b, not greedy)'
  'abbaabbba'
  'ab'
  ...'aa'



A character set can also be used to exclude specific characters. The carat (^) means to look for characters that are not in the set following the carat.

This pattern finds all of the substrings that do not contain the characters -, ., or a space.

In [9]:
# re_charset_exclude.py
from re_test_patterns import test_patterns

test_patterns(
    'This is some text -- with punctuation.',
    [('[^-. ]+', 'sequences without -, ., or space')],
)

'[^-. ]+ (sequences without -, ., or space)'
  'This is some text -- with punctuation.'
  'This'
  .....'is'
  ........'some'
  .............'text'
  .....................'with'
  ..........................'punctuation'



As character sets grow larger, typing every character that should (or should not) match becomes tedious. A more compact format using character ranges can be used to define a character set to include all of the contiguous characters between the specified start and stop points.

Here the range a-z includes the lowercase ASCII letters, and the range A-Z includes the uppercase ASCII letters. The ranges can also be combined into a single character set.

In [10]:
# re_charset_ranges.py
from re_test_patterns import test_patterns

test_patterns(
    'This is some text -- with punctuation.',
    [('[a-z]+', 'sequences of lowercase letters'),
     ('[A-Z]+', 'sequences of uppercase letters'),
     ('[a-zA-Z]+', 'sequences of letters of either case'),
     ('[A-Z][a-z]+', 'one uppercase followed by lowercase')],
)

'[a-z]+ (sequences of lowercase letters)'
  'This is some text -- with punctuation.'
  .'his'
  .....'is'
  ........'some'
  .............'text'
  .....................'with'
  ..........................'punctuation'

'[A-Z]+ (sequences of uppercase letters)'
  'This is some text -- with punctuation.'
  'T'

'[a-zA-Z]+ (sequences of letters of either case)'
  'This is some text -- with punctuation.'
  'This'
  .....'is'
  ........'some'
  .............'text'
  .....................'with'
  ..........................'punctuation'

'[A-Z][a-z]+ (one uppercase followed by lowercase)'
  'This is some text -- with punctuation.'
  'This'



As a special case of a character set, the meta-character dot, or period (.), indicates that the pattern should match any single character in that position.

Combining the dot with repetition can result in very long matches, unless the non-greedy form is used.

In [11]:
# re_charset_dot.py
from re_test_patterns import test_patterns

test_patterns(
    'abbaabbba',
    [('a.', 'a followed by any one character'),
     ('b.', 'b followed by any one character'),
     ('a.*b', 'a followed by anything, ending in b'),
     ('a.*?b', 'a followed by anything, ending in b')],
)


'a. (a followed by any one character)'
  'abbaabbba'
  'ab'
  ...'aa'

'b. (b followed by any one character)'
  'abbaabbba'
  .'bb'
  .....'bb'
  .......'ba'

'a.*b (a followed by anything, ending in b)'
  'abbaabbba'
  'abbaabbb'

'a.*?b (a followed by anything, ending in b)'
  'abbaabbba'
  'ab'
  ...'aab'



## Escape Codes

An even more compact representation uses escape codes for several predefined character sets. The escape codes recognized by re are listed in the table below.


| Code | Meaning                                |
|------|----------------------------------------|
| \d   | a digit                                |
| \D   | a non-digit                            |
| \s   | whitespace (tab, space, newline, etc.) |
| \S   | non-whitespace                         |
| \w   | alphanumeric                           |
| \W   | non-alphanumeric                       |

#### Note
Escapes are indicated by prefixing the character with a backslash (\\). Unfortunately, a backslash must itself be escaped in normal Python strings, and that results in difficult-to-read expressions. Using raw strings, which are created by prefixing the literal value with r, eliminates this problem and maintains readability.

These sample expressions combine escape codes with repetition to find sequences of like characters in the input string.

In [12]:
# re_escape_codes.py
from re_test_patterns import test_patterns

test_patterns(
    'A prime #1 example!',
    [(r'\d+', 'sequence of digits'),
     (r'\D+', 'sequence of non-digits'),
     (r'\s+', 'sequence of whitespace'),
     (r'\S+', 'sequence of non-whitespace'),
     (r'\w+', 'alphanumeric characters'),
     (r'\W+', 'non-alphanumeric')],
)

'\d+ (sequence of digits)'
  'A prime #1 example!'
  .........'1'

'\D+ (sequence of non-digits)'
  'A prime #1 example!'
  'A prime #'
  ..........' example!'

'\s+ (sequence of whitespace)'
  'A prime #1 example!'
  .' '
  .......' '
  ..........' '

'\S+ (sequence of non-whitespace)'
  'A prime #1 example!'
  'A'
  ..'prime'
  ........'#1'
  ...........'example!'

'\w+ (alphanumeric characters)'
  'A prime #1 example!'
  'A'
  ..'prime'
  .........'1'
  ...........'example'

'\W+ (non-alphanumeric)'
  'A prime #1 example!'
  .' '
  .......' #'
  ..........' '
  ..................'!'



To match the characters that are part of the regular expression syntax, escape the characters in the search pattern.

The pattern in this example escapes the backslash and plus characters, since both are meta-characters and have special meaning in a regular expression.

In [13]:
# re_escape_escapes.py
from re_test_patterns import test_patterns

test_patterns(
    r'\d+ \D+ \s+',
    [(r'\\.\+', 'escape code')],
)

'\\.\+ (escape code)'
  '\d+ \D+ \s+'
  '\d+'
  .....'\D+'
  ..........'\s+'



## Anchoring

In addition to describing the content of a pattern to match, the relative location can be specified in the input text where the pattern should appear by using anchoring instructions. the table below lists valid anchoring codes.

| Code | Meaning                                            |
|------|----------------------------------------------------|
| ^    | start of string, or line                           |
| $    | end of string, or line                             |
| \A   | start of string                                    |
| \Z   | end of string                                      |
| \b   | empty string at the beginning or end of a word     |
| \B   | empty string not at the beginning or end of a word |

The patterns in the example for matching words at the beginning and the end of the string are different because the word at the end of the string is followed by punctuation to terminate the sentence. The pattern \w+$ would not match, since . is not considered an alphanumeric character.



In [14]:
# re_anchoring.py
from re_test_patterns import test_patterns

test_patterns(
    'This is some text -- with punctuation.',
    [(r'^\w+', 'word at start of string'),
     (r'\A\w+', 'word at start of string'),
     (r'\w+\S*$', 'word near end of string'),
     (r'\w+\S*\Z', 'word near end of string'),
     (r'\w*t\w*', 'word containing t'),
     (r'\bt\w+', 't at start of word'),
     (r'\w+t\b', 't at end of word'),
     (r'\Bt\B', 't, not start or end of word')],
)

'^\w+ (word at start of string)'
  'This is some text -- with punctuation.'
  'This'

'\A\w+ (word at start of string)'
  'This is some text -- with punctuation.'
  'This'

'\w+\S*$ (word near end of string)'
  'This is some text -- with punctuation.'
  ..........................'punctuation.'

'\w+\S*\Z (word near end of string)'
  'This is some text -- with punctuation.'
  ..........................'punctuation.'

'\w*t\w* (word containing t)'
  'This is some text -- with punctuation.'
  .............'text'
  .....................'with'
  ..........................'punctuation'

'\bt\w+ (t at start of word)'
  'This is some text -- with punctuation.'
  .............'text'

'\w+t\b (t at end of word)'
  'This is some text -- with punctuation.'
  .............'text'

'\Bt\B (t, not start or end of word)'
  'This is some text -- with punctuation.'
  .......................'t'
  ..............................'t'
  .................................'t'

