#7.2. re — Regular expression operations

**Regular expression pattern strings may not contain null bytes, but can specify the null byte using the \number notation, e.g., '\x00'.**

In [1]:
import re
m = re.search('(?<=abc)def', 'abcdef')

In [4]:
m.group(0)

'def'

In [5]:
m = re.search('(?<=-)\w+', 'spam-egg')
m.group(0)

'egg'

**(?(id/name)yes-pattern|no-pattern)**  #Learn this one.

#7.2.2. Module Contents

##re.compile(pattern, flags=0)
Compile a regular expression pattern into a regular expression object, which can be used for matching using its match() and search() methods, described below.

The expression’s behaviour can be modified by specifying a flags value. Values can be any of the following variables, combined using bitwise OR (the | operator).

In [9]:
pattern = 'Python'
string = 'People enjoy learning Python.'

In [10]:
prog = re.compile(pattern)
result = prog.match(string)

In [11]:
result = re.match(pattern, string)

but using re.compile() and saving the resulting regular expression object for reuse is more efficient when the expression will be used several times in a single program.

Note The compiled versions of the most recent patterns passed to re.match(), re.search() or re.compile() are cached, so programs that use only a few regular expressions at a time needn’t worry about compiling regular expressions.

##re.DEBUG
Display debug information about compiled expression.

In [18]:
re.DEBUG

128

In [21]:
re.match(pattern, string, re.DEBUG)

literal 80
literal 121
literal 116
literal 104
literal 111
literal 110


##re.I
##re.IGNORECASE
Perform case-insensitive matching; expressions like [A-Z] will match lowercase letters, too. This is not affected by the current locale.

In [24]:
re.match(pattern, string, re.I)

In [28]:
re.match(pattern, string, re.IGNORECASE)

In [29]:
re.I

2

In [30]:
re.IGNORECASE

2

##re.L
##re.LOCALE
Make \w, \W, \b, \B, \s and \S dependent on the current locale.

In [31]:
re.L

4

In [32]:
re.match(pattern, string, re.L)

In [33]:
re.M

8

In [34]:
re.match(pattern, string, re.M)

##re.S
##re.DOTALL
Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline.

In [35]:
re.match(pattern, string, re.DOTALL)

In [36]:
re.S

16

##re.U
##re.UNICODE¶
Make \w, \W, \b, \B, \d, \D, \s and \S dependent on the Unicode character properties database.

In [37]:
re.U

32

In [38]:
re.match(pattern, string, re.U)

##re.X
##re.VERBOSE
This flag allows you to write regular expressions that look nicer. Whitespace within the pattern is ignored, except when in a character class or preceded by an unescaped backslash, and, when a line contains a '#' neither in a character class or preceded by an unescaped backslash, all characters from the leftmost such '#' through the end of the line are ignored.

In [39]:
re.X

64

That means that the two following regular expression objects that match a decimal number are functionally equal:

In [41]:
a = re.compile(r"""\d +  # the integral part
                   \.    # the decimal point
                   \d *  # some fractional digits""", re.X)
b = re.compile(r"\d+\.\d*")

##re.search(pattern, string, flags=0)
Scan through string looking for the first location where the regular expression pattern produces a match, and return a corresponding MatchObject instance. Return None if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string.

In [42]:
re.search(pattern,string)

<_sre.SRE_Match at 0x1037c88b8>

##re.match(pattern, string, flags=0)
If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding MatchObject instance. Return None if the string does not match the pattern; note that this is different from a zero-length match.

Note that even in MULTILINE mode, re.match() will only match at the beginning of the string and not at the beginning of each line.

If you want to locate a match anywhere in string, use search() instead (see also search() vs. match()).

In [44]:
re.match(pattern,string)

In [45]:
string

'People enjoy learning Python.'

In [55]:
string = 'Python is great! We love Python!'

In [56]:
re.match(pattern, string)

<_sre.SRE_Match at 0x1037cf2a0>

##re.split(pattern, string, maxsplit=0, flags=0)
Split string by the occurrences of pattern. If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list. If maxsplit is nonzero, at most maxsplit splits occur, and the remainder of the string is returned as the final element of the list. (Incompatibility note: in the original Python 1.5 release, maxsplit was ignored. This has been fixed in later releases.)

In [57]:
re.split(pattern, string)

['', ' is great! We love ', '!']

In [58]:
re.split('\W+', 'Words, words, words.')

['Words', 'words', 'words', '']

In [59]:
re.split('(\W+)', 'Words, words, words.')

['Words', ', ', 'words', ', ', 'words', '.', '']

In [60]:
re.split('\W+', 'Words, words, words.', 1)

['Words', 'words, words.']

In [61]:
re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)

['0', '3', '9']

If there are capturing groups in the separator and it matches at the start of the string, the result will start with an empty string. The same holds for the end of the string:

In [62]:
re.split('(\W+)', '...words, words...')

['', '...', 'words', ', ', 'words', '...', '']

In [63]:
re.split('(\W+)', '...words, words...')

['', '...', 'words', ', ', 'words', '...', '']

In [66]:
re.split('x*', 'foo')

['foo']

In [65]:
re.split("(?m)^$", "foo\n\nbar\n")

['foo\n\nbar\n']

##re.findall(pattern, string, flags=0)
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.

New in version 1.5.2.

Changed in version 2.4: Added the optional flags argument.

In [68]:
re.findall(pattern,string, re.DEBUG)

literal 80
literal 121
literal 116
literal 104
literal 111
literal 110


['Python', 'Python']

##re.finditer(pattern, string, flags=0)
Return an iterator yielding MatchObject instances over all non-overlapping matches for the RE pattern in string. The string is scanned left-to-right, and matches are returned in the order found. Empty matches are included in the result unless they touch the beginning of another match.

New in version 2.2.

Changed in version 2.4: Added the optional flags argument

In [101]:
x = re.finditer(pattern, string, re.DOTALL)

In [103]:
y = iter(x)

In [105]:
next(y).group()

'Python'

In [106]:
next(y).group()

'Python'

In [107]:
next(y).group()

StopIteration: 

##re.sub(pattern, repl, string, count=0, flags=0)
Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged. repl can be a string or a function; if it is a string, any backslash escapes in it are processed. That is, \n is converted to a single newline character, \r is converted to a carriage return, and so forth. Unknown escapes such as \j are left alone. Backreferences, such as \6, are replaced with the substring matched by group 6 in the pattern.

In [108]:
re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):',
       r'static PyObject*\npy_\1(void)\n{',
       'def myfunc():')

'static PyObject*\npy_myfunc(void)\n{'

In [113]:
def dashrepl(matchobj):
    if matchobj.group(0) == '-': return ' '
    else: return '-'
re.sub('-{1,2}', dashrepl, 'pro----gram-files')

'pro--gram files'

In [115]:
def dashrepl(matchobj):
    if matchobj.group(0) == '-': return ' '
    else: return ''
re.sub('-{1,4}', dashrepl, 'pro----gram-files')

'program files'

In [114]:
re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE)

'Baked Beans & Spam'

##re.subn(pattern, repl, string, count=0, flags=0)
Perform the same operation as sub(), but return a tuple (new_string, number_of_subs_made).

Changed in version 2.7: Added the optional flags argument.

In [119]:
re.subn(pattern, 'Learning Python', string)

('Learning Python is great! We love Learning Python!', 2)

##re.escape(string)
Return string with all non-alphanumerics backslashed; this is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.

In [121]:
string2 = '!@#$%^&*()saber'
re.escape(string2)

'\\!\\@\\#\\$\\%\\^\\&\\*\\(\\)saber'

#re.purge()
Clear the regular expression cache.

In [124]:
re.purge()

#exception re.error
Exception raised when a string passed to one of the functions here is not a valid regular expression (for example, it might contain unmatched parentheses) or when some other error occurs during compilation or matching. It is never an error if a string contains no match for a pattern.

**Allows Assignment using re.error=arg**

#7.2.3. Regular Expression Objects

##class re.RegexObject
The RegexObject class supports the following methods and attributes:

search(string[, pos[, endpos]])
Scan through string looking for a location where this regular expression produces a match, and return a corresponding MatchObject instance. Return None if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string.

The optional second parameter pos gives an index in the string where the search is to start; it defaults to 0. This is not completely equivalent to slicing the string; the '^' pattern character matches at the real beginning of the string and at positions just after a newline, but not necessarily at the index where the search is to start.

The optional parameter endpos limits how far the string will be searched; it will be as if the string is endpos characters long, so only the characters from pos to endpos - 1 will be searched for a match. If endpos is less than pos, no match will be found, otherwise, if rx is a compiled regular expression object, rx.search(string, 0, 50) is equivalent to rx.search(string[:50], 0).

In [3]:
pattern = re.compile("d")
pattern.search("dog")     # Match at index 0

<_sre.SRE_Match at 0x103788920>

In [2]:
import re

In [4]:
pattern.search("dog", 1)  # No match; search doesn't include the "d"

##match(string[, pos[, endpos]])
If zero or more characters at the beginning of string match this regular expression, return a corresponding MatchObject instance. Return None if the string does not match the pattern; note that this is different from a zero-length match.

The optional pos and endpos parameters have the same meaning as for the search() method.

In [5]:
pattern = re.compile("o")
pattern.match("dog")      # No match as "o" is not at the start of "dog"

In [6]:
pattern.match("dog", 1)   # Match as "o" is the 2nd character of "dog".

<_sre.SRE_Match at 0x103788b28>

##split(string, maxsplit=0)
Identical to the split() function, using the compiled pattern.

##findall(string[, pos[, endpos]])
Similar to the findall() function, using the compiled pattern, but also accepts optional pos and endpos parameters that limit the search region like for match().

##finditer(string[, pos[, endpos]])
Similar to the finditer() function, using the compiled pattern, but also accepts optional pos and endpos parameters that limit the search region like for match().

##sub(repl, string, count=0)
Identical to the sub() function, using the compiled pattern.

##subn(repl, string, count=0)
Identical to the subn() function, using the compiled pattern.

##flags
The regex matching flags. This is a combination of the flags given to compile() and any (?...) inline flags in the pattern.

##groups
The number of capturing groups in the pattern.

##groupindex
A dictionary mapping any symbolic group names defined by (?P<id>) to group numbers. The dictionary is empty if no symbolic groups were used in the pattern.

#7.2.4. Match Objects

##class re.MatchObject
Match objects always have a boolean value of True. Since match() and search() return None when there is no match, you can test whether there was a match with a simple if statement:

In [16]:
pattern = 'Python'
string = "Learning Python for fun and profit."
def process(word):
    print word[::-1]

In [17]:
match = re.search(pattern, string)
if match:
    process(pattern)

nohtyP


##expand(template)
Return the string obtained by doing backslash substitution on the template string template, as done by the sub() method. Escapes such as \n are converted to the appropriate characters, and numeric backreferences (\1, \2) and named backreferences (\g<1>, \g<name>) are replaced by the contents of the corresponding group.

In [39]:
match = re.compile(r"(Python)")

In [40]:
match_object = match.search(string)

In [41]:
print match_object.expand(r"Re-Learning \1")    # Year: 1999

Re-Learning Python


##group([group1, ...])
Returns one or more subgroups of the match. If there is a single argument, the result is a single string; if there are multiple arguments, the result is a tuple with one item per argument. Without arguments, group1 defaults to zero (the whole match is returned). If a groupN argument is zero, the corresponding return value is the entire matching string; if it is in the inclusive range [1..99], it is the string matching the corresponding parenthesized group. If a group number is negative or larger than the number of groups defined in the pattern, an IndexError exception is raised. If a group is contained in a part of the pattern that did not match, the corresponding result is None. If a group is contained in a part of the pattern that matched multiple times, the last match is returned.

In [70]:
string = 'Learning Python for fun and profit. Learning Python for fun and profit.'
match = re.compile(r"(Python)")
match_object = match.search(string)

In [47]:
match_object.group(0)

'Python'

In [48]:
match_object.group(1)

'Python'

In [49]:
match_object.group(2)

IndexError: no such group

In [50]:
match_object.group(0,1)

('Python', 'Python')

In [51]:
m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
m.group('first_name')

'Malcolm'

In [52]:
m.group('last_name')

'Reynolds'

##groups([default])
Return a tuple containing all the subgroups of the match, from 1 up to however many groups are in the pattern. The default argument is used for groups that did not participate in the match; it defaults to None. (Incompatibility note: in the original Python 1.5 release, if the tuple was one element long, a string would be returned instead. In later versions (from 1.5.1 on), a singleton tuple is returned in such cases.)

In [53]:
m = re.match(r"(\d+)\.(\d+)", "24.1632")
m.groups()

('24', '1632')

In [56]:
m = re.match(r"(\d+)\.?(\d+)?", "24")
m.groups()      # Second group defaults to None.

('24', None)

In [55]:
m.groups('0')   # Now, the second group defaults to '0'.

('24', '0')

##groupdict([default])
Return a dictionary containing all the named subgroups of the match, keyed by the subgroup name. The default argument is used for groups that did not participate in the match; it defaults to None. For example:

In [57]:
m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
m.groupdict()

{'first_name': 'Malcolm', 'last_name': 'Reynolds'}

##start([group])
##end([group])
Return the indices of the start and end of the substring matched by group; group defaults to zero (meaning the whole matched substring). Return -1 if group exists but did not contribute to the match. For a match object m, and a group g that did contribute to the match, the substring matched by group g (equivalent to m.group(g)) is

m.string[m.start(g):m.end(g)]

Note that m.start(group) will equal m.end(group) if group matched a null string. For example, after m = re.search('b(c?)', 'cba'), m.start(0) is 1, m.end(0) is 2, m.start(1) and m.end(1) are both 2, and m.start(2) raises an IndexError exception.

An example that will remove remove_this from email addresses:

In [59]:
email = "tony@tiremove_thisger.net"
m = re.search("remove_this", email)
email[:m.start()] + email[m.end():]

'tony@tiger.net'

##span([group])
For MatchObject m, return the 2-tuple (m.start(group), m.end(group)). Note that if group did not contribute to the match, this is (-1, -1). group defaults to zero, the entire match.

In [61]:
match_object.span()

(9, 15)

In [62]:
string[9:15]

'Python'

##pos
The value of pos which was passed to the search() or match() method of the RegexObject. This is the index into the string at which the RE engine started looking for a match.

In [66]:
match_object.pos

0

##endpos
The value of endpos which was passed to the search() or match() method of the RegexObject. This is the index into the string beyond which the RE engine will not go.

In [67]:
match_object.endpos

71

##lastindex
The integer index of the last matched capturing group, or None if no group was matched at all. For example, the expressions (a)b, ((a)(b)), and ((ab)) will have lastindex == 1 if applied to the string 'ab', while the expression (a)(b) will have lastindex == 2, if applied to the same string.

In [68]:
match_object.lastindex

1

##lastgroup
The name of the last matched capturing group, or None if the group didn’t have a name, or if no group was matched at all.

In [69]:
match_object.lastgroup

In [71]:
string = 'Learning Python for fun and profit. Learning Python for fun and profit.'
match = re.compile("(?P<token>Python)")
match_object = match.search(string)

In [72]:
match_object.lastgroup

'token'

In [74]:
match_object.group('token')

'Python'

##re
The regular expression object whose match() or search() method produced this MatchObject instance.

In [75]:
match_object.re

re.compile(r'(?P<token>Python)')

##string
The string passed to match() or search().

In [76]:
match_object.string

'Learning Python for fun and profit. Learning Python for fun and profit.'

##Making a phonebook

In [77]:
text = """Ross McFluff: 834.345.1254 155 Elm Street

Ronald Heathmore: 892.345.3428 436 Finley Avenue
Frank Burger: 925.541.7625 662 South Dogwood Way


Heather Albrecht: 548.326.4584 919 Park Place"""

In [78]:
entries = re.split("\n+", text)

In [79]:
entries

['Ross McFluff: 834.345.1254 155 Elm Street',
 'Ronald Heathmore: 892.345.3428 436 Finley Avenue',
 'Frank Burger: 925.541.7625 662 South Dogwood Way',
 'Heather Albrecht: 548.326.4584 919 Park Place']

In [80]:
[re.split(":? ", entry, 3) for entry in entries]

[['Ross', 'McFluff', '834.345.1254', '155 Elm Street'],
 ['Ronald', 'Heathmore', '892.345.3428', '436 Finley Avenue'],
 ['Frank', 'Burger', '925.541.7625', '662 South Dogwood Way'],
 ['Heather', 'Albrecht', '548.326.4584', '919 Park Place']]

##7.2.5.5. Text Munging

In [83]:
from numpy import random

In [84]:
def repl(m):
  inner_word = list(m.group(2))
  random.shuffle(inner_word)
  return m.group(1) + "".join(inner_word) + m.group(3)
text = "Professor Abdolmalek, please report your absences promptly."
re.sub(r"(\w)(\w+)(\w)", repl, text)

'Poossferr Aemadblolk, pesale reprot yuor aensecbs ptrpomly.'

In [85]:
re.sub?

##7.2.5.6. Finding all Adverbs

In [86]:
text = "He was carefully disguised but captured quickly by police."
re.findall(r"\w+ly", text)

['carefully', 'quickly']

##7.2.5.7. Finding all Adverbs and their Positions

In [87]:
text = "He was carefully disguised but captured quickly by police."
for m in re.finditer(r"\w+ly", text):
    print '%02d-%02d: %s' % (m.start(), m.end(), m.group(0))

07-16: carefully
40-47: quickly


##7.2.5.8. Raw String Notation

In [89]:
re.match(r"\W(.)\1\W", " ff ")

<_sre.SRE_Match at 0x103db16c0>

In [90]:
re.match("\\W(.)\\1\\W", " ff ")

<_sre.SRE_Match at 0x103db1198>

In [91]:
re.match(r"\\", r"\\")

<_sre.SRE_Match at 0x103db73d8>

In [92]:
re.match("\\\\", r"\\")

<_sre.SRE_Match at 0x103db74a8>