In [1]:
import re

## Finding Patterns of Text with Regular Expressions

In [72]:
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')

In [73]:
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
mo = phoneNumRegex.search('My number is 415-555-4242.')
print('Phone number found: ' + mo.group())

Phone number found: 415-555-4242


## Grouping with Parentheses

Say you want to separate the area code from the rest of the phone number. Adding parentheses will create groups in the regex: (\d\d\d)-(\d\d\d-\d\d\d\d). Then you can use the group() match object method to grab the matching text from just one group.

The first set of parentheses in a regex string will be group 1. The second set will be group 2. By passing the integer 1 or 2 to the group() match object method, you can grab different parts of the matched text. Passing 0 or nothing to the group() method will return the entire matched text. 

In [19]:
phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')
mo = phoneNumRegex.search('My number is 415-555-4242.')
mo.group(1)

'415'

In [20]:
mo.group(2)

'555-4242'

In [21]:
mo.group(0)

'415-555-4242'

In [22]:
mo.group()

'415-555-4242'

If you would like to retrieve all the groups at once, use the groups() method—note the plural form for the name.

In [23]:
mo.groups()

('415', '555-4242')

In [24]:
areaCode, mainNumber = mo.groups()
print(areaCode)

415


In [25]:
print(mainNumber)

555-4242


In [26]:
phoneNumRegex = re.compile(r'(\(\d\d\d\)) (\d\d\d-\d\d\d\d)')

In [27]:
mo = phoneNumRegex.search('My phone number is (415) 555-4242.')

In [28]:
mo.group(1)

'(415)'

In [29]:
mo.group(2)

'555-4242'

## Matching Multiple Groups with the Pipe

The | character is called a pipe. You can use it anywhere you want to match one of many expressions. For example, the regular expression r'Batman|Tina Fey' will match either 'Batman' or 'Tina Fey'.

When both Batman and Tina Fey occur in the searched string, the first occurrence of matching text will be returned as the Match object. Enter the following into the interactive shell:

In [30]:
heroRegex = re.compile (r'Batman|Tina Fey')

In [31]:
mo1 = heroRegex.search('Batman and Tina Fey.')

In [32]:
mo1.group()

'Batman'

In [33]:
mo2 = heroRegex.search('Tina Fey and Batman.')

In [34]:
mo2.group()

'Tina Fey'

You can also use the pipe to match one of several patterns as part of your regex. For example, say you wanted to match any of the strings 'Batman', 'Batmobile', 'Batcopter', and 'Batbat'. Since all these strings start with Bat, it would be nice if you could specify that prefix only once. This can be done with parentheses. Enter the following into the interactive shell:

In [35]:
batRegex = re.compile(r'Bat(man|mobile|copter|bat)')

In [36]:
mo = batRegex.search('Batmobile lost a wheel')

In [37]:
mo.group()

'Batmobile'

In [38]:
mo.group()

'Batmobile'

The method call mo.group() returns the full matched text 'Batmobile', while mo.group(1) returns just the part of the matched text inside the first parentheses group, 'mobile'. By using the pipe character and grouping parentheses, you can specify several alternative patterns you would like your regex to match.

If you need to match an actual pipe character, escape it with a backslash, like \|.

## Optional Matching with the Question Mark

Sometimes there is a pattern that you want to match only optionally. That is, the regex should find a match whether or not that bit of text is there. The ? character flags the group that precedes it as an optional part of the pattern. For example, enter the following into the interactive shell:

In [39]:
batRegex = re.compile(r'Bat(wo)?man')

In [40]:
mo1 = batRegex.search('The Adventures of Batman')

In [41]:
mo1.group()

'Batman'

In [42]:
mo2 = batRegex.search('The Adventures of Batwoman')

In [43]:
mo2.group()

'Batwoman'

The (wo)? part of the regular expression means that the pattern wo is an optional group. The regex will match text that has zero instances or one instance of wo in it. This is why the regex matches both 'Batwoman' and 'Batman'.

Using the earlier phone number example, you can make the regex look for phone numbers that do or do not have an area code. Enter the following into the interactive shell:

In [44]:
phoneRegex = re.compile(r'(\d\d\d-)?\d\d\d-\d\d\d\d')

In [45]:
mo1 = phoneRegex.search('My number is 415-555-4242')

In [46]:
mo1.group()

'415-555-4242'

In [48]:
mo2 = phoneRegex.search('My number is 555-4242')

In [47]:
mo2.group()

'Batwoman'

You can think of the ? as saying, “Match zero or one of the group preceding this question mark.”

If you need to match an actual question mark character, escape it with \?.

## Matching Zero or More with the Star

The * (called the star or asterisk) means “match zero or more”—the group that precedes the star can occur any number of times in the text. It can be completely absent or repeated over and over again. Let’s look at the Batman example again.

In [48]:
batRegex = re.compile(r'Bat(wo)*man')

In [49]:
mo1 = batRegex.search('The Adventures of Batman')

In [50]:
mo1.group()

'Batman'

In [51]:
mo2 = batRegex.search('The Adventures of Batwoman')

In [52]:
mo2.group()

'Batwoman'

In [53]:
mo3 = batRegex.search('The Adventures of Batwowowowoman')

In [54]:
mo3.group()

'Batwowowowoman'

For 'Batman', the (wo)* part of the regex matches zero instances of wo in the string; for 'Batwoman', the (wo)* matches one instance of wo; and for 'Batwowowowoman', (wo)* matches four instances of wo.

If you need to match an actual star character, prefix the star in the regular expression with a backslash, \*

## Matching One or More with the Plus

While * means “match zero or more,” the + (or plus) means “match one or more.” Unlike the star, which does not require its group to appear in the matched string, the group preceding a plus must appear at least once. It is not optional. Enter the following into the interactive shell, and compare it with the star regexes in the previous section:

In [55]:
batRegex = re.compile(r'Bat(wo)+man')

In [56]:
mo1 = batRegex.search('The Adventures of Batwoman')

In [57]:
mo1.group()

'Batwoman'

In [58]:
mo2 = batRegex.search('The Adventures of Batwowowowoman')

In [59]:
mo2.group()

'Batwowowowoman'

In [60]:
mo3 = batRegex.search('The Adventures of Batman')

In [61]:
mo3 == None

True

The regex Bat(wo)+man will not match the string 'The Adventures of Batman' because at least one wo is required by the plus sign.

If you need to match an actual plus sign character, prefix the plus sign with a backslash to escape it: \+

## Matching Specific Repetitions with Curly Brackets

If you have a group that you want to repeat a specific number of times, follow the group in your regex with a number in curly brackets. For example, the regex (Ha){3} will match the string 'HaHaHa', but it will not match 'HaHa', since the latter has only two repeats of the (Ha) group.

Instead of one number, you can specify a range by writing a minimum, a comma, and a maximum in between the curly brackets. For example, the regex (Ha){3,5} will match 'HaHaHa', 'HaHaHaHa', and 'HaHaHaHaHa'.

You can also leave out the first or second number in the curly brackets to leave the minimum or maximum unbounded. For example, (Ha){3,} will match three or more instances of the (Ha) group, while (Ha){,5} will match zero to five instances. Curly brackets can help make your regular expressions shorter. These two regular expressions match identical patterns:

(Ha){3}
(Ha)(Ha)(Ha)

And these two regular expressions also match identical patterns:

(Ha){3,5}
((Ha)(Ha)(Ha))|((Ha)(Ha)(Ha)(Ha))|((Ha)(Ha)(Ha)(Ha)(Ha))

In [67]:
haRegex = re.compile(r'(Ha){3}')

In [68]:
mo1 = haRegex.search('HaHaHa')

In [69]:
mo1.group()

'HaHaHa'

In [70]:
mo2 = haRegex.search('Ha')

In [71]:
mo2 == None

True

Here, (Ha){3} matches 'HaHaHa' but not 'Ha'. Since it doesn’t match 'Ha', search() returns None.

## Greedy and Nongreedy Matching

Since (Ha){3,5} can match three, four, or five instances of Ha in the string 'HaHaHaHaHa', you may wonder why the Match object’s call to group() in the previous curly bracket example returns 'HaHaHaHaHa' instead of the shorter possibilities. After all, 'HaHaHa' and 'HaHaHaHa' are also valid matches of the regular expression (Ha){3,5}.

Python’s regular expressions are greedy by default, which means that in ambiguous situations they will match the longest string possible. The non-greedy version of the curly brackets, which matches the shortest string possible, has the closing curly bracket followed by a question mark.

In [72]:
greedyHaRegex = re.compile(r'(Ha){3,5}')

In [73]:
mo1 = greedyHaRegex.search('HaHaHaHaHa')

In [74]:
mo1.group()

'HaHaHaHaHa'

In [75]:
nongreedyHaRegex = re.compile(r'(Ha){3,5}?')

In [76]:
mo2 = nongreedyHaRegex.search('HaHaHaHaHa')

In [77]:
mo2.group()

'HaHaHa'

Note that the question mark can have two meanings in regular expressions: declaring a nongreedy match or flagging an optional group. These meanings are entirely unrelated.

## The findall() Method

In addition to the search() method, Regex objects also have a findall() method. While search() will return a Match object of the first matched text in the searched string, the findall() method will return the strings of every match in the searched string. To see how search() returns a Match object only on the first instance of matching text, enter the following into the interactive shell:

In [78]:
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')

In [79]:
mo = phoneNumRegex.search('Cell: 415-555-9999 Work: 212-555-0000')

In [80]:
mo.group()

'415-555-9999'

On the other hand, findall() will not return a Match object but a list of strings—as long as there are no groups in the regular expression. Each string in the list is a piece of the searched text that matched the regular expression. Enter the following into the interactive shell:

In [81]:
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d') # has no groups

In [82]:
phoneNumRegex.findall('Cell: 415-555-9999 Work: 212-555-0000')

['415-555-9999', '212-555-0000']

If there are groups in the regular expression, then findall() will return a list of tuples. Each tuple represents a found match, and its items are the matched strings for each group in the regex. To see findall() in action, enter the following into the interactive shell (notice that the regular expression being compiled now has groups in parentheses):

In [83]:
phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d)-(\d\d\d\d)') # has groups

In [84]:
phoneNumRegex.findall('Cell: 415-555-9999 Work: 212-555-0000')

[('415', '555', '9999'), ('212', '555', '0000')]

To summarize what the findall() method returns, remember the following:

1. When called on a regex with no groups, such as \d\d\d-\d\d\d-\d\d\d\d, the method findall() returns a list of string matches, such as ['415-555-9999', '212-555-0000'].

2. When called on a regex that has groups, such as (\d\d\d)-(\d\d\d)-(\d\ d\d\d), the method findall() returns a list of tuples of strings (one string for each group), such as [('415', '555', '9999'), ('212', '555', '0000')].

## Character Classes

Shorthand character class
	
\d Any numeric digit from 0 to 9.

\D Any character that is not a numeric digit from 0 to 9.

\w Any letter, numeric digit, or the underscore character. (Think of this as matching “word” characters.)

\W Any character that is not a letter, numeric digit, or the underscore character.

\s Any space, tab, or newline character. (Think of this as matching “space” characters.)

\S Any character that is not a space, tab, or newline.

In [85]:
xmasRegex = re.compile(r'\d+\s\w+')

In [87]:
xmasRegex.findall('12 drummers, 11 pipers, 10 lords, 9 ladies, 8 maids, 7 swans, 6 geese, 5 rings, 4 birds, 3 hens, 2 doves, 1 partridge')

['12 drummers',
 '11 pipers',
 '10 lords',
 '9 ladies',
 '8 maids',
 '7 swans',
 '6 geese',
 '5 rings',
 '4 birds',
 '3 hens',
 '2 doves',
 '1 partridge']

The regular expression \d+\s\w+ will match text that has one or more numeric digits (\d+), followed by a whitespace character (\s), followed by one or more letter/digit/underscore characters (\w+). The findall() method returns all matching strings of the regex pattern in a list.

## Making Your Own Character Classes

There are times when you want to match a set of characters but the shorthand character classes (\d, \w, \s, and so on) are too broad. You can define your own character class using square brackets. For example, the character class [aeiouAEIOU] will match any vowel, both lowercase and uppercase. Enter the following into the interactive shell:

In [88]:
vowelRegex = re.compile(r'[aeiouAEIOU]')

In [89]:
vowelRegex.findall('Robocop eats baby food. BABY FOOD.')

['o', 'o', 'o', 'e', 'a', 'a', 'o', 'o', 'A', 'O', 'O']

You can also include ranges of letters or numbers by using a hyphen. For example, the character class [a-zA-Z0-9] will match all lowercase letters, uppercase letters, and numbers.

Note that inside the square brackets, the normal regular expression symbols are not interpreted as such. This means you do not need to escape the ., *, ?, or () characters with a preceding backslash. For example, the character class [0-5.] will match digits 0 to 5 and a period. You do not need to write it as [0-5\.].

By placing a caret character (^) just after the character class’s opening bracket, you can make a negative character class. A negative character class will match all the characters that are not in the character class. For example, enter the following into the interactive shell:

In [90]:
consonantRegex = re.compile(r'[^aeiouAEIOU]')

In [91]:
consonantRegex.findall('Robocop eats baby food. BABY FOOD.')

['R',
 'b',
 'c',
 'p',
 ' ',
 't',
 's',
 ' ',
 'b',
 'b',
 'y',
 ' ',
 'f',
 'd',
 '.',
 ' ',
 'B',
 'B',
 'Y',
 ' ',
 'F',
 'D',
 '.']

## The Caret and Dollar Sign Characters

You can also use the caret symbol (^) at the start of a regex to indicate that a match must occur at the beginning of the searched text. Likewise, you can put a dollar sign ($)  
at the end of the regex to indicate the string must end with this regex pattern. And you can use the ^ and $ together to indicate that the entire string must match the regex—that is, it’s not enough for a match to be made on some subset of the string.
For example, the r'^Hello' regular expression string matches strings that begin with 'Hello'. Enter the following into the interactive shell:


In [95]:
beginsWithHello = re.compile(r'^Hello')

In [96]:
beginsWithHello.search('Hello world!')

<_sre.SRE_Match object; span=(0, 5), match='Hello'>

In [97]:
beginsWithHello.search('He said hello.') == None

True

The r'\d$' regular expression string matches strings that end with a numeric character from 0 to 9. 

In [98]:
endsWithNumber = re.compile(r'\d$')

In [99]:
endsWithNumber.search('Your number is 42')

<_sre.SRE_Match object; span=(16, 17), match='2'>

In [100]:
endsWithNumber.search('Your number is forty two.') == None

True

The r'^\d+$' regular expression string matches strings that both begin and end with one or more numeric characters.

In [101]:
wholeStringIsNum = re.compile(r'^\d+$')

In [102]:
wholeStringIsNum.search('1234567890')

<_sre.SRE_Match object; span=(0, 10), match='1234567890'>

In [103]:
wholeStringIsNum.search('12345xyz67890') == None

True

In [104]:
wholeStringIsNum.search('12 34567890') == None

True

The last two search() calls in the previous interactive shell example demonstrate how the entire string must match the regex if ^ and $ are used.

## The Wildcard Character

The . (or dot) character in a regular expression is called a wildcard and will match any character except for a newline.

In [105]:
atRegex = re.compile(r'.at')

In [106]:
atRegex.findall('The cat in the hat sat on the flat mat.')

['cat', 'hat', 'sat', 'lat', 'mat']

Remember that the dot character will match just one character, which is why the match for the text flat in the previous example matched only lat. To match an actual dot, escape the dot with a backslash: \..

## Matching Everything with Dot-Star

Sometimes you will want to match everything and anything. For example, say you want to match the string 'First Name:', followed by any and all text, followed by 'Last Name:', and then followed by anything again. You can use the dot-star (.*) to stand in for that “anything.” Remember that the dot character means “any single character except the newline,” and the star character means “zero or more of the preceding character.”

In [107]:
nameRegex = re.compile(r'First Name: (.*) Last Name: (.*)')

In [108]:
mo = nameRegex.search('First Name: Al Last Name: Sweigart')

In [109]:
mo.group(1)

'Al'

In [110]:
mo.group(2)

'Sweigart'

The dot-star uses greedy mode: It will always try to match as much text as possible. To match any and all text in a nongreedy fashion, use the dot, star, and question mark (.*?). Like with curly brackets, the question mark tells Python to match in a nongreedy way.

In [111]:
nongreedyRegex = re.compile(r'<.*?>')

In [112]:
mo = nongreedyRegex.search('<To serve man> for dinner.>')

In [113]:
mo.group()

'<To serve man>'

In [114]:
greedyRegex = re.compile(r'<.*>')

In [115]:
mo = greedyRegex.search('<To serve man> for dinner.>')

In [116]:
mo.group()

'<To serve man> for dinner.>'

Both regexes roughly translate to “Match an opening angle bracket, followed by anything, followed by a closing angle bracket.” But the string '<To serve man> for dinner.>' has two possible matches for the closing angle bracket. In the nongreedy version of the regex, Python matches the shortest possible string: '<To serve man>'. In the greedy version, Python matches the longest possible string: '<To serve man> for dinner.>'.

## Matching Newlines with the Dot Character

The dot-star will match everything except a newline. By passing re.DOTALL as the second argument to re.compile(), you can make the dot character match all characters, including the newline character.

In [117]:
noNewlineRegex = re.compile('.*')

In [119]:
noNewlineRegex.search('Serve the public trust.\nProtect the innocent.\nUphold the law.').group()

'Serve the public trust.'

In [120]:
newlineRegex = re.compile('.*', re.DOTALL)

In [121]:
newlineRegex.search('Serve the public trust.\nProtect the innocent.\nUphold the law.').group()

'Serve the public trust.\nProtect the innocent.\nUphold the law.'

The regex noNewlineRegex, which did not have re.DOTALL passed to the re.compile() call that created it, will match everything only up to the first newline character, whereas newlineRegex, which did have re.DOTALL passed to re.compile(), matches everything. This is why the newlineRegex.search() call matches the full string, including its newline characters.

## Case-Insensitive Matching

Sometimes you care only about matching the letters without worrying whether they’re uppercase or lowercase. To make your regex case-insensitive, you can pass re.IGNORECASE or re.I as a second argument to re.compile()

In [122]:
robocop = re.compile(r'robocop', re.I)

In [123]:
robocop.search('Robocop is part man, part machine, all cop.').group()

'Robocop'

In [124]:
robocop.search('ROBOCOP protects the innocent.').group()

'ROBOCOP'

In [125]:
robocop.search('Al, why does your programming book talk about robocop so much?').group()

'robocop'

## Substituting Strings with the sub() Method

Regular expressions can not only find text patterns but can also substitute new text in place of those patterns. The sub() method for Regex objects is passed two arguments. The first argument is a string to replace any matches. The second is the string for the regular expression. The sub() method returns a string with the substitutions applied.

In [126]:
namesRegex = re.compile(r'Agent \w+')

In [127]:
namesRegex.sub('CENSORED', 'Agent Alice gave the secret documents to Agent Bob.')

'CENSORED gave the secret documents to CENSORED.'

Sometimes you may need to use the matched text itself as part of the substitution. In the first argument to sub(), you can type \1, \2, \3, and so on, to mean “Enter the text of group 1, 2, 3, and so on, in the substitution.”

For example, say you want to censor the names of the secret agents by showing just the first letters of their names. To do this, you could use the regex Agent (\w)\w* and pass r'\1****' as the first argument to sub(). The \1 in that string will be replaced by whatever text was matched by group 1—that is, the (\w) group of the regular expression.

In [128]:
agentNamesRegex = re.compile(r'Agent (\w)\w*')

In [129]:
agentNamesRegex.sub(r'\1****', 'Agent Alice told Agent Carol that Agent Eve knew Agent Bob was a double agent.')

'A**** told C**** that E**** knew B**** was a double agent.'

## Managing Complex Regexes

Regular expressions are fine if the text pattern you need to match is simple. But matching complicated text patterns might require long, convoluted regular expressions. You can mitigate this by telling the re.compile() function to ignore whitespace and comments inside the regular expression string. This “verbose mode” can be enabled by passing the variable re.VERBOSE as the second argument to re.compile().

Now instead of a hard-to-read regular expression like this:

In [139]:
phoneRegex = re.compile(r'((\d{3}|\(\d{3}\))?(\s|-|\.)?\d{3}(\s|-|\.)\d{4}(\s*(ext|x|ext.)\s*\d{2,5})?)')

you can spread the regular expression over multiple lines with comments like this:

In [140]:
phoneRegex = re.compile(r'''(
    (\d{3}|\(\d{3}\))?            # area code
    (\s|-|\.)?                    # separator
    \d{3}                         # first 3 digits
    (\s|-|\.)                     # separator
    \d{4}                         # last 4 digits
    (\s*(ext|x|ext.)\s*\d{2,5})?  # extension
    )''', re.VERBOSE)

## Combining re.IGNORECASE, re.DOTALL, and re.VERBOSE

What if you want to use re.VERBOSE to write comments in your regular expression but also want to use re.IGNORECASE to ignore capitalization? Unfortunately, the re.compile() function takes only a single value as its second argument. You can get around this limitation by combining the re.IGNORECASE, re.DOTALL, and re.VERBOSE variables using the pipe character (|), which in this context is known as the bitwise or operator.

So if you want a regular expression that’s case-insensitive and includes newlines to match the dot character, you would form your re.compile() call like this:

In [141]:
someRegexValue = re.compile('foo', re.IGNORECASE | re.DOTALL)

someRegexValue = re.compile('foo', re.IGNORECASE | re.DOTALL)

In [143]:
someRegexValue = re.compile('foo', re.IGNORECASE | re.DOTALL | re.VERBOSE)

## Project: Phone Number and Email Address Extractor

In [71]:
#! python3
# phoneAndEmail.py - Finds phone numbers and email addresses on the clipboard.

import pyperclip, re

phoneRegex = re.compile(r'''(
    (\d{3}|\(\d{3}\))?                # area code
    (\s|-|\.)?                        # separator
    (\d{3})                           # first 3 digits
    (\s|-|\.)                         # separator
    (\d{4})                           # last 4 digits
    (\s*(ext|x|ext.)\s*(\d{2,5}))?    # extension
    )''', re.VERBOSE)

# TODO: Create email regex.
emailRegex = re.compile(r'''(
    [a-zA-Z0-9._%+-]+      # username
    @                      # @ symbol
    [a-zA-Z0-9.-]+         # domain name
    (\.[a-zA-Z]{2,4})      # dot-something
    )''', re.VERBOSE)

# TODO: Find matches in clipboard text.
text = str(pyperclip.paste())
matches = []
for groups in phoneRegex.findall(text):
    phoneNum = '-'.join([groups[1], groups[3], groups[5]])
    if groups[6] != '':
        phoneNum += ' x' + groups[8]
    matches.append(phoneNum)
for groups in emailRegex.findall(text):
    matches.append(groups[0])


# TODO: Copy results to the clipboard.
if len(matches) > 0:
    pyperclip.copy('\n'.join(matches))
    print('Copied to clipboard:')
    print('\n'.join(matches))
else:
    print('No phone numbers or email addresses found.')

Copied to clipboard:
800-420-7240
415-863-9900
415-863-9950
info1@nostarch.com
media1@nostarch.com
academic1@nostarch.com
info@nostarch.com
