In [1]:
import re

# Creating and Matching Regex Objects

The code in this file is from the book "Automat the Boring Stuff in Python" by Al Sweigart. The steps to follow are:
1. Import the regex module with `import re`.
2. Create a Regex object with the `re.compile()` function using a raw string.
3. Pass the string you want to search into the Regex object's `search()` method. This will return a match object.
4. Call the Match object's `group()` method to return a string of the actual matched text.

In [3]:
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
matching_object = phoneNumRegex.search('My number is 415-555-4242.')
print('Phone number found: {}'.format(matching_object.group()))

Phone number found: 415-555-4242


## Grouping with Parentheses

In [10]:
phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')
matching_object = phoneNumRegex.search('My number is 415-555-4242.')
print(matching_object.group(1))
print(matching_object.group(2))
print(matching_object.group(0))
print(matching_object.groups())
areaCode, mainNumber = matching_object.groups()
print('Area Code: {}'.format(areaCode))
print('Main Number: {}'.format(mainNumber))

415
555-4242
415-555-4242
('415', '555-4242')
Area Code: 415
Main Number: 555-4242


## Matching Multiple Groups with the Pipe

In [14]:
heroRegex = re.compile(r'Batman|Tina Fey')
matching_object1 = heroRegex.search("Batman and Tina Fey.")
print(matching_object1.group())
matching_object2 = heroRegex.search("Tina Fey and Batman")
print(matching_object2.group())

Batman
Tina Fey


In [23]:
batRegex = re.compile(r'Bat(man|mobile|copter|bat)')
matching_object = batRegex.search('The Adventures of Batman')
print(matching_object.group())
print(matching_object.group(1))

Batman
man


## Optional Matching with the Question Mark

In [24]:
batRegex = re.compile(r'Bat(wo)?man')
matching_object1 = batRegex.search('The Adventures of Batman')
print(matching_object1.group())

matching_object2 = batRegex.search('The Adventures of Batwoman')
print(matching_object2.group())

Batman
Batwoman


In [26]:
phoneRegex = re.compile(r'(\d\d\d-)?\d\d\d-\d\d\d\d')
matching_object1 = phoneRegex.search('My number is 415-555-4242')
print(matching_object1.group())

matching_object2 = phoneRegex.search('My number is 555-4242')
print(matching_object2.group())

415-555-4242
555-4242


## Matching Zero or More with the Star

In [27]:
batRegex = re.compile(r'Bat(wo)*man')
matching_object1 = batRegex.search('The Adventures of Batman')
matching_object2 = batRegex.search('The Adventures of Batwoman')
matching_object3 = batRegex.search('The Adventures of Batwowowoman')
print(matching_object1.group())
print(matching_object2.group())
print(matching_object3.group())

Batman
Batwoman
Batwowowoman


## Matching One or More with the Plus

In [28]:
batRegex = re.compile(r'Bat(wo)+man')
matching_object1 = batRegex.search('The Adventures of Batman')
matching_object2 = batRegex.search('The Adventures of Batwoman')
matching_object3 = batRegex.search('The Adventures of Batwowowoman')
print(matching_object1 == None)
print(matching_object2.group())
print(matching_object3.group())

True
Batwoman
Batwowowoman


## Matching Specific Repetitions with Curly Brackets

In [29]:
haRegex = re.compile(r'(Ha){3}')
matching_object1 = haRegex.search("HaHaHa")
matching_object2 = haRegex.search("Ha")
print(matching_object1.group())
print(matching_object2 == None)

HaHaHa
True


## Greedy and Nongreedy Matching

The question mark character can be used to find optional groups or nongreedy matches.

In [30]:
greedyHaRegex = re.compile(r'(Ha){3,5}') #finds longest pattern
nongreedyHaRegex = re.compile(r'(Ha){3,5}?') #finds shortest pattern
matching_object1 = greedyHaRegex.search("HaHaHaHaHa")
matching_object2 = nongreedyHaRegex.search("HaHaHaHaHa")
print(matching_object1.group())
print(matching_object2.group())

HaHaHaHaHa
HaHaHa


## The `findall()` Method

When called on a regex with no groups, `findall()` returns a list of matches. When called on a regex with groups, `findall()` returns a list of tuples of strings.

In [37]:
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d') #no groups
matching_object = phoneNumRegex.search('Cell: 415-555-9999 Work: 212-555-0000')
print(matching_object.group())
print(phoneNumRegex.findall('Cell: 415-555-9999 Work: 212-555-0000'))

415-555-9999
['415-555-9999', '212-555-0000']


In [38]:
phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d)-(\d\d\d\d)') #has groups
phoneNumRegex.findall('Cell: 415-555-9999 Work: 212-555-0000')

[('415', '555', '9999'), ('212', '555', '0000')]

In [2]:
%%html
<style>
table {float:left}
</style>

# Character Classes

| Regex Symbol | Definition |
| :--- | :--- |
| \d | any numerica digit from 0 to 9
| \D | any character that is not a numeric digit from 0 to 9
| \w | any letter, numeric digit, or the underscore character (match words)
| \W | any character that is not a letter, numeric digit, or the underscore
| \s | any space, tab, or newline character (match spaces)
| \S | any character that is not a space, tab, or newline
| ? | matches zero or one of the preceeding group
| * | matches zero or more of the preceeding group
| + | matches one or more of the preceeding group
| {n} | matches exactly $n$ of the preceeding group
| {n,} | matches $n$ or more of the preceeding group
| {,m} | matches 0 to $m$ of the preceeding group
| {n,m} | matches at least $n$ and at most $m$ of the preceeding group
| {n,m}? or *? or +? | performs a nongreedy match of the preceeding group
| ^spam | means the string must begin with spam
| spam$ | means the string must end with spam
| . | matches any character, except newline characters
| \[abc\] | matches any character between the brackets
| \[^abc\] | matches any character that isn't between the brackets

## Making Custom Character Classes

In [3]:
vowelRegex = re.compile(r'[aeiouAEIOU]')
vowelRegex.findall('But Robocop last year was a shock')

['u', 'o', 'o', 'o', 'a', 'e', 'a', 'a', 'a', 'o']

In [4]:
consonantRegex = re.compile(r'[^aeiouAEIOU]')
consonantRegex.findall('But Robocop last year was a shock')

['B',
 't',
 ' ',
 'R',
 'b',
 'c',
 'p',
 ' ',
 'l',
 's',
 't',
 ' ',
 'y',
 'r',
 ' ',
 'w',
 's',
 ' ',
 ' ',
 's',
 'h',
 'c',
 'k']

## The Caret and Dollar Sign

In [5]:
beginsWithHello = re.compile(r'^Hello')
beginsWithHello.search('Hello world!')

<re.Match object; span=(0, 5), match='Hello'>

In [6]:
beginsWithHello.search('He said hello.') == None

True

In [11]:
endsWithNumber = re.compile(r'\d$')
endsWithNumber.search('You number is 42')

<re.Match object; span=(15, 16), match='2'>

In [12]:
endsWithNumber.search('You number is forty two') == None

True

In [15]:
wholeStringIsNum = re.compile(r'^\d+$')
wholeStringIsNum.search('1234567890')

<re.Match object; span=(0, 10), match='1234567890'>

In [16]:
print(wholeStringIsNum.search('12345xyz67890') == None)
print(wholeStringIsNum.search('12  34567890') == None)

True
True


## The Wildcard Character

In [32]:
atRegex = re.compile(r'.at')
atRegex.findall('The cat in the hat sat on the flat mat.')

['cat', 'hat', 'sat', 'lat', 'mat']

## Matching Everything with Dot-Star

In [38]:
nameRegex = re.compile(r'First Name: (.*) Last Name: (.*)')
name_object = nameRegex.search('First Name: Cardy Last Name: Moten')
print(name_object.groups())
print(name_object.group(1))
print(name_object.group(2))

('Cardy', 'Moten')
Cardy
Moten


In [39]:
nongreedyRegex = re.compile(r'<.*?>')
matching_object = nongreedyRegex.search('<To serve man> for dinner.>')
matching_object.group()

'<To serve man>'

In [40]:
greedyRegex = re.compile(r'<.*>')
matching_object = greedyRegex.search('<To serve man> for dinner.>')
matching_object.group()

'<To serve man> for dinner.>'

## Matching Newlines With the Dot Character

To make the dot character match everything including a newline, use the `re.DOTALL` argument in the `re.compile()` function.l

In [41]:
noNewlineRegex = re.compile(r'.*')
noNewlineRegex.search('Serve the public trust.\nProtect the innocent.\nUphold the law.').group()

'Serve the public trust.'

In [42]:
newlineRegex = re.compile(r'.*', re.DOTALL)
newlineRegex.search('Serve the public trust.\nProtect the innocent.\nUphold the law.').group()

'Serve the public trust.\nProtect the innocent.\nUphold the law.'

## Case-Insensitive Matching
Use the `re.IGNORECASE` or `re.I` argument to `re.compile()` to make a match regardless of case.

In [43]:
robocop = re.compile(r'robocop', re.I)
robocop.search('RoboCop is part man, part machine, all cop.').group()

'RoboCop'

In [44]:
robocop.search('ROBOCOP protects the innocent.').group()

'ROBOCOP'

In [45]:
robocop.search("What's up with all this talk about robocop?").group()

'robocop'

## Substituting String with the `sub()` Method

In [46]:
namesRegex = re.compile(r'Agent \w+')
namesRegex.sub('CENSORED', 'Agent Alice gave the secret documents to Agent Bob.')

'CENSORED gave the secret documents to CENSORED.'

In [51]:
agentNamesRegex = re.compile(r'Agent (\w)\w*')
agentNamesRegex.sub(r'\1****', 'Agent Alice told Agent Carol that Agent Eve knew Agent Bob was a double agent.')

'A**** told C**** that E**** knew B**** was a double agent.'