# Regular Expressions

### Using Regular Expressions to Find Patterns of Text

In [1]:
#create Regex object
import re
#pass raw sting to re.compile()
phoneNumRegex= re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
#matching Regex Object
mo=phoneNumRegex.search('My number is 415-555-4242')
print('Phone number found: '+mo.group())

Phone number found: 415-555-4242


In [2]:
#regex matching using grouping
phoneNumRegex=re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')
mo = phoneNumRegex.search('My number is 415-555-4242.')
#prints matches from first group (first set of parenthises)
print(mo.group(1))
#prints match from second group
print(mo.group(2))
#prints match from whole regex
print(mo.group(0))
#also prints match from whole regex
print(mo.group())

415
555-4242
415-555-4242
415-555-4242


In [3]:
#prints all the groups
print(mo.groups())
#assign names to each group
areaCode, mainNumber=mo.groups()
print(areaCode)
print(mainNumber)

('415', '555-4242')
415
555-4242


In [4]:
#if you need to use parentheses as a character in the regex expression
phoneNumRegex=re.compile(r'(\(\d\d\d\)) (\d\d\d-\d\d\d\d)')
mo=phoneNumRegex.search('My phone number is (415) 555-4242.')
print(mo.group(1))
print(mo.group(2))

(415)
555-4242


### Matching Groups with the Pipe

The "|" character is called a pipe. When used it will match the preceding string or the proceding string.

In [5]:
heroRegex=re.compile(r'Batman|Tina Fey')
mo1=heroRegex.search('Batman and Tina Fey.')
#matched the first occurance of either Batman or Tina Fey, not both
#this will match batman
print(mo1.group())
mo2=heroRegex.search('Tina Fey and Batman.')
#this will match Tina Fey
print(mo2.group())

Batman
Tina Fey


In [6]:
#matches Bat plus any of the following strings
batRegex=re.compile(r'Bat(man|mobile|copter|bat)')
mo=batRegex.search('Batmobile lost a wheel')
print(mo.group())
print(mo.group(1))

Batmobile
mobile


### Optional Matching with Question Mark

In [7]:
# use ? to add optional matching
batRegex=re.compile(r'Bat(wo)?man')
mo1=batRegex.search('The Adventures of Batman')
print(mo1.group())
mo2=batRegex.search('The Adventures of Batwoman')
print(mo2.group())

Batman
Batwoman


Note: if you ever need to match the question mark use \?

### Matching One or More with the Plus

In [8]:
batRegex=re.compile(r'Bat(wo)+man')
mo1=batRegex.search('The Adventures of Batwoman')
print(mo1.group())
mo2=batRegex.search('The Adventures of Batwowowowoman')
print(mo2.group())
#doesnt match the no occurances of wo
mo3=batRegex.search('The Adventures of Batman')
print(mo3==None)

Batwoman
Batwowowowoman
True


Note: if you need to use the plus just \+

### Matching Specific Repetitions with Curly Brackets

In [9]:
#searches for HaHaHa
haRegex=re.compile(r'(Ha){3}')
mo1=haRegex.search('HaHaHa')
print(mo1.group())
mo2=haRegex.search('Ha')
#wont find because only 1 repition of Ha
print(mo2==None)

HaHaHa
True


In [10]:
#searches from 1 to 3 repititions of Ha
ha2Regex=re.compile(r'(Ha){1,3}')
mo1=ha2Regex.search('Ha')
print(mo1.group())
mo2=ha2Regex.search('HaHaHa')
print(mo2.group())

Ha
HaHaHa


In [11]:
#greedy and nongreedy
#will get the most repitions possible
greedyHaRegex=re.compile(r'(Ha){3,5}')
mo1=greedyHaRegex.search('HaHaHaHaHa')
print(mo1.group())
#will get the least repitions possible
nongreedyHaRegex=re.compile(r'(Ha){3,5}?')
mo2=nongreedyHaRegex.search('HaHaHaHaHa')
print(mo2.group())

HaHaHaHaHa
HaHaHa


### FindAll() Method

While search() finds the first match, findall() finds all the matches.

In [12]:
phoneNumRegex=re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
#gets the first match
mo=phoneNumRegex.search('Cell: 415-555-9999 Work: 212-555-0000')
print(mo.group())
#gets all matches
#findall() returns a list
mo2=phoneNumRegex.findall('Cell: 415-555-9999 Work: 212-555-0000')
print(mo2)

415-555-9999
['415-555-9999', '212-555-0000']


In [13]:
#findall() with groups
phoneNumRegex=re.compile(r'(\d\d\d)-(\d\d\d)-(\d\d\d\d)')
mo=phoneNumRegex.findall("Cell: 415-555-9999 Work: 212-555-0000")
print(mo)

[('415', '555', '9999'), ('212', '555', '0000')]


### Character Classes

Shorthand character class
- \d any numeric digit 0-9
- \D any character that is not a numeric digit from 0-9
- \w any letter, numeric digit, or underscore
- \W any character that is not a letter, numeric digit, or underscore
- \s any space, tab, or newline character
- \S any character that is not a space, tab, or newline

### Custom Character Classes

In [14]:
#will match any of the characters found inside the square brackets
vowelRegex=re.compile(r'[aeiouAEIOU]')
mo=vowelRegex.findall('Robocop eats baby food. BABY FOOD')
print(mo)
#will match anything but the character in the square braces
novowelRegex=re.compile(r'[^aeiouAEIOU]')
mo=novowelRegex.findall('Robocop eats baby food. BABY FOOD')
print(mo)

['o', 'o', 'o', 'e', 'a', 'a', 'o', 'o', 'A', 'O', 'O']
['R', 'b', 'c', 'p', ' ', 't', 's', ' ', 'b', 'b', 'y', ' ', 'f', 'd', '.', ' ', 'B', 'B', 'Y', ' ', 'F', 'D']


Note: can use ranges like [a-z0-9]

### Caret and Dollar Sign Characters

In [15]:
#can use a ^ as show to see if the first character/Characters are as specified
beginsWithHello=re.compile(r'^Hello')
mo=beginsWithHello.search('Hello World')
print(mo.group())
# can use $ as well for ending
endsWithWorld=re.compile(r'World')
mo=endsWithWorld.search('Hello World')
print(mo.group())

Hello
World


In [16]:
#here are more examples
(r'\d$') #ends with a number
(r'^\d+$') #whole string is a number

'^\\d+$'

### Wildcard Character

The "." (or dot) character in a regular expression is called a wildcard that will match anything but a newline.

In [17]:
#match anything first character that has at after
atRegex=re.compile(r'.at')
mo=atRegex.findall('The cat in the hat sat on the flat mat.')
print(mo)

['cat', 'hat', 'sat', 'lat', 'mat']


Note: lat is matched not flat

### Match everything with Dot-Star

In [18]:
#.* is greedy by default
nameRegex=re.compile(r'First Name: (.*) Last Name: (.*)')
mo=nameRegex.search('First Name: Al Last Name: Sweigart')
print(mo.group(1))
print(mo.group(2))

Al
Sweigart


In [19]:
#nongreedy
nongreedyRegex=re.compile(r'<.*?>')
mo=nongreedyRegex.search('<To serve man> for dinner.>')
print(mo.group())

#greedy
greedyRegex=re.compile(r'<.*>')
mo=greedyRegex.search('<To serve man> for dinner.>')
print(mo.group())

<To serve man>
<To serve man> for dinner.>


### Matching Newlines with Dot Character

In [20]:
#to get \n as well use re.DOTALL as second parameter to compile
newlineRegex=re.compile('.*',re.DOTALL)
mo=newlineRegex.search('Serve the public trust.\nProtect the innocent.\nUphold the law.')
print(mo.group())

Serve the public trust.
Protect the innocent.
Uphold the law.


### Regex Symbols

In [21]:
# ? matches zero or 1 of preceding group
# * matches zero or more of preceding group
# + matches one or more of the preceding group
# {n} matches exactly n of the preceding group
# {n,} matched n or more of the preceding group
# {,m} matches 0 to m of the preceding group
# {n,m} matches at least n and at most m of the preceding group
# {n,m}? or *? or +? performs nongreedy match
# ^spam means string must begin with spam
# spam$ means string must end with spam
# The . matches any character except newline
# \d, \w, and \s match a digit, word or space character respectively
# \D, \W and \S match anything execpt a digit, word, or space character respectively
# [abc] matches any character between the brackets
# [^abc] matches any character that isn't between the brackets

### Case-Insensitive Matching

In [22]:
# to make character insensitive pass re.IGNORECASE or re.I to second arg of re.compile()
robocop=re.compile(r'robocop',re.I)
mo1=robocop.search('Robocop')
mo2=robocop.search('RoBoCoP')
print(mo1.group())
print(mo2.group())

Robocop
RoBoCoP


### Substituting Strings with the Sub() Method

In [23]:
namesRegex=re.compile(r'Agent \w+')
mo=namesRegex.sub('CENSORED','Agent Alice gave the secret documents to Agent Bob.')
print(mo)

CENSORED gave the secret documents to CENSORED.


In [24]:
#  \1 takes the first letter from the match
agentNamesRegex=re.compile(r'Agent (\w)\w*')
mo=agentNamesRegex.sub(r'\1****','Agent Alice told Agent Carol that Agent Eve knew Agent Bob was a double agent.')
print(mo)

A**** told C**** that E**** knew B**** was a double agent.


### Managing Complex Regexes

In [26]:
# can use re.VERBOSE as second arg to re.compile() to ignore whitespaces and comments inside a reg expression
phoneRegex=re.compile(r'''(
(\d{3}|\(\d{3}\))?      #area code
(\s|-|\.)?              # separator
\d{3}                   # first 3 digits
(\s|-|\.)               # separator
\d{4}                   # last 4 digits
(\s*(ext|x|ext.)\s*\d{2,5})?   #extension
)''',re.VERBOSE)

#if you need to use multiple second args for re.compile()
someRegexValue=re.compile('foo',re.IGNORECASE | re.DOTALL)