# Regular Expressions

A regular expression is a sequence of characters that forms a search pattern. When you search for data in a text, you can use this search pattern to describe what you are searching for.

In [23]:
# Import re to use Regular Expressions in Python
import re

In [24]:
# This tells Python to run the raw string, so no \ interpretation
print(r"\tTab")

\tTab


In [25]:
# the string we'll run our expressions on
text_to_search = '''
abcdefghijklmnopqurtuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ

1234567890

Ha HaHa

MetaCharacters (Need to be escaped):
. ^ $ * + ? { } [ ] \ | ( )

coreyms.com

321-555-4321
123.555.1234
123*555*1234
800-555-1234
900-555-1234

cat
bat
Fat
mat

Mr. Schafer
Mr Smith
Ms Davis
Mrs. Robinson
Mr. T
'''

In [26]:
# Let's search for the pattern 'abc'
pattern = re.compile(r"abc")

# More on finditer and other regular expression methods later
matches = pattern.finditer(text_to_search)

# Print out the matches
for match in matches:
    print(match)
    print(match.span())

<re.Match object; span=(1, 4), match='abc'>
(1, 4)


Notice how the search is case senstive and looks for the pattern in the exact order

In [27]:
# The span shows us the start and end indecies of the match found
print(text_to_search[1:4])

abc


## MetaCharacters

MetaCharacters are used to form Regular Expressions that can match any text pattern we want.

### Searching for MetaCharacters

In [28]:
# Let's search for a dot
pattern = re.compile(".")

matches = pattern.finditer(text_to_search)

# Print out the matches
for match in matches:
    print(match)

<re.Match object; span=(1, 2), match='a'>
<re.Match object; span=(2, 3), match='b'>
<re.Match object; span=(3, 4), match='c'>
<re.Match object; span=(4, 5), match='d'>
<re.Match object; span=(5, 6), match='e'>
<re.Match object; span=(6, 7), match='f'>
<re.Match object; span=(7, 8), match='g'>
<re.Match object; span=(8, 9), match='h'>
<re.Match object; span=(9, 10), match='i'>
<re.Match object; span=(10, 11), match='j'>
<re.Match object; span=(11, 12), match='k'>
<re.Match object; span=(12, 13), match='l'>
<re.Match object; span=(13, 14), match='m'>
<re.Match object; span=(14, 15), match='n'>
<re.Match object; span=(15, 16), match='o'>
<re.Match object; span=(16, 17), match='p'>
<re.Match object; span=(17, 18), match='q'>
<re.Match object; span=(18, 19), match='u'>
<re.Match object; span=(19, 20), match='r'>
<re.Match object; span=(20, 21), match='t'>
<re.Match object; span=(21, 22), match='u'>
<re.Match object; span=(22, 23), match='v'>
<re.Match object; span=(23, 24), match='w'>
<re.M

When we tried to search for a dot we got everything as match, and that's because the . is a meta character and so to prevent this from happening we add a \ behind it which is called **Escaping the character**.

all of the following are MetaCharacters:
. ^ $ * + ? { } [ ] \ | ( )

MetaCharacters have special uses that we'll get to later.

In [29]:
# Escaping the .
pattern = re.compile(r"\.")

matches = pattern.finditer(text_to_search)

# Print out the matches
for match in matches:
    print(match)

<re.Match object; span=(114, 115), match='.'>
<re.Match object; span=(150, 151), match='.'>
<re.Match object; span=(172, 173), match='.'>
<re.Match object; span=(176, 177), match='.'>
<re.Match object; span=(241, 242), match='.'>
<re.Match object; span=(272, 273), match='.'>
<re.Match object; span=(285, 286), match='.'>


### Using MetaCharacters

Inside [Metacharacters.md](MetaCharacters.md) Is a list of all the metacharacters and what they represent.

\d for example reprents a digits and so if we pass it in as a pattern we will get all of teh digits as a match.

In [30]:
pattern = re.compile(r"\d")

matches = pattern.finditer(text_to_search)

# Print out the matches
for match in matches:
    print(match)

<re.Match object; span=(56, 57), match='1'>
<re.Match object; span=(57, 58), match='2'>
<re.Match object; span=(58, 59), match='3'>
<re.Match object; span=(59, 60), match='4'>
<re.Match object; span=(60, 61), match='5'>
<re.Match object; span=(61, 62), match='6'>
<re.Match object; span=(62, 63), match='7'>
<re.Match object; span=(63, 64), match='8'>
<re.Match object; span=(64, 65), match='9'>
<re.Match object; span=(65, 66), match='0'>
<re.Match object; span=(156, 157), match='3'>
<re.Match object; span=(157, 158), match='2'>
<re.Match object; span=(158, 159), match='1'>
<re.Match object; span=(160, 161), match='5'>
<re.Match object; span=(161, 162), match='5'>
<re.Match object; span=(162, 163), match='5'>
<re.Match object; span=(164, 165), match='4'>
<re.Match object; span=(165, 166), match='3'>
<re.Match object; span=(166, 167), match='2'>
<re.Match object; span=(167, 168), match='1'>
<re.Match object; span=(169, 170), match='1'>
<re.Match object; span=(170, 171), match='2'>
<re.Matc

Note that \D means not a digit which is something all the metacharacters share

In [31]:
pattern = re.compile(r"\D")

matches = pattern.finditer(text_to_search)

# Print out the matches
for match in matches:
    print(match)

<re.Match object; span=(0, 1), match='\n'>
<re.Match object; span=(1, 2), match='a'>
<re.Match object; span=(2, 3), match='b'>
<re.Match object; span=(3, 4), match='c'>
<re.Match object; span=(4, 5), match='d'>
<re.Match object; span=(5, 6), match='e'>
<re.Match object; span=(6, 7), match='f'>
<re.Match object; span=(7, 8), match='g'>
<re.Match object; span=(8, 9), match='h'>
<re.Match object; span=(9, 10), match='i'>
<re.Match object; span=(10, 11), match='j'>
<re.Match object; span=(11, 12), match='k'>
<re.Match object; span=(12, 13), match='l'>
<re.Match object; span=(13, 14), match='m'>
<re.Match object; span=(14, 15), match='n'>
<re.Match object; span=(15, 16), match='o'>
<re.Match object; span=(16, 17), match='p'>
<re.Match object; span=(17, 18), match='q'>
<re.Match object; span=(18, 19), match='u'>
<re.Match object; span=(19, 20), match='r'>
<re.Match object; span=(20, 21), match='t'>
<re.Match object; span=(21, 22), match='u'>
<re.Match object; span=(22, 23), match='v'>
<re.Ma

In [32]:
# Any three word character: a-z, A-Z, 0-9, _
pattern = re.compile(r"\w\w\w")

matches = pattern.finditer(text_to_search)

# Print out the matches
for match in matches:
    print(match)

<re.Match object; span=(1, 4), match='abc'>
<re.Match object; span=(4, 7), match='def'>
<re.Match object; span=(7, 10), match='ghi'>
<re.Match object; span=(10, 13), match='jkl'>
<re.Match object; span=(13, 16), match='mno'>
<re.Match object; span=(16, 19), match='pqu'>
<re.Match object; span=(19, 22), match='rtu'>
<re.Match object; span=(22, 25), match='vwx'>
<re.Match object; span=(28, 31), match='ABC'>
<re.Match object; span=(31, 34), match='DEF'>
<re.Match object; span=(34, 37), match='GHI'>
<re.Match object; span=(37, 40), match='JKL'>
<re.Match object; span=(40, 43), match='MNO'>
<re.Match object; span=(43, 46), match='PQR'>
<re.Match object; span=(46, 49), match='STU'>
<re.Match object; span=(49, 52), match='VWX'>
<re.Match object; span=(56, 59), match='123'>
<re.Match object; span=(59, 62), match='456'>
<re.Match object; span=(62, 65), match='789'>
<re.Match object; span=(71, 74), match='HaH'>
<re.Match object; span=(77, 80), match='Met'>
<re.Match object; span=(80, 83), match=

In [33]:
# Any two characters
pattern = re.compile(r"..")

matches = pattern.finditer(text_to_search)

# Print out the matches
for match in matches:
    print(match)

<re.Match object; span=(1, 3), match='ab'>
<re.Match object; span=(3, 5), match='cd'>
<re.Match object; span=(5, 7), match='ef'>
<re.Match object; span=(7, 9), match='gh'>
<re.Match object; span=(9, 11), match='ij'>
<re.Match object; span=(11, 13), match='kl'>
<re.Match object; span=(13, 15), match='mn'>
<re.Match object; span=(15, 17), match='op'>
<re.Match object; span=(17, 19), match='qu'>
<re.Match object; span=(19, 21), match='rt'>
<re.Match object; span=(21, 23), match='uv'>
<re.Match object; span=(23, 25), match='wx'>
<re.Match object; span=(25, 27), match='yz'>
<re.Match object; span=(28, 30), match='AB'>
<re.Match object; span=(30, 32), match='CD'>
<re.Match object; span=(32, 34), match='EF'>
<re.Match object; span=(34, 36), match='GH'>
<re.Match object; span=(36, 38), match='IJ'>
<re.Match object; span=(38, 40), match='KL'>
<re.Match object; span=(40, 42), match='MN'>
<re.Match object; span=(42, 44), match='OP'>
<re.Match object; span=(44, 46), match='QR'>
<re.Match object; s

### Anchors

These don't match any characters, but rather indicate positions before or after a character

for example if we look at the three ha we have (`Ha HaHa`) we can search for the Ha that have a word bourdry before them.

In [34]:
pattern = re.compile(r"\bHa")

matches = pattern.finditer(text_to_search)

# Print out the matches
for match in matches:
    print(match)

<re.Match object; span=(68, 70), match='Ha'>
<re.Match object; span=(71, 73), match='Ha'>


As we can see the two matchs we got are the Ha that did not have a character before them. (*Ha* *Ha*Ha)

^ indicates the start of the string
$ indicates the end of the string

In [35]:
sentence = 'Start a sentence and then bring it to an end\nStartend'

In [36]:
pattern = re.compile(r"^Start")

matches = pattern.finditer(sentence)

# Print out the matches
for match in matches:
    print(match)

<re.Match object; span=(0, 5), match='Start'>


In [37]:
# Even though the string sentence has a in it none of them is at the start of the string
pattern = re.compile(r"^Start")

matches = pattern.finditer(sentence)

# Print out the matches
for match in matches:
    print(match)

<re.Match object; span=(0, 5), match='Start'>


In [38]:
pattern = re.compile(r"end$")

matches = pattern.finditer(sentence)

# Print out the matches
for match in matches:
    print(match)

<re.Match object; span=(50, 53), match='end'>


In [39]:
pattern = re.compile(r"s$")

matches = pattern.finditer(sentence)

# Print out the matches
for match in matches:
    print(match)

## Character Set

If we want to find all of the phone numbers that have a . or - inbetween them, we can use a character set.

In [40]:
pattern = re.compile(r"\d\d\d[.-]\d\d\d[.-]\d\d\d\d")

matches = pattern.finditer(text_to_search)

# Print out the matches
for match in matches:
    print(match)

<re.Match object; span=(156, 167), match='321-555-432'>
<re.Match object; span=(169, 180), match='123.555.123'>
<re.Match object; span=(195, 206), match='800-555-123'>
<re.Match object; span=(208, 219), match='900-555-123'>


To create a character set we include the characters we want inside a []. \[. -] means that this position can be either a . or a -. Note how we didn't need to escape behind any of those metacharacters inside the character set.

Using the - to specify the range of characters we want in a character set.

In [41]:
pattern = re.compile(r"[a-zA-Z]")

matches = pattern.finditer(text_to_search)

# Print out the matches
for match in matches:
    print(match)

<re.Match object; span=(1, 2), match='a'>
<re.Match object; span=(2, 3), match='b'>
<re.Match object; span=(3, 4), match='c'>
<re.Match object; span=(4, 5), match='d'>
<re.Match object; span=(5, 6), match='e'>
<re.Match object; span=(6, 7), match='f'>
<re.Match object; span=(7, 8), match='g'>
<re.Match object; span=(8, 9), match='h'>
<re.Match object; span=(9, 10), match='i'>
<re.Match object; span=(10, 11), match='j'>
<re.Match object; span=(11, 12), match='k'>
<re.Match object; span=(12, 13), match='l'>
<re.Match object; span=(13, 14), match='m'>
<re.Match object; span=(14, 15), match='n'>
<re.Match object; span=(15, 16), match='o'>
<re.Match object; span=(16, 17), match='p'>
<re.Match object; span=(17, 18), match='q'>
<re.Match object; span=(18, 19), match='u'>
<re.Match object; span=(19, 20), match='r'>
<re.Match object; span=(20, 21), match='t'>
<re.Match object; span=(21, 22), match='u'>
<re.Match object; span=(22, 23), match='v'>
<re.Match object; span=(23, 24), match='w'>
<re.M

Using the ^ in character set means negotiation

In [42]:
pattern = re.compile(r"[^b]at")

matches = pattern.finditer(text_to_search)

# Print out the matches
for match in matches:
    print(match)

<re.Match object; span=(222, 225), match='cat'>
<re.Match object; span=(230, 233), match='Fat'>
<re.Match object; span=(234, 237), match='mat'>


## Quantifier

Quantifiers Indicate how many times do we want a character or pattern to repeat

Using quantifiers we can improve our patterns

In [87]:
pattern = re.compile(r"\d{3}[.-]\d{3}[.-]\d{4}")

matches = pattern.finditer(text_to_search)

# Print out the matches
for match in matches:
    print(match)

<re.Match object; span=(156, 168), match='321-555-4321'>
<re.Match object; span=(169, 181), match='123.555.1234'>
<re.Match object; span=(195, 207), match='800-555-1234'>
<re.Match object; span=(208, 220), match='900-555-1234'>


We can match the prefix followed by the name of a list of people:

- Mr. Schafer
- Mr Smith
- Ms Davis
- Mrs. Robinson
- Mr. T

In [48]:
pattern = re.compile(r"M(r|s|rs)\.?\s[A-Z]\w*")

matches = pattern.finditer(text_to_search)

# Print out the matches
for match in matches:
    print(match)

<re.Match object; span=(239, 250), match='Mr. Schafer'>
<re.Match object; span=(251, 259), match='Mr Smith'>
<re.Match object; span=(260, 268), match='Ms Davis'>
<re.Match object; span=(269, 282), match='Mrs. Robinson'>
<re.Match object; span=(283, 288), match='Mr. T'>


## Groups

we use groups to specify one of teh following patterns. (r|s|rs) mean that after M there can be an r, s, or rs.

## Capturing information from our patterns

Let's take a look at an example where caputer information from our matches.

In [49]:
urls = '''
https://www.google.com
http://coreyms.com
https://youtube.com
https://www.nasa.gov
'''

In [60]:
pattern = re.compile(r"https?://(www\.)?(\w+)(\.\w+)")

matches = pattern.finditer(urls)

# Print out the matches
for match in matches:
    print(match)

https://www.google.com
http://coreyms.com
https://youtube.com
https://www.nasa.gov


`https?://(www\.)?(\w+)(\.\w+)`

In this regular expression there's a couple of groups:

- group() or group(0): the entire match
- group(1): (www\.)
- group(2): (\w+)
- group(3): (\.\w+)

In [68]:
pattern = re.compile(r"https?://(www\.)?(\w+)(\.\w+)")

matches = pattern.finditer(urls)

# Print out the group we want
for match in matches:
    print(match.group(2)+match.group(3))

google.com
coreyms.com
youtube.com
nasa.gov


### Back reference

A short hand to access groups which we can use to perform a subsitiution.

the back reference for group `n` is `\g<n>`

In [82]:
pattern = re.compile(r"https?://(www\.)?(\w+)(\.\w+)")

# Replaces the matches with group 2, \2, followed by group 3, \3.
subbed_urls = pattern.sub(r'\2\3', urls)

print(subbed_urls)


google.com
coreyms.com
youtube.com
nasa.gov



## Additional methods

### Findall

returns the matches as a list of strings. If the pattern has a group then it will just return that group, or groups as a tuple.

In [83]:
pattern = re.compile(r"M(r|s|rs)\.?\s[A-Z]\w*")

matches = pattern.findall(text_to_search)

# Print out the matches
for match in matches:
    print(match)

r
r
s
rs
r


In [84]:
pattern = re.compile(r"https?://(www\.)?(\w+)(\.\w+)")

matches = pattern.findall(urls)

# Print out the matches
for match in matches:
    print(match)

('www.', 'google', '.com')
('', 'coreyms', '.com')
('', 'youtube', '.com')
('www.', 'nasa', '.gov')


In [85]:
# No groups
pattern = re.compile(r"\d{3}[.-]\d{3}[.-]\d{4}")

matches = pattern.findall(text_to_search)

# Print out the matches
for match in matches:
    print(match)

321-555-4321
123.555.1234
800-555-1234
900-555-1234


### Match

checks the start of the string for a match and returns it if one exits.

In [90]:
pattern = re.compile(r"Start")

match = pattern.match(sentence)

# Print out the matches
print(match)

<re.Match object; span=(0, 5), match='Start'>


### Search

The same as match but checks the entire string returning the first match.

In [92]:
pattern = re.compile(r"a")

match = pattern.search(sentence)

# Print out the matches
print(match)

<re.Match object; span=(2, 3), match='a'>


## Flags

Makes our re patterns simpler

In [93]:
# Ignore case
pattern = re.compile(r"start", re.IGNORECASE)

match = pattern.match(sentence)

# Print out the matches
print(match)

<re.Match object; span=(0, 5), match='Start'>
