# 2.1 Regular Expressions

The book aptly calls out regular expressions as one of the unsung successes in computer science :-). They are super handy for a lot of tasks and can sometimes be adequate to solve some NLP problems. They can also help establish baselines for many NLP problems.

In Python, `re` is the default regular expression library that we need to import and use.

In [17]:
import re

Below, we will use the variable `s` for the string that we are interested in searching in and a variable ending in `_re` to denote the regex we are searching for. We will mostly use the function `re.search` since it's the most flexible. We will also use compiled regexes since they run faster.

## 2.1.1 Basic Regular Expression Patterns

In [25]:
s = '''Do you know what a woodchuck is?'''

# compile a regex for matching the exact string 'woodchuck'
# we use raw strings so regex special characters are not interpreted
woodchuck_re = re.compile(r'woodchuck')

# search
m = re.search(woodchuck_re, s)
print(m)

<re.Match object; span=(19, 28), match='woodchuck'>


In [19]:
s[19:28]

'woodchuck'

In [20]:
# regexes are case-sensitive by default.

s = '''I'm called little Buttercup'''

my_re = re.compile(r'buttercup')

# this won't match since the case does not match.
m = re.search(my_re, s)
print(m)

None


In [21]:
# use square braces for disjunction to allow case insensitivity 
# for the first character

my_re = re.compile(r'[bB]uttercup')

m = re.search(my_re, s)
print(m)

<re.Match object; span=(18, 27), match='Buttercup'>


In [27]:
# match any vowel, case-insensitive

s = '''Hello there, Mr. E!'''

vowel_re = re.compile(r'[aeiouAEIOU]')

m = re.search(vowel_re, s)
print(m)

<re.Match object; span=(1, 2), match='e'>


Notice that only the first match is returned by `re.search`. If we want to find all the matches, we can use `re.finditer`

In [28]:
for m in re.finditer(vowel_re, s):
    print(m)

<re.Match object; span=(1, 2), match='e'>
<re.Match object; span=(4, 5), match='o'>
<re.Match object; span=(8, 9), match='e'>
<re.Match object; span=(10, 11), match='e'>
<re.Match object; span=(17, 18), match='E'>


In [35]:
# Set case-insensitive match with flag

s = '''Hello there, Mr. E!'''

vowel_re = re.compile(r'[aeiou]', flags=re.IGNORECASE)

m = re.search(vowel_re, s)
print(m)

<re.Match object; span=(1, 2), match='e'>


In [37]:
# match any uppercase letter

s = '''hello there, Mr. E!'''

upper_re = re.compile(r'[A-Z]')
m = re.search(upper_re, s)
print(m)

<re.Match object; span=(13, 14), match='M'>


Similarly, `[a-z]` matches any lower-case letter and `[0-9]` matches any digit. In python `\d` also means a digit.

In [41]:
s = '''He won $100 in the raffle.'''

# let's try a regex directly and not compile it
for m in re.finditer(r'\d', s):
    print(m)

<re.Match object; span=(8, 9), match='1'>
<re.Match object; span=(9, 10), match='0'>
<re.Match object; span=(10, 11), match='0'>


Negation i.e. to not match on something, use `[^`.

In [42]:
s = '''Oyfn PripeT'''

not_upper_case_re = re.compile(r'[^A-Z]')

for m in re.finditer(not_upper_case_re, s):
    print(m)

<re.Match object; span=(1, 2), match='y'>
<re.Match object; span=(2, 3), match='f'>
<re.Match object; span=(3, 4), match='n'>
<re.Match object; span=(4, 5), match=' '>
<re.Match object; span=(6, 7), match='r'>
<re.Match object; span=(7, 8), match='i'>
<re.Match object; span=(8, 9), match='p'>
<re.Match object; span=(9, 10), match='e'>


For substituting patterns matching a regular expression with something else, we can use `re.sub`.

In [44]:
# remove all upper-case characters

s = '''Hello There, Doc!'''

upper_removed_s = re.sub(r'[A-Z]', '', s)
print(upper_removed_s)

ello here, oc!


In [46]:
# optional element, meaning 0 or 1 occurrence

# what they see in England
s = 'rainbow colours'

m = re.search(r'colou?r', s)
print(m)

<re.Match object; span=(8, 14), match='colour'>


In [47]:
# what they see in the US
s = 'rainbow colors'

m = re.search(r'colou?r', s)
print(m)

<re.Match object; span=(8, 13), match='color'>
