# Regular Expressions

We frequently wish to be able to evaluate text that we might be processing. We will use regular expressions to do this.

In [1]:
import re

text = "cab"
print(text == "cab")
print(text == "ca")

True
False


One challenge with the above comparison is that it must be exact. Would it be possible to instead describe what we are trying to match?

In [2]:
pattern = re.compile("c")
print(pattern.fullmatch(text))

None


In [3]:
pattern = re.compile("cab")
print(pattern.fullmatch(text))

<re.Match object; span=(0, 3), match='cab'>


Is there any way that we can make our match criteria more flexible? For example, if we knew that the pattern we were looking for contained a consonant in (c,d,f) then a vowel in (a,e,i) and then another consonant in (b,t,y,p)?

We could exhaustively construct the output possibilities. But perhaps we can describe what we want to match.

In [4]:
pattern = re.compile("[cdf][aei][btyp]")
print(pattern.fullmatch(text))

<re.Match object; span=(0, 3), match='cab'>


What if we are told the vowels in the middle might repeat? Can we change the expression to handle this?

In [5]:
pattern = re.compile("[cdf][aei]*[btyp]")
print(pattern.fullmatch(text))
text2 = "caab"
print(pattern.fullmatch(text2))

<re.Match object; span=(0, 3), match='cab'>
<re.Match object; span=(0, 4), match='caab'>


But might this not be the best expression to capture these pieces of text? Imagine we encounter a word without a vowel.

In [6]:
pattern = re.compile("[cdf][aei]*[btyp]")
print(pattern.fullmatch(text))
print(pattern.fullmatch(text2))
text3 = "cb"
print(pattern.fullmatch(text3))

<re.Match object; span=(0, 3), match='cab'>
<re.Match object; span=(0, 4), match='caab'>
<re.Match object; span=(0, 2), match='cb'>


It's matched even without a vowel. This could be a problem. So we might replace the * with the +.
The * will match 0 or more instances while the + will require 1 or more instances.

In [7]:
pattern = re.compile("[cdf][aei]+[btyp]")
print(pattern.fullmatch(text))
print(pattern.fullmatch(text2))
print(pattern.fullmatch(text3))

<re.Match object; span=(0, 3), match='cab'>
<re.Match object; span=(0, 4), match='caab'>
None


What if we only want to match with an ending that has one consonant and no more? Currently we can only use the * or +.

In [8]:
pattern = re.compile("[cdf][aei]+[btyp]?")
print(pattern.fullmatch(text))
print(pattern.fullmatch(text2))
print(pattern.fullmatch(text3))
text4 = "cabb"
print(pattern.fullmatch(text4))

<re.Match object; span=(0, 3), match='cab'>
<re.Match object; span=(0, 4), match='caab'>
None
None


What if we are concerned that the last character of the word might be an additional character we are not aware of or not be alphanumeric? Can we still catch those?

In [9]:
pattern = re.compile("[cdf][aei]+[btyp]?")
print(pattern.fullmatch(text))
print(pattern.fullmatch(text2))
print(pattern.fullmatch(text3))
text4 = "cabb"
print(pattern.fullmatch(text4))
text5 = "cab?"
print(pattern.fullmatch(text5))

pattern = re.compile("[cdf][aei]+[btyp].")
print(pattern.fullmatch(text5))



<re.Match object; span=(0, 3), match='cab'>
<re.Match object; span=(0, 4), match='caab'>
None
None
None
<re.Match object; span=(0, 4), match='cab?'>


But how would this be useful in NLP? Perhaps we can use this to inspect tokens?

In [10]:
from nltk.tokenize import TreebankWordTokenizer
treebank_tokenizer = TreebankWordTokenizer()

input = "Specifically, we reviewed the AN/ASQ‑235 Airborne Mine Neutralization System (AMNS), Airborne Laser Mine Detection System (ALMDS), and Coastal Battlefield Reconnaissance and Analysis (COBRA) Block I systems."
tokens = treebank_tokenizer.tokenize(input)
print(tokens)


['Specifically', ',', 'we', 'reviewed', 'the', 'AN/ASQ‑235', 'Airborne', 'Mine', 'Neutralization', 'System', '(', 'AMNS', ')', ',', 'Airborne', 'Laser', 'Mine', 'Detection', 'System', '(', 'ALMDS', ')', ',', 'and', 'Coastal', 'Battlefield', 'Reconnaissance', 'and', 'Analysis', '(', 'COBRA', ')', 'Block', 'I', 'systems', '.']


In [11]:
pattern = re.compile("[w][aei]")
for token in tokens:
    print(pattern.fullmatch(token))

None
None
<re.Match object; span=(0, 2), match='we'>
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None


What if we are interested in getting uppercase?

In [17]:
pattern = re.compile("[mMiInNeE]+")
pattern = re.compile("[A-Z0-9/-]+")
for token in tokens:
    print(pattern.fullmatch(token))

None
None
None
None
None
None
None
None
None
None
None
<re.Match object; span=(0, 4), match='AMNS'>
None
None
None
None
None
None
None
None
<re.Match object; span=(0, 5), match='ALMDS'>
None
None
None
None
None
None
None
None
None
<re.Match object; span=(0, 5), match='COBRA'>
None
None
<re.Match object; span=(0, 1), match='I'>
None
None
