## Regular Expressions

In [1]:
import re

# from DS100 book...
def reg(regex, text):
    """
    Prints the string with the regex match highlighted.
    """
    print(re.sub(f'({regex})', r'\033[1;30;43m\1\033[m', text))

In [2]:
s1 = " ".join(["A DAG is a directed graph without cycles.",
               "A tree is a DAG where every node has one parent (except the root, which has none).",
               "To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯"])
print(s1)

A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯


In [3]:
s2 = """1-608-123-4567
a-bcd-efg-hijg (not a phone number)
1-608-123-456 (not a phone number)
608-123-4567
123-4567
"""
print(s2)

1-608-123-4567
a-bcd-efg-hijg (not a phone number)
1-608-123-456 (not a phone number)
608-123-4567
123-4567



In [4]:
s3 = "In CS 320, there are 14 quizzes, 7 projects, 41 lectures, and 1000 things to learn.  CS 320 is awesome!"
s3

'In CS 320, there are 14 quizzes, 7 projects, 41 lectures, and 1000 things to learn.  CS 320 is awesome!'

In [5]:
s4 = """In CS 320,  there are 14 quizzes,    7 projects,
41 lectures, and 1000 things to learn.  CS 320 is awesome!"""
print(s4)

In CS 320,  there are 14 quizzes,    7 projects,
41 lectures, and 1000 things to learn.  CS 320 is awesome!


In [6]:
print(s1)

A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯


In [7]:
reg("a", s1)

A DAG is [1;30;43ma[m directed gr[1;30;43ma[mph without cycles. A tree is [1;30;43ma[m DAG where every node h[1;30;43ma[ms one p[1;30;43ma[mrent (except the root, which h[1;30;43ma[ms none). To le[1;30;43ma[mrn more, visit www.ex[1;30;43ma[mmple.com or c[1;30;43ma[mll 1-608-123-4567. :) ¯\_(ツ)_/¯


In [8]:
reg("A", s1)

[1;30;43mA[m D[1;30;43mA[mG is a directed graph without cycles. [1;30;43mA[m tree is a D[1;30;43mA[mG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯


In [11]:
# find the left arm
reg("\\\\", s1)

A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯[1;30;43m\[m_(ツ)_/¯


In [12]:
reg(r"\\", s1)

A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯[1;30;43m\[m_(ツ)_/¯


In [13]:
reg(r"aA", s1)

A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯


In [14]:
# character class
reg(r"[aA]", s1)

[1;30;43mA[m D[1;30;43mA[mG is [1;30;43ma[m directed gr[1;30;43ma[mph without cycles. [1;30;43mA[m tree is [1;30;43ma[m D[1;30;43mA[mG where every node h[1;30;43ma[ms one p[1;30;43ma[mrent (except the root, which h[1;30;43ma[ms none). To le[1;30;43ma[mrn more, visit www.ex[1;30;43ma[mmple.com or c[1;30;43ma[mll 1-608-123-4567. :) ¯\_(ツ)_/¯


In [15]:
# all the vowels
reg(r"[AEIOUaeiou]", s1)

[1;30;43mA[m D[1;30;43mA[mG [1;30;43mi[ms [1;30;43ma[m d[1;30;43mi[mr[1;30;43me[mct[1;30;43me[md gr[1;30;43ma[mph w[1;30;43mi[mth[1;30;43mo[m[1;30;43mu[mt cycl[1;30;43me[ms. [1;30;43mA[m tr[1;30;43me[m[1;30;43me[m [1;30;43mi[ms [1;30;43ma[m D[1;30;43mA[mG wh[1;30;43me[mr[1;30;43me[m [1;30;43me[mv[1;30;43me[mry n[1;30;43mo[md[1;30;43me[m h[1;30;43ma[ms [1;30;43mo[mn[1;30;43me[m p[1;30;43ma[mr[1;30;43me[mnt ([1;30;43me[mxc[1;30;43me[mpt th[1;30;43me[m r[1;30;43mo[m[1;30;43mo[mt, wh[1;30;43mi[mch h[1;30;43ma[ms n[1;30;43mo[mn[1;30;43me[m). T[1;30;43mo[m l[1;30;43me[m[1;30;43ma[mrn m[1;30;43mo[mr[1;30;43me[m, v[1;30;43mi[ms[1;30;43mi[mt www.[1;30;43me[mx[1;30;43ma[mmpl[1;30;43me[m.c[1;30;43mo[mm [1;30;43mo[mr c[1;30;43ma[mll 1-608-123-4567. :) ¯\_(ツ)_/¯


In [16]:
# everything except vowels
# ^ means "not" (when it's inside a character class)
reg(r"[^AEIOUaeiou]", s1)

A[1;30;43m [m[1;30;43mD[mA[1;30;43mG[m[1;30;43m [mi[1;30;43ms[m[1;30;43m [ma[1;30;43m [m[1;30;43md[mi[1;30;43mr[me[1;30;43mc[m[1;30;43mt[me[1;30;43md[m[1;30;43m [m[1;30;43mg[m[1;30;43mr[ma[1;30;43mp[m[1;30;43mh[m[1;30;43m [m[1;30;43mw[mi[1;30;43mt[m[1;30;43mh[mou[1;30;43mt[m[1;30;43m [m[1;30;43mc[m[1;30;43my[m[1;30;43mc[m[1;30;43ml[me[1;30;43ms[m[1;30;43m.[m[1;30;43m [mA[1;30;43m [m[1;30;43mt[m[1;30;43mr[mee[1;30;43m [mi[1;30;43ms[m[1;30;43m [ma[1;30;43m [m[1;30;43mD[mA[1;30;43mG[m[1;30;43m [m[1;30;43mw[m[1;30;43mh[me[1;30;43mr[me[1;30;43m [me[1;30;43mv[me[1;30;43mr[m[1;30;43my[m[1;30;43m [m[1;30;43mn[mo[1;30;43md[me[1;30;43m [m[1;30;43mh[ma[1;30;43ms[m[1;30;43m [mo[1;30;43mn[me[1;30;43m [m[1;30;43mp[ma[1;30;43mr[me[1;30;43mn[m[1;30;43mt[m[1;30;43m [m[1;30;43m([me[1;30;43mx[m[1;30;43mc[me[1;30;43mp[m[1;30;43mt[m[1;30;43m [m[1;30;43mt[m[1;30;43mh[me[

In [17]:
# find all capital letters
reg(r"[A-Z]", s1)

[1;30;43mA[m [1;30;43mD[m[1;30;43mA[m[1;30;43mG[m is a directed graph without cycles. [1;30;43mA[m tree is a [1;30;43mD[m[1;30;43mA[m[1;30;43mG[m where every node has one parent (except the root, which has none). [1;30;43mT[mo learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯


In [18]:
# find 3 things: A, Z, and -
reg(r"[A\-Z]", s1)

[1;30;43mA[m D[1;30;43mA[mG is a directed graph without cycles. [1;30;43mA[m tree is a D[1;30;43mA[mG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1[1;30;43m-[m608[1;30;43m-[m123[1;30;43m-[m4567. :) ¯\_(ツ)_/¯


In [20]:
# meta characters: \d, \s, \w, .
reg(r"\d", s1) # all digits

A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call [1;30;43m1[m-[1;30;43m6[m[1;30;43m0[m[1;30;43m8[m-[1;30;43m1[m[1;30;43m2[m[1;30;43m3[m-[1;30;43m4[m[1;30;43m5[m[1;30;43m6[m[1;30;43m7[m. :) ¯\_(ツ)_/¯


In [21]:
reg(r"\s", s1) # all whitespace (tabs, spaces, newlines)

A[1;30;43m [mDAG[1;30;43m [mis[1;30;43m [ma[1;30;43m [mdirected[1;30;43m [mgraph[1;30;43m [mwithout[1;30;43m [mcycles.[1;30;43m [mA[1;30;43m [mtree[1;30;43m [mis[1;30;43m [ma[1;30;43m [mDAG[1;30;43m [mwhere[1;30;43m [mevery[1;30;43m [mnode[1;30;43m [mhas[1;30;43m [mone[1;30;43m [mparent[1;30;43m [m(except[1;30;43m [mthe[1;30;43m [mroot,[1;30;43m [mwhich[1;30;43m [mhas[1;30;43m [mnone).[1;30;43m [mTo[1;30;43m [mlearn[1;30;43m [mmore,[1;30;43m [mvisit[1;30;43m [mwww.example.com[1;30;43m [mor[1;30;43m [mcall[1;30;43m [m1-608-123-4567.[1;30;43m [m:)[1;30;43m [m¯\_(ツ)_/¯


In [22]:
# \S means not \s
reg(r"\S", s1) # all non-whitespace

[1;30;43mA[m [1;30;43mD[m[1;30;43mA[m[1;30;43mG[m [1;30;43mi[m[1;30;43ms[m [1;30;43ma[m [1;30;43md[m[1;30;43mi[m[1;30;43mr[m[1;30;43me[m[1;30;43mc[m[1;30;43mt[m[1;30;43me[m[1;30;43md[m [1;30;43mg[m[1;30;43mr[m[1;30;43ma[m[1;30;43mp[m[1;30;43mh[m [1;30;43mw[m[1;30;43mi[m[1;30;43mt[m[1;30;43mh[m[1;30;43mo[m[1;30;43mu[m[1;30;43mt[m [1;30;43mc[m[1;30;43my[m[1;30;43mc[m[1;30;43ml[m[1;30;43me[m[1;30;43ms[m[1;30;43m.[m [1;30;43mA[m [1;30;43mt[m[1;30;43mr[m[1;30;43me[m[1;30;43me[m [1;30;43mi[m[1;30;43ms[m [1;30;43ma[m [1;30;43mD[m[1;30;43mA[m[1;30;43mG[m [1;30;43mw[m[1;30;43mh[m[1;30;43me[m[1;30;43mr[m[1;30;43me[m [1;30;43me[m[1;30;43mv[m[1;30;43me[m[1;30;43mr[m[1;30;43my[m [1;30;43mn[m[1;30;43mo[m[1;30;43md[m[1;30;43me[m [1;30;43mh[m[1;30;43ma[m[1;30;43ms[m [1;30;43mo[m[1;30;43mn[m[1;30;43me[m [1;30;43mp[m[1;30;43ma[m[1;30;43mr[m[1;30;43me[m[1;30;43mn[m[

In [23]:
reg(r"\w", s1) # \w is word characters

[1;30;43mA[m [1;30;43mD[m[1;30;43mA[m[1;30;43mG[m [1;30;43mi[m[1;30;43ms[m [1;30;43ma[m [1;30;43md[m[1;30;43mi[m[1;30;43mr[m[1;30;43me[m[1;30;43mc[m[1;30;43mt[m[1;30;43me[m[1;30;43md[m [1;30;43mg[m[1;30;43mr[m[1;30;43ma[m[1;30;43mp[m[1;30;43mh[m [1;30;43mw[m[1;30;43mi[m[1;30;43mt[m[1;30;43mh[m[1;30;43mo[m[1;30;43mu[m[1;30;43mt[m [1;30;43mc[m[1;30;43my[m[1;30;43mc[m[1;30;43ml[m[1;30;43me[m[1;30;43ms[m. [1;30;43mA[m [1;30;43mt[m[1;30;43mr[m[1;30;43me[m[1;30;43me[m [1;30;43mi[m[1;30;43ms[m [1;30;43ma[m [1;30;43mD[m[1;30;43mA[m[1;30;43mG[m [1;30;43mw[m[1;30;43mh[m[1;30;43me[m[1;30;43mr[m[1;30;43me[m [1;30;43me[m[1;30;43mv[m[1;30;43me[m[1;30;43mr[m[1;30;43my[m [1;30;43mn[m[1;30;43mo[m[1;30;43md[m[1;30;43me[m [1;30;43mh[m[1;30;43ma[m[1;30;43ms[m [1;30;43mo[m[1;30;43mn[m[1;30;43me[m [1;30;43mp[m[1;30;43ma[m[1;30;43mr[m[1;30;43me[m[1;30;43mn[m[1;30;43mt[m 

In [27]:
reg(r".", s1) # match anything EXCEPT a newline

[1;30;43mA[m[1;30;43m [m[1;30;43mD[m[1;30;43mA[m[1;30;43mG[m[1;30;43m [m[1;30;43mi[m[1;30;43ms[m[1;30;43m [m[1;30;43ma[m[1;30;43m [m[1;30;43md[m[1;30;43mi[m[1;30;43mr[m[1;30;43me[m[1;30;43mc[m[1;30;43mt[m[1;30;43me[m[1;30;43md[m[1;30;43m [m[1;30;43mg[m[1;30;43mr[m[1;30;43ma[m[1;30;43mp[m[1;30;43mh[m[1;30;43m [m[1;30;43mw[m[1;30;43mi[m[1;30;43mt[m[1;30;43mh[m[1;30;43mo[m[1;30;43mu[m[1;30;43mt[m[1;30;43m [m[1;30;43mc[m[1;30;43my[m[1;30;43mc[m[1;30;43ml[m[1;30;43me[m[1;30;43ms[m[1;30;43m.[m[1;30;43m [m[1;30;43mA[m[1;30;43m [m[1;30;43mt[m[1;30;43mr[m[1;30;43me[m[1;30;43me[m[1;30;43m [m[1;30;43mi[m[1;30;43ms[m[1;30;43m [m[1;30;43ma[m[1;30;43m [m[1;30;43mD[m[1;30;43mA[m[1;30;43mG[m[1;30;43m [m[1;30;43mw[m[1;30;43mh[m[1;30;43me[m[1;30;43mr[m[1;30;43me[m[1;30;43m [m[1;30;43me[m[1;30;43mv[m[1;30;43me[m[1;30;43mr[m[1;30;43my[m[1;30;43m [m[1;30;43mn[m[1;30

In [28]:
reg(r"www", s1)

A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit [1;30;43mwww[m.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯


In [29]:
reg(r"w{3}", s1)

A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit [1;30;43mwww[m.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯


In [30]:
reg(r"w{2}", s1) # RULE: no overlapping matches

A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit [1;30;43mww[mw.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯


In [32]:
# + means one or more   (prefers more -- "greedy" operator)
# * means zero or more  (prefers more)
reg(r"w+", s1)

A DAG is a directed graph [1;30;43mw[mithout cycles. A tree is a DAG [1;30;43mw[mhere every node has one parent (except the root, [1;30;43mw[mhich has none). To learn more, visit [1;30;43mwww[m.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯


In [33]:
# want everything between parens
reg(r"(.*)", s1) # BAD

[1;30;43mA DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯[m[1;30;43m[m


In [34]:
reg(r"\(.*\)", s1) # LESS BAD

A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent [1;30;43m(except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)[m_/¯


In [35]:
# +? means one or more   (prefers less)
# *? means zero or more  (prefers less)
# ? means zero or one
reg(r"\(.*?\)", s1) # GOOD

A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent [1;30;43m(except the root, which has none)[m. To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_[1;30;43m(ツ)[m_/¯


In [36]:
# anchor
# ^: beginning of string
# $: end of string

In [42]:
# match the first two sentences

# write good comments with regexes!!!
reg(r"^([^\.]*\.){2}", s1)

[1;30;43mA DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none).[m To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯


In [44]:
reg(r"\d-\d{3}-\d{3}-\d{4}", s2)

[1;30;43m1-608-123-4567[m
a-bcd-efg-hijg (not a phone number)
1-608-123-456 (not a phone number)
608-123-4567
123-4567



In [45]:
reg(r"(\d-)?(\d{3}-)?\d{3}-\d{4}", s2)

[1;30;43m1-608-123-4567[m
a-bcd-efg-hijg (not a phone number)
1-608-123-456 (not a phone number)
[1;30;43m608-123-4567[m
[1;30;43m123-4567[m



In [None]:
# TODO: handle no dashes
reg(r"(\d-)?(\d{3}-)?\d{3}-\d{4}", s2)

In [46]:
reg(r"(\d-)?(\d{3}-)?\d{3}-\d{4}", "1-123-4567")

[1;30;43m1-123-4567[m


In [47]:
reg(r"((\d-)?\d{3}-)?\d{3}-\d{4}", "1-123-4567")

1-[1;30;43m123-4567[m


In [48]:
reg(r"((\d-)?\d{3}-)?\d{3}-\d{4}", s2)

[1;30;43m1-608-123-4567[m
a-bcd-efg-hijg (not a phone number)
1-608-123-456 (not a phone number)
[1;30;43m608-123-4567[m
[1;30;43m123-4567[m



# re.findall and re.sub

In [49]:
import re

In [52]:
print(s3)

In CS 320, there are 14 quizzes, 7 projects, 41 lectures, and 1000 things to learn.  CS 320 is awesome!


In [53]:
# example 1: can we extract info about a course to a dict
re.findall(r"\d", s3)

['3', '2', '0', '1', '4', '7', '4', '1', '1', '0', '0', '0', '3', '2', '0']

In [56]:
re.findall(r"\d+", s3)

['320', '14', '7', '41', '1000', '320']

In [61]:
d = {}
for num, name in re.findall(r"(\d+) (\w+)", s3):
    d[name] = int(num)
d

{'quizzes': 14, 'projects': 7, 'lectures': 41, 'things': 1000, 'is': 320}

In [62]:
# example 2: make all the numbers bold (in HTML, using <b>)
print(s3)

In CS 320, there are 14 quizzes, 7 projects, 41 lectures, and 1000 things to learn.  CS 320 is awesome!


In [63]:
re.sub("\d+", "###", s3)

'In CS ###, there are ### quizzes, ### projects, ### lectures, and ### things to learn.  CS ### is awesome!'

In [64]:
re.sub("(\d+)", "\g<1>", s3)

'In CS 320, there are 14 quizzes, 7 projects, 41 lectures, and 1000 things to learn.  CS 320 is awesome!'

In [70]:
re.sub("(\d+)", "\g<1>!", s3)

'In CS 320!, there are 14! quizzes, 7! projects, 41! lectures, and 1000! things to learn.  CS 320! is awesome!'

In [68]:
html_str = re.sub("(\d+)", "<b>\g<1></b>", s3)
html_str

'In CS <b>320</b>, there are <b>14</b> quizzes, <b>7</b> projects, <b>41</b> lectures, and <b>1000</b> things to learn.  CS <b>320</b> is awesome!'

In [69]:
from IPython.core.display import display, HTML
HTML(html_str)