## CS 210 Fall 2023
#### Lecture 15

In [1]:
import re

# if you restart the kernel,
# don't forget to re-import

### <font color="purple">Reg</font><font color="green">ul</font><font color="red">ar</font> <font color="purple">Ex</font><font color="blue">press</font><font color="orange">ions</font>

- literal characters match literal characters
- . matches any character (except newlines)
- R* matches 0 or more copies of R
- R+ matches 1 or more copies of R
- R? matches 0 or 1 copy of R
- [] is used to match a class of characters (e.g., [a-zA-Z0-9])
- ^ matches start of target string
  (outside of a [ ] class)
- $ matches end of target string
- | is used for alternative match, usually with ( )
- { } used for specific number of instances
- ^ negates when used as *first* character inside a class [ ]

### Exercise

Match or No Match? 

In [2]:
m = re.search('^[a][a-z]{3,}$', 'artificial')

In [3]:
print(m)

<re.Match object; span=(0, 10), match='artificial'>


### Exercise

Match or No Match? 

In [4]:
m = re.search('^[^a][a-z]+', 'artificial')

In [5]:
print(m)

None


### Exercise

Match or No Match? 

In [6]:
m = re.search('^a[a-z]+tion$', 'attention')

In [7]:
print(m)

<re.Match object; span=(0, 9), match='attention'>


### Exercise

Match or No Match? 

In [8]:
m = re.search('a[a-z]+tion$', 'distraction')

In [9]:
print(m)

<re.Match object; span=(5, 11), match='action'>


In [10]:
def search_strings(pattern, test_strings): 
    for s in test_strings: 
        res = re.search(pattern, s)
        if res is None:
            print(f"'{s}' does not match")
        else:
            span = res.span()
            print(f"'{s}' matches '{s[span[0]:span[1]]}'")

**Example**
Want to match the sequence 'a\y' or 'ay', i.e. one or zero occurence of '\\' between 'a' and 'y'

In [11]:
test_strings = ['ay', 'a\y']
search_strings('a\?y', test_strings)
search_strings('a\\?y', test_strings)
search_strings('a\\\\?y', test_strings)

'ay' does not match
'a\y' does not match
'ay' does not match
'a\y' does not match
'ay' matches 'ay'
'a\y' matches 'a\y'


**Example**
Want to match the sequence 'a\t' or 'at', i.e. one or zero occurence of '\\' between 'a' and 't'

In [12]:
test_strings = ['at', 'a\t', 'a\\t']
search_strings('a\t', test_strings)
search_strings('a\\t', test_strings)
search_strings('a\?t', test_strings)
search_strings('a\\?t', test_strings)
search_strings('a\\\\?t', test_strings)
search_strings(r'a\?t', test_strings)

'at' does not match
'a	' matches 'a	'
'a\t' does not match
'at' does not match
'a	' matches 'a	'
'a\t' does not match
'at' does not match
'a	' does not match
'a\t' does not match
'at' does not match
'a	' does not match
'a\t' does not match
'at' matches 'at'
'a	' does not match
'a\t' matches 'a\t'
'at' does not match
'a	' does not match
'a\t' does not match



**Workaround: tag the pattern as a RAW string, with an 'r' in front**

**Python leaves it alone and sends it as-is to the search function**

In [13]:
# Use r in front of target string as well
print(re.search(r'a\\?y','a\y'))

<re.Match object; span=(0, 3), match='a\\y'>


In [14]:
# with r in front of regexp string, single '\' still retains its special meaning
print(re.search(r'a\?y','a?y'))
print(re.search(r'a\?y','a\y'))

<re.Match object; span=(0, 3), match='a?y'>
None


#### Always use r'...' for the regular expression if it has an escape '\\' character

**One final obstacle: Python special characters**

**written with a '\\' in front of them. E.g., '\\n', '\\t', etc.**

See https://chercher.tech/python-programming/python-special-characters

In [15]:
print(re.search(r'a\\?y','a\y'))

<re.Match object; span=(0, 3), match='a\\y'>


In [16]:
print(re.search(r'a\\?t','a\t'))

None


Why doesn't this code work?

**We will need to escape the '\\' in the target string as well**

In [17]:
print(re.search(r'a\\?t','a\\t'))

<re.Match object; span=(0, 3), match='a\\t'>


**Easy workaround is to write the target as r'...' as well**

In [18]:
print(re.search(r'a\\?t',r'a\t'))

<re.Match object; span=(0, 3), match='a\\t'>


In [19]:
print(re.search(r'a\\?t',r'at'))

<re.Match object; span=(0, 2), match='at'>


#### Greedy and non-greedy matching

What should the following search return?

In [20]:
myRegex = 'a+'
myString = 'caaaaaaat'

res = re.search(myRegex, myString)
print(res)

<re.Match object; span=(1, 8), match='aaaaaaa'>


In [21]:
# greedy matching
# * matches longest possible sequence 
res = re.search('<.*>', '<p class="para">This is a paragraph.</p>')
print(res)

<re.Match object; span=(0, 40), match='<p class="para">This is a paragraph.</p>'>


In [22]:
# non-greedy matching with ?
res = re.search('<.*?>', '<p class="para">This is a paragraph.</p>')
print(res)
res = re.search('^[,.();]+?', '...;,;,,')
print(res)

<re.Match object; span=(0, 16), match='<p class="para">'>
<re.Match object; span=(0, 1), match='.'>


**In the above, the search stops as soon as the first '>' is seen**

Note: ? following * or +, is different from ? following a character

In [23]:
# or, we can use negation to prevent < or > characters in between
print(re.search(r'<[^<>]*>','<p class="para">This is a paragraph.</p>'))
print(re.search(r'\([^()]*\)','(abc)(def)'))

<re.Match object; span=(0, 16), match='<p class="para">'>
<re.Match object; span=(0, 5), match='(abc)'>


#### Special regular expression sequences to match predefined sets of characters
1. Whitespace: \\s, \\S
2. Word (alphanumeric, plus underscore) characters: \\w, \\W
3. Digits: \\d, \\D
4. Word Boundary: \\b

**Whitespace**
- \\s : matches any whitespace character (including tab and newline)
- \\S : matches any non-whitespace character 

In [24]:
pattern = r'[.?!]{2,}\s+'
test_strings = ["...What's going on??? ",
                "... What's going on?",
                "...  What's going on?", 
                "What's going on?! Hey..."]

In [25]:
search_strings(pattern, test_strings)

'...What's going on??? ' matches '??? '
'... What's going on?' matches '... '
'...  What's going on?' matches '...  '
'What's going on?! Hey...' matches '?! '


In [31]:
# at least 4 non-whitespace characters followed by at least one whitespace
res = re.search(r'\S{4,}\s+','The quick brown fox...')
print(res)

<re.Match object; span=(4, 10), match='quick '>


In [32]:
# can specify whitspace alternatively by using [] class with space, tab, and newline
pattern = r'[.?!]{2,}[ \t\n]+'
test_strings = ["... What the?",
                "What the?!\nHey!!",
                ".?.?.?\tWhat? "]
search_strings(pattern, test_strings)

'... What the?' matches '... '
'What the?!
Hey!!' matches '?!
'
'.?.?.?	What? ' matches '.?.?.?	'


In [33]:
# from last time... to match a '-' in a character class, use '\'
# or put it first or last
pattern = "0[-+^]?9"
test_strings = ['0-9','0+9', '09', '0^9', '0\9']
search_strings(pattern, test_strings)

'0-9' matches '0-9'
'0+9' matches '0+9'
'09' matches '09'
'0^9' matches '0^9'
'0\9' does not match


**"Word": characters (alphanumeric)**

- \\w : matches any alphanumeric character => [a-zA-Z0-9_]  (includes underscore)
- \\W : matches any non-alphanumeric character => [^a-zA-Z0-9_]  

In [34]:
# want at least 4 word characters followed by at least one whitespace
res = re.search(r'\w{4,}\s+',"Hey! What's up?")
print(res)

None


In [35]:
# want at least 4 word characters followed by at least one whitespace
res = re.search(r'\w{4,}\s+',"Hey! What's up with_you today?")
print(res)

<re.Match object; span=(15, 24), match='with_you '>


**Digits**

- \\d : matches any digit character => [0-9]
- \\D : matches any non-digit character => [^0-9]

### Exercise 1:
Write a regular expression to determine if a given string is an acceptable phone number.
Following are the acceptable phone number formats (d stands for digit):

- (ddd)ddddddd
- (ddd)ddd-dddd

```
[In]:
test_strings = ["(123)4567890", 
                "(123)456-7890", 
                "1234567890", 
                "999-9999"]

phone_pattern_1 = # YOUR PATTERN HERE
search_strings(phone_pattern_1, test_strings)

[Out]:
'(123)4567890' matches '(123)4567890'
'(123)456-7890' matches '(123)456-7890'
'1234567890' does not match
'999-9999' does not match
```

In [39]:
test_strings = ["(123)4567890", 
                "(123)456-7890", 
                "1234567890", 
                "999-9999"]
#phone_pattern_1 = ''
search_strings(phone_pattern_1, test_strings)


'(123)4567890' matches '(123)4567890'
'(123)456-7890' matches '(123)456-7890'
'1234567890' does not match
'999-9999' does not match


**Next, let's strengthen the above with the ability to handle leading/trailing whitespaces**

In [88]:
pat = r'\s*'+phone_pattern_1+'\s*'
res = re.search(pat, '      (123)456-7890    ')
print('match') if res else print('no match')

match


**For the non '( )' variants, we can't have a single pattern using -? for each of the - positions because it will match even if only one dash is present, and the other is not**

- ddd-ddd-dddd
- dddddddddd


In [41]:
# so, for instance, it will work for this string
res = re.search(r'^\s*\d{3}-?\d{3}-?\d{4}\s*$','  848-555-4321')
print(res)

<re.Match object; span=(0, 14), match='  848-555-4321'>


In [42]:
# but also for this string, which is not an acceptable variant
res = re.search(r'^\s*\d{3}-?\d{3 }-?\d{4}\s*$','  848-5554321')
print(res) 

None


**So let's do one pattern to catch both dashes**

In [43]:
# both dashes
print(re.search(r'^\s*\d{3}(-\d{3}-|\d{3})\d{4}\s*$',' hi my name is 848-555-4321   '))

None


**And another pattern to catch a straight sequence of 10 digits**

In [44]:
# 10 digits in sequence
print(re.search(r'^\s*\d{10}\s*$','  8485554321   '))

<re.Match object; span=(0, 15), match='  8485554321   '>


**Final solution, single regexp to catch all variants**

In [45]:
phone_pattern = r'^\s*(\(\d{3}\)\d{3}-?\d{4}|\d{3}-\d{3}-\d{4}|\d{10})\s*$'
test_strings = ["(123)4567890", 
                "(123)456-7890", 
                "1234567890", 
                "123-999-9999",
                "123999-1234",
                "222(333)4444"]
search_strings(phone_pattern, test_strings)


'(123)4567890' matches '(123)4567890'
'(123)456-7890' matches '(123)456-7890'
'1234567890' matches '1234567890'
'123-999-9999' matches '123-999-9999'
'123999-1234' does not match
'222(333)4444' does not match


**Word boundary**
- \\b : matches only at word boundary (doesn't actually match any character, just sets the rule).
(Remember, a word is a sequence of alphanumeric characters plus underscore.)

In [46]:
# check if a string contains the word 'part'
pattern = r'\b[pP]art\b'
test_strings = ["I'm going to a party tomorrow", 
                "This is the best part of the movie.",
                "This is a big apartment.",
                'Til death do us part'] # end of string is also word boundary
search_strings(pattern, test_strings)

'I'm going to a party tomorrow' does not match
'This is the best part of the movie.' matches 'part'
'This is a big apartment.' does not match
'Til death do us part' matches 'part'


In [47]:
res = re.search(r'\b[eE]pisode\b',"Episode3 has a high rating.") 
print(res)
res = re.search(r'\b[eE]pisode\b',"Episode-3 has a high rating.") 
print(res)
res = re.search(r'\b[eE]pisode\b',"Episode 3 has a high rating.") 
print(res)

None
<re.Match object; span=(0, 7), match='Episode'>
<re.Match object; span=(0, 7), match='Episode'>


**In the above, since word includes digits, the '3' is not a word boundary**

#### Using the match function
**The match function always starts matching from the beginning of string**

In [48]:
print(re.search('ar','barbaric')) # 'ar' is in 'barbaric'
print(re.match('ar','barbaric')) # but 'barbaric' doesn't begin with 'ar'

<re.Match object; span=(1, 3), match='ar'>
None


In [49]:
# match all strings that start with ar, end with t, 
# and have at least one lowercase letter between

res = re.search('^ar[a-z]+t$', 'arrest')  # version 1, using search
print(res)
res = re.match('ar[a-z]+t$', 'arrest')  # version 2, using match   
print(res)

<re.Match object; span=(0, 6), match='arrest'>
<re.Match object; span=(0, 6), match='arrest'>


**Note that if you want to match an entire string with the match function, you will still need to use $ at the end**

#### Using the match object returned by search/match
**Applying the methods group(), span(), start(), end()**

In [50]:
res = re.search('at', 'catch')  # returned Match object is stored in res 
res

<re.Match object; span=(1, 3), match='at'>

**group() returns the matched result string**

In [51]:
print(res.group())

at


**span() returns the range tuple (start,end) indices of matching part of original string**

In [52]:
print(res.span())

(1, 3)


**start() and end() return starting and ending indices of matching part of original string**

In [53]:
print(res.start()) 
print(res.end()) 

1
3


**of course, you can get these same values from the tuple returned by span()**

In [54]:
start,end = res.span()
print(start,',',end)

1 , 3


**By definition, re.match() will always return a span that starts at 0 (if a match is found)**

In [55]:
res = re.match(r'<.*?>','<span>This is within a span tag in html</span>')  # non-greedy
print(res.group())
print(res.span())
print(res.start())
print(res.end())

<span>
(0, 6)
0
6


In [56]:
res = re.match(r'<.*>','<span>This is within a span tag in html</span>')  # greedy
print(res.group())
print(res.span())
print(res.start())
print(res.end())

<span>This is within a span tag in html</span>
(0, 46)
0
46


**Be careful to check for existence of returned match object before applying methods!**

In [57]:
res = re.match('bar','sandbar')
print(res.group())

AttributeError: 'NoneType' object has no attribute 'group'

In [58]:
res = re.match('bar','sandbar')
print(res.group()) if res else print('No match')

No match


**Typical usage is to store in Match object, check if it exists (not None), and then get matched string with group**

In [59]:
# find out if a string contains any sequence that starts with ar, ends with t, 
# and has at least one lowercase letter between
def substr(astr):
    res = re.search('ar[a-z]+t',astr)  
    print('Match:',res.group()) if res else print('No match')
        
substr('parasite')
substr('artist')
substr('part')

Match: arasit
Match: artist
No match


#### Splitting a string with split function

In [60]:
s = 'ab;cd'
re.split(';',s)

['ab', 'cd']

In [61]:
s.split(';')

['ab', 'cd']

In [62]:
s = 'Really? I mean, really?!'
re.split('[?!]',s)

['Really', ' I mean, really', '', '']

**Regexp split will split separately on each of the characters in the given class.
Also, notice the empty string returned between consecutive split characters,
and between consecutive split character and end of string**

In [63]:
s.split('?!')

['Really? I mean, really', '']

**But `str.split` will only split on ALL characters in the given set as a group.
Empty string returned as in regexpt split**

In [64]:
# split into words, using \W (non-word character) as delimiter
res = re.split('\W+','This   is  a bunch of words!')
print(res)

['This', 'is', 'a', 'bunch', 'of', 'words', '']


#### Substituting in a string with sub function

In [65]:
# substitute all digits in 'Account number 1223456789' with '#'
re.sub('\d','#','Account number 1223456789')

'Account number ##########'

In [66]:
# substitute last 3 digits with '#'
re.sub('\d{3}$','###','Account number 1223456789')

'Account number 1223456###'

In [67]:
# removing comments from html
# <!-- this is a comment -->

htmlstr = 'Before comment...<!-- This is a comment -->, and after comment'
res = re.sub('<!--.*-->','',htmlstr) # replace comment with nothing
print(res)

Before comment..., and after comment


In [68]:
# Careful with greedy regexes!
htmlstr = 'Before first... <!-- comment1 -->between first and second <!-- comment2--> ... after second'
res = re.sub('<!--.*-->','', htmlstr)  # replace comment with nothing
print(res)

Before first...  ... after second


**Since the regexp above does a greedy match, everything from the first '<' to the last '>' is matched,
including the string between the two comment sections**

In [69]:
# make it non-greedy
htmlstr = 'Before first... <!-- comment1 -->between first and second <!-- comment2--> ... after second'
res = re.sub('<!--.*?-->','', htmlstr)  # replace comment with nothing
print(res)
res = re.sub('<!--[^>-]*-->','', htmlstr)  # replace comment with nothing
print(res)

Before first... between first and second  ... after second
Before first... between first and second  ... after second


In [70]:
# does it work with a multiline string?
htmlstr2 = """<!-- first 
comment -->Not a comment<!-- comment2 -->"""
res = re.sub('<!--.*?-->','', htmlstr2)  # replace comment with nothing
print(res)

<!-- first 
comment -->Not a comment


**The '.' metacharacter does not match a newline**

In [71]:
res = re.sub('<!--[^>-]*-->','', htmlstr2)  # replace comment with nothing
print(res)

Not a comment


#### Grouping/Capturing

In [72]:
# want to extract ("capture") area code and local part from phone number
# assume format (ddd)ddd-dddd

res = re.match(r'\s*\((\d{3})\)(\d{3})-(\d{4})', '(848)555-4321')

**Notice the grouping/capturing with parentheses around the area code part, as in (\d{3})
and likewise for the entire non-area code part**

In [73]:
print(res.group())  # for the whole thing
print(res.groups()) # for all parts captured with ( )
print(res.group(0)) # entire thing
print(res.group(1)) # first grouping with ( )
print(res.group(2)) # second grouping with ( )
print(res.group(3))

(848)555-4321
('848', '555', '4321')
(848)555-4321
848
555
4321


In [2]:
import re

In [3]:
# equally, you can use search instead of match, just make sure to use ^ for start of string
res = re.search(r'^\s*\((\d{3})\)(\d{3})-(\d{4})', '(848)555-4321')

In [4]:
print(res.group())  # for the whole thing
print(res.groups()) # for all parts grouped with ( )
print(res.group(0)) # entire thing
print(res.group(1)) # first grouping with ( )
print(res.group(2)) # second grouping with ( )
print(res.group(3))

(848)555-4321
('848', '555', '4321')
(848)555-4321
848
555
4321


In [76]:
# alternatively, you can index into the groups() tuple
print(res.groups()[0])
print(res.groups()[1])

848
555


In [77]:
# iterate through all the groups
res = re.match(r'\s*\((\d{3})\)(\d{3})-(\d{4})', '(848)555-4321')
if res:
    for gr in res.groups():
        print(gr)

848
555
4321


**Numbering and back-referencing capture groups**

In [78]:
# captures can be numbered, and backreferenced using numbers
res = re.search(r'(\d+)-(.*)-\2','123-456-4567')
print(res)

<re.Match object; span=(0, 11), match='123-456-456'>


In [79]:
# captures can be numbered, and backreferenced using numbers
res = re.search(r'(air).*\1','cool air or hot air')
print(res)

<re.Match object; span=(5, 19), match='air or hot air'>


**When using back references, make sure to use raw string for the regexp, otherwise it won't work, see below**

In [80]:
# same as 2 cells above, but without using raw string
res = re.search('(air).*\1','cool air or hot air')
print(res)
res = re.search('(air).*\\1','cool air or hot air')
print(res)

None
<re.Match object; span=(5, 19), match='air or hot air'>


#### Pre-compiling a regular expression

**Sometimes it's easier to "compile" a regular expression and use it several times later**

In [81]:
pattrn = re.compile(r'\s*\((\d{3})\)(\d{3}-\d{4})')
res = pattrn.match('(848)555-4321')
print(res.groups())

('848', '555-4321')


In [82]:
patt = re.compile(r'\s*#?\s*(\d+)')
res = patt.match('#25 Infinite Loop,Cupertino,CA 12345')
print(res.groups())
res = patt.match(' # 25 Infinite Loop,Cupertino,CA 12345')
print(res.groups())
res = patt.match(' 25 Infinite Loop,Cupertino,CA 12345')
print(res.groups())

('25',)
('25',)
('25',)


**Exercise 2**
<pre>
Given a string of the form:
     '"&lt;last name>, &lt;first name>",&lt;netid>'

Output the string:
     '&lt;first name>,&lt;last name>,&lt;netid>'

e.g. '"  Smith,   Bob ", bs123 ' => 'Bob,Smith,bs123@rutgers.edu'
</pre>

In [83]:
#ex2_pattern = r''. # search pattern here
#ex2_sub = r'' # your substition pattern here

In [84]:
student_str = '"  Smith ,   Bob " , bs123 '
res = re.sub(ex2_pattern, ex2_sub, student_str)
print(res)

Bob,Smith,bs123@rutgers.edu


In [85]:
# what if try pre-compiling both the strings?
student_str = '"  Smith,   Bob ", bs123 '
target = re.compile(ex2_pattern)
repl = re.compile(ex2_sub)
res = re.sub(target,repl,student_str)
print(res)

error: invalid group reference 2 at position 1

**The above doesn't work: the context of the pattern is restricted to the target variable, so the references to the captured groups in the repl variable are out of context**

**Solutions to Exercises**

In [86]:
phone_pattern_1 = r'\(\d{3}\)\d{3}-?\d{4}'

In [38]:
ex2_pattern = r'"\s*(\S+)\s*,\s*(\S+)\s*"\s*,\s*(\w+)\s*'
ex2_sub = r'\2,\1,\3@rutgers.edu'