## Week 2 Friday Morning
### Hammond Chapter 6-7: Regular Expressions, Text Manipulation; comprehensions

#### Pattern matching

- Check whether a pattern matches a string
- Most programming languages have some implementation of this idea 

Interesting tool for help with constructing and debugging regular expressions: https://regex101.com/

Very good tutorial on regex: https://docs.python.org/3.8/howto/regex.html

In [1]:
# we already saw some pattern matching

'ab' in 'table'

True

In [2]:
'booking'.index('ing')

4

In [4]:
# Does the pattern (x followed by anything followed by y) match a string?

def mymatch (x, y, text):
    'x followed by y with anything in between'
    for letter in text:
        if letter == x:
            pos = text.index(letter)
            for letter in text[pos:]:
                if letter == y:
                    return True
    return False


In [5]:
mymatch('a', 'b', 'zanzibir')

True

In [6]:
import re

result = re.search('a.*b', 'zanzibar')
result

<re.Match object; span=(1, 6), match='anzib'>

In [7]:
result.group()  # group, span, start, end are only available if match succeeds!

'anzib'

In [8]:
result.start()

1

In [9]:
result.end() # is index of character AFTER the final char of the match

6

In [10]:
result.span()

(1, 6)

#### search 
- stops with first match found (has start, end, span, group)
- returns None when there is no match

#### findall
- returns a list of matches (only the strings)
- returns empty list when there is no match

#### finditer
- can be used to loop through matches
- each match has span and what is matched

#### match
- looks at a match from the start of the string
- has a span and a match (like group in search)

Matching is greedy.

### Patterns
- single symbols ('a', 'f', ...)
- concatenation of symbols ('ab', 'bar', ...)
- disjunction ('a|b', '(ab)|c', 'a[bc]')  (either or, any of)
- sets
    - [abc] a b or c
    - [^abc] not a, b or c
    - [a-z] a character in the range a-z
    - [a-zA-Z] a character in the range a-z or A-Z
    - [^a-zA-Z] a character NOT in the range a-z or A-Z
- any character (.)
- start of string ^ and end of string $
- a whitespace character \s
- a non-whitespace character \S
- a digit \d
- a non-digit \D
- a word character \w
- a non word character \W
- a word boundary character \b
- a non word boundary character \B
- quantifiers
    - a? zero or one 'a'
    - a+ one or more 'a'
    - a* zero or more 'a'
    - a{5} exactly 5 of 'a'

In [11]:
result = re.findall('bar', 'barbara')
result

['bar', 'bar']

In [12]:
result = re.search('bar', 'barbara')
result

<re.Match object; span=(0, 3), match='bar'>

In [13]:
result = re.match('bar', 'barbara')
result

<re.Match object; span=(0, 3), match='bar'>

In [14]:
result = re.search('bar', 'rabarbara')
result

<re.Match object; span=(2, 5), match='bar'>

In [18]:
result = re.match('bar', 'rabarbara')  # No match because pattern not found at the start
result

In [19]:
result = re.findall('ba', 'zanzibarbar')
result

['ba', 'ba']

In [20]:
for match in re.finditer('bar', 'zanzibarbar'):
    print(match)

<re.Match object; span=(5, 8), match='bar'>
<re.Match object; span=(8, 11), match='bar'>


In [21]:
for match in re.finditer('\d+', '12/10/1990'):
    print(match)

<re.Match object; span=(0, 2), match='12'>
<re.Match object; span=(3, 5), match='10'>
<re.Match object; span=(6, 10), match='1990'>


In [22]:
re.findall('o', 'www.google.com') # find all o

['o', 'o', 'o']

In [23]:
re.search('oo', 'www.google.com') # find two consecutive o's

<re.Match object; span=(5, 7), match='oo'>

In [24]:
re.findall('[ow]+', 'www.google.com') # find sequences of 1 or more o or w

['www', 'oo', 'o']

In [25]:
re.findall('\.', 'www.Google.com') # find sequences of one or more lowercase alphabetic character

['.', '.']

In [26]:
re.search('o{2}', 'www.google.com') # find sequence of exactly 2 o's

<re.Match object; span=(5, 7), match='oo'>

In [27]:
re.findall('\.', 'www.google.com') # find dots (escape!)

['.', '.']

In [28]:
re.findall('\w', 'www.google.com') # find word characters

['w', 'w', 'w', 'g', 'o', 'o', 'g', 'l', 'e', 'c', 'o', 'm']

In [29]:
re.findall('\W', 'www.google.com') # find non-word characters

['.', '.']

In [30]:
print('www\n   google\tcom')

www
   google	com


In [31]:
re.findall('\s', 'www\n   google\tcom') # find whitespace characters

['\n', ' ', ' ', ' ', '\t']

In [32]:
re.findall('[0-9]+', '12/6/1990') # find numeric characters

['12', '6', '1990']

In [33]:
re.findall('\d+', '12/10/1990') # find numeric characters

['12', '10', '1990']

In [34]:
re.findall('[^0-9]+', '12/10/1990') # find non-numeric characters

['/', '/']

### Exercise

Find a regular expression that matches time-of-day strings (such as 9:14 am or 08:20PM). Check that the matching is robust to all sorts of variations, such as:

  - 9 : 14 AM
  - 09:14 pm
  - 9: 14 Am
  - 9:14PM
  - ...

In [36]:
re.search('[0-9]?[0-9]\s*:\s*[0-9][0-9]\s*[aApP][mM]', 'Meet me in Starbucks at 09: 14 Am')

<re.Match object; span=(24, 33), match='09: 14 Am'>

## Replacing and backreferences

In [37]:
# re.sub takes a pattern, a replacement, and a string as input and returns a string in which 
# all instances of the pattern are replaced.

re.sub('/', '-', '12/10/1990') # replace the / by hyphen in the date

'12-10-1990'

In [38]:
# By using round brackets, you can define groups in a pattern, these can be accessed by a number preceded 
# by a slash, which allows you to change the order of the groups.

re.sub('([0-9]+)\/([0-9]+)\/([0-9]+)', '\\3-\\2-\\1', '12/10/1990') # double \\ because \ has to be 'escaped'

'1990-10-12'

In [39]:
re.sub('[aiouy]', 'e', 'it was a dark and windy night') # replace all non-e vowels by e

'et wes e derk end wende neght'

In [40]:
re.sub('[aiouy]', 'e', 'it was a dark and windy night', count=3) # stop after 3 replacements

'et wes e dark and windy night'

In [41]:
# If you are replacing one character by another character, consider .translate() 

my_table = str.maketrans('aeiou', '12345')

s1 = 'The moon shone bright'
s1.translate(my_table)

'Th2 m44n sh4n2 br3ght'

In [42]:
# .split() and re.split()
# Suppose you get these lines from a csv (comma separated value) file

s1 = 'large  , blue'
s2 = 'small, red'
s3 = 'small  ,orange'
s4 = 'large,orange'

In [43]:
s3.split(',')

['small  ', 'orange']

In [46]:
re.split('\s*,\s*', s3)

['small', 'orange']

In [45]:
# .join() concatenates strings into a larger string separated by the string it is used on

s = 'How now brown cow'
l = s.split()
l

['How', 'now', 'brown', 'cow']

In [48]:
# now concatenate them together again!

" ".join(l)

'How now brown cow'

In [49]:
# or with a different separator
",".join(l)

'How,now,brown,cow'

### Gaining a bit more efficiency in pattern matching

In [50]:
# Compile a regular expression
p = re.compile('[0-9]+')
p.findall('12/10/1990')

['12', '10', '1990']

In [51]:
%time
n = 0
while n < 1000000:
    re.findall('[0-9]+', '12/10/1990')
    n += 1

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 4.05 µs


In [52]:
%time
p = re.compile('[0-9]+')
n = 0
while n < 1000000:
    p.findall('12/10/1990')
    n += 1


CPU times: user 2 µs, sys: 1e+03 ns, total: 3 µs
Wall time: 4.29 µs


In [53]:
p

re.compile(r'[0-9]+', re.UNICODE)

## Comprehensions

In [54]:
# works for lists, sets, dictionaries

In [55]:
# given a list of numbers, make a list with the square of each number

l = [1, 2, 3, 4, 5]

result = []
for number in l:
    result.append(number**2)
    
result

[1, 4, 9, 16, 25]

In [57]:
[number*3 for number in l]

[3, 6, 9, 12, 15]

In [58]:
# only the ones larger than 2

[x**2 for x in l if x > 2]

[9, 16, 25]

In [59]:
word = 'banapple'
vowels = 'aeiou'

['V' if letter in vowels else 'C' for letter in word]

['C', 'V', 'C', 'V', 'C', 'C', 'C', 'V']

In [60]:
l1 = [1, 2, 3, 4]
l2 = [5, 6, 7, 8]

[x*y for x in l1 for y in l2]

[5, 6, 7, 8, 10, 12, 14, 16, 15, 18, 21, 24, 20, 24, 28, 32]

In [62]:
s1 = 'abc'
s2 = '123'
s3 = ':,.'

print([x+y+z for x in s1 for y in s2 for z in s3])

['a1:', 'a1,', 'a1.', 'a2:', 'a2,', 'a2.', 'a3:', 'a3,', 'a3.', 'b1:', 'b1,', 'b1.', 'b2:', 'b2,', 'b2.', 'b3:', 'b3,', 'b3.', 'c1:', 'c1,', 'c1.', 'c2:', 'c2,', 'c2.', 'c3:', 'c3,', 'c3.']


In [63]:
result = []

for x in s1:
    for y in s2:
        for z in s3:
            result.append(x+y+z)
            
print(result)

['a1:', 'a1,', 'a1.', 'a2:', 'a2,', 'a2.', 'a3:', 'a3,', 'a3.', 'b1:', 'b1,', 'b1.', 'b2:', 'b2,', 'b2.', 'b3:', 'b3,', 'b3.', 'c1:', 'c1,', 'c1.', 'c2:', 'c2,', 'c2.', 'c3:', 'c3,', 'c3.']


In [65]:
text = [['Linguistics', 'in', 'the', 'courtroom'], ['Literary', 'Detective', 'Work', 'on', 'the', 'Computer']]
text

[['Linguistics', 'in', 'the', 'courtroom'],
 ['Literary', 'Detective', 'Work', 'on', 'the', 'Computer']]

In [67]:
# What does this expression do?

len([letter for sentence in text for word in sentence for letter in word if letter in 'aeiou'])

# It's a vowelcounter!

23

In [68]:
# Also works for dictionaries

sentence = ['Linguistics', 'in', 'the', 'courtroom']
{w: len(w) for w in sentence}

{'Linguistics': 11, 'in': 2, 'the': 3, 'courtroom': 9}