# Regular Expressions (RegEx)

Staff : !!!!!!!!!
Support Material : !!!!!!!!
Support Sessions : !!!!!!!!

#### Pattern matching

- Check whether a pattern matches a string
- Most programming languages have some implementation of this idea

Interesting tool for help with constructing and debugging regular expressions: https://regex101.com/

Very good tutorial on regex: https://docs.python.org/3.8/howto/regex.html

In [None]:
# we already saw some pattern matching

'ab' in 'table'

In [None]:
'booking'.find('ing')

In [None]:
# Does the pattern (a followed by anything followed by b) match a string?

def mymatch (x, y, text):
    'x followed by y with anything in between'
    for letter in text:
        if letter == x:
            pos = text.index(letter)
            for letter in text[pos:]:
                if letter == y:
                    return True
    return False


In [None]:
mymatch('a', 'b', 'zanzibir')

In [None]:
import re

result = re.search('a.*b', 'zanzibar')
result

In [None]:
result.group()  # group, span, start, end are only available if match succeeds!

In [None]:
result.start()

In [None]:
result.end() # is character AFTER the final char of the match

In [None]:
result.span()

#### search 
- stops with first match found (has start, end, span, group)
- returns None when there is no match

#### findall
- returns a list of matches (only the strings)
- returns empty list when there is no match

#### finditer
- can be used to loop through matches
- each match has span and what is matched

#### match
- looks at a match from the start of the string
- has a span and a match (like group in search)

Matching is greedy.

### Patterns
- single symbols ('a', 'f', ...)
- concatenation of symbols ('ab', 'bar', ...)
- disjunction ('a|b', '(ab)|c', 'a[bc]')  (either or, any of)
- sets
    - [abc] a b or c
    - [^abc] not a, b or c
    - [a-z] a character in the range a-z
    - [a-zA-Z] a character in the range a-z or A-Z
    - [^a-zA-Z] a character NOT in the range a-z or A-Z
- any character (.)
- start of string ^ and end of string $
- a whitespace character \s
- a non-whitespace character \S
- a digit \d
- a non-digit \D
- a word character \w
- a non word character \W
- a word boundary character \b
- a non word boundary character \B
- quantifiers
    - a? zero or one 'a'
    - a+ one or more 'a'
    - a* zero or more 'a'
    - a{5} exactly 5 of 'a'

In [None]:
result = re.findall('bar', 'barbara')
result

In [None]:
result = re.search('bar', 'barbara')
result

In [None]:
result = re.match('bar', 'barbara')
result

In [None]:
result = re.search('bar', 'rabarbara')
result

In [None]:
result = re.match('bar', 'rabarbara')
result

In [None]:
result = re.findall('ba', 'zanzibarbar')
result

In [None]:
for match in re.finditer('bar', 'zanzibarbar'):
    print(match)

In [None]:
for match in re.finditer('[0-9]+', '12/10/1990'):
    print(match)

In [None]:
re.findall('o', 'www.google.com') # find all o

In [None]:
re.search('oo', 'www.google.com') # find two consecutive o's

In [None]:
re.findall('[ow]+', 'www.google.com') # find sequences of 1 or more o or w

In [None]:
re.findall(r'[a-z]+', 'www.google.com') # find sequences of one or more lowercase alphabetic character

In [None]:
re.search('o{2}', 'www.google.com') # find sequence of exactly 2 o's

In [None]:
re.findall('\.', 'www.google.com') # find dots (escape!)

In [None]:
re.findall('\w', 'www.google.com') # find word characters

In [None]:
re.findall('\W', 'www.google.com') # find non-word characters

In [None]:
re.findall('\s', 'www\n   google\tcom') # find whitespace characters

In [None]:
re.findall('[0-9]+', '12/10/1990') # find numeric characters

In [None]:
re.findall('\d+', '12/10/1990') # find numeric characters

In [None]:
re.findall('[^0-9]+', '12/10/1990') # find non-numeric characters

### Gaining a bit more efficiency in pattern matching

In [None]:
# Compile a regular expression
p = re.compile('[0-9]+')
p.findall('12/10/1990')

In [None]:
%time
n = 0
while n < 1000000:
    re.findall('[0-9]+', '12/10/1990')
    n += 1

In [None]:
%time
p = re.compile('[0-9]+')
n = 0
while n < 1000000:
    p.findall('12/10/1990')
    n += 1
