# Lesson 4 - Regular Expressions

Regular Expressions (regex) are a powerful way to look for various patterns of text. It is available in most languages, including Python, in slightly different flavours. It relies on a syntax that allows us to specify characters to match and their possible repetitions.

In Python this is available in the package `re`, which needs to be imported before it can be used.

A very useful website to build and test regular expressions is available at <https://regexr.com/>.

__NOTE__: Regex patterns often use `\` to specify special character matchers (e.g. where `\s` means any whitespace character). This conflicts with Python's escape character (where e.g. `\n` means a newline character). To solve this, regex patterns are often using Python's raw string syntax, using `r"raw string"`, which disables escape characters.

## Regex syntax overview

In [46]:
import re

long_text = '''Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum'''

print(r'match regular characters:', re.findall('qui', long_text))
print(r'match character classes with []:', re.findall('[a-zA-Z0-9]d', long_text))
print(r'built-in classes: \w (word), \s (spaces), \d (digits), and their negations \W, \S, \D...', re.findall(r'\s\wd\s', long_text))
print(r'match word boundaries with \b', re.findall(r'\b\wd\b', long_text))
print('match one or more character with +', re.findall(r'\b[a-i]+\b', long_text))
print('match zero or one character with ?', re.findall(r'\b\w?d\b', long_text))
print('match zero or more character with *', re.findall(r'\bdo\w*\b', long_text))
print('match a specific number of characters with {}', re.findall(r'\b[a-z]{1,2}\b', long_text))
print('match either pattern with |', re.findall(r'\b(a|i)d\b', long_text))



match regular characters: ['qui', 'qui', 'qui']
match character classes with []: ['ad', 'ed', 'od', 'id', 'id', 'ad', 'ud', 'od', 'nd', 'id', 'id', 'id']
built-in classes: \w (word), \s (spaces), \d (digits), and their negations \W, \S, \D... [' ad ', ' id ']
match word boundaries with \b ['ad', 'id']
match one or more character with + ['ad', 'ea', 'id']
match zero or one character with ? ['ad', 'id']
match zero or more character with * ['dolor', 'do', 'dolore', 'dolor', 'dolore']
match a specific number of characters with {} ['do', 'ut', 'et', 'ad', 'ut', 'ex', 'ea', 'in', 'in', 'eu', 'in', 'id']
match either pattern with | ['a', 'i']


## Usage in Python

In [47]:
print('Find all matching substrings:', re.findall('is', 'This is a string'))
print('Does this *entire* string match the pattern?', re.fullmatch('is', 'This is a string'))
print('Replace all matches:', re.sub('is', 'ese', 'This is a string'))
print('Split a string by regex separators:', re.split(r'\W+', long_text))


Find all matching substrings: ['is', 'is']
Does this *entire* string match the pattern? None
Replace all matches: These ese a string
Split a string by regex separators: ['Lorem', 'ipsum', 'dolor', 'sit', 'amet', 'consectetur', 'adipiscing', 'elit', 'sed', 'do', 'eiusmod', 'tempor', 'incididunt', 'ut', 'labore', 'et', 'dolore', 'magna', 'aliqua', 'Ut', 'enim', 'ad', 'minim', 'veniam', 'quis', 'nostrud', 'exercitation', 'ullamco', 'laboris', 'nisi', 'ut', 'aliquip', 'ex', 'ea', 'commodo', 'consequat', 'Duis', 'aute', 'irure', 'dolor', 'in', 'reprehenderit', 'in', 'voluptate', 'velit', 'esse', 'cillum', 'dolore', 'eu', 'fugiat', 'nulla', 'pariatur', 'Excepteur', 'sint', 'occaecat', 'cupidatat', 'non', 'proident', 'sunt', 'in', 'culpa', 'qui', 'officia', 'deserunt', 'mollit', 'anim', 'id', 'est', 'laborum']
