# Regular expressions

## Lab 03

### February 23, 2018

This notebook covers the basics of regular expressions.

Regular expressions (regex or regexp) are search patterns described as strings. They allow pattern matching in arbitrary strings.

Regex are implemented in Python's `re` module.

The `re` module operates via two objects:

1. **pattern** objects, which are a compiled regular expressions and
2. **match** objects that describe successful pattern matches.

In [1]:
import re

r = re.compile("abc")
type(r), r

(_sre.SRE_Pattern, re.compile(r'abc', re.UNICODE))

In [2]:
m = r.search("aabcd")
type(m), m

(_sre.SRE_Match, <_sre.SRE_Match object; span=(1, 4), match='abc'>)

or `None` is returned when no matching pattern was found in the string

In [3]:
m2 = r.search("def")
m2

The match object's `span` method returns the starting and ending index of the match:

In [4]:
m = r.search("aabcde")
m.span()

(1, 4)

In [5]:
m.start(), m.end()

(1, 4)

these can be directly used to extract the matching substring:

In [6]:
s = "aabcdef"
m = r.search(s)

s[m.start():m.end()]

'abc'

Pattern objects do not have to be created but compilation offers some speed-up.

In [7]:
re.search("ab", "abcd")

<_sre.SRE_Match object; span=(0, 2), match='ab'>

# Basic regular expressions

Below we show the most commonly used regex types. For a full list see the official documentation [here](https://docs.python.org/3/library/re.html).

We will not compile the regular expressions for the sake of brevity.

For capturing a range of characters, use `[]`:

In [8]:
re.search("[bB]", "abc")

<_sre.SRE_Match object; span=(1, 2), match='b'>

In [9]:
re.search("[bB]", "aBb")

<_sre.SRE_Match object; span=(1, 2), match='B'>

## Qualifiers

Qualifiers control how many times a pattern is searched for.

`?` matches zero or one time:

In [10]:
re.search("a?bc", "bc")

<_sre.SRE_Match object; span=(0, 2), match='bc'>

In [11]:
re.search("a?bc", "aabc")

<_sre.SRE_Match object; span=(1, 4), match='abc'>

`*` matches zero or more times in a greedy manner (match as many as possible):

In [12]:
re.search("a*bc", "aaabc")

<_sre.SRE_Match object; span=(0, 5), match='aaabc'>

`+` matches one or more times:

In [13]:
re.search("a+bc", "daaaabc")

<_sre.SRE_Match object; span=(1, 7), match='aaaabc'>

`{N}` will match exactly $N$ times:

In [14]:
re.search("a{3}bc", "aabc")

In [15]:
re.search("a{3}bc", "aaaabc")

<_sre.SRE_Match object; span=(1, 6), match='aaabc'>

`{N,M}` will match at least $N$ and at most $M$ times:

In [16]:
re.search("a{3,5}bc", "aaaabc")

<_sre.SRE_Match object; span=(0, 6), match='aaaabc'>

In [17]:
re.search("a{3,5}bc", "aaaaabc")

<_sre.SRE_Match object; span=(0, 7), match='aaaaabc'>

## Special characters

`.` matches any character besides newline:

In [18]:
re.search("a.c", "abc")

<_sre.SRE_Match object; span=(0, 3), match='abc'>

In [19]:
print(re.search("a.c", "ac"))

None


`^` (Caret) matches the beginning of the string and in MULTILINE mode, immediately after each newline:

In [20]:
re.search("^a", "abc"), re.search("^a", "bc\na")

(<_sre.SRE_Match object; span=(0, 1), match='a'>, None)

In [21]:
re.search("^a", "bc\nab", re.MULTILINE)

<_sre.SRE_Match object; span=(3, 4), match='a'>

`$` matches the end of the string or before newline in MULTILINE mode:

In [22]:
re.search("c$", "abc")

<_sre.SRE_Match object; span=(2, 3), match='c'>

In [23]:
re.search("a$", "aba\nc", re.MULTILINE)

<_sre.SRE_Match object; span=(2, 3), match='a'>

## `|` in patterns

`|` allows matching any of several patterns

In [24]:
re.search("ab|cd", "ab").span()

(0, 2)

In [25]:
re.search("ab|cd", "decd").span()

(2, 4)

## Character ranges

`[A-F]` matches every character between A and F:

In [26]:
re.search("[A-F]{3}", "abBCDEef")

<_sre.SRE_Match object; span=(2, 5), match='BCD'>

`[A-Za-z]` matches every English letter:

In [27]:
re.search("[A-Za-z]+", "abAaz12")

<_sre.SRE_Match object; span=(0, 5), match='abAaz'>

`[0-9]` matches digits:

In [28]:
re.search("[0-9]{3}", "ab12345cd")

<_sre.SRE_Match object; span=(2, 5), match='123'>

`-` needs to be escaped if we want to include it in the character range:

In [29]:
re.search("[0-9\-]+", "1-2")

<_sre.SRE_Match object; span=(0, 3), match='1-2'>

## Character classes

Some character classes are predefined such as ascii letters or whitespaces.

`\s` matches any Unicode whitespace:

In [30]:
re.search("\s+", "ab \t\n\n")

<_sre.SRE_Match object; span=(2, 6), match=' \t\n\n'>

`\S` (capital S) matches anything else:

In [31]:
re.search("\S+", "ab \t\n\n")

<_sre.SRE_Match object; span=(0, 2), match='ab'>

`\w` matches any Unicode word character that can be part of a word in any language, and `\W` matches anything else:

In [32]:
re.search("\w+", "tükörfúrógép")

<_sre.SRE_Match object; span=(0, 12), match='tükörfúrógép'>

In [33]:
re.search("\w+", "tükör fúrógép")

<_sre.SRE_Match object; span=(0, 5), match='tükör'>

negative character classes match characters **NOT** in the class

In [34]:
re.search("[^abc]+", "abcdef").group(0)

'def'

## Capture groups

Patterns may contain groups, marked by parentheses. Groups can be accessed by their indices, or all of them can be retrieved by the groups method:

In [35]:
patt = re.compile("([hH]ell[oó]) (.*)!")
match = patt.search("Hello people!")

In [36]:
match.group()

'Hello people!'

In [37]:
match.groups()

('Hello', 'people')

In [38]:
match.group(0)

'Hello people!'

In [39]:
match.group(1)

'Hello'

In [40]:
match.group(2)

'people'

Groups can also be named, in this case they can be retrieved in a dict maping group names to matched substrings using groupdict:

In [41]:
patt = re.compile("(?P<greeting>[hH]ell[oó]) (?P<name>.*)!")
match = patt.search("Hello people!")
match.group("name")

'people'

In [42]:
match.groupdict()

{'greeting': 'Hello', 'name': 'people'}

## Other `re` methods

`re.match` matches only at the beginning of the string

In [43]:
re.match("ab", "abcd")

<_sre.SRE_Match object; span=(0, 2), match='ab'>

In [44]:
print(re.match("ab", "zabcd"))

None


`re.findall` matches every occurrence of a pattern in a string. Unlike most other methods, `findall` directly returns the string patterns instead of patter objects.

In [45]:
re.findall("[AaBb]+", "ab Abcd")

['ab', 'Ab']

`re.finditer` returns an iterator that iterates over every match:

In [46]:
for match in re.finditer("[AaBb]+", "ab Abcd"):
    print(match.span(), match.group(0))

(0, 2) ab
(3, 5) Ab


`re.sub` replaces an occurrence:

In [47]:
re.sub("[Aa]", "b", "acA")

'bcb'

`re.split` splits a string at every pattern match

In [48]:
re.split("\s+", "words  with\t whitespace")

['words', 'with', 'whitespace']

we can keep the separators as well:

In [49]:
re.split("(\s+)", "words  with\t whitespace")

['words', '  ', 'with', '\t ', 'whitespace']