# Intermediate Regular Expressions (Regex)

## Agenda

1. Greedy or lazy quantifiers
2. Alternatives
3. Substitution
4. Anchors
5. Option flags
6. Lookarounds
7. Assorted functionality

In [1]:
# for Python 2: use print only as a function
from __future__ import print_function

In [2]:
import re

## Part 1: Greedy or lazy quantifiers

Quantifiers modify the **required quantity** of a character or a pattern.

Quantifier|What it matches
---|---
**`a+`** | 1 or more occurrences of 'a' (the pattern directly to its left)
**`a*`** | 0 or more occurrences of 'a'
**`a?`** | 0 or 1 occurrence of 'a'

In [3]:
s = 'sid is missing class'

In [4]:
re.search(r'miss\w+', s).group()

'missing'

In [5]:
re.search(r'is\w*', s).group()

'is'

In [6]:
re.search(r'is\w+', s).group()

'issing'

**`+`** and **`*`** are **"greedy"**, meaning that they try to use up as much of the string as possible:

In [7]:
s = 'Some text <h1>my heading</h1> More text'

In [8]:
re.search(r'<.+>', s).group()

'<h1>my heading</h1>'

Add a **`?`** after **`+`** or **`*`** to make them **"lazy"**, meaning that they try to use up as little of the string as possible:

In [9]:
re.search(r'<.+?>', s).group()

'<h1>'

## Part 2: Alternatives

Alternatives define **multiple possible patterns** that can be used to produce a match. They are separated by a pipe and put in parentheses:

In [10]:
s = 'I live at 100 First St, which is around the corner.'

In [11]:
re.search(r'\d+ .+ (Ave|St|Rd)', s).group()

'100 First St'

## Part 3: Substitution

`re.sub()` finds **all matches** in a given string and **replaces them** with a specified string:

In [12]:
s = 'my twitter is @jimmy, my emails are john@hotmail.com and jim@yahoo.com'

In [13]:
re.sub(r'jim', r'JIM', s)

'my twitter is @JIMmy, my emails are john@hotmail.com and JIM@yahoo.com'

In [14]:
re.sub(r' @\w+', r' @johnny', s)

'my twitter is @johnny, my emails are john@hotmail.com and jim@yahoo.com'

The replacement string can refer to text from **match groups**:

- `\1` refers to `group(1)`
- `\2` refers to `group(2)`
- etc.

In [15]:
re.sub(r'(\w+)@[\w.]+', r'\1@gmail.com', s)

'my twitter is @jimmy, my emails are john@gmail.com and jim@gmail.com'

## Part 4: Anchors

Anchors define **where in a string** the regular expression pattern must occur.

Anchor|What it requires
---|---
**`^abc`** | this pattern must appear at the start of a string
**`abc$`** | this pattern must appear at the end of a string

In [16]:
s = 'sid is missing class'

In [17]:
re.search(r'\w+', s).group()

'sid'

In [18]:
re.search(r'\w+$', s).group()

'class'

In [19]:
# this will cause an error
# re.search(r'^is', s).group()

## Part 5: Option flags

Options flags **change the default behavior** of the pattern matching.

Default behavior | Option flag | Behavior when using flag
---|---|---
matching is case-sensitive | re.IGNORECASE | matching is case-insensitive
**`.`** matches any character except a newline | re.DOTALL | **`.`** matches any character including a newline
within a multi-line string, **`^`** and **`$`**<br>match start and end of entire string | re.MULTILINE | **`^`** and **`$`** matches start and end of each line
spaces and **`#`** are treated as literal characters | re.VERBOSE | spaces and **`#`** are ignored (except in a character class or<br>when preceded by **`\`**), and characters after **`#`** are ignored

In [20]:
s = 'LINE one\nLINE two'

In [21]:
print(s)

LINE one
LINE two


In [22]:
# case-sensitive
re.search(r'..n.', s).group()

' one'

In [23]:
# case-insensitive
re.search(r'..n.', s, flags=re.IGNORECASE).group()

'LINE'

In [24]:
# . does not match a newline
re.search(r'n.+', s).group()

'ne'

In [25]:
# . matches a newline
re.search(r'n.+', s, flags=re.DOTALL).group()

'ne\nLINE two'

In [26]:
# combine option flags
re.search(r'n.+', s, flags=re.IGNORECASE|re.DOTALL).group()

'NE one\nLINE two'

In [27]:
# $ matches end of entire string
re.search(r'..o\w*$', s).group()

'two'

In [28]:
# $ matches end of each line
re.search(r'..o\w*$', s, flags=re.MULTILINE).group()

'E one'

In [29]:
# spaces are literal characters
re.search(r' \w+', s).group()

' one'

In [30]:
# spaces are ignored
re.search(r' \w+', s, flags=re.VERBOSE).group()

'LINE'

In [31]:
# use multi-line patterns and add comments in verbose mode
re.search(r'''
\     # single space
\w+   # one or more word characters
''', s, flags=re.VERBOSE).group()

' one'

## Part 6: Lookarounds

A **lookahead** matches a pattern only if it is **followed by** another pattern. For example:

- `100(?= dollars)` matches `'100'` only if it is followed by `' dollars'`

A **lookbehind** matches a pattern only if it is **preceded by** another pattern. For example:

- `(?<=\$)100` matches `'100'` only if it is preceded by `'$'`

In [32]:
s = 'Name: Cindy, 66 inches tall, 30 years old'

In [33]:
# find the age, without a lookahead
re.search(r'(\d+) years? old', s).group(1)

'30'

In [34]:
# find the age, with a lookahead
re.search(r'\d+(?= years? old)', s).group()

'30'

In [35]:
# find the name, without a lookbehind
re.search(r'Name: (\w+)', s).group(1)

'Cindy'

In [36]:
# find the name, with a lookbehind
re.search(r'(?<=Name: )\w+', s).group()

'Cindy'

## Part 7: Assorted functionality

`re.compile()` compiles a regular expression pattern for **improved readability and performance** (if the pattern is used frequently):

In [37]:
s = 'emails: john-doe@gmail.com and jane-doe@hotmail.com'

In [38]:
email = re.compile(r'[\w.-]+@[\w.-]+')

In [39]:
# these are all equivalent
re.search(r'[\w.-]+@[\w.-]+', s).group()
re.search(email, s).group()
email.search(s).group()

'john-doe@gmail.com'

In [40]:
# these are all equivalent
re.findall(r'[\w.-]+@[\w.-]+', s)
re.findall(email, s)
email.findall(s)

['john-doe@gmail.com', 'jane-doe@hotmail.com']

Use the `span()` method of a match object, rather than the `group()` method, to determine the **location of a match**:

In [41]:
re.search(email, s).span()

(8, 26)

In [42]:
s[8:26]

'john-doe@gmail.com'

`re.split()` **splits a string** by the occurrences of a regular expression pattern:

In [43]:
re.split(r'john|jane', s)

['emails: ', '-doe@gmail.com and ', '-doe@hotmail.com']