# Regular expressions

Import the regular expressions module.

In [None]:
import re

Take a quote from Bram Stoker's "Dracula".

In [None]:
text = """Oh, the terrible struggle that I have had against sleep so often of late; the pain of the sleeplessness, or the pain of the fear of sleep, and with such unknown horror as it has for me!
How blessed are some people, whose lives have no fears, no dreads; to whom sleep is a blessing that comes nightly, and brings nothing but sweet dreams."""

## Simple matches

Search for the word "children" in the text.

In [None]:
re.search("sleep", text)

Check the match positions (`span`) are correct by slicing `text`.

In [None]:
text[50:55]

This is the first occurrence of "sleep" in the passage.

## Using metacharacters

Find the first word that begins with "s". That's "s" at the start of a word, followed by zero or more word characters.

In [None]:
re.search(r"\bs\w*", text)

What about _all_ the words that start with "s"?

In [None]:
re.findall(r"\bs\w*", text)

Find _unique_ words that start with "s".

In [None]:
sorted(set(re.findall(r"\bs\w*", text)))

Find postcodes in text.

In [None]:
postcode_text = "Downing Street, SW1A 2AA; Buckingham Palace SW1A 1AA; Old Trafford, M16 0RA; Aintree, L9 5AS; Shakespeare's birthplace, CV37 6QW; Wembley Stadium, HA9 0WS"
re.findall(r"[A-Z]{1,2}[0-9]{1,2}[A-Z]? \d[A-Z]{2}", postcode_text)

Validate a phone number.

In [None]:
print(re.match(r"\d{3}-\d{3}-\d{4}", "555-555-5555") is not None)
print(re.match(r"\d{3}-\d{3}-\d{4}", "555-555-555") is not None)

Extract a day of the month

In [None]:
regex = re.compile(r"(\d{1,2})(?=(st|nd|rd|th))")
print(regex.search("2nd"))
print(regex.search("15th"))
print(regex.search("31st"))

## Escaping meta characters

Markdown uses two stars (\*\*) to embolden text. Find the bold phrases in a piece of markdown text.

In [None]:
markdown_text = "Regular expressions are known as **regexes**. Use the **re** module to work with regexes in Python."
re.findall(r"\*{2}([^\*]+)\*{2}", markdown_text)

We can also use `+?` which is the non-greedy version of `+`.

In [None]:
re.findall(r"\*{2}(.+?)\*{2}", markdown_text)

Regex qualifiers are usually greedy---they will match as much as they can. Non-greedy qualifiers, such as `*?` and `+?` will match as little as they can.

## Anchors

Find the first word in a piece of text.

In [None]:
re.search(r"^\W*\w+", text)

Find the last word in some text.

In [None]:
re.search(r"\w+(?=(\W*$))", text)

## Groups

Deconstruct text by capturing groups.

In [None]:
m = re.search(r"(\d{1,2})(?:st|nd|rd|th)? ([A-Za-z]{3,}) (\d{4})", "25th Dec 2021")
m.groups()

When you have lots of groups, it can be helpful to name them.

In [None]:
m = re.search(r"(?P<date>\d{1,2})(?:st|nd|rd|th)? (?P<month>[A-Za-z]{3,}) (?P<year>\d{4})", "25th Dec 2021")
{"date": m.group("date"), "month": m.group("month"), "year": m.group("year")}

## Backreferences

How can we match text that starts and ends with the same word?

In [None]:
count_text = "one, two, three...three, two, one"
re.search(r"^(\w+).*\1$", count_text).groups()

In [None]:
count_zero_text = "one, two, three...three, two, one, zero"
print(re.search(r"^(\w+).*\1$", count_zero_text).groups())
print(re.search(r"^(\w+\b).*\1$", count_zero_text))

## Applying flags

This fails to match because the month is mixed case.

In [None]:
re.search(r"\d{1,2}(?:st|nd|rd|th)? ([A-Z]{3,}) \d{4}", "25th Dec 2021")

We could use `[A-Za-z]`, but if we don't care about case we can say that explicitly.

In [None]:
re.search(r"\d{1,2}(?:st|nd|rd|th)? ([A-Z]{3,}) \d{4}", "25th Dec 2021", re.IGNORECASE)

## Substituting values

It's common to clean up data by removing (replacing with "") invalid text/tokens.

In [None]:
only_words_text = re.sub(r"[^\w\s]", "", text)
only_words_text

## Splitting text

Splitting text into fields is a common problem when working with data (e.g. log file analysis). Regexes allow us to perform splits on complicated, hetrogeneous delimiters.

In [None]:
delimited_text = "The|  quick, brown   fox; jumps+ over.the. lazy\ndog"
re.split(r"\W+", delimited_text)