## Tutorial 05: Regular Expressions

Here we introduce the concept of a regular expression and the
Python module `re`, which provides an efficent and easy to use
implemention of regular expressions.

### Matching fixed strings

Regular expressions are used to identify patterns within string
objects. Once found, these patterns can be used for tasks such
as extraction, substitution, or splitting the string into parts.

We will introduce the basic concepts through substitution and
show the other tasks at the end of the notebook.

To start, import the `re` module:

In [None]:
import re

We will use the function `re.sub` to replace all instances of a substring
with another. Here, we will replace all spaces with dashes: 

In [None]:
re.sub(" ", "-", "I am having fun with regular expressions! They are great!")

In [None]:
re.sub("fun", "FUN", "I am having fun with regular expressions! They are great!")

As we see, the first argument defines a pattern, the second the replacement,
and the third the string to operate on. Used in sequence, substitutions can
be used to clean character data:

In [None]:
msg = "I am having fun with regular expressions! They are great!"
msg = re.sub(" ", "-", msg)
msg = re.sub("!", "", msg)
msg

### Matching patterns

The power of regular expressions comes from the ability to match not just
fixed strings but patterns of strings. There is a whole language of regular
expressions; here I will show just a few of the most common examples.

The symbol `+` matches one or more of the prior characters. Take the example
here:

In [None]:
msg = "aardvark?"
re.sub("a", "A", msg)

And compare it to:

In [None]:
msg = "ardvark?"
re.sub("a+", "A", msg)

The expression `a+` matches both the letter "a" and the letter pair "aa"
(regular expressions always find the largest matching string).

We can group letters together using braces, `[]`. So to match any combination
of numbers we can use this:

In [None]:
msg = "1000x 2341y 1104z"
re.sub("[0123456789]+", "NUMBER", msg)

This reads: "replace any sequence of digits with the string 'NUMBER'". There
is a shortcut for this using the notation `[0-9]`. Similarly, `[a-z]` matches
lowercase letters and `[A-Z]` matches upper case letters.

Finally, the symbol `^` stands for **not**. So `[^a-z]+` stands for anything that
is not a lower case letter:

In [None]:
re.sub("[^a-z]", " ", "I am having fun with regular expressions! They are great!")

You may find that you want to match a character with a special meaning, such as the
actual carrot symbol: `^`. To do this, simply proceed the character with `\\` to escape
it in the string.

In [None]:
re.sub("\\^", "-", "2^3")

### Application: HTML

A very common application of regular expressions is to match HTML tags,
which are contained between `<` and `>`. For example, `<a href="python.org">`.
To match an html tag use this expression:

In [None]:
re.sub("<[^>]+>", "", "<a href='www.python.org'>click here!</a>")

Can you figure out exactly how this expression works?

### Find and split

As mentioned, there are other tasks we can do once a substring has been identified.
We could, for example, split a string apart wherever a substring is detected using
`re.split`:

In [None]:
re.split(" ", "I am having fun with regular expressions! They are great!")

Or, extract just the matching substrings using `re.findall`:

In [None]:
re.findall("<[^>]+>", "<a href='www.python.org'>click here!</a>")

Both of these functions return list objects, which we will see in the next
notebook.

-------

## Practice