## Regular Expressions

This will look like some more text searching! But it's useful in lots of contexts.

Regular expressions give you a way to both match pieces of text you are interested in, and extract information from them.

First import the regular expression library `re` -- this is built in to Python.

In [None]:
import re

## Searching and matching

A regular expression is a pattern. For a given piece of text, you can then ask the question, "does the pattern occur in this text?". The `re.search()` function asks this question.

Here's an example:

```python
pattern = r'mor+ning'
```

You can test this pattern against some text like this. Try these, in separate cells so you can see the results:

```python
print(re.search(pattern, 'morning'))
print(re.search(pattern, 'good morrrrrrning'))
print(re.search(pattern, 'mornin'))
```

The `re.match()` function is similar, except that it insists that the pattern must match the text from the start of the line. Compare:

```python
print(re.match(pattern, 'good morrrrrrning'))
```

In the pattern `r'mor+ning'`, the `+` means "repeat one or more times". So any text containing something like `morning`, but with any number of `'r'`, will match the pattern.

Modifiers like `+` always apply to the item directly before them -- so here the `+` refers only to the `r`.

## Match objects

When a regular expression function such as `search` is unsuccessful, it returns `None`. When it's successful, it returns a match object.

The match object itself doesn't tell us much! The documentation for match objects is here:

<https://docs.python.org/3/library/re.html#match-objects>

Try capturing a match object in a variable, and run some of its methods:

```python
m = re.search(pattern, 'good morrrrrrning')
m.start()
m.end()
m.group(0)
```

## The "result"

Sometimes you'll just want to know whether a string matches a regular expression. In that case, just look at the returned match object. If it exists, there was a match! If not, the result will be `None` (the special Python "null value").

In [None]:
m = re.search('cat', 'dogularity')
if m is None:
    print("No match found!")
else:
    print("It matches!")

## Exercise: 404

Change the first line of the cell above so that a match occurs.

## Getting information out of a match

For now, it's enough to know that:

* `m.start()` gives the start position of the match, in the original string
* `m.end()` gives the end of the match, in the original string. This follows the usual Python convention -- it refers to the position after the end of the match.
* `m.group(0)` gives the part of the string that matches (more about groups later)

## More regular expressions

Here are some more things you can use in regular expressions.

### Repetition

A question mark **`?`** indicates that something can appear 0 or 1 times. For example, the regular expression:

`colou?r`

will match `color` and `colour`. The `u` can appear zero times (`color`) or once (`colour`).


A plus **`+`** (which we've already seen) indicates that something can appear 1 or more times.


An asterisk **`*`** indicates that something can appear 0 or more times. For example:

`Moo*`

matches `Moo`, `Mooooooo`, `Moooooooooooooooo` etc., but also matches `Mo`.


**`{n}`** A number in curly brackets indicates that number of repeats. For example:

`Ba{4}`

matches `Baaaa`.


**`{n,m}`** Two numbers indicate a range of possible repeats. For example:

`Do{1,5}`

matches `Do`, `Doo`, `Dooo`, `Doooo`, and `Dooooo`.

(Note that unlike most Python ranges, it *does* include the second number!)

### Exercise: Repetition

First run the cell below to define the test function:

In [None]:
def test_re(r, include=None, exclude=None):
    for i in include:
        if not re.match('^'+r+'$', i):
            print("Doesn't match included string: "+i)
    for x in exclude:
        if re.match('^'+r+'$', x):
            print("Matches excluded string: "+x)

In the following cells, change the variable `pattern` to match all the strings in `include` and not match the strings in `exclude`. Unlike `re.search` and `re.match`, it must match the *whole* string. Run the cell and it will test your pattern. If you don't see a message, your pattern was correct! Note that several different answers are possible for some of these exercises.

In [None]:
pattern = ''
test_re(
    pattern,
    include = ['xyyyz', 'xyyyyz', 'xyyz'],
    exclude = ['xyz', 'xyyyyyz']
)

In [None]:
pattern = ''
test_re(
    pattern,
    include = ['jk', 'ijk', 'ik', 'ij', ''],
    exclude = ['ii', 'jj', 'kk', 'ijjk']
)

### Groups

At the moment we can only apply our repetition modifiers, `?`, `+` and `*`, to single characters. To apply it to a sequence we can use a group. This is denoted by parentheses `()`. For example:

`(bye)+`

will match:

`bye`
`byebye`
`byebyebye`

etc.

### Exercise: Groups

As before, change `pattern` to something that will match the strings in `include`, and fail to match the strings in `exclude`.

In [None]:
pattern = ''
test_re(
    pattern,
    include = ['', 'hi hi ', 'hi hi hi '],
    exclude = ['hello']
)

In [None]:
pattern = ''
test_re(
    pattern,
    include = ['good morning', 'good morrrrrrrrning', 'morning', 'morrrning'],
    exclude = ['good']
)

### Escaping and backslashes

If you want to match an *actual* parenthesis in a regular expression, use a backslash before it. This applies to any character with a special meaning:

`"hello \(why\?\)"`

To prevent these backslashes from being misinterpreted, it's wise to use the "raw" form of strings, with an 'r' at the beginning.

`r"hello \(why\?\)"`

(You may remember this style from the handling of Windows file paths containing backslashes).

It's good practice to use this kind of string for any regular expression. Even if your expression doesn't contain a backslash to start with, use `r""` at the beginning, and it'll still work when you do!

### Groups for extracting information

Try this:

```python
text = "baaaaaaa mooooooo"
m = re.search(r'(ba+) (mo+)', text)
print(m.group(1))
print(m.group(2))
```

As you've already seen, group zero is the whole match. Further groups are matches of individual parenthesised groups within the expression, from left to right.

### Wildcards

A full stop **`.`** can match any character. For example, `.{8}` will match any 8 characters.

**`\w`** This matches any word character. For example, `z\w+a` will match any word starting with 'z' and ending in 'a'.

**`\d`** This matches any digit character. For example, `\d+` will match any number.

**`\s`** This matches any space character -- usually tabs and spaces.

All of these can be inverted by being capitalised -- so for example, `\D` matches any character which is *not* a digit.

The definitions of which characters are in each of these categories is taken from the Unicode standard. For most character sets, this should be fairly obvious.

```python
m = re.match(r'\d+', "٣٤٦٢")
```

This leads to a common regular expression idiom: since `.` means "any character" and `*` means "0 or more repeats", `.*` means "any amount of anything".

Therefore (for example) `q.*y` will match anything beginning with a q and ending in a y -- but the characters in between can be anything, not just alphabetic characters.

### More groups for extracting information

Now this is starting to look useful! Imagine a collaborator has given you some data files of recordings of participants, with filenames like this:

```
001_2016-08-12_control.wav
```

So the filename has a participant number, a date, and a condition / participant group, all separated by underscores `_`.

You could match each filename with:

```python
filename_re = r'(\d{3})_(\d{4}-\d{2}-\d{2})_(\w+)\.wav'
m = re.match(filename_re, filename)
```

So the match object `m` will now contain the participant number in `m.group(1)`, date in `m.group(2)`, and condition in `m.group(3)`.

Try this out, if you need to convince yourself!

In [None]:
filename = '001_2016-08-12_control.wav'
filename_re = r'(\d{3})_(\d{4}-\d{2}-\d{2})_(\w+)\.wav'
m = re.match(filename_re, filename)
m.groups()

### Exercise: Wildcards

As before, change `pattern` to something that will match the strings in `include`, and fail to match the strings in `exclude`.

In [None]:
# addresses!
pattern = ''
test_re(
    pattern,
    include = ['12 Something Grove', '1 A Street', '8 Thingy Avenue'],
    exclude = ['743', 'Buckingham Palace', 'Princes 123 Street']
)

In [None]:
# phone numbers
pattern = ''
test_re(
    pattern,
    include = ['07771234567', '0164 210 0001', '0131 444 5555'],
    exclude = ['1234', '13579864233', '0 16 42 10 00 01']
)

### Character classes

Square brackets **`[]`** can be used to enclose alternative single characters.

For example `[abc]` will match any of `a`, `b`, or `c` -- so `[abc]+` matches any sequence of one or more `a`, `b`, or `c`, eg. `aaaa`, `bacab`, `ccab`, `b` etc.

You can use wildcards in character classes. For example:

`[\d-]`

matches any digit or a hyphen `-`.

### Exercise: Character classes

As before, change `pattern` to something that will match the strings in `include`, and fail to match the strings in `exclude`.

In [44]:
pattern = ''
test_re(
    pattern,
    include = ['1234321', '1', '3', '4242'],
    exclude = ['6', '632', '7891', '0']
)

In [46]:
pattern = ''
test_re(
    pattern,
    include = ['bat', 'bit', 'but', 'bet', 'bot'],
    exclude = ['bqt', 'brt']
)

## Alternatives

Last but not least, you can use the pipe character **|** to mark several alternatives in a regular expression.

For example, `r'cat|dog|fish'` matches *either* `cat`, `dog`, or `fish`.

If you want to put alternatives inside a larger expression, use a group. for example:

`r'Remember to feed the (cat|dog|fish).'`

## Timing

Regular expressions are powerful, but also sometimes slower than alternative approaches. Let's look at how to test this.

You can time some code by typing

`%%timeit`

at the top of a cell.

This will run all the code in the cell many times, and show you the average.

To demonstrate this, try:

```python
%%timeit

x = 2 + 2
```

Be careful **not** to put any `print()` commands in a cell with `%%timeit`, as this will end up printing thousands of times!

## Parsing dates

If we have a date in ISO format (YYYY-MM-DD), we could match it with a regular expression. Alternatively, we could use the string `.split()` function, or string slicing. 

First, write code in a cell that uses regular expressions to extract the year, month and day. At the end you should have variables `year`, `month` and `day` with the relevant values in, as strings. Print these to show that the code is working correctly.

Now write code in another cell using one of the other two approaches to achieve the same result. Again, print these to show it produces the same result as the other code.

Finally, remove the `print()` commands from both cells. Add `%%timeit` at the top of each of them. Now run them again to compare the time taken by each approach.