# Regular Expression Tutorial

## What are they good for?

Most things standard searching is good for, plus more!

### Finding lines in a file

Here I've used a regular expression to search for counties which are shires.

In [None]:
import re
counties = open("counties.txt").read().splitlines()
print(counties)
shires = [county for county in counties if re.search("shire", county)]
print(shires)

### Advanced search and replace

Here I'm adding a comment character to the beginning of all my lines.

![searching for: ^ Replacing with //](https://raw.githubusercontent.com/ccouzens/regex_tutorial/master/advanced%20replace%20add%20comments.png)

### Finding strings that conform

Here I've written a naive Python function that checks to see if an
affirmative word is in the input text.

In [None]:
import re

def affirmation(text):
    return re.search("yes|yarp|yeah|ok|affirmative", text) != None

print(affirmation("no"))
print(affirmation("yarp"))

### Finding strings that don't conform

Here I've written a Python function that checks your telephone number does
not contain letters.

In [None]:
import re

def is_telephone_number(text):
    # Telephone numbers cannot contain letter
    return re.search('[A-z]', text) == None

print(is_telephone_number('01252 123 456'))
print(is_telephone_number('01252 foo'))

### Extracting information from text

Here I've written a Python function that extracts the date of birth from a
complex string about a person.

In [None]:
import re

def date_of_birth(text):
    matches = re.search("DOB: ([0-9/]+)", text)
    return matches[1]

einstein = "Name: Albert Einstein, Nationality: American, Weimar, Swiss, Prussian, DOB: 14/04/1978, Died: 14/03/1955"
print(date_of_birth(einstein))

As you can see, regular expressions can be used to perform a variety of
different functions, and can be used in a variety of different contexts.

## Limitations

Regular expression implementations vary.
The basic syntax can be expected to be well supported in all instances.
Advanced features may not always be present, or may work slightly differently.

Some problems can't be determined by a regular expression.
For example, we can't write a regular expression to determine if a string has matching open and close brackets.

```
# Can't be determined by a regular expression!
(((()))) # good
(((())) # bad
```

For more information, please study the computer science topic [automata](https://en.wikipedia.org/wiki/Regular_language#Location_in_the_Chomsky_hierarchy).

The tooling you're using may force you to use regular expressions in a particular way.
For example it may want you to provide a regular expression that matches positive input.
But it may be more convenient to write a regular expression that matches negative input.

## Writing/Reading Regular Expressions

Mozilla have a good
[reference](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions#Writing_a_regular_expression_pattern).

### Simple patterns

If you want search for an exact match, simple use what you're searching for as
the regular expression.

Here we're searching for counties that contain "York".


In [None]:
import re
counties = open("counties.txt").read().splitlines()
print(counties)
yorks = [county for county in counties if re.search("York", county)]
print(yorks)

However, if your search has any of these characters:
`\^$[](){}*+.?`
then you need to escape them.
These are known as special characters, because they have extra meaning.
We'll cover them later.

### Escaping special characters

If you want to match a special character, you need to prepend a `\` (back
slash).

For example, if you want to search for `(text)` in this document, your regular
expression would be `\(text\)`.

Or if you want to search for file paths that match a Windows `C:\` drive, your
regular expression would be `C:\\`.
The backslash is escaped with a backslash!

Here we have an example of escaping the dollar character:

In [None]:
import re

def contains_dollars(message):
    return re.search("\$", message) != None

print(contains_dollars("invoice for $20"))
print(contains_dollars("receipt for £20"))

### Matching at the start of a line

You can 'anchor' your regular expression to the start of a line by using the
`^` character.

In [None]:
import re

def starts_with_cake(text):
    return re.search("^cake", text) != None

print(starts_with_cake("teacake"))
print(starts_with_cake("cakewalk"))

### Matching at the end of a line

You can 'anchor' your regular expression to the end of a line by using the
`$` character.

In [None]:
import re

def ends_with_cake(text):
    return re.search("cake$", text) != None

print(ends_with_cake("spongecake"))
print(ends_with_cake("cakehouse"))

We can combine searching at the beginning and end of a line into a single regular expression.

In [None]:
import re

def is_cake(text):
    return re.search("^cake$", text) != None

print(is_cake("coffeecake"))
print(is_cake("cake"))

### Matching multiple options

You can use allow multiple options by using the `|` or pipe character.

In [None]:
import re
counties = open("counties.txt").read().splitlines()
print(counties)
sides_and_hams = [county for county in counties if re.search("side|ham", county)]
print(sides_and_hams)

We can use brackets (non matching parentheses) if we want to have only some of the expression multiple choice.

In [None]:
import re
counties = open("counties.txt").read().splitlines()
print(counties)
berkshire_and_hampshire = [county for county in counties if re.search("(?:Berk|Hamp)shire", county)]
print(berkshire_and_hampshire)

We can combine 2 groups of multiple choice at the same time.

In [None]:
import re
counties = open("counties.txt").read().splitlines()
print(counties)
double_vowel_counties = [county for county in counties if re.search("(?:a|e|i|o|u)(?:a|e|i|o|u)", county)]
print(double_vowel_counties)

### Repeating ourselves

Regular expressions give us several ways of expressing the previous example
without the repetition.

The following are all equivalent:

```
(?:a|e|i|o|u)(?:a|e|i|o|u)
(?:a|e|i|o|u){2}
(?:a|e|i|o|u){2,2}
```

They all match vowels immediately after each other.

More specifically, `{n}` matches the previous expression precisely n times.
`{n, m}` matches the previous expression between n and m times.


In [None]:
import re
counties = open("counties.txt").read().splitlines()
print(counties)
double_vowel_counties = [county for county in counties if re.search("(?:a|e|i|o|u){2}", county)]
print(double_vowel_counties)

### Repeating ourselves indefinitely

We can use `{n,}` to repeat ourselves n or more times.

In [None]:
import re
counties = open("counties.txt").read().splitlines()
print(counties)
pattern = "^B(?:a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z){8,}$"
long_b_counties = [county for county in counties if re.search(pattern, county)]
print(long_b_counties)

The shortcut for `{1,}` is `+`.
The shortcut for `{0,}` is `*`.