# Regular Expression Tutorial

## What are they good for?

Most things standard searching is good for, plus more!

### Finding lines in a file

Here I've used a regular expression to search for counties which are shires.

In [1]:
import re
counties = open("counties.txt").read().splitlines()
print(counties)
shires = [county for county in counties if re.search("shire", county)]
print(shires)

['Bedfordshire', 'Berkshire', 'Bristol', 'Buckinghamshire', 'Cambridgeshire', 'Cheshire', 'City of London', 'Cornwall', 'Cumbria', 'Derbyshire', 'Devon', 'Dorset', 'Durham', 'East Riding of Yorkshire', 'East Sussex', 'Essex', 'Gloucestershire', 'Greater London', 'Greater Manchester', 'Hampshire', 'Herefordshire', 'Hertfordshire', 'Isle of Wight', 'Kent', 'Lancashire', 'Leicestershire', 'Lincolnshire', 'Merseyside', 'Norfolk', 'North Yorkshire', 'Northamptonshire', 'Northumberland', 'Nottinghamshire', 'Oxfordshire', 'Rutland', 'Shropshire', 'Somerset', 'South Yorkshire', 'Staffordshire', 'Suffolk', 'Surrey', 'Tyne and Wear', 'Warwickshire', 'West Midlands', 'West Sussex', 'West Yorkshire', 'Wiltshire', 'Worcestershire']
['Bedfordshire', 'Berkshire', 'Buckinghamshire', 'Cambridgeshire', 'Cheshire', 'Derbyshire', 'East Riding of Yorkshire', 'Gloucestershire', 'Hampshire', 'Herefordshire', 'Hertfordshire', 'Lancashire', 'Leicestershire', 'Lincolnshire', 'North Yorkshire', 'Northamptonshire

### Advanced search and replace

Here I'm adding a comment character to the beginning of all my lines.

![searching for: ^ Replacing with //](https://raw.githubusercontent.com/ccouzens/regex_tutorial/master/advanced%20replace%20add%20comments.png)

### Finding strings that conform

Here I've written a naive Python function that checks to see if an
affirmative word is in the input text.

In [2]:
import re

def affirmation(text):
    return re.search("yes|yarp|yeah|ok|affirmative", text) is not None

print(affirmation("no"))
print(affirmation("yarp"))

False
True


### Finding strings that don't conform

Here I've written a Python function that checks your telephone number does
not contain letters.

In [3]:
import re

def is_telephone_number(text):
    # Telephone numbers cannot contain letter
    return re.search('[A-z]', text) is None

print(is_telephone_number('01252 123 456'))
print(is_telephone_number('01252 foo'))

True
False


### Extracting information from text

Here I've written a Python function that extracts the date of birth from a
complex string about a person.

In [4]:
import re

def date_of_birth(text):
    matches = re.search("DOB: ([0-9/]+)", text)
    return matches[1]

einstein = "Name: Albert Einstein, Nationality: American, Weimar, Swiss, Prussian, DOB: 14/04/1978, Died: 14/03/1955"
print(date_of_birth(einstein))

14/04/1978


As you can see, regular expressions can be used to perform a variety of
different functions, and can be used in a variety of different contexts.

## Limitations

Regular expression implementations vary.
The basic syntax can be expected to be well supported in all instances.
Advanced features may not always be present, or may work slightly differently.

Some problems can't be determined by a regular expression.
For example, we can't write a regular expression to determine if a string has matching open and close brackets.

```
# Can't be determined by a regular expression!
(((()))) # good
(((())) # bad
```

For more information, please study the computer science topic [automata](https://en.wikipedia.org/wiki/Regular_language#Location_in_the_Chomsky_hierarchy).

The tooling you're using may force you to use regular expressions in a particular way.
For example it may want you to provide a regular expression that matches positive input.
But it may be more convenient to write a regular expression that matches negative input.

## Writing/Reading Regular Expressions

Mozilla have a good
[reference](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions#Writing_a_regular_expression_pattern).

### Simple patterns

If you want search for an exact match, simple use what you're searching for as
the regular expression.

Here we're searching for counties that contain "York".


In [5]:
import re
counties = open("counties.txt").read().splitlines()
yorks = [county for county in counties if re.search("York", county)]
print(yorks)

['East Riding of Yorkshire', 'North Yorkshire', 'South Yorkshire', 'West Yorkshire']


However, if your search has any of these characters:
`\^$[](){}*+.?`
then you need to escape them.
These are known as special characters, because they have extra meaning.
We'll cover them later.

### Escaping special characters

If you want to match a special character, you need to prepend a `\` (back
slash).

For example, if you want to search for `(text)` in this document, your regular
expression would be `\(text\)`.

Or if you want to search for file paths that match a Windows `C:\` drive, your
regular expression would be `C:\\`.
The backslash is escaped with a backslash!

Here we have an example of escaping the dollar character:

In [6]:
import re

def contains_dollars(message):
    return re.search("\$", message) is not None

print(contains_dollars("invoice for $20"))
print(contains_dollars("receipt for £20"))

True
False


### Matching at the start of a line

You can 'anchor' your regular expression to the start of a line by using the
`^` character.

In [7]:
import re

def starts_with_cake(text):
    return re.search("^cake", text) is not None

print(starts_with_cake("teacake"))
print(starts_with_cake("cakewalk"))

False
True


### Matching at the end of a line

You can 'anchor' your regular expression to the end of a line by using the
`$` character.

In [8]:
import re

def ends_with_cake(text):
    return re.search("cake$", text) is not None

print(ends_with_cake("spongecake"))
print(ends_with_cake("cakehouse"))

True
False


We can combine searching at the beginning and end of a line into a single regular expression.

In [9]:
import re

def is_cake(text):
    return re.search("^cake$", text) is not None

print(is_cake("coffeecake"))
print(is_cake("cake"))

False
True


### Matching multiple options

You can use allow multiple options by using the `|` or pipe character.

In [10]:
import re
counties = open("counties.txt").read().splitlines()
sides_and_hams = [county for county in counties if re.search("side|ham", county)]
print(sides_and_hams)

['Buckinghamshire', 'Durham', 'Merseyside', 'Northamptonshire', 'Nottinghamshire']


We can use brackets (non matching parentheses) if we want to have only some of the expression multiple choice.

In [11]:
import re
counties = open("counties.txt").read().splitlines()
berkshire_and_hampshire = [county for county in counties if re.search("(?:Berk|Hamp)shire", county)]
print(berkshire_and_hampshire)

['Berkshire', 'Hampshire']


We can combine 2 groups of multiple choice at the same time.

In [12]:
import re
counties = open("counties.txt").read().splitlines()
double_vowel_counties = [county for county in counties if re.search("(?:a|e|i|o|u)(?:a|e|i|o|u)", county)]
print(double_vowel_counties)

['Cumbria', 'Gloucestershire', 'Greater London', 'Greater Manchester', 'Leicestershire', 'South Yorkshire', 'Tyne and Wear']


### Repeating ourselves

Regular expressions give us several ways of expressing the previous example
without the repetition.

The following are all equivalent:

```
(?:a|e|i|o|u)(?:a|e|i|o|u)
(?:a|e|i|o|u){2}
(?:a|e|i|o|u){2,2}
```

They all match vowels immediately after each other.

More specifically, `{n}` matches the previous expression precisely n times.
`{n, m}` matches the previous expression between n and m times.


In [13]:
import re
counties = open("counties.txt").read().splitlines()
double_vowel_counties = [county for county in counties if re.search("(?:a|e|i|o|u){2}", county)]
print(double_vowel_counties)

['Cumbria', 'Gloucestershire', 'Greater London', 'Greater Manchester', 'Leicestershire', 'South Yorkshire', 'Tyne and Wear']


### Repeating ourselves indefinitely

We can use `{n,}` to repeat ourselves n or more times.

In [14]:
import re
counties = open("counties.txt").read().splitlines()
pattern = "^B(?:a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z){8,}$"
long_b_counties = [county for county in counties if re.search(pattern, county)]
print(long_b_counties)

['Bedfordshire', 'Berkshire', 'Buckinghamshire']


The shortcut for `{1,}` is `+`.
The shortcut for `{0,}` is `*`.

### Characer sets

We can use square brackets to quickly list out a lot of single characters at once.

The previous example can be rewritten to use it.

In [15]:
import re
counties = open("counties.txt").read().splitlines()
pattern = "^B[abcdefghijklmnopqrstuvwxyz]{8,}$"
long_b_counties = [county for county in counties if re.search(pattern, county)]
print(long_b_counties)

['Bedfordshire', 'Berkshire', 'Buckinghamshire']


Within a character set, the following characters are special `]-\^`.

Dashes can be used to specify a range of characters.

In [16]:
import re
counties = open("counties.txt").read().splitlines()
pattern = "^B[a-z]{8,}$"
long_b_counties = [county for county in counties if re.search(pattern, county)]
print(long_b_counties)

['Bedfordshire', 'Berkshire', 'Buckinghamshire']


### Negated character sets

We can use a caret `^` to negate a character set.

In [17]:
import re
counties = open("counties.txt").read().splitlines()
pattern = "^[^BCDEGHLNSW]"
uncommon_starts = [county for county in counties if re.search(pattern, county)]
print(uncommon_starts)

['Isle of Wight', 'Kent', 'Merseyside', 'Oxfordshire', 'Rutland', 'Tyne and Wear']


### Common character sets

There are shortcuts for common character sets.

```
\d = [0-9] Matches digits
\D = [^0-9] Matches non digits
\s Matches white space such as space, tab and new lines
\S = [^\s] Matches any charachter that isn't white space
\w = [A-Za-z0-9_] Matches alpha numeric characters and underscore
\W = [^w] Matches anything but alpha numeric characters or underscore
. Matches everything
```

## Further reading

I've covered the basic parts of regular expressions.
Please see various documentation for more advanced features.

Microsoft .net regular expression syntax https://docs.microsoft.com/en-us/dotnet/standard/base-types/regular-expression-language-quick-reference?view=netframework-4.7.2

Python regular expression syntax https://docs.python.org/3/library/re.html#regular-expression-syntax

Javascript regular expression syntax https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions#Writing_a_regular_expression_pattern

## Questions

## Exercises

### IP address extraction

Modify the pattern below to search for IP addresses.

In [23]:
import re
hosts = open("hosts").read()
print(hosts)
pattern = "local\w+"
matches = re.findall(pattern, hosts)
print(matches)

##
# Host File
#
# example file to use for IP addresses
##
127.0.0.1	localhost localapp
::1             localhost
10.100.7.3  myserver3.mynetwork
10.20.127.200  testserver8.testnetwork

['localhost', 'localapp', 'localhost']
