# Regular Expressions

Regular expressions (called REs, or regexes, or regex patterns) are essentially a tiny, highly specialized programming language embedded inside Python and made available through the re module. Using this little language, you specify the rules for the set of possible strings that you want to match; this set might contain English sentences, or e-mail addresses, or TeX commands, or anything you like. You can then ask questions such as “Does this string match the pattern?”, or “Is there a match for the pattern anywhere in this string?”. You can also use REs to modify a string or to split it apart in various ways.

Regular expression patterns are compiled into a series of bytecodes which are then executed by a matching engine written in C. For advanced use, it may be necessary to pay careful attention to how the engine will execute a given RE, and write the RE in a certain way in order to produce bytecode that runs faster. Optimization isn’t covered in this document, because it requires that you have a good understanding of the matching engine’s internals.

The regular expression language is relatively small and restricted, so not all possible string processing tasks can be done using regular expressions. There are also tasks that can be done with regular expressions, but the expressions turn out to be very complicated. In these cases, you may be better off writing Python code to do the processing; while Python code will be slower than an elaborate regular expression, it will also probably be more understandable.

## The rules of regex

Before we begin with some patterns and using the powerful Regex engine, it's important to cover some groundwork on *how* we make compilable pattern strings. For example, a regex string with characters also takes the *order* of the strings into consideration for a match or search, for example, the regex string:

    'Hello'

will match to any string that either fully matches `'Hello'` or contains `'Hello'` anywhere in the test string. REs are read **left to right**, and matches by default apply to the *leftmost*/*first* instance within a string.

### Repeats of characters

It is very common that you might be working with strings with a pre-defined, extractable structure, such as datetime, or URL address, or date of birth. REs use a few qualifying characters to indicate that there is an expansion of the character or selected region, that can either be infinite or fixed by some specified length or bounds.

| Qualifier | Example | Description | 
| ------ | ---- | --------------- |
| `*` | `a*` | Match character repeats of 'a' 0 to infinite times |
| `+` | `ab+` | Match character repeats of 'b' at least 1 to infinite times |
| `?` | `abc?` | Match character repeats of 'c' 0 or 1 only |
| `{}` | `Hello{5}` | Match character repeats of 'o' up to 5 times |
| `{}` | `atgc{2,5}` | Match character repeats of 'c' between 2 and 5 times |

You may notice that the repeated region only applied in the examples to the **last** character in the sequence before the qualifier. This is important to remember when designing regex strings.

### Selections of different characters

Being able to define a fixed character that repeats is all well and good, however often in practice we want to match a **range** of characters, such as to match a numerical character between 0 and 9, or a letter between 'a' and 'z'. 

If we refer back to our *date of birth* example, we want to construct a regex string that follows the patten `YYYY/MM/DD`, where `Y` refers to the year, `M` refers to month and `D` refers to the day. To do this, we want to ensure that the three groups: *year*, *month* and *day* are **numerical** values between 0 and 9, thus a fixed character is inappropriate.

This is achieved by selecting the characters of interest within the *square-bracket notation* `[]`.

| Example | Description | 
| ------ | --------------- |
| `[abc]` | Matches a, b or c |
| `[a-z]` | Matches any single lowercase character a-z |
| `[0-9]?` | Matches any digit 0-9 zero or once |
| `[ATCG]+` | Matches A, T, C or G one or more times |
| `[A-Za-z0-9\_]*` | Matches any uppercase, lowercase, number or underbar zero or more times |

If you use a qualifier `?, +, *`, it is recommend to include it *outside* the square-bracket notation.

Coming back to the *date of birth* example, our RE may look like this:

    [0-9]{4}\/[0-9]{2}\/[0-9]{2}
    
We're looking for 4 digits between 0-9 for the *year*, then followed by a slash, but since it is a special character, we prefix it with `\`. Then we repeat for two numbers for the *month*, then slash, then two numbers for the *day*. You can see that, by simply learning these two principles, we can already construct coherent REs that can match and manipulate real-world string examples.

In [1]:
import re

In [10]:
dobs = ["1970/12/30", "1984/09/17", "1990/04/03", "2001/02/11"]
# create our regex string
reg = re.compile("[0-9]{4}\/[0-9]{2}\/[0-9]{2}")
[re.match(reg, d) for d in dobs]

[<_sre.SRE_Match object; span=(0, 10), match='1970/12/30'>,
 <_sre.SRE_Match object; span=(0, 10), match='1984/09/17'>,
 <_sre.SRE_Match object; span=(0, 10), match='1990/04/03'>,
 <_sre.SRE_Match object; span=(0, 10), match='2001/02/11'>]

### Capture Groups

Matching someone's date of birth helps to **validate** that the input string properly fulfills the format of a date of birth, but what about if one wishes to extract the *individual* elements from the date of birth? 

Let's say we're only interested in the *year* of birth, we can use the *circle-bracket notation* `()` to select a **capture group** i.e a subgroup to return as output from the RE search.

| Example | Description | 
| ------ | --------------- |
| `(abc)` | Matches and captures `abc` |
| `([a-z]+)` | Matches and captures any lowercase character one or more times |
| `([a-z])+` | Matches any lowercase character one or more times, captures the first character |
| `(ab)*` | Matches and captures `ab` zero or more times |
| `(abc)*[A-Z]+(abc)*` | Matches and captures `abc` repeats zero or more times, either side of one or more uppercase letters |

A given RE can have more than one *capture group*, which means a given match or search can return zero or more subgroup results.

If we apply this to our *date of birth* RE, we can extract the *year*, *month* and *day* simultaenously:

In [17]:
reg = re.compile("([0-9]{4})\/([0-9]{2})\/([0-9]{2})")
m = re.match(reg, dobs[0])
m.groups()

('1970', '12', '30')

### Naming Capture Groups

REs give the opportunity to provide a name to each capture group, such that it can be referenced later. 

This is achieved by appending a rather strange-looking tag inside a given capture group: `?P<name>`, where `name` can be any lower/uppercase word of your choice. You can include characters like underbar `_`, but not numbers or special characters such as `<`,`>`,`-`,`\`,`'`,`:`,`;`,`,` and so on.

In [20]:
reg = re.compile("(?P<year>[0-9]{4})\/(?P<month>[0-9]{2})\/(?P<day>[0-9]{2})")
m = re.match(reg, dobs[1])
m.groupdict()

{'day': '17', 'month': '09', 'year': '1984'}

### Non-capturing Groups

Let's say we needed to include a grouping of characters in a RE, but **not** extract that group as a capture group. This can be used when you wish to compare between two groupings of characters, or define a group that a qualifier can then be attached to, but not extract it. RE has another highly non-intuitive way of handling this use case: `(?:...)`.

| Example | Description | 
| ------ | --------------- |
| `(?:abc)+(def)` | Matches but does not capture `abc` one or more times, matches and captures `def`|
| `(?:ATG)[ATCG]+` | Matches but doesn't capture a string beginning with `ATG` followed by one or more A,T,C or Gs |
| `((?:https?)\|(?:ftp))` | Matches and extracts either `http` or `ftp` non-capturing groups, with optional `https` |

## Simple Patterns

We’ll start by learning about the simplest possible regular expressions. Since regular expressions are used to operate on strings, we’ll begin with the most common task: matching characters.

In [5]:
re.search("e", "e")

<_sre.SRE_Match object; span=(0, 1), match='e'>

## Using Regular Expressions

Now that we’ve looked at some simple regular expressions, how do we actually use them in Python? The re module provides an interface to the regular expression engine, allowing you to compile REs into objects and then perform matches with them.

Regular expressions are compiled into pattern objects, which have methods for various operations such as searching for pattern matches or performing string substitutions.

In [None]:
p = re.compile("ab*")
p

`re.compile()` also accepts an optional flags argument, used to enable various special features and syntax variations. We’ll go over the available settings later, but for now a single example will do:

In [None]:
p = re.compile('ab*', re.IGNORECASE)
p

The RE is passed to `re.compile()` as a string. REs are handled as strings because regular expressions aren’t part of the core Python language, and no special syntax was created for expressing them. (There are applications that don’t need REs at all, so there’s no need to bloat the language specification by including them.) Instead, the `re` module is simply a C extension module included with Python, just like the socket or zlib modules.

Putting REs in strings keeps the Python language simpler, but has one disadvantage which is the topic of the next section.

## The Backslash Plague

As stated earlier, regular expressions use the backslash character (`'\'`) to indicate special forms or to allow special characters to be used without invoking their special meaning. This conflicts with Python’s usage of the same character for the same purpose in string literals.

Let’s say you want to write a RE that matches the string `\section`, which might be found in a LaTeX file. To figure out what to write in the program code, start with the desired string to be matched. Next, you must escape any backslashes and other metacharacters by preceding them with a backslash, resulting in the string `\\section`. The resulting string that must be passed to `re.compile()` must be `\\section`. However, to express this as a Python string literal, both backslashes must be escaped again.

| Characters | Stage | 
| ------ | ---- |
| `\section` | Text string to be matched |
| `\\section` | Escaped backslash for `re.compile()` |
| `\\\\section` | Escaped backslashes for a string literal |

In short, to match a literal backslash, one has to write `'\\\\'` as the RE string, because the regular expression must be `\\`, and each backslash must be expressed as `\\` inside a regular Python string literal. In REs that feature backslashes repeatedly, this leads to lots of repeated backslashes and makes the resulting strings difficult to understand.

The solution is to use Python’s raw string notation for regular expressions; backslashes are not handled in any special way in a string literal prefixed with 'r', so r"\n" is a two-character string containing `'\'` and `'n'`, while `"\n"` is a one-character string containing a newline. Regular expressions will often be written in Python code using this raw string notation.

In addition, special escape sequences that are valid in regular expressions, but not valid as Python string literals, now result in a DeprecationWarning and will eventually become a SyntaxError, which means the sequences will be invalid if raw string notation or escaping the backslashes isn’t used.

| Regular String | Raw String | 
| ------ | ---- |
| `"ab*"` | `r"ab*"` |
| `\\\\section` | `r"\\section"` |
| `\\w+\\s+\\1` | `r\w+\s+\1` |

## Performing Matches

Once you have an object representing a compiled regular expression, what do you do with it? Pattern objects have several methods and attributes. Only the most significant ones will be covered here; consult the re docs for a complete listing.

| Method/Attribute | Purpose | 
| ------ | ---- |
| `match()` | Determine if the RE matches at the beginning of the string. |
| `search()` | Scan through a string, looking for any location where this RE matches. |
| `findall()` | Find all substrings where the RE matches, and returns them as a list. |
| `finditer()` | Find all substrings where the RE matches, and returns them as an iterator. |

In [None]:
p = re.compile("[a-z]+")
p

Now, you can try matching various strings against the RE [a-z]+. An empty string shouldn’t match at all, since + means ‘one or more repetitions’. match() should return None in this case, which will cause the interpreter to print no output. You can explicitly print the result of `match()` to make this clear.

In [None]:
print(p.match(""))

Now, let’s try it on a string that it should match, such as `tempo`. In this case, `match()` will return a match object, so you should store the result in a variable for later use.

In [None]:
m = p.match("tempo")
print(m)

Now you can query the match object for information about the matching string. Match object instances also have several methods and attributes; the most important ones are:

| Method/Attribute | Purpose | 
| ------ | ---- |
| `group()` | Return the string matched by the RE. |
| `start()` | Return the starting position of the match. |
| `end()` | Return the ending position of the match. |
| `span()` | Return a tuple containing the (start, end) positions of the match. |

Trying these methods will soon clarify their meaning:

In [None]:
m.group()

In [None]:
m.start(), m.end()

In [None]:
m.span()

`group()` returns the substring that was matched by the RE. `start()` and `end()` return the starting and ending index of the match. `span()` returns both start and end indexes in a single tuple. Since the `match()` method only checks if the RE matches at the start of a string, `start()` will always be zero. However, the `search()` method of patterns scans through the string, so the match may not start at zero in that case.

In [None]:
print(p.match('::: message'))

In [None]:
m = p.search('::: message'); print(m)

In [None]:
m.group()

In [None]:
m.span()

In actual programs, the most common style is to store the match object in a variable, and then check if it was `None`. This usually looks like:

In [None]:
p = re.compile(r'\d+')
m = p.match( 'string goes here' )
if m:
    print('Match found: ', m.group())
else:
    print('No match')

In [None]:
p.findall("12 drummers drumming, 11 pipers piping, 10 lords a-leaping")

The `r` prefix, making the literal a raw string literal, is needed in this example because escape sequences in a normal “cooked” string literal that are not recognized by Python, as opposed to regular expressions, now result in a `DeprecationWarning` and will eventually become a `SyntaxError`. See The Backslash Plague.

`findall()` has to create the entire list before it can be returned as the result. The `finditer()` method returns a sequence of match object instances as an iterator:

In [None]:
iterator = p.finditer('12 drummers drumming, 11 pipers piping, 10 lords a-leaping')
iterator

In [None]:
for match in iterator:
    print(match.span())

## Module-Level Functions

You don’t have to create a pattern object and call its methods; the re module also provides top-level functions called `match()`, `search()`, `findall()`, `sub()`, and so forth. These functions take the same arguments as the corresponding pattern method with the RE string added as the first argument, and still return either None or a match object instance.

In [None]:
print(re.match(r'From\s+', 'Fromage amk'))

In [None]:
re.match(r'From\s+', 'From amk Thu May 14 19:12:10 1998')

Under the hood, these functions simply create a pattern object for you and call the appropriate method on it. They also store the compiled object in a cache, so future calls using the same RE won’t need to parse the pattern again and again.

Should you use these module-level functions, or should you get the pattern and call its methods yourself? If you’re accessing a regex within a loop, pre-compiling it will save a few function calls. Outside of loops, there’s not much difference thanks to the internal cache.

## More Pattern Power

So far we’ve only covered a part of the features of regular expressions. In this section, we’ll cover some new metacharacters, and how to use groups to retrieve portions of the text that was matched.

|

    Alternation, or the “or” operator. If A and B are regular expressions, A|B will match any string that matches either A or B. | has very low precedence in order to make it work reasonably when you’re alternating multi-character strings. Crow|Servo will match either 'Crow' or 'Servo', not 'Cro', a 'w' or an 'S', and 'ervo'.
    
    To match a literal '|', use \|, or enclose it inside a character class, as in [|].
    
^

    Matches at the beginning of lines. Unless the MULTILINE flag has been set, this will only match at the beginning of the string. In MULTILINE mode, this also matches immediately after each newline within the string.

    For example, if you wish to match the word From only at the beginning of a line, the RE to use is ^From.

In [None]:
print(re.search('^From', 'From Here to Eternity'))  


In [None]:
print(re.search('^From', 'Reciting From Memory'))

To match a literal '^', use \^.

$

    Matches at the end of a line, which is defined as either the end of the string, or any location followed by a newline character.

In [None]:
print(re.search('}$', '{block}'))  

In [None]:
print(re.search('}$', '{block} '))

In [None]:
print(re.search('}$', '{block}\n')) 

To match a literal `'$'`, use `\$` or enclose it inside a character class, as in [$].

\A
    
    Matches only at the start of the string. When not in MULTILINE mode, \A and ^ are effectively the same. In MULTILINE mode, they’re different: \A still matches only at the beginning of the string, but ^ may match at any location inside the string that follows a newline character.
\Z
    
    Matches only at the end of the string.
    
\b

    Word boundary. This is a zero-width assertion that matches only at the beginning or end of a word. A word is defined as a sequence of alphanumeric characters, so the end of a word is indicated by whitespace or a non-alphanumeric character.

    The following example matches class only when it’s a complete word; it won’t match when it’s contained inside another word.

In [None]:
p = re.compile(r'\bclass\b')
print(p.search('no class at all'))

In [None]:
print(p.search('the declassified algorithm'))

In [None]:
print(p.search('one subclass is'))

## Grouping

Frequently you need to obtain more information than just whether the RE matched or not. Regular expressions are often used to dissect strings by writing a RE divided into several subgroups which match different components of interest. For example, an RFC-822 header line is divided into a header name and a value, separated by a ':', like this:

    From: author@example.com
    User-Agent: Thunderbird 1.5.0.9 (X11/20061227)
    MIME-Version: 1.0
    To: editor@example.com
    
This can be handled by writing a regular expression which matches an entire header line, and has one group which matches the header name, and another group which matches the header’s value.

Groups are marked by the `'('`, `')'` metacharacters. `'('` and `')'` have much the same meaning as they do in mathematical expressions; they group together the expressions contained inside them, and you can repeat the contents of a group with a repeating qualifier, such as `*`, `+`, `?`, or `{m,n}`. For example, `(ab)*` will match zero or more repetitions of ab.

In [26]:
p = re.compile('(ab)*')
print(p.match('ababababab').span())

(0, 10)


Groups indicated with `'('`, `')'` also capture the starting and ending index of the text that they match; this can be retrieved by passing an argument to `group()`, `start()`, `end()`, and `span()`. Groups are numbered starting with 0. Group 0 is always present; it’s the whole RE, so match object methods all have group 0 as their default argument. Later we’ll see how to express groups that don’t capture the span of text that they match.

In [24]:
p = re.compile('(a)b')
m = p.match('ab')
m.group()

'ab'

In [25]:
m.group(0)

'ab'

Subgroups are numbered from left to right, from 1 upward. Groups can be nested; to determine the number, just count the opening parenthesis characters, going from left to right.

In [21]:
p = re.compile('(a(b)c)d')
m = p.match('abcd')
m.group(0)

'abcd'

In [22]:
m.group(1)

'abc'

In [23]:
m.group(2)

'b'

`group()` can be passed multiple group numbers at a time, in which case it will return a tuple containing the corresponding values for those groups.

In [None]:
m.group(2,1,2)

The `groups()` method returns a tuple containing the strings for all the subgroups, from 1 up to however many there are.

In [None]:
m.groups()

## Non-capturing and Named Groups

Elaborate REs may use many groups, both to capture substrings of interest, and to group and structure the RE itself. In complex REs, it becomes difficult to keep track of the group numbers.

The solution chosen by the Perl developers was to use `(?...)` as the extension syntax. `?` immediately after a parenthesis was a syntax error because the `?` would have nothing to repeat, so this didn’t introduce any compatibility problems. The characters immediately after the ? indicate what extension is being used, so `(?=foo)` is one thing (a positive lookahead assertion) and `(?:foo)` is something else (a non-capturing group containing the subexpression foo).

Python supports several of Perl’s extensions and adds an extension syntax to Perl’s extension syntax. If the first character after the question mark is a P, you know that it’s an extension that’s specific to Python.

Now that we’ve looked at the general extension syntax, we can return to the features that simplify working with groups in complex REs.

Sometimes you’ll want to use a group to denote a part of a regular expression, but aren’t interested in retrieving the group’s contents. You can make this fact explicit by using a non-capturing group: `(?:...)`, where you can replace the `...` with any other regular expression.

In [None]:
m = re.match("([abc])+", "abc")
m.groups()

In [None]:
m = re.match("(?:[abc])+", "abc")
m.groups()

Except for the fact that you can’t retrieve the contents of what the group matched, a non-capturing group behaves exactly the same as a capturing group; you can put anything inside it, repeat it with a repetition metacharacter such as `*`, and nest it within other groups (capturing or non-capturing). `(?:...)` is particularly useful when modifying an existing pattern, since you can add new groups without changing how all the other groups are numbered. It should be mentioned that there’s no performance difference in searching between capturing and non-capturing groups; neither form is any faster than the other.

A more significant feature is named groups: instead of referring to them by numbers, groups can be referenced by a name.

The syntax for a named group is one of the Python-specific extensions: `(?P<name>...)`. name is, obviously, the name of the group. Named groups behave exactly like capturing groups, and additionally associate a name with a group. The match object methods that deal with capturing groups all accept either integers that refer to the group by number or strings that contain the desired group’s name. Named groups are still given numbers, so you can retrieve information about a group in two ways:

In [None]:
p = re.compile(r'(?P<word>\b\w+\b)')
m = p.search( '(((( Lots of punctuation )))' )
m.group('word')

In [None]:
m.group(1)

Named groups are handy because they let you use easily-remembered names, instead of having to remember numbers. Here’s an example RE from the `imaplib` module:

In [None]:
InternalDate = re.compile(r'INTERNALDATE "'
        r'(?P<day>[ 123][0-9])-(?P<mon>[A-Z][a-z][a-z])-'
        r'(?P<year>[0-9][0-9][0-9][0-9])'
        r' (?P<hour>[0-9][0-9]):(?P<min>[0-9][0-9]):(?P<sec>[0-9][0-9])'
        r' (?P<zonen>[-+])(?P<zoneh>[0-9][0-9])(?P<zonem>[0-9][0-9])'
        r'"')

It’s obviously much easier to retrieve `m.group('zonem')`, instead of having to remember to retrieve group 9.

The syntax for backreferences in an expression such as `...)\1` refers to the number of the group. There’s naturally a variant that uses the group name instead of the number. This is another Python extension: `(?P=name)` indicates that the contents of the group called name should again be matched at the current point. The regular expression for finding doubled words, `\b(\w+)\s+\1\b` can also be written as `\b(?P<word>\w+)\s+(?P=word)\b`:

In [None]:
p = re.compile(r'\b(?P<word>\w+)\s+(?P=word)\b')
p.search('Paris in the the spring').group()

## Tasks

### Task 1

Develop and compile a regex pattern that can:

Match the following strings:

    can
    man
    fan
    
**but** ignore the following strings:

    dan
    ran
    pan

In [None]:
strings = ["can", "man", "fan", "dan", "ran", "pan"]

# your codes here

### Task 2

Develop, compile and match a regex pattern that extracts the numbers in the middle of each username.

In [None]:
usernames = [
    "wA854k_12", "xQ764b-19", "oZ488n_86", "vK221i_09"
]

# your codes here

### Task 3

Develop, compile and match a regex pattern that matches to URLs.

Baring in mind, `https://` is optional, the `s` in `https` is sometimes optional, and slashes must be `\`.

For example, a few standard URLs may be:

    https://www.google.co.uk/
    http://www.yahoo.com
    www.wikipedia.org/wiki/Main_Page

In [None]:
# your codes here

### Task 4

Develop, compile and match a regex pattern that extracts each parameter (such as `img`, `alt`) from the below HTML `<img>` tags. Ensure that you label each extracted group with an appropriate name, such as **IMG**, **ALT**.

For example, a standard tag may look like:

    <img src="smileyface.gif" alt="Smiley Face" height="42" width="35">
    <img src="http://www.example.com/image.gif" alt="An example image" style="width:500px; height:600px;">

In [None]:
# your codes here