## Regular Expressions

The methods of Python's `str` type give you a powerful set of tools for formatting, splitting, and manipulating string data. But even more powerful tools are available in Python's built-in *regular expression* module.

Fundamentally, regular expressions are a means of *flexible pattern matching* in strings.

The Python interface to regular expressions is contained in the built-in `re` module. We can duplicate a lot of the functionality of the methods of strings with regular expressions.

As an example, we can check whether a string begins with some substring as follows:

In [29]:
txt = 'Apple computer'
txt.startswith('Apple')

True

If we would like to know whether it starts with `"Apple"` or `"apple"`, we would have to call the `startswith` method twice. Regular expressions offer a simpler solution:

In [30]:
import re
re.match(r"[Aa]pple", txt)

<re.Match object; span=(0, 5), match='Apple'>

The bracket notation is one example of the special syntax of *regular expressions*. In this case, it says that any of the characters inside the brackets will do: either `"A"` or `"a"`. The other letters in `"pple"` will act normally. The string `r"[Aa]pple"` is called a *pattern*.

A more complicated example asks whether the string starts with either `apple` or `banana` (no matter if the first letter is capital or not):

In [31]:
if re.match(r"[Aa]pple|[Bb]anana", txt):
    print("Yay!")

Yay!


In this example, we see a new special character `|` that denotes an alternative. On either side of the bar character, we have a *subpattern*.

A legal variable name in Python starts with a letter or an underscore and the following characters can be digits. So, for instance, legal names are `_hidden`, `L_value`, `A123_`, but the name `2abc` is not a valid variable name. The regular expression pattern to recognize valid variable names would be:

`r"[A-Za-z_][A-Za-z_0-9]*\Z"`

The first character of the variable name is defined in the first brackets. The subsequent characters are defined in the second brackets. The special character `*` means that we allow any number (0, 1, 2, ...) of the previous subpattern. For example, the pattern `r"ba*"` allows strings `"b"`, `"ba"`, `"baa"`, `"baaa"`, and so on. The special syntax `\Z` denotes the end of the string.

The special notations, like `\Z`, also cause problems with string handling. Normally in string literals, we have some special notation:

- `\n` stands for newline
- `\t` stands for tab
- And more ...

So, both string literals and regular expressions use similar looking notations, which can create serious confusion. This can be solved by using *raw strings*. We denote a raw string by having an `r` letter before the first quotation mark, like `r"ab*\Z"`. When using raw string, the newline (`\n`), tab (`\t`), and other special string literal notations aren't interpreted. One should always use raw strings when defining regular expression patterns.

### Patterns

A pattern represents a set of strings. This set can be potentially infinite. In patterns, normal characters (letters, numbers) just represent themselves, unless preceded by a backslash, which may trigger some special meaning. Punctuation characters have special meaning, unless preceded by a backslash (`\`), which deprives their special meaning. Use `\\` to represent backslash character without any special meaning. 

#### Special Characters Match Character Groups

In the following, note that a carat (`^`) as the first character inside brackets will create a complement set of characters:

- `\d`: matches a digit, same as `[0-9]`
- `\D`: matches anything but a digit, same as `[^0-9]`
- `\s`: matches a whitespace character (space, newline, tab, ...)
- `\S`: matches a nonwhitespace character
- `\w`: matches one alphanumeric character, same as `[a-zA-Z0-9_]`
- `\W`: matches one non-alphanumeric character, same as `[^a-zA-Z0-9_]`

Using the above notation, we can shorten our previous variable name example, `r"[A-Za-z_][A-Za-z_0-9]*\Z"`, to `r"[a-zA-Z_]\w*\Z"`.

Another example of these:

In [18]:
regex = re.compile(r"\w\s\w")
regex.findall("the fox is 9 years old")

['e f', 'x i', 's 9', 's o']

The patterns `\A`, `\b`, `\B`, and `\Z` will all match an empty string, but in specific places. The patterns `\A` and `\Z` will recognize the beginning and end of the string, respectively. 

The pattern `\b` matches at the start or end of a word, or the "word boundary". There are three different positions that qualify as word boundaries:

1. Before the first character in a string
2. After the last character in a string
3. Between two characters in the string, where one is an alphanumeric character and the other is a non-alphanumeric character

Simply put, `\b` allows you to perform a "whole words only" search using a regular expression in the form of `\bword\b`. 

The pattern `\B` does the reverse of `\b` and matches at every position that `\b` does not. Effectively, `\B` matches at any position between two alphanumeric characters as well as at any position between two non-alphanumeric characters.

#### Wildcards Match Repeated Characters

If you would like to match a string with, say, three alphanumeric characters in a row, it is possible to write, for example, `"\w\w\w"`. Because this is such a common need, there is a specific syntax to match repetitions - curly braces with a number:

In [21]:
regex = re.compile(r"\w{3}")
regex.findall("The quick brown fox")

['The', 'qui', 'bro', 'fox']

There are also markers available to match any number of repetitions. Here are some more of the more common RegEx notations:

- `.`: matches any character
- `[...]`: matches any character contained within the brackets
- `[^...]`: matches any character not appearing after the carat (`^`)
- `^`: matches the start of the string
- `$`: matches the end of the string
- `*`: matches zero or more previous RegEx
- `+`: matches one or more previous RegEx
- `{n}`: matches `n` occurrences of previous RegEx
- `{m, n}`: matches `m` to `n` occurrences of previous RegEx
- `?`: matches zero or one occurrences of previous RegEx

Here's an example using the `"+"` character:

In [22]:
regex = re.compile(r"\w+")
regex.findall("The quick brown fox")

['The', 'quick', 'brown', 'fox']

Here's a more complex example to match email addresses:

In [45]:
email = re.compile(r"\w+.\w+@\w+\.[a-z]{3}")
email.findall("barack.obama@whitehouse.gov")

['barack.obama@whitehouse.gov']

In [46]:
email.findall("erin_mcconnell@gmail.com")

['erin_mcconnell@gmail.com']

We want one or more alphanumeric character (`"\w+"`) followed by the *at sign* (`"@"`), followed by one or more alphanumeric character (`"\w+"`), followed by a period (`"\."` - note the need for a backslash escape), followed by exactly three lowercase letters.

If we change our code to:

In [40]:
email2 = re.compile(r"[\w.]+@\w+\.[a-z]{3}")
email2.findall('barack.obama@whitehouse.gov')

['barack.obama@whitehouse.gov']

We have changed `"\w+"` to `"[\w.]+"`, so we will match any alphanumeric character *or* any other character. With this more flexible expression, we can match a wider range of email addresses.

We have already seen that a `|` character denotes alternatives. For example, the pattern `r"Get (on|off|ready)"` matches the following strings: 

`"Get on"`, `"Get off"`, `"Get ready"`

We can use parentheses to create groupings inside a pattern: `r"(ab)+"` will match the strings:

`"ab"`, `"abab"`, `"ababab"`, and so on

These groups are also given a reference number starting from 1. We can refer to groups using back references: `\number`. For example, we can find separated patterns that get repeated: `r"([a-z]{3,}) \1 \1"`. This will recognize, for example, the following strings:

`"aca aca aca"`, `"turn turn turn"`

But not the strings `"aca aba aca"` or `"ac ac ac"`.

In [102]:
re.match(r"([a-z]{3,}) \1 \1", "erin erin erin")

<re.Match object; span=(0, 14), match='erin erin erin'>

In [47]:
re.match(r"([a-z]{3,}) \1 \1", "erin data erin")

#### Square Brackets Match Custom Character Groups

If the built-in character groups aren't specific enough for you, you can use square brackets to specify any set of characters you're interested in. For example, the following will match any lowercase vowel:

In [19]:
regex = re.compile('[aeiou]')
regex.split('consequential')

['c', 'ns', 'q', '', 'nt', '', 'l']

Similarly, you can use a dash to specify a range: for example `"[a-z]"` will match any lowercase letter, and `"[1-3]"` will match any of `"1"`, `"2"`, or `"3"`. For instance, you may need to extract from a document specific numerical codes that consist of a capital letter followed by a digit. You could do this as follows:

In [20]:
regex = re.compile("[A-Z][0-9]")
regex.findall('1043879, G2, H6')

['G2', 'H6']

### Match and Search Functions

So far today, we have only used the `re.match` function, which tries to find a match at the beginning of the string. As we saw last class, the function `re.search` allows to match any substring of a string. 

Example:

In [5]:
s = "a back is a body part"
re.search(r"\bback\b", s)

<re.Match object; span=(2, 6), match='back'>

In [48]:
re.search(r"\bback\b", "get back")

<re.Match object; span=(4, 8), match='back'>

In [52]:
re.search(r"\bback\b", "back")

<re.Match object; span=(0, 4), match='back'>

In [49]:
re.search(r"\bback\b", "backspace")

Some other strings this will match are `"back"` or `"get back"`, but it will not match the strings `"backspace"` or `"comeback"`.

The function `re.search` finds only the first occurrence. We can use the `re.findall` function to find all occurrences. Let's say we want to find all present participle words in a string `s`. The present participle words have ending `'ing'`. The function call would look like this:

In [54]:
s = "Doing things, going home, staying awake, sleeping later"
re.findall(r'\w+ing\b', s)

['Doing', 'going', 'staying', 'sleeping']

Let's say we want to pick up all the integers from a string. We can try that with the following function call:

In [27]:
re.findall(r"[+-]?\d*", "23 + -24 = -1")

['23', '', '+', '', '-24', '', '', '', '-1', '']

#### Parentheses Indicate *Groups* to Extract

For compound regular expressions, we often want to extract their components rather than the full match. This can be done using parentheses to *group* the results:

In [97]:
email3 = re.compile(r"([\w.]+)@(\w+)\.([a-z]{3})")
text = "To email Guido, try guido@python.org or the older address guido@google.com."
email3.findall(text)

[('guido', 'python', 'org'), ('guido', 'google', 'com')]

In [98]:
re.findall(r"([\w.]+)@(\w+)\.([a-z]{3})", text)

[('guido', 'python', 'org'), ('guido', 'google', 'com')]

As we see, this grouping actually extracts a list of the sub-components of the email address.

Suppose we are given a string of if/then sentences and we would like to extract the conditions from these sentences. Let's try the following function call:

In [58]:
s = ("If I’m not in a hurry, then I should stay. On the other hand, if I leave, then I can sleep.")
re.findall(r'[Ii]f (.*), then', s)

['I’m not in a hurry, then I should stay. On the other hand, if I leave']

If instead we wanted the result `["I'm not in a hurry", "I leave"]`, we can fix this by changing `.*`. 

That pattern tries to match as many characters as possible. This is called *greedy matching*. One way of solving this problem is to notice that the two sentences are separated by a full-stop (.). So, instead of matching all the characters, we need to match everything but the dot character. This can be achieved by using the complement character class: `[^.]`. The carat character (`^`) in the beginning of a character class means the complement character class.

After the modification, the function call looks like this:

`re.findall(r'[Ii]f ([^.]*), then', s)`

Another way of solving this problem is to use a non-greedy matching. The repetition specifiers `+`, `*`, `?`, and `{m, n}` have corresponding non-greedy versions: `+?`, `*?`, `??`, and `{m, n}?`. These expressions use as few characters as possible to make the whole pattern match some substring. By using the non-greedy version, the function call looks like this:

`re.findall(r'[Ii]f (.*?), then', s)`

In [59]:
re.findall(r'[Ii]f ([^.]*), then', s)

['I’m not in a hurry', 'I leave']

In [60]:
re.findall(r'[Ii]f (.*?), then', s)

['I’m not in a hurry', 'I leave']

### Functions in the `re` Module

Below is a list of the most common functions in the `re` module:

- `re.match(pattern, str)`
- `re.search(pattern, str)`
- `re.findall(pattern, str)`
- `re.finditer(pattern, str)`
- `re.sub(pattern, replacement, str, count=0)`

The functions `match` and `search` return a *match object*. A match object describes the found occurrence. The function `findall` returns a list of all the occurrences of the pattern. The elements in the list are strings. The function `finditer` works like `findall` function except that instead of returning a list, it returns an iterator whose items are match objects. The function `sub` replaces all the occurrences of the pattern in the `str` with the string replacement and returns the new string. 

Example - the following code will replace all "she" words with "he":

In [73]:
txt = "She goes where she wants to, she's a sheriff."
newstr = re.sub(r'\b[Ss]he\b', 'he', txt)
print(newstr)

he goes where he wants to, he's a sheriff.


The `sub` function can also use back references to refer to the matched string. The back references \1, \2, and so on, refer to the groups of the pattern, in order.

Example:

In [92]:
txt = """He is the president of Russia. 
He’s a powerful man."""
newstr = re.sub(r'(\b[Hh]e\b)', r'\1 (Putin)', txt, 1)
print(newstr)

He (Putin) is the president of Russia. 
He’s a powerful man.


When we utilize the back reference `\1` inside of our replacement string, this is telling Python to replace our pattern with the first group of the pattern, which in this case is the pattern `r'(\b[Hh]e\b)'`. 

### Match Object

The functions `match`, `search`, and `finditer` use `match` objects to describe the found occurrence. The method `groups()` of the match object returns the tuple of all the substrings matched by the groups of the pattern. Each pair of parentheses in the pattern creates a new group. These groups are referred to by indices 1, 2, .... The group 0 is a special one: it refers to the match created by the whole pattern.

Let's look at the match object returned by the call:

In [93]:
mo = re.search(r"\d+ (\d+) \d+ (\d+)",
               "first 123 45 67 890 last")
mo.groups()

('45', '890')

In [94]:
mo

<re.Match object; span=(6, 19), match='123 45 67 890'>

We can access individual groups by using the method `group(group_id, ...)`. For example:

In [15]:
mo.group(1)

'45'

In [95]:
mo.group(2)

'890'

The zeroth group will represent the whole match:

In [13]:
mo.group(0)

'123 45 67 890'

In addition to accessing the strings matched by the pattern and its groups, the corresponding indices of the original string can be accessed:

- The `start(0)` and `end(0)` methods return the start and end indices of the matched group
- The `span(group_id)` returns the pair of these start and end indices

The match object `mo` can also be used like a boolean value:

In [None]:
mo = re.search(...)
if mo:
    # do something

This code block will do something if a match is found. Alternatively, the match object can be converted to a boolean value by the call:

In [17]:
found = bool(mo)
print(found)

True


In [96]:
no = re.search(r"\d+ (\d+) \d+ (\d+)",
               "first last")
lost = bool(no)
print(lost)

False


### Miscellaneous Stuff

If the same pattern is used in many function calls, it may be wise to precompile the pattern, mainly for efficiency reasons. This can be done using the `compile(pattern, flags=0)` function in the `re` module. The function returns a RegEx object, which has method versions of the functions found in the `re` module. The only difference is that the first parameter is not the pattern since the precompiled pattern is stored in the RegEx object.

The details of matching operations can be specified using optional flags. These flags can be given either inside the pattern or as a parameter to the compile function. Some of the more common flags are below:

- `(?i)`: `re.IGNORECASE`
- `(?m)`: `re.MULTILINE`
- `(?s)`: `re.DOTALL`

The elements on the left can appear anywhere in the pattern, but preferably in the beginning. On the right, there are attributes of the `re` module that can be given to the `compile()` function as the second parameter `flags`.

The `IGNORECASE` flag makes lower and uppercase characters appear as equal. The `MULTILINE` flag makes the special characters `^` and `$` match the beginning and end of each line in addition to the beginning and end of the whole string. These flags make `\A` differ from `^` and `\Z` differ from the `$`. The `DOTALL` flag makes the character class `.` (dot) also accept newline characters in addition to all the other letters.

When giving multiple flags to the compile function, the flags can be separated with the `|` sign. For example,

`re.compile(pattern, re.MULTILINE | re.DOTALL)` 

is equal to 

`re.compile("(?m)(?s)" + pattern)`