<div align=right>
<img src="img/logosmall.png" width="100px" align=right>
</div>

# Regular expressions

<div class="alert alert-warning">
Parts of this section have been adapted from copyrighted material in *Jones, M: Python for Biologists: A complete programming course for beginners (2013)*.

**Please do not distribute it!**

## The importance of patterns in biology

A lot of what we do when writing programs for biology can be described as searching for patterns in strings. The obvious examples come from the analysis of biological sequence data – remember that DNA, RNA and protein sequences are just strings. Many of the things we want to look for in biological sequences can be described in terms of patterns:

* protein domains
* DNA transcription factor binding motifs
* restriction enzyme cut sites
* degenerate PCR primer sites
* runs of mononucleotides

However, it’s not just sequence data that can have interesting patterns. As we discussed before, most of the other types of data we have to deal with in biology comes in the form of strings inside text files – things like:

* read mapping locations
* geographical sample coordinates
* taxonomic names
* gene names
* gene accession numbers
* BLAST searches

In previous sections, we’ve looked at some programming tasks that involve pattern recognition in strings. We’ve seen how to count individual amino acid residues (and even groups of amino acid residues) in protein sequences, and how to identify restriction enzyme cut sites in DNA sequences. We’ve also seen how to examine parts of gene names and match them against individual characters.

The common theme among all these problems is that they involve searching for a fixed set of characters. But there are many problems that we want to solve that require more flexible patterns. For example:

* Given a DNA sequence, what’s the length of the poly-A tail?
* Given a gene accession name, extract the part between the third character and the underscore
* Given a protein sequence, determine if it contains this highly-redundant domain motif

Because these types of problems crop up in so many different fields, there’s a standard set of tools for dealing with them: *regular expressions*. Regular expressions are not specific to Python; in fact, they pre-date Python. In fact, regular expressions are not part of the core Python language, but — like many modern languages — Python has an extensive regular expression *library*. Python's regular expression library — called `re` — is part of the Python *standard library*, and is therefore available wherever Python is installed.

Regular expressions are a topic that might not be covered in a general-purpose programming book, but because they’re so useful in biology, we’re going to devote the whole of this section to looking at them.

Since the tools for dealing with regular expressions are not built into the core Python langauge, they are not automatically available when you write a program. Before we can use them, we must first talk about *modules* and *packages*.

## Recap: Finding a fixed substring:

We've seen several ways of doing this:

* `<pattern> in <sequence>` — test if the substring `<pattern>` is anywhere in the keyword `<sequence>` and return a boolean value (`True` or `False`)

In [None]:
my_seq = "ACTAGCGTTTACT"
"CG" in my_seq

* `<sequence>.find(<pattern>)` — find index of the _first_ `<pattern>` in `<sequence>`;  return `-1` if not found

In [None]:
my_seq.find("CG")

* `<sequence>.index(<pattern>)` — find index of the _first_ `<pattern>` in `<sequence>`;  raise error if not found

In [None]:
my_seq.index("CC")

* `<sequence>.count(<pattern>)` – count the number of non-overlapping occurrences of `<pattern>` in `<sequence>` and return that number

In [None]:
my_seq.count('C')

## Extended example

In this example I'll try to illustrate why regular expressions are useful.

Let's say we need to find the p53 transcription factor in a longer protein sequence:

In [1]:
protein = "SEFTTVLYNFMCNSSCMGGMNRRPILTIIS"
p53 = "MCNSSCMGGMNRR"
location = protein.find(p53)
print(location)

10


In [2]:
protein[location:location + len(p53)]

'MCNSSCMGGMNRR'

However, the p53 transcription factor is variable in one residue.  It can be either of:

* <tt>MCNSSC<font color="red">M</font>GGMNRR</tt>
* <tt>MCNSSC<font color="red">V</font>GGMNRR</tt>

How can we write a test to see whether either variant is present in a protein sequence.  Well, we could combine two conditional tests with `and`:

In [3]:
p53_1 = "MCNSSCMGGMNRR"
p53_2 = "MCNSSCVGGMNRR"

if p53_1 in protein or p53_2 in protein:
    print("Found!")

Found!


…and in fact, this is quite doable for this example where we have just two variants.  You can imagine how tedious this could get if there were many variants, though!

For now, however, let's continue with our simple example.  Wouldn't it be nice if we could write some sort of *pattern* before we test whether the string representing the p53 transcription factor is present in the string representing a protein sequence?

What if we could combine the two pattern strings:

* <tt>"MCNSSC<font color="red">M</font>GGMNRR"</tt> 
* <tt>"MCNSSC<font color="red">V</font>GGMNRR"</tt>

…and make a string something like this:

* <tt>"MCNSSC<font color="red">[MV]</font>GGMNRR"</tt>

In our putative pattern matching language, this would mean:

* *a string that starts with `"MCNSSC"`, followed by either an `'M'` or a `'V'`, followed by `"GGMNRR"`*

As it happens, such a pattern language exists, and it goes by the name of *regular expressions*.  Run the following bit of code:

In [4]:
import re

if re.search("MCNSSC[MV]GGMNRR", protein):
    print("Found!")

Found!


In fact, we can use regular expressions to search for patterns much more complex than that one, which is fortunate, since biology presents us with many such complex patterns.

But first a little detour as we find out what that `import re` statement was all about!

## Python modules

### The standard library

Regular expressions are **not** a part of the core Python language.  In fact, regular expressions predate Python, and are today implemented in a large variety of programming languages and other tools.

Very often when there's such a generally useful tool out there (which is nevertheless self-contained and independent, and should therefore not obviously be a part of the language itself), the Python maintainers have decided to make it available as a *module* in the Python *standard library*.

We've mentioned the standard library in the Introduction to this course:  It's a large collection of generally useful add-on modules that ship with Python, is maintained along with the Python interpreter (and are therefore guaranteed to be of a high standard) and are available wherever Python is installed.  (Most programming languages have a standard library.)

The standard library contains a large collection of *modules*, each of which covers a single field of application.

You'll find a list of modules included in the Python standard library (along with links to their documentation) here:

* http://docs.python.org/3/py-modindex.html

Python's support for regular expressions is in a module called `re`, and you can read *its* documentation here:

* http://docs.python.org/3/library/re.html

### Importing modules

When we want to use the functions and types defined in a Python module, we first have to import it into our current program using the `import` statement.  For instance, to use Python's regular expression support in the `re` module, we add a line like this to our program:

In [5]:
import re

We only have to import a specific module once in a program.  (Though if we do it more than once it's not problematic:  Python is smart enough to recognise that a module has already been imported and won't do any extra work to import it again.)

It's customary to group all of a program's `import` statements together, right near the top.

>Also in a Jupyter Notebook, we need to import a module only once.  We've now already had two `import re` statements, but no matter!  As long as either (or both) of them have been executed, `re` will remain imported for the rest of this Notebook.

Once a module has been imported, you can read its documentation using Python's built-in `help` function or Jupyter's “`?`” magic.  Warning:  A whole module's documentation can be very long!

In [6]:
?re

>What actually happens when you use the `import` statement is that a new variable gets created in the current program's *namespace* — in this case, named `re` — which references a special object in memory where Python stores the functions and types of that module.  See what happens if you evaluate `re`:

In [7]:
re

<module 're' from '/Users/sabineurban/miniconda3/lib/python3.6/re.py'>

### Using an imported module

Once we've imported a module like `re`, we now have a variable `re` in the current namespace which we can use like a handle to get access to the functions and types defined in that module.

For the moment, the most of our interaction with the `re` module is going to be via *functions* defined in it.  Using a function in a module is no different from using a built-in function (like `print()` or `range()`) or a function we have defined ourselves, except that we have to *qualify* the name of the function with the module's name, using Python's qualification notation, the period (“`.`”):

```python
re.search(<pattern>, <string>)
```

Or, in the example we saw above:

In [None]:
if re.search("MCNSSC[MV]GGMNRR", protein):
    print("Found!")

This syntax looks identical to the way in which we accessed *methods* of types, e.g. the invocation of `protein.find()` in this example:

In [15]:
protein = "SEFTTVLYNFMCNSSCMGGMNRRPILTIIS"
if protein.find("MCNSSCMGGMNRR") != -1:
    print("Found!")

Found!


…and that's no coincidence!  In Python, a type and a module will each have its own *namespace*, and the period operator (“`.`”) is a *namespace resolution operator*:

* `protein.find` means *"the function (method) `find` in the namespace of the string object `protein`"*

* `re.search` means "*the function `search` in the namespace of the module `re`*"

Nevertheless, types and modules are very different kinds of objects, so don't confuse the two.  It's just that they (and many other things in Python) each have a *namespace* wherein variables can be bound to objects.

As you might by now expect, you can view the help of the `search` function in the `re` module like so:

In [None]:
re.search

### Aside: `None`?!

There was an interesting statement in the help of `re.search`, namely that it returns *None if no match was found*.

None?

In [17]:
None

Nothing happens when we evaluate `None`, but Python also doesn't complain the way it would if we evaluate an arbitrary word:

In [18]:
Hello

NameError: name 'Hello' is not defined

Let's try to `print` it:

In [19]:
print(None)

None


The textual representation of `None` is just … `None`.

OK, clearly `None` has meaning to the language — it's like `True` and `False`.  Well, `True` and `False` were of type `bool`, so what's the type of `None`:

In [16]:
type(None)

NoneType

Okaaaay.  So `None` is an object of type `NoneType`.  I wonder if it's truthy or falsy;  I think I can guess:

In [20]:
bool(None)

False

Yep, it's false.  In fact, `None` is a special value in Python.  It has its own type — `NoneType` — and it is in fact the only value of type `NoneType`.

`None` is Python's way of saying "nothing".  When a function wants to return something to actively indicate that it's returning **nothing**, it returns `None`.

So we see in the documentation of `re.search` that it returns *"a match object, or None if no match was found"*.  We don't know what a "match object" is at this stage (though we can assume it's some type of object defined in the `re` module), but we can be pretty sure of one thing:  A "match object" will be truthy.  Why?  Because almost everything in Python is truthy, except for a couple of exceptions we've already listed:  The number `0`, an empty string, an empty list, and now also `None`.

This is why we can do the following conditional test:

In [None]:
match_object = re.search("MCNSSC[MV]GGMNRR", protein)
if match_object:
    print("Found!")

Let's evaluate match_object:

In [21]:
match_object

NameError: name 'match_object' is not defined

Let's see what the match_object looks like when the match definitely fails.  E.g. when we try to match a DNA sequence against a protein sequence:

In [22]:
match_object = re.search("ACTAGAACTG", protein)

if match_object:
    print("Found!")
else:
    print("Not found!")

Not found!


In [None]:
match_object

That didn't help.  Let's try `print`ing it:

In [None]:
print(match_object)

Yes, as expected (from the documentation of `re.search()`), the `match_object` is now `None`.  

In a sense `None` is a non-object;  it's the *opposite* of an object.  Usually, a well-written function will always return the same kind of value.  If a function returns a string, then its way of returning "nothing" will be the empty string `""`.  If a function returns an arbitrary object, its way of returning "nothing" will be to return the non-object `None`.

## Back to Regular expressions

We took a short detour through modules a, and then a detour-from-the-detour talking about `None`.  Let's get back to regular expressions now, and try to give our first vague definition of what they are:

* A *regular expression* is a **pattern** (or **template**, or **boilerplate**) against which strings (**targets**) are matched.
* A match either *succeeds* or *fails*.
* Sometimes we're only interested in success or failure, but often we want to extract the part(s) of the target that match(es) the regular expression.
* And finally, regular expressions are written in a formal language.

>"Regular expressions" is an unwieldy phrase, so it's often shortened to "regex" or even "regexp".  I don't like "regexp" since I have no idea how to pronounce it, but I'll often say "regex".

Regular expressions have their roots in format computer science.  The specific formal language that we use today to express them predates Python.  The exact syntax of that language has changed a lot over the years, resulting in incompatible versions of regular expressions in various tools.  Today, there is a recognised industry standard for regular expressions, and Python's `re` module adheres to this standard.

Regular expression libraries exist for most programming languages, either in the standard library (as is the case with Python) or as third party libraries.  A couple of scripting-oriented languages even have regular expressions built into the core language itself.  Regular expressions are also implemented in a variety of UNIX command-line tools, like `grep` and 
`sed`.

When your job involves parsing a lot of textual data (as ours do), regular expressions at first look like a panacea.  So let's start with a warning:

<div class="alert alert-info">

Some people, when confronted with a problem, think “I know, I'll use regular expressions.”

Now they have two problems.

<div align="right">
— Jamie Zawinsky, co-creator of Netscape Navigator

## Regular expression syntax

The formal language used to define regular expressions has a simple yet compact syntax.  We'll look at some of the most common elements of that grammar now.

### Literals

A *literal* part of a regular expression is a string of characters that matches the identical string in the target.  Here are some examples of literal regexes.  (The quotes are not part of the regular expression):

    * "abc"
    * "    123  456"
    * "This is a sentence"
    * "     "

Let's look at this literal regular expression:

* <tt><font color="red">abc</font></tt>

This table contain a list of targets that will be matched by <tt><font color="red">abc</font></tt>.  In each target, the *part that will be matched* is highlighted in red:

| Matching target |
|-----------------|
| <tt><font color="red">abc</font></tt> |
| <tt>aaa<font color="red">abc</font>ccc</tt> |
| <tt><font color="red">abc</font>def</tt> |
| <tt>123<font color="red">abc</font>456</tt> |
| <tt>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<font color="red">abc</font></tt> |
| <tt>This is a sentence containing "<font color="red">abc</font>".</tt> |
| <tt><font color="red">abc</font>abcabcabcabc</tt> |

Note in the last example that the regex <tt><font color="red">abc</font></tt> only matches the *first* occurrence of the string `abc` in the target!  We can use Python to test:

In [23]:
def test_targets(pattern, targets):
    for target in targets:
        if re.search(pattern, target):
            print("Match:", '"' + target + '"')
        else:
            print("No match:", '"' + target + '"')

matching_targets = ["abc",
                    "aaaabcccc",
                    "abcdef",
                    "123abc456",
                    "        abc",
                    'This is a sentence containgin "abc".'
                    "abcabcabcabcabc"]

test_targets("abc", matching_targets)

Match: "abc"
Match: "aaaabcccc"
Match: "abcdef"
Match: "123abc456"
Match: "        abc"
Match: "This is a sentence containgin "abc".abcabcabcabcabc"


Here are a couple of targets that will not be matched by <tt><font color="red">abc</font></tt>:

| Non-matching target |
|---------------------|
| `ABC`               |
| `cba`               |
| `abdc`              |
| `ab c`              |
| `abbc`              |

Note that regular expressions are *case sensitive* by default!, so <tt><font color="red">abc</font></tt> will *not* match the target `ABC`!

Let's test these targets too:

In [24]:
nonmatching_targets = ["ABC", "cba", "abdc", "ab c", "abbc"]
test_targets("abc", nonmatching_targets)

No match: "ABC"
No match: "cba"
No match: "abdc"
No match: "ab c"
No match: "abbc"


### Character classes

Remember that first example when we were still discussion regular expressions in the abstract?

* `MCNSSC[MV]GGMNRR`
    
We "invented" the `[MV]` syntax to mean *something that matches either a single `M` or a single `V`*.  We then saw that this syntax actually exists in regular expressions.

In regex grammar, a construction like `[MV]` is called a *character class*.  **A character class always matches a single character in the target**.

* A basic character class consists of any number of characters enclosed in square brackets.  E.g. `[MV]` matches any single `M` or `V`.


* You can *negate* a character class by making the first character after the opening bracket a caret (`^`):  `[^PWM]` matches any single character that's *not* a `P`, a `W` or an `M`


* You can create a character class to match a range of characters by separating the characters forming the start and end of the range by a dash (minus) sign (`-`):  `[A-Z]` matches any single capital letter that is in the range `A`, `B`, `C`, … through `Z`.

  You can concatenate ranges and single characters within a character class:
  
    * `[A-Za-z]` matches any single character that's a capital *or* lower case letter
    * `[A-Za-z0-9_]` matches any single character that's an uppercase letter, a lowercase letter, a digit, or the underscore (`_`)


* The dot or period (`.`) simply matches *any single character*

An example:  A "Homeobox" antennapedia-type protein signature looks like this:

| Description | In regex syntax |
|-------------|-----------------|
| Starts with `L`, `I`, `V`, `M`, `F`, or `E` | `[LIVMFE]` |
| …then an `F` or a `Y` | `[FY]` |
| …then a `P`, followed by a `W`, followed by an `M` | `PWM` |
| …and ending in `K`, `R`, `Q`, `T` or `A` | `[KRQTA]` |

Put that all together, and the regular expression to match this signature looks like this:

* `[LIVMFE][FY]PWM[KRQTA]`

The following three short fragments of protein sequence all contain the Homeobox signature.  See if you can spot them by eye.

    ...LHNEANLRIYPWMRSAGADR...
    ...PTVGKQIFPWMKES...
    ...VFPWMKMGGAKGGESKRTR...

No, really take your time.  I'll wait.

It's starting to become clear this is easier using a Python.  Let's try it.

First I'll write myself a little utility function to highlight the part of the target that matches a regular expression.  (Don't worry, you're not expected to understand how this works yet!)

In [25]:
# You're not expected to understand the Python in this code box (yet!)
from itertools import tee, zip_longest
def highlight_match(regex, target, fg=35):
    m = re.search(regex, target)
    start, end = tee([0, m.start(), m.end()])
    next(end)
    before, match, after = [m.string[i:j] for i, j in zip_longest(start, end)]
    return "{before}\x1b[01;{fg}m{match}\x1b[00m{after}".format(**locals())

Now let's test it on those three protein sequences:

In [26]:
targets = ["LHNEANLRIYPWMRSAGADR",
           "PTVGKQIFPWMKES",
           "VFPWMKMGGAKGGESKRTR"]

homeobox = r"[LIVMFE][FY]PWM[KRQTA]"

for target in targets:
    print(highlight_match(homeobox, target))

LHNEANLR[01;35mIYPWMR[00mSAGADR
PTVGKQ[01;35mIFPWMK[00mES
[01;35mVFPWMK[00mMGGAKGGESKRTR


Previous challenge too easy?  The same signature is also present in the protein sequence below.  Can you spot it by eye **now**?

In [27]:
prot_seq = """
MDPDCFAMSS YQFVNSLASC YPQQMNPQQN HPGAGNSSAG GSGGGAGGSG GVVPSGGTNG
GQGSAGAATP GANDYFPAAA AYTPNLYPNT PQPTTPIRRL ADREIRIWWT TRSCSRSDCS
CSSSSNSNSS NMPMQRQSCC QQQQQLAQQQ HPQQQQQQQQ ANISCKYAND PVTPGGSGGG
GVSGSNNNNN SANSNNNNSQ SLASPQDLST RDISPKLSPS SVVESVARSL NKGVLGGSLA
AAAAAAGLNN NHSGSGVSGG PGNVNVPMHS PGGGDSDSES DSGNEAGSSQ NSGNGKKNPP
QIYPWMKRVH LGTSTVNANG ETKRQRTSYT RYQTLELEKE FHFNRYLTRR RRIEIAHALC
LTERQIKIWF QNRRMKWKKE HKMASMNIVP YHMGPYGHPY HQFDIHPSQF AHLSA
"""

In [28]:
print(highlight_match(homeobox, prot_seq))


MDPDCFAMSS YQFVNSLASC YPQQMNPQQN HPGAGNSSAG GSGGGAGGSG GVVPSGGTNG
GQGSAGAATP GANDYFPAAA AYTPNLYPNT PQPTTPIRRL ADREIRIWWT TRSCSRSDCS
CSSSSNSNSS NMPMQRQSCC QQQQQLAQQQ HPQQQQQQQQ ANISCKYAND PVTPGGSGGG
GVSGSNNNNN SANSNNNNSQ SLASPQDLST RDISPKLSPS SVVESVARSL NKGVLGGSLA
AAAAAAGLNN NHSGSGVSGG PGNVNVPMHS PGGGDSDSES DSGNEAGSSQ NSGNGKKNPP
Q[01;35mIYPWMK[00mRVH LGTSTVNANG ETKRQRTSYT RYQTLELEKE FHFNRYLTRR RRIEIAHALC
LTERQIKIWF QNRRMKWKKE HKMASMNIVP YHMGPYGHPY HQFDIHPSQF AHLSA



And that, ladies and gentlemen, is why we have computers.

### Character class shortcuts

Some character classes are so useful that there are predefined shortcuts for them.

The simplest and most useful of these is the period (“`.`”) which matches *any single character* in the target — we've already mentioned that.

The pattern `GC.GC` would match all `GCAGC`, `GCTGC`, `GCGGC` and `GCCGC`. However, the period would also match any character which is not a DNA base, or even a letter. Therefore, the whole pattern would also match `GCFGC`, `GC&GC` and `GC9GC`.

Here are some examples of patterns that include the dot, and what they match in a variety of targets:

| `.` | matches |
|---|---|
|| <code><span style="background-color: #FFFF00">a</span></code> |
|| <code><span style="background-color: #FFFF00">a</span>bc</code> |
|| <code><span style="background-color: #FFFF00">1</span>23</code> |
|| <code><span style="background-color: #FFFF00">a</span></code> |
|| <code><span style="background-color: #FFFF00">&nbsp;</span>&nbsp;&nbsp;&nbsp;abc</code> |

| `..` | matches | does not match |
|---|---|---|
|| <code><span style="background-color: #FFFF00">ab</span></code> | `a` |
|| <code><span style="background-color: #FFFF00">ab</span>cdef</code> | `1` |
|| <code><span style="background-color: #FFFF00">&nbsp;a</span>bc</code> | `q` |

| `a.` | matches | does not match |
|---|---|---|
|| <code><span style="background-color: #FFFF00">a1</span></code> | `bc ` |
|| <code><span style="background-color: #FFFF00">ab</span>c</code> | `a` |
|| <code>in&nbsp;<span style="background-color: #FFFF00">a&nbsp;</span>bin</code> | `cba` |
|| <code>T<span style="background-color: #FFFF00">an</span>ia</code> | `Sonia` |

These other character class shortcuts all start with a backslash (“`\`”) followed by a character, i.e they look very similar to the special characters Python uses in its strings to indicate newlines, tabs, etc.  Be careful not to confuse them.  A regular expression pattern may be represented in Python as a string, but the contents of that pattern is interpreted by the regular expression engine.

Here are some of the most commonly-used character class abbreviations:

| shortcut | description                 | equivalent to    |
| ---------|-----------------------------|------------------|
| `\w`     | alphanumeric and underscore | `[A-Za-z0-9_]`      |
| `\W`     | non-alphanumeric            | `[^A-Za-z0-9_]`     |
| `\s`     | whitespace                  | `[ \t\n\r\f\v]`  |
| `\S`     | non-whitespace              | `[^ \t\n\r\f\v]` |
| `\d`     | decimal digit               | `[0-9]`          |
| `\D`     | non-digit                   | `[^0-9]`         |



### Aside: Raw strings

As we've just seen, there are some regular expression patterns (like character class shortcuts) that contains backslash (“`\`”) characters.  Recall from the section on strings and text that certain combinations starting in `\` are interpreted by Python to have special meaning. For example, `\n` means start a new line, and `\t` means insert a tab character.

Obviously, when we define a regular expression pattern, we'd prefer that Python **does not** interpret its own set of special characters, since we need the pattern — backslashes and all — to be passed to the regex engine.

Python’s way round this problem is to have a special way of defining strings: If we put the letter `r` immediately before the opening quotation mark, then any special characters inside the string are ignored:

In [29]:
print(r"one\ttwo\n")

one\ttwo\n


The `r` stands for "raw", which is Python’s description for a string where special characters are ignored. Notice that the `r` goes outside the quotation marks – it is not part of the string itself. We can see from the output that the above code prints out the string just as we’ve written it without any tabs or new lines.  (Note that you can write raw strings with double quotes, single quotes or triple quotes, just like any other Python string.)

It's a good idea to get into the habit of using raw strings to define regular expressions, to prohibit Python from misinterpreting any special characters in those regular expressions.

### Multipliers

A character class, as we've seen, matches only a single character in the target, irrespective of whether it's a manually created character class like `[^PWM]`, a character class abbreviation like `\S`, or the dot (`.`).

What if we want to match two or more consecutive characters of belonging to the same character class?  In that case, we have to use a *multiplier*.

A multiplier is written directly after a character class **or** a literal character.  It matches zero or more of the preceding character or character class.  Basic multipliers are written in curly braces, as follows:

| syntax | description |
|---|---|
| `{n}` | match exactly `n` times |
| `{n,}` | match `n` or more times |
| `{,m}` | match between 0 and `m` times |
| `{n,m}` | match between `n` and `m` times |

So this regular expression:

    [FILAPVM]{5}
    
…is equivalent to this regular expression:

    [FILAPVM][FILAPVM][FILAPVM][FILAPVM][FILAPVM]

Here are some examples of targets it matches:

| `[FILAPVM]{5}` | matches |
|---|---|
|| <code><span style="background-color: #FFFF00">AAAAA</span></code> |
|| <code><span style="background-color: #FFFF00">AAPAP</span></code> |
|| <code><span style="background-color: #FFFF00">LAPMA</span>VAILA</code> |
|| <code><span style="background-color: #FFFF00">VILLA</span>MAP</code> |

Another example:

| `A{3,5}` | matches | does not match |
|---|---|---|
|| <code><span style="background-color: #FFFF00">AAA</span></code> | `AA` |
|| <code><span style="background-color: #FFFF00">AAAA</span></code> | `ATA` |
|| <code><span style="background-color: #FFFF00">AAAAA</span>A</code> | `ATTAATA` |

Some multipliers are used so often that there are special abbreviations for them:

| syntax | meaning | equivalent to |
|--------|---------|---------------|
| `*`    | match 0 or more times | `{0,}` |
| `+`    | match 1 or more times | `{1,}` |
| `?`    | match 0 or 1 times, i.e. “optional” | `{0,1}` |

A question mark (“`?`”) immediately following a literal character or character class means that that character (or class) is optional – it can match either zero or one times. So in the pattern `GAT?C` the `T` is optional, and the pattern will match either `GATC` or `GAC`.

A plus sign (“`+`”) immediately following a character or group means that the character or group must be present but can be repeated any number of times – in other words, it will match one or more times. For example, the pattern `GGGA+TTT` will match three `G`s, followed by one or more `A`s, followed by three `T`s. So it will match `GGGATTT`, `GGGAATTT`, `GGGAAATTT`, etc., but not `GGGTTT`.

An asterisk (“`*`”) immediately following a character or group means that the character or group is optional, but can also be repeated. In other words, it will match zero or more times. For example, the pattern `GGGA*TTT` will match three `G`s, followed by zero or more `A`s, followed by three `T`s. So it will match `GGGTTT`, `GGGATTT`, `GGGAATTT`, etc.

If we want to specify a different number of repeats, we have to use curly braces. Following a literal character or character class with a single number inside curly brackets will match exactly that number of repeats. For example, the pattern `GA{5}T` will match `GAAAAAT` but not `GAAAAT` or `GAAAAAAT`. Following a character or group with a pair of numbers inside curly brackets separated with a comma allows us to specify an acceptable range of number of repeats. For example, the pattern `GA{2,4}T` will match `GAAT`, `GAAAT` and `GAAAAT` but not `GAT` or `GAAAAAT`.

>Some of these pattern characters — specifically the `?` and `*` — look a lot like the characters we type on the UNIX shell command line to match filenames.  But *they're not the same thing*, so don't confuse them.  The so-called "globbing" syntax used by the shell is not the same as regular expressions, and is much more limited.

### Groups

Square brackets, as we've seen, denote character classes. Parentheses, on the other hand, merely *group* characters (or character classes).  By themselves they don't change how matching is performed, but they can be used to group characters for multipliers.  Some examples:

| regex with multiplier | equivalent to |
|---|---|
| `[CG]{5}` | `[CG][CG][CG][CG][CG]` |
| `CG{5}` | `CGGGGG` |
| `(CG){5}` | `CGCGCGCGCG` |

If we want to apply a multiplier (say, `?`) to more than one character, we can group the characters in parentheses. For example, in the pattern `GGG(AAA)?TTT` the group of three `A`s is optional, so the pattern will match either `GGGAAATTT` or `GGGTTT`.

To state it more plainly:  A multiplier applies to either…

* the directly preceding literal character, or
* the directly preceding character class in `[]`, or
* the directly preceding group in `()`.

### Alternation

A vertical bar (“`|`”) *within a group* (delimited by parentheses) denotes *alternatives*.

For instance, a stop codon can be `TAA`, `TAG` or `TGA`.

We might try to write a pattern to match a stop codon like this:

    T[AG][AG]
    
…but this would match (for example) `TGG`, which is wrong!

To write a regular expression pattern to match a stop codon we have to use alternation syntax:

    (TAA|TAG|TGA)
    
Or also:

    T(AA|AG|GA)

Note that a character class like `[KRQTA]` can be written in terms of alternation as `(K|R|Q|T|A)`, which works but is more verbose.  On the other hand, alternation syntax offers no way to write negated character classes (like `[^PWC]`) or ranges (like `[A-Z]`).

### Opportunism and greediness

Regular expressions are *opportunistic*.  They consume (match) the **first** substring of the target that matches the pattern:

| regex | matches |
|---|---|
| `abc` | <code><span style="background-color: #FFFF00">abc</span>abcabc</code> |
| `..` | <code><span style="background-color: #FFFF00">ab</span>cabcabc</code> |

Multipliers are *greedy*.  They consume (match) as many characters of the target as they can:

| regex | matches |
|---|---|
| `a*x` | <code><span style="background-color: #FFFF00">aaax</span>aaxax</code>
| `a.*x` | <code><span style="background-color: #FFFF00">aaaxaaxax</span></code> |

>Python's `re` module offers ways to change some of these aspects of regular expressions.  For instance, one can toggle greediness on and off.  We won't cover that in this course, but you can find out how in the `re` module's documentation.

### Anchors

*Anchors* are different from literal characters or character classes, in that they match *positions* rather than *characters* in the target string. Here are some of the more common anchors:

| anchor | description |
|---|---|
| `^` | match at the beginning of the target |
| `$` | match at the end of the target |
| `\b` | match beginning or end of a word |
| `\B` | match zero-length string not at beginning or end of a word |

Some examples:

| regex | description |
|---|---|
| `^A` | start with an `A` |
| `^[MPK]` | start with an `M`, `P` or `K` |
| `E$` | end with an `E` |
| `[QSN]$` | end with a `Q`, `S` or `N` |
| `^[^P]` | start with anything *except* `P` |
| `^A.*E$` | matches entire target that starts with `A` and ends with `E` |

Some examples of regular expressions using anchors:

| regex | matches | doesn't match |
|--|--|--|
| `^abc` | <code><span style="background-color: #FFFF00">abc</span></code> | &nbsp;abc |
| `^abc` | <code><span style="background-color: #FFFF00">abc</span>def</code> | xabc |
| `^a.*e$` | <code><span style="background-color: #FFFF00">ae</span></code> | abc |
| `^a.*e$` | <code><span style="background-color: #FFFF00">abcde</span></code> | abcdef |
| `^$` | `<empty string>` | `<anything else>` |

### Putting it all together

The real power of regular expressions comes from combining these tools. We can use multipliers together with alternations and character classes to specify very flexible patterns. For example, here’s a complex pattern to identify full-length eukaryotic messenger RNA sequences:

    ^ATG[ATGC]{30,1000}A{5,10}$

Reading the pattern from left to right, it will match:

* an `ATG` start codon at the beginning of the sequence
* followed by between 30 and 1000 bases which can be `A`, `T`, `G` or `C`
* followed by a poly-`A` tail of between 5 and 10 bases at the end of the sequence

As you can see, regular expressions can be quite tricky to read until you’re familiar with them! However, it’s well worth investing a bit of time learning to use them, as the same notation is used across multiple different tools. The regular expression skills that you learn in Python are transferable to other programming languages, command line tools, and text editors.

The features we’ve discussed above are the ones most useful in biology, and are sufficient to tackle all the exercises at the end of the section. However, there are many more regular expression features available in Python. If you want to become a regular expression master, it’s worth reading up on greedy vs. minimal quantifiers, back-references, lookahead and lookbehind assertions, and built-in character classes.

## Using the `re` module

### `re.search()`

Now that we've covered the basics of regular expression syntax, let's get back to Python's standard library `re` module and have a look at some of the tools it contains.

We've already seen the simplest regular expression tool:  `re.search()` is function that matches a regular expression pattern against a target string, returning a match object if it matches, or else `None`.

It takes two arguments, both strings. The first argument is the pattern that you want to search for (the regular expression, usually expressed in a raw string), and the second argument is the target string that you want to search in. For example, here’s how we test if a DNA sequence contains an EcoRI restriction site:

In [None]:
dna = "ATCGCGAATTCAC"
if re.search(r"GAATTC", dna):
    print("restriction site found!")

Notice that we’ve used the raw notation for the pattern, even though it’s not strictly necessary as it doesn’t contain any special characters.  It's just a good habit to get into when writing regular expression patterns!

Also notice that this simple example was not a good use of regular expressions, since it would've been able to perform the same search using the `in` operator, or even the `find` method of a string, which is more efficient and doesn't require importing the `re` module:

In [None]:
if "GAATTC" in dna:
    print("restriction site found!")

In [None]:
if dna.find("GAATTC") != -1:
    print("restriction site found!")

However, we'll soon encounter examples where the flexibility and power of regular expressions will allow us to tackle problems that `in` or the `find` method of a string can't solve.

>The `re` module also includes a function `re.match()` which works almost exactly like `re.search()`, except it always matches only at the beginning of the target string.  Personally, I'm not 100% sure why `re.match()` even exists.  I just mention it so you don't get confused between the two.

### Extracting the part of the string that matched

In the section above we used `re.search()` as the condition in an `if` statement to decide whether or not a string contained a pattern. Often in our programs, we want to find out not only if a pattern matched, but what part of the string was matched. To do this, we need to store the match object returned by `re.search()`, then use its `group()` method to find the matching parts of the target.

Tthe value that's actually returned by `re.search()` is a match object – a new object type defined in the `re` module. A match object doesn’t represent a simple thing, like a number or string. Instead, it represents something abstract:  The results of a regular expression search. A match object has a number of useful methods for getting data out of it.

One such method is `group()`. If we call this method on a match object that represents the result of a regular expression search, we get the portion of the target string that matched the pattern:

In [30]:
dna = "ATGACGTACGTACGACTG"
 
# store the match object in the variable m
m = re.search(r"GA[ATGC]{3}AC", dna)
print(m.group())

GACGTAC


In the above code, we’re searching inside a DNA sequence for `GA`, followed by exactly three bases, followed by `AC`. By calling the group method on the resulting match object, we can see the part of the DNA sequence that matched, and figure out what the middle three bases were:

    GACGTAC

What if we want to extract more than one bit of the pattern? Say we want to match this pattern:

    GA[ATGC]{3}AC[ATGC]{2}AC

That’s `GA`, followed by three bases, followed by `AC`, followed by two bases, followed by `AC` again. We can group the bits of the pattern that we want to extract with parentheses.  Adding these groups don't impact what the regular expression will or won't match, in this case, but they do influence what will be returned by the match object's `group` method.  We say that the groups *capture* bits of the target string.

    GA([ATGC]{3})AC([ATGC]{2})AC

We can now refer to the captured bits of the pattern by supplying an argument to the group method. `group(1)` will return the bit of the string matched by the section of the pattern in the first set of parentheses, `group(2)` will return the bit matched by the second, etc.:

In [31]:
dna = "ATGACGTACGTACGACTG"
 
# store the match object in the variable m
m = re.search(r"GA([ATGC]{3})AC([ATGC]{2})AC", dna)
print("entire match:\t", m.group())
print("first bit:\t", m.group(1))
print("second bit:\t", m.group(2))

entire match:	 GACGTACGTAC
first bit:	 CGT
second bit:	 GT


The output shows that the three bases in the first variable section were `CGT`, and the two bases in the second variable section were `GT`.

As well as containing information about the contents of a match, the match object also holds information about the position of the match. The `start` and `end` methods get the positions of the start and end of the pattern on the sequence:

In [None]:
dna = "ATGACGTACGTACGACTG"
m = re.search(r"GA([ATGC]{3})AC([ATGC]{2})AC", dna)
print("start:\t", m.start())
print("end:\t", m.end())

Remember that we start counting from zero, so in this case, the match starting at the third base has a start position of two.

We can get the start and end positions of individual groups by supplying a number as the argument to `start` and `end`:

In [None]:
dna = "ATGACGTACGTACGACTG"
m = re.search(r"GA([ATGC]{3})AC([ATGC]{2})AC", dna)
print("start:", m.start())
print("end:", m.end())
print("group one start:", m.start(1))
print("group one end:", m.end(1))
print("group two start:", m.start(2))
print("group two end:", m.end(2))

In this particular case, we could figure out the start and end positions of the individual groups from the start and end positions of the whole pattern, but that might not always be possible for patterns that have variable length repeats.

### Splitting a string using a regular expression

Occasionally it can be useful to split a string using a regular expression pattern as the delimiter. The normal `split()` method of string objects doesn't allow this, but the `re` module has a `split()` function of its own that uses a regular expression pattern as the delimiter. The first argument is the pattern (to use as delimiter), the second argument is the string to be split.

Imagine we have a consensus DNA sequence that contains ambiguity codes, and we want to extract all runs of contiguous unambiguous bases. We need to split the DNA string wherever we see a base that *isn't* `A`, `T`, `G` or `C`:

In [None]:
dna = "ACTNGCATRGCTACGTYACGATSCGAWTCG"
runs = re.split(r"[^ATGC]", dna)
print(runs)

Recall that putting a caret (“`^`”) at the start of a character class negates it. The output shows how the function works – the return value is a list of strings.

### Finding multiple matches

The examples we’ve seen so far deal with cases where we’re only interested in a single occurrence of a pattern in a string. If instead we want to find *every* place where a pattern occurs in a string, there are two functions in the re module to help us.

`re.findall()` returns a list of all matches of a pattern in a string. The first argument is the pattern, and the second argument is the target string. Say we want to find all runs of `A` and `T` in a DNA sequence longer than five bases:

In [None]:
dna = "ACTGCATTATATCGTACGAAATTATACGCGCG"
runs = re.findall(r"[AT]{6,}", dna)
print(runs)

Notice that the return value of the `findall()` method is not a match object – it is a straightforward list of strings.  Thus we have no way to extract the positions. If we want to do anything more complicated than simply extracting the text of the matches, we need to use the `re.finditer()` method. `finditer()` returns a generator object that generates match objects.  So so to do anything useful with it, we need to iterate over it:

In [None]:
dna = "ACTGCATTATATCGTACGAAATTATACGCGCG"
runs = re.finditer(r"[AT]{6,}", dna)

for match in runs:
    run_start = match.start()
    run_end = match.end()
    print("AT rich region from", run_start, "to", run_end)

As we can see from the output, `finditer` gives us considerably more flexibility that `findall`.

---

## Exercises

### 1. Accession names

Here’s a list of made-up gene accession names:

    xkn59438, yhdck2, eihd39d9, chdsye847, hedle3455, xjhd53e, 45da, de37dp

Write a program that will print only the accession names that satisfy the following criteria – treat each criterion separately:

* contain the number 5
* contain the letter `d` or `e`
* contain the letters `d` and `e` in that order
* contain the letters `d` and `e` in that order with a single letter between them
* contain both the letters `d` and `e` in any order
* start with `x` or `y`
* start with `x` or `y` and end with `e`
* contain three or more numbers in a row
* end with `d` followed by either `a`, `r` or `p`

In [None]:
# Exercise 1
accessionns = backslash
"xkn59438, x".split(',')

regexes = [r'']

### 2. Double digest

In the `files` subdirectory there’s a file called `regex_dna.txt` which contains a made-up DNA sequence. Predict the fragment lengths that we will get if we digest the sequence with two made-up restriction enzymes – AbcI, whose recognition site is `ANT*AAT`, and AbcII, whose recognition site is `GCRW*TG` (asterisks indicate the position of the cut site).

In [None]:
%cd files

In [None]:
# Exercise 2

### 3. Open reading frames

Regular expressions are very useful for detecting all kinds of sequence features. In this exercise we’ll use them to detect open reading frames. 

A DNA sequence can be read in one of three frames (ignoring the reverse direction for the purpose of this exercise):

```
Sequence:     ATGCCCAAGCTGAATAGCGTAGAGGGGTTTTAA

Frame 1: ATG CCC AAG CTG AAT AGC GTA GAG GGG TTT TAA
Frame 2:  TGC CCA AGC TGA ATA GCG TAG AGG GGT TTT AA
Frame 3:   GCC CAA GCT GAA TAG CGT AGA GGG GTT TTA A
```

The region of the nucleotide sequences from the start codon to the stop codon is called the open reading frame (ORF).

* The start codon is `ATG`.
* A stop codon is `TAA`, `TAG` or `TGA`.

In the example above, Frame 1 is contains the longest (and only) open reading frame. In this case, it extends over the entire length of the sequence, which starts with a start codon and ends with a stop codon.

**Write a function which takes a sequence as argument, and determines which of the three (forward) reading frames contains the longest open reading frame. Use a regular expression.**

Assume for the sake of simplicity that each frame will contain at most one ORF.

*Hints:*

* Keep the documentation of the re module open throughout!


* You can your regular expression to a `RegexObject` by using `re.compile()`. The `search()` method of a `RegexObject` takes a useful optional second parameter which indicates the index in the string where the search is to start. If `search()` finds a match, it returns a `MatchObject`. (This parameter is missing from the basic `search()` function in the `re` module.)


* A `MatchObject` has a `group()` method which can be used to extract parts of the target string matched by parenthesised sections of the regular expression pattern. The `re` module’s documentation has some clear examples explaining this concept.


* In regex syntax, you can’t apply a multiplier directly to another multiplier. But you can do so if you group the “inner” multiplier (and the pattern it applies to).

  In other words, if you want to match one or more groups of three characters (each of which can be `A`, `T`, `G` or `C`) you can do it with the following pattern:
      ([ACGT]{3})+
      
  (Oops, this hint just gave the entire game away!)
  
Finally, test your function using some sequences you’ve constructed manually.

In [None]:
# Exercise 3