<h1 id="toctitle">Regular expressions</h1>
<ul id="toc"/>

## What are regular expressions?

Regular expressions (aka. _regex_) are a special mini-language for describing patterns in strings. 

Handy for working with patterns in DNA/protein, also text files of many different types.

Also crop up in other tools: text editors, grep.

### Regular expression module

The tools for using regular expressions live in the `re` module. We have to `import` the module, then use the module name when running functions:

In [1]:
import re
re.search('a', 'abc')#looking for a in abc

<re.Match object; span=(0, 1), match='a'>

In [2]:
search('a', 'abc')

NameError: name 'search' is not defined

## Searching for patterns with variation

`re.search()` takes two arguments: a pattern and a string. It searches for the pattern in the string and returns `True` or `False`. (Actually it returns a "match object" or `None` but we can treat these as `True` or `False` for the purpose of an `if` statement.)

Here is a boring pattern:

In [3]:
dna = "ATCGCGAATTCAC"

if re.search("GAATTC", dna):
    print("restriction site found!")
else:
    print("no restriction site!")

restriction site found!


### Alternation

Here's an example of a pattern that's a bit more interesting. When there are two different possibilities we surround them with parentheses and separate with pipe characters:

In [None]:
dna = "ATCGCGAATTCAC"

if re.search("GG(A|T)CC", dna):# search for a or t
    print("restriction site found!")
else:
    print("no restriction site!")

### Character groups

A very common type of alternation is when we want to allow any one of a list of characters. We can write it like this:

In [None]:
dna = "ATCGCGAATTCAC"
if re.search("GC(A|T|G|C)GC", dna):
    print("restriction site found!")

or with a shorthand like this:

In [None]:
if re.search("GC[ATGC]GC", dna): #sq brakets = any of characters in this 
    print("restriction site found!")

Sometimes it's easier to describe a character group by listing the characters that are __not__ allowed. Special rule: if the character group starts with ^ then it means any character except these ones:

In [4]:
dna = "ATCGCGYAATTCAC"

if re.search("[^ATGC]", dna): #hat means looking for anything that is not ATGC 
    print("ambiguous base found!")

ambiguous base found!


There are useful shortcuts for some commonly-used character groups:

- A full stop (aka dot, period, `.`) stands for any character
- `\d` means any digit
- `\s` means any whitespace (spaces, tabs)

### Quantifiers

Another type of variation is the number of times something is repeated.

A question mark means the thing preceding it is optional. In the pattern `GCG?Y`the second G is optional. The pattern will match `GCGY` and `GCY`. 

A plus means that the thing preceding it can be repeated more than once. In the pattern `GCG+Y` the second G can be repeated, so it matches `GCGY`, `GCGGY`, `GCGGGY`, etc. but __not__ `GCY`

An asterisk is the most flexible quantifier; the thing preceding it optional, but can also be repeated. The pattern `GCG*Y` will match `GCY`, `GCGY`, `GCGGY`, `GCGGGY`, etc. 

For more specificity, we can specify a minimum and maximum number of repetitions:

`GCG{2,4}Y` will match `GCGGY`, `GCGGGY` and `GCGGGGY` but __not__ `GCGY` or `GCGGGGGY`, etc. 

Can write {2,} to say 2 or more times, as only setting lower limit



### Anchors

Unlike all the stuff above, __anchors__ specify where the pattern has to match the string. 

`^` means the start (only within square brackets it means a negated character group as described earlier). So `^G` will match `GATC` but not `ATGC`. 

(Hat outside the bracket now)

`$` means the end, so the pattern `G$` will match `ATCG` but not `AGTC` 

### Combinations

The real power of regular expressions comes from combining all these features. Here's a complex regular expression that describes a full length messenger RNA with start codon and polyA tail:

`^ATG[ATGC]{30,1000}A{5,10}$`

Look at the features:

- string must start with ATG
- then between 30 and 1000 bases that must be A/T/G/C
- string must end with between 5 and 10 consecutive As



## Other stuff we can do with regular expressions

### Extracting the match

In an `if` statement, `re.search()` just gives us a true/false answer, but in the case of finding a match it actually returns a "match object". We can save that match object and use methods to get information from it. For example with our non-ATGC base example:

In [5]:
dna = "ATCGCGYAATTCAC"

if re.search("[^ATGC]", dna):
    print("ambiguous base found!")

ambiguous base found!


We know that we found a non-ATGC base, but what was it? Calling `group()` on the match object will tell us, and `start()` will show the position:

In [6]:
dna = "CGATNCGGAAYCGATC"
mo = re.search("[^ATGC]", dna)

# mo is now a match object
if mo: #if match oject is true then do these following things
    print("ambiguous base found!")
    ambig = mo.group()
    position = mo.start() #tells us the position in string that this occurs
    print("the base is " + ambig)
    print("the base is at position " + str(position))

ambiguous base found!
the base is N
the base is at position 4


As always, the position is counted from 0. Also `end()` will give the end, and just like with substring indexing this is one past the last character, so in this case will be 5.

### Splitting a string with a regex

`re.split()` works just like regular `split()`, but takes a regular expression pattern as the separator. Here we split a DNA sequence whenever we see a non-ATGC base. Note pattern reuse!

In [7]:
dna = "ACTNGCATRGCTACGTYACGATSCGAWTCG"
runs = re.split("[^ATGC]", dna) #split at any character that isnt A,T,G, or C i.e. splits at ambiguous bases

print(runs)

['ACT', 'GCAT', 'GCTACGT', 'ACGAT', 'CGA', 'TCG']


The output is a list of strings. 

Above, we exclude the bits of the string that matched the pattern and just keep the non-matching bits. For the opposite, use `re.findall()`. E.g. find all runs of A/T that are at least four bases long:

In [8]:
dna = "ACTGCATTATATCGTACGAAATTATACGCGCG"
runs = re.findall("[AT]{4,}", dna) #long string of either a or t

print(runs)

['ATTATAT', 'AAATTATA']


### Finding multiple matches

Some problems require getting a list of match objects for all matches, not just the first - use `re.finditer()`. E.g. using the same pattern and sequence, what are the start/stop positions of all runs of A/T?

In [9]:
dna = "ACTGCATTATATCGTACGAAATTATACGCGCG"
runs = re.finditer("[AT]{4,}", dna)

for mo in runs:
    run_start = mo.start()#where it starts
    run_end = mo.end()#where it ends 
    print(run_start, run_end)

5 12
18 26


## Exercises

### Gene names

Here's a list of made-up gene accession numbers:

`xkn59438, yhdck2, eihd39d9, chdsye847, hedle3455, xjhd53e, 45da, de37dp`

Write a program that will print only the accession names that satisfy the following criteria – treat each criterion separately:

- contain the number 5
- contain the letter d or e
- contain the letters d and e in that order
- contain the letters d and e in that order with a single letter between them
- contain both the letters d and e in any order
- start with x or y
- start with x or y and end with e
- contain three or more digits in a row
- end with d followed by either a, r or p

__Warning: this is another exercise where it's easy to get the wrong answer, so check your outputs.__

In [83]:
acc_n = ['xkn59438', 'yhdck2', 'eihd39d9', 'chdsye847', 'hedle3455', 'xjhd53e', '45da', 'de37dp']

criteria= ['5','d|e','de','d.e','[de]','^(x|y)','^[xy].*e$','\d{3,}','d[arp]$']

for n, specific in enumerate(criteria):
    list_of_acc=[]
    for item in acc_n:
        if re.search(specific,item):
            list_of_acc.append(item)
    print(str(list_of_acc)+" match: "+str(specific))

#print(list_of_acc)

['xkn59438', 'hedle3455', 'xjhd53e', '45da'] match: 5
['yhdck2', 'eihd39d9', 'chdsye847', 'hedle3455', 'xjhd53e', '45da', 'de37dp'] match: d|e
['de37dp'] match: de
['hedle3455'] match: d.e
['yhdck2', 'eihd39d9', 'chdsye847', 'hedle3455', 'xjhd53e', '45da', 'de37dp'] match: [de]
['xkn59438', 'yhdck2', 'xjhd53e'] match: ^(x|y)
['xjhd53e'] match: ^[xy].*e$
['xkn59438', 'chdsye847', 'hedle3455'] match: \d{3,}
['45da', 'de37dp'] match: d[arp]$


### Double digest

Look at the file *long_dna.txt* which contains a made up DNA sequence. 

__/ indicates the position of the cut site__ for the following explanation.

Predict the fragment lengths that we will get if we digest the sequence with a made up restriction enzyme __AbcI__, whose recognition site is `ANT/AAT`.

What will the fragment lengths be if we do a double digest with both __AbcI__ and __AbcII__, whose recognition site is `GCRW/TG` (hard)? Can you predict the sequences of the fragments themselves?

In [76]:
long = open("long_dna.txt")
long_dna = long.read().rstrip('\n')

rist_site = re.finditer("A(A|T|G|C)TAAT", long_dna)#using finditer in case there are multiple restriction sites 

for mo in rist_site:
    rist_start = mo.start()#where it starts
    rist_end = mo.end()
    print(rist_start, rist_end) ##collect into list and print list?

print(len(long_dna))

#print(rist_start(1))

#split = re.split("A(A|T|G|C)TAAT", dna)
#print(split)

1140 1146
1625 1631
2012


TypeError: 'int' object is not callable

### Super bonus exercise: many restriction enzymes

The file *restriction_enzyme_data.txt* contains names and motifs for many different restriction enzymes, one per line. Each line has the name of the enzyme and its motif separated by a comma. In the motifs, forward slash indicates the cut position. Write a program that will take a DNA sequence and predict the fragment lengths when digested with each of these enzymes. You'll have to turn the motifs, which use ambiguity codes, into regular expressions. Use the file *ce1.txt* to test your program (it contains *C. elegans* chromosome one).