## BS22003 Bioinformatics tutorial

Welcome to the BS22003 bioinformatics tutorial. In this tutorial we will explore how a computer can handle biological data and help us perform analysis.
In this tutorial we will develop a strategy for cloning a gene into a plasmid. To do that we will identify the appropriate enzymes to use - finding enzymes that have a single cut site in the right place and have complementary sticky ends.

To do this we have to learn how to:
* represent sequences in a computer
* find the text corresponding to a restriction site
* Count the number of matches in the gene
* make decisions, e.g. deciding whether they are in the correct place for cloning
* work through lists, eg Go through a database of restriction enzymes to test them all
* store information in a structured way

## Using this notebook
This notebook is made up of **cells** that can be edited and run. This cell contains [Markdown](https://daringfireball.net/projects/markdown/), a text formatting language. To edit this cell double click on it and change what you see. Then press **CTRL + ENTER** to make the changes.

Other cell types are code cells. In this notebook they run a language called [Python](http://www.python.org) which is the same language used for many scientific packages. You can see the code cells because they have a [ ] next to them and have a grey background. After you have run the cell (with **CTRL + ENTER** or by using the run button above) a number appears in that box and any output appears below the cell. Try this on the cell below. Click on it so it becomes active then press *CTRL* and *ENTER* together.

In [1]:
sequence = 'ACTGATCTCGGATCTCGAGGATCGATTCGATGCTAGTCCGGACGATTCGATCGCGATTAGGAGCTTGATTAGCTCTCTAGGATCTCTAGGATTCTAGTAGGCTG'

Not much has happened. We know the cell has run because the number on the left has changed. What we have done is to create a _variable_, a named value. The value of the variable is the sequence text. We know this is text because it is surrounded by quotes ' '. If we just type the name of the variable, Python will print out the value stored in that name.

In [2]:
sequence

'ACTGATCTCGGATCTCGAGGATCGATTCGATGCTAGTCCGGACGATTCGATCGCGATTAGGAGCTTGATTAGCTCTCTAGGATCTCTAGGATTCTAGTAGGCTG'

If we type a name that we haven't yet defined, Python will give an error.

In [3]:
sequnce # note the misspelling

NameError: name 'sequnce' is not defined

The `#` indicates a comment. Python ignores all the text after the `#` which is a comment for us, the human reader.

Python has many useful functions. For example we can find the length of the text we have stored in `sequence`.

In [4]:
len(sequence)

104

And with a string, a piece of text, we can run all sorts of methods to find another piece of text within it, count the number of times it appears and so on. Suppose we want to look for the sequence motif `AGGA` in our sequence?

In [5]:
sequence.find('AGGA')
# finds the first occurrence of 'AGGA' in the text in sequence

17

The number that `.find(text)` returns is the position of the text we are looking for (`AGGA`) in sequence.  Is there another occurrence of `AGGA`? We can use the `.count(text)` method to see.

In [6]:
sequence.count('AGGA')

4

> #### Thinking point
> What happens if we use lower case instead of upper case?
> Try finding the text `agga` in the sequence

In [7]:
sequence.count('agga')

0

### Challenge: 
In the cell below create a new variable `enzyme` that contains the cut site for a restriction enzyme `TCCGGA`. Then run the following cell. This will read a much longer sequence into the variable `longsequence` from the file [longsequence.txt](longsequence.txt) and check it for the restriction site.

In [8]:
enzyme='TCCGGA'

In [9]:
longsequence = open('longsequence.txt').read()
exec(open('./bs22003.py').read())
test1()

variable enzyme created correctly


What about the enzyme EcoRI (cut site `GAATTC`)? fix and run the cell below.

In [10]:
ecori = 'GAATTC'  # fill this in properly
longsequence.count(ecori)

2

If we try to find the EcoRI sites in our long sequence we only find the first one if we call `longsequence.find(ecori)`. Change the cell below to store the position of the first cut site. You should use the variable name `cut1` and set it to the value we get from running the command in the cell.

In [11]:
cut1=longsequence.find(ecori)

We can then print this out using the variable name.
> #### Variable names
> Which variable names can we have? Python has some simple rules.
> 1. They cannot contain spaces
> 2. They can only start with letters or the underscore
> 3. They cannot be a Python keyword

In [12]:
cut1

840

With `.find(text)` we can add another option to tell Python where to start looking in our sequence. We have to add 1 to the position we find in `cut1` otherwise we just find the same position again. Change the cell below to save the position of the second restriction site to the variable `cut2`

In [13]:
cut2=longsequence.find(ecori,cut1+1)

In [14]:
#Check you have this correct by running this cell
test1a()

cut1 value set correctly
Correct answers. cut1 and cut2 set correctly


### Challenge
You can use simple arithmetic with raw values, such as *1*, variables such as *a* or *b*, or the value returned by functions such as *len(text)*. eg `a - b` to subtract the value in b from the value in a, or `a/2` to divide *a* by the literal value 2, or `b + len(text)` to add the length of *text* to the value in *b*.  
The diagram below gives a guide to calculating the fragment lengths for the EcoR1 cuts of our sequence.  
![fragments](fragments.png)

Save the lengths of the fragments of `longsequence` produced by the enzyme EcoRI in the variables `frag1-frag3` in the cell below. Then run the next cell to check your answer.

In [15]:
frag1 = cut1
frag2 = cut2-cut1
frag3 = len(longsequence)-cut2

In [16]:
#Check your code by running this cell
test2()

Fragment 1 correct
Fragment 2 correct
Fragment 3 correct


### Getting bits of a text string 

So far we have been looking at positions in a piece of text. We have found the location where our enzyme cut site is, but it would be good to get the complete fragment.

We can access parts of a piece of text using code like `text[start:end]` where _text_ is the variable we want to get part of, _start_ is the first position we want to get, and _end_ is the position **_after_** the last letter we want. This is very much like R but Python counts differently - **the first position is 0, not 1.**
As we know the position of our first cut site, and the length of our enzyme cut site, we should be able to retrieve it.

In [17]:
cutsite=longsequence[cut1: cut1+len(ecori)]
print(cutsite)


GAATTC


We can easily get the sequence that is created by the cleavage. Let's also translate this to lower case. A little python trick is that if the part of a list we want to get (and text is a list of letters, or characters) includes the start then we don't need to include that number, or if it includes the end we don't need to include that number.

In [18]:
print(longsequence[:cut1].lower()) 
# longsequence[:cut1] gives a piece of text, so we can then run 
# .lower() on it which gives another piece of text, which is then printed.

ctcgcactccctctggccggcccagggcgccttcagcccaacctccccagccccacgggcgccacggaacccgctcgatctcgccgccaactggtagacatggagacccctgcctggccccgggtcccgcgccccgagaccgccgtcgctcggacgctcctgctcggctgggtcttcgcccaggtggccggcgcttcaggcactacaaatactgtggcagcatataatttaacttggaaatcaactaatttcaagacaattttggagtgggaacccaaacccgtcaatcaagtctacactgttcaaataagcactaagtcaggagattggaaaagcaaatgcttttacacaacagacacagagtgtgacctcaccgacgagattgtgaaggatgtgaagcagacgtacttggcacgggtcttctcctacccggcagggaatgtggagagcaccggttctgctggggagcctctgtatgagaactccccagagttcacaccttacctggagacaaacctcggacagccaacaattcagagttttgaacaggtgggaacaaaagtgaatgtgaccgtagaagatgaacggactttagtcagaaggaacaacactttcctaagcctccgggatgtttttggcaaggacttaatttatacactttattattggaaatcttcaagttcaggaaagaaaacagccaaaacaaacactaatgagtttttgattgatgtggataaaggagaaaactactgtttcagtgttcaagcagtgattccctcccgaacagttaaccggaagagtacagacagcccggtagagtgtatgggccaggagaaaggg


What happens if we add bits of text together? If we use a `+` then Python will join two bits of text together. So we can splice out the middle fragment with:

In [19]:
print(longsequence[:cut1]+longsequence[cut2:].lower())
# I used lower so you can see where the join is between the two sequences.


CTCGCACTCCCTCTGGCCGGCCCAGGGCGCCTTCAGCCCAACCTCCCCAGCCCCACGGGCGCCACGGAACCCGCTCGATCTCGCCGCCAACTGGTAGACATGGAGACCCCTGCCTGGCCCCGGGTCCCGCGCCCCGAGACCGCCGTCGCTCGGACGCTCCTGCTCGGCTGGGTCTTCGCCCAGGTGGCCGGCGCTTCAGGCACTACAAATACTGTGGCAGCATATAATTTAACTTGGAAATCAACTAATTTCAAGACAATTTTGGAGTGGGAACCCAAACCCGTCAATCAAGTCTACACTGTTCAAATAAGCACTAAGTCAGGAGATTGGAAAAGCAAATGCTTTTACACAACAGACACAGAGTGTGACCTCACCGACGAGATTGTGAAGGATGTGAAGCAGACGTACTTGGCACGGGTCTTCTCCTACCCGGCAGGGAATGTGGAGAGCACCGGTTCTGCTGGGGAGCCTCTGTATGAGAACTCCCCAGAGTTCACACCTTACCTGGAGACAAACCTCGGACAGCCAACAATTCAGAGTTTTGAACAGGTGGGAACAAAAGTGAATGTGACCGTAGAAGATGAACGGACTTTAGTCAGAAGGAACAACACTTTCCTAAGCCTCCGGGATGTTTTTGGCAAGGACTTAATTTATACACTTTATTATTGGAAATCTTCAAGTTCAGGAAAGAAAACAGCCAAAACAAACACTAATGAGTTTTTGATTGATGTGGATAAAGGAGAAAACTACTGTTTCAGTGTTCAAGCAGTGATTCCCTCCCGAACAGTTAACCGGAAGAGTACAGACAGCCCGGTAGAGTGTATGGGCCAGGAGAAAGGGgaattcctgacctcagttgatccacccaccttggcctcccaaagtgctagtattatgggcgtgaaccaccatgcccagccgaaaagcttttgaggggctgacttcaatccatgtaggaaagtaaaatggaaggaaattgggtgcatttctaggacttttc

> ### Challenge
> Print out the full sequence, but in lower case with the restriction sites in upper case.
> **Hint:** add each fragment together with the enzyme 'ecori' in between. Don't include the enzyme site when you add 
> the fragment. Do this by first assigning the formatted sequence to the variable `prettyseq`.

In [21]:
prettyseq =longsequence[:cut1].lower()+ecori+longsequence[cut1+6:cut2].lower()+ecori+longsequence[cut2+6:].lower()
print(prettyseq)

ctcgcactccctctggccggcccagggcgccttcagcccaacctccccagccccacgggcgccacggaacccgctcgatctcgccgccaactggtagacatggagacccctgcctggccccgggtcccgcgccccgagaccgccgtcgctcggacgctcctgctcggctgggtcttcgcccaggtggccggcgcttcaggcactacaaatactgtggcagcatataatttaacttggaaatcaactaatttcaagacaattttggagtgggaacccaaacccgtcaatcaagtctacactgttcaaataagcactaagtcaggagattggaaaagcaaatgcttttacacaacagacacagagtgtgacctcaccgacgagattgtgaaggatgtgaagcagacgtacttggcacgggtcttctcctacccggcagggaatgtggagagcaccggttctgctggggagcctctgtatgagaactccccagagttcacaccttacctggagacaaacctcggacagccaacaattcagagttttgaacaggtgggaacaaaagtgaatgtgaccgtagaagatgaacggactttagtcagaaggaacaacactttcctaagcctccgggatgtttttggcaaggacttaatttatacactttattattggaaatcttcaagttcaggaaagaaaacagccaaaacaaacactaatgagtttttgattgatgtggataaaggagaaaactactgtttcagtgttcaagcagtgattccctcccgaacagttaaccggaagagtacagacagcccggtagagtgtatgggccaggagaaagggGAATTCagagaaatattctacatcattggagctgtggtatttgtggtcatcatccttgtcatcatcctggctatatctctacacaagtgtagaaaggcaggagtggggcagagctggaaggagaactccccactgaatgtttcataaaggaagcactgtt

In [22]:
test4()

congratulations. Your sequence is suitably pretty


### trying several restriction sites

Try the following restriction sites. 

| Enzyme | Recognition site | forward cut | reverse cut |
|---|---|---|---|
|BsaI|	GGTCTC|	7	|11|
|HindIII	|AAGCTT	|1|	5|
|MvnI	|CGCG|	2	|2|
|SmaI	|CCCGGG	|3|	3|


* Which pattern isn't found at all?
* Which produces the largest fragment? 
* Which produces the smallest fragment?

To answer these questions, create a variable to hold the text pattern. You can use `longsequence.count(enzyme)` for each enzyme to find the one that doesn't count, and `longsequence.find(enzyme, from)` to find the cut sites. Use the the cell below for your working and fill in the values you calculate into the variables in the following cell and run it. When you cna run it without error, try the test routine.

In [25]:
bsai = 'GGTCTC' # I've made a start.. keep the variable name lower case and the value upper case.
hindiii='AAGCTT'
mvni='CGCG'
smai='CCCGGG'

print(longsequence.count(bsai)) # one piece - not cut
print(longsequence.count(hindiii)) # four pieces
print(longsequence.count(mvni)) # two pieces
print(longsequence.count(smai)) # two pieces



0
3
1
1


In [27]:
noncutter=bsai # eg. noncutter=ecori
largest_frag = 2009 # in base pairs
largest_cutter = 'SmaI'# name of the enzyme giving the largest fragment, e.g. 'EcoRI'
smallest_frag = 118 # in base pairs
smallest_cutter = 'SmaI'# name of the enzyme giving the smallest fragment

In [28]:
test5()

Correct. BsaI does not cut.
Correct largest fragment


### Splitting into fragments
We can now find the number of times a pattern appears in a sequence  with `sequence.count(pattern)` where _sequence_ is the text we are searching and _pattern_ is the text we are searching for. We can also split a piece of text on a particular pattern using the `text.split(pattern)` function. You can find a lot more commands to do things to text [here](https://docs.python.org/3/library/string.html)

In [30]:
frags=longsequence.split('GCGC')
#this creates a list of fragments, each of which is a bit of text.
len(frags)
#gives us the number of fragments in the list.

6

`frags` is a list variable. It contains a list of values. These values can be anything - numbers, text, or even other lists or objects. Just as with text earlier we can get to any item in the list with `list[itemnumber]` or a sub-list (a *slice* ) with `list[startwith:endbefore]` where _startwith_ is the position of the first item in the sub-list and _endbefore_ is the position of the item immediately after the last one we want.

In [31]:
print(len(frags[0]))
print(len(frags[1]))
print(len(frags[2]))


26
28
66


It would be quite tedious to write out 
```
print(len(frags[0]))
print(len(frags[1]))
...
```
to get the length of every fragment. Instead we can go through the list one item at a time with a **_loop_**. This looks like:
```
for item in list:
    do something with item
    and something more
    and something more
    
```
Python takes the list and sets the variable _item_ (it can be any valid name we want) to the first item in the list. It then runs all the code that is indented (that is how python knows that the code is part of the loop). Python repeats this for every item in the list. Lets try that for our list of fragments.

In [32]:
for f in frags:
    print(len(f))
    

26
28
66
58
1301
628


It doesn't matter how long the list is, python will work through it one item at a time. We can put loops inside loops as long as we indent them properly
```
for thingy in biglist:
    commands for doing things to thingy
    for bit in thingy:
        commands for doing things to each bit of thingy
    more commands for doing things to thingy
    
```
Note how the commands are indented so that Python knows which ones relate to which loop.

#### Challenge: Calculate the variance of the fragment lengths for MvnI

You will need two loops, one to calculate the total length of the fragments, and then another to calculate the total variance. The first part is done for you

In [34]:
#This uses a variable mvni which contains the pattern for MvnI
frags=longsequence.split(mvni)
totallength=0
for f in frags:
    totallength = totallength+len(f)
averagelength = totallength/len(frags)

sumofsquares=0
for f in frags:
    #remember the sum of squares is the sum of the squares of the (fragment length - average fragment length)
    # a quick qay to raise something to a power is to do variable ** 2 to get variable squared
    #FILL IN THE RIGHT COMMAND HERE
    sumofsquares+=(averagelength-len(f))**2
    
variance = sumofsquares/len(frags)
print(variance)

873290


In [35]:
test6()

Correct. The variance has been calculated.


### Making decisions
Sometimes we want to be able to do different things depending on the situation. For example, if we want to find enzymes that cut only once, or cut between certain positions.
We can do this using an `if: ... else:` command

```
if something is true:
    do this
else:
    do that
```
As an example, let's find the shortest fragment in our list.

In [36]:
shortest=len(frags[0])
#the shortest fragment will either be frags[0] or shorter than frags[0] so we can set that value first.
for f in frags:
    if len(f) < shortest:
        shortest=len(f)
print(shortest)



127


How about finding the longest?
Note that just like the `for .. in ..:`, the `if .. :` and `else:` also end with colons and the commands to carry out are indented.

In [37]:
longest=0
for f in frags:
    if len(f)> longest:#fill in the rest.
        longest=len(f)
print(longest)

1996


In [38]:
test7()

Correct. The longest fragment has been successfully found


### Reading information from file
So far we have written all our enzymes out by hand. The [REBASE](http://rebase.com) database contains many thousands of restriction enzymes. The file `enzymes.txt` contains a selection of many hundred of these. We can read them in from a file and process our sequence. 

To open a file we create a _filehandle_, a bookmark that lets us get to any point in the file. We can read from the file one line at a time or read all the lines at once as a list.


In [39]:
fh = open('enzymes.txt')

`fh` is our bookmark. Note that we put _enzymes.txt_ between quotes as it is the text name of the file. 

Run the next cell several times (with CTRL-ENTER) and see what it gives us.

In [51]:
fh.readline()

'AciI\tCCGC\t1\t3\n'

The first line returned is
```
'AanI\tTTATAA\t3\t3\n'
```
This is the enzyme name, a TAB character (represented by `\t`), the pattern at which the enzyme cuts, another TAB, the cut position (number of bases from the start) of the pattern on the forward strand and the cut position on the reverse strand. The text finishes off with a NEWLINE character `\n`. 

> ##### Special characters
> The _special characters_ are ones that we can't easily type at the keyboard. These are relics of the old days of computing. 
> They are represented in text by a backslash `\` followed by a letter or other symbol. 
> These are known as _escape codes_ as the normal meaning of the letter is 'escaped' by the `\`. 
>
> | code | character |
> | --- | --- |
> | \t | Tab |
> | \n | Newline|
> | \r | carriage return |
> | \b | backspace |
> | \\\\  | \ itself|

We can see the effect of these special characters if we print the value instead.


In [52]:
print(fh.readline())

AclI	AACGTT	2	4



Note that the filehandle `fh` moves on to the next line each time we call `fh.readline()`. To get to the start of the file again we need to run the open command again or run `fh.seek(0)` which will take us back to the top of the file.

We can read the line from `fh.readline()` into a variable so that we can process it further. If we then split it to a list (splitting on the TAB character that separates each part of the line) we can get each of the parts we want from each entry.

In [53]:
fh.seek(0)


In [71]:
line =  fh.readline() # read in the next line from the file
parts= line.split('\t') # create a list of parts by splitting it on the tab character
name=parts[0]    # create a variable 'name'  with the value of the first part. remember that we start counting at 0 in Python, not 1 
pattern=parts[1] # create a variable with the value of the second part
forwardcut=parts[2]
reversecut=parts[3]
cutsites = longsequence.count(pattern) # remember what this does from before? We use the variable pattern as the text to find 
print('Enzyme '+name+' cuts at '+pattern+' which occurs '+str(cutsites)+' times in the gene sequence')

Enzyme AfeI cuts at AGCGCT which occurs 0 times in the gene sequence


Run the cell above repeatedly to see each line in turn.

We have so far read each line one at a time. We can get a list of all the remaining lines with `fh.readlines()` (notice it is plural). This can be used as a list to go through every line in a file

```
for line in fh.readlines():
    do stuff to the line
    do more stuff to the line
```

So let's go through each line in the file and find all the enzymes that cut our gene sequence only once.

In [72]:
fh.seek(0) # go back to the start of the file
for line in fh.readlines(): # spot the difference between fh.readline() and fh.readlines()
    parts=line.split('\t')
    name=parts[0]
    pattern=parts[1]
    if longsequence.count(pattern) == 1: 
        print(name+' cuts once')


AanI cuts once
AccII cuts once
Acc16I cuts once
AccBSI cuts once
AflII cuts once
AgeI cuts once
Alw26I cuts once
AseI cuts once
AsiGI cuts once
AsuHPI cuts once
BalI cuts once
Bbr7I cuts once
BbsI cuts once
BbvI cuts once
BbvII cuts once
BbvCI cuts once
BccI cuts once
BcoDI cuts once
BfiI cuts once
BfrI cuts once
BmrI cuts once
BmuI cuts once
BpiI cuts once
BsbI cuts once
BseXI cuts once
Bsh1236I cuts once
BshTI cuts once
BsmAI cuts once
BsmBI cuts once
Bsp19I cuts once
BspFNI cuts once
BspTI cuts once
BsrBI cuts once
BstAFI cuts once
BstFNI cuts once
BstMAI cuts once
BstUI cuts once
BstV1I cuts once
BstV2I cuts once
BtsI cuts once
Cfr9I cuts once
CseI cuts once
CspAI cuts once
EsaBC3I cuts once
Esp3I cuts once
FauNDI cuts once
FnuDII cuts once
FseI cuts once
FspI cuts once
HauII cuts once
HgaI cuts once
HpaI cuts once
HphI cuts once
HpyCH4IV cuts once
HpySE526I cuts once
KspAI cuts once
LmnI cuts once
Lsp1109I cuts once
MaeII cuts once
MbiI cuts once
MlsI cuts once
MluNI cuts once
Mly

## Challenge 1
How many enzymes cut only once? How many do not cut at all?
**Hint:** create an empty list, then add the enzymes that cut only once using the `list.append(item)` to add _item_ to _list_.

In [74]:
fh.seek(0) # to go back to the start of the file

singlecutters=[] # this is our empty list.

enzcount=0 # here is a counter to count the total number of enzymes we read in.

for line in fh.readlines():
    enzcount += 1 # add one to enzcount
    parts=line.split('\t')
    name=parts[0]
    pattern=parts[1]
    if longsequence.count(pattern) == 1: 
        #put your code here
        singlecutters.append(pattern)

print('The number of single cutting enzymes is '+str(len(singlecutters))+' from '+str(enzcount))


The number of single cutting enzymes is 92 from 430


In [75]:
test8()

Correct. There are 92 enzymes in the list that cut exactly once


### Types of data
So far we have seen several sorts of data
* text - a list of characters
* numbers - either whole numbers (integers) or floating point (decimals)
* lists - an ordered collection of data items

We can convert one type to another with simple commands. `'101'` si the letters '1', '0', and '1', but we can turn it into the number `101` with the command
```
int('101')
```
This is very useful when reading values from text files like our enzyme file. use `float(text)` to make a decimal number from its text representation.  
We can turn numbers into text for printing or writing to file using the command `str(number)`

```
str(101)
```
will give us the text `'101'`.


What if we want to keep a collection of data items together that don't have any natural order but would be best referred to by name? We can do that with a _dictionary_ which is a set where each data item is referred to by a unique tag or _key_. That _key_ and its associated _value_ make a _key:value pair_. A dictionary is a set of key-value pairs that have no explicit order so instead of referring to each item by its position (as in a list) we refer to the value using the key.

In [77]:
#create a dictionary
mydict={'key':'value', 'anotherkey':'anothervalue', 'key3':3}
mydict

{'anotherkey': 'anothervalue', 'key': 'value', 'key3': 3}

In [78]:
#Note that the key:value pairs can come in any order
mydict['anotherkey'] #retrieve the value stored with the key ['anotherkey']


'anothervalue'

In [79]:
# set the value stored in ['key']
mydict['anotherkey']='a different value'
mydict['more keys']='more values'
mydict

{'anotherkey': 'a different value',
 'key': 'value',
 'key3': 3,
 'more keys': 'more values'}

In [80]:
# remove a key from a dictionary
del(mydict['key'])
mydict

{'anotherkey': 'a different value', 'key3': 3, 'more keys': 'more values'}

We can use a dictionary of dictionaries to store our enzyme data. We can read each line into a dictionary, labelling each part. Then we can add the enzyme information as a value to a master enzymes dictionary, using the enzyme name as the key.

In [81]:
#go back to the top of our enzyme file
fh=open('enzymes.txt')
enzymes={} # make an empty dictionary to which we can append our data.
for line in fh.readlines():
    parts=line.split('\t') # make our line into a list of values
    #now create a dictionary to put our values with their names
    enz={'name':parts[0], 'pattern':parts[1], 'forcut':int(parts[2]), 'revcut':int(parts[3])}
    enzymes[parts[0]]=enz #add our dictionary to the master list

print('read '+str(len(enzymes))+' enzymes') 
# note that we use str() to convert the value len() gives us (an integer) to text so that it can be printed

read 430 enzymes


In [82]:
#Try and retrieve an enzyme by name to see it's details, eg HindIII or EcoRI
enzymes['BamHI']

{'forcut': 1, 'name': 'BamHI', 'pattern': 'GGATCC', 'revcut': 5}

## Getting some numbers on the enzymes
Now we have a directory of enzymes, we can do some counting and analysis. Follow the examples below and try the challenges:

### How many have a 5' overlap?
If the forward cut is before the reverse cut then the enzyme will have a 5' overlap (and vice versa if it is a 3' overlap). We can loop through the _values_ in our dictionary (the _keys_ are the enzyme name, each value is another dictionary containing the information for that enzyme) and check to see if they have 5' overhangs with an `if` statement.


In [84]:
fiveprime=0 # why can't we use the name 5prime?

for enz in enzymes.values():
    # enz is now a dictionary with our enzyme information
    if enz['forcut'] < enz['revcut']:
        fiveprime = fiveprime + 1
print(str(fiveprime)+" enzymes give a 5' overhang")

227 enzymes give a 5' overhang


### How many recognition sites start with a G? 

In [85]:
count=0 # why can't we use the name 5prime?

for enz in enzymes.values():
    # enz is now a dictionary with our enzyme information
    if enz['pattern'].startswith('G'):
        count = count + 1
print(str(count)+" enzymes have a pattern starting with G")

169 enzymes have a pattern starting with G


### Challenge: How many enzymes are blunt cutters?
### Challenge: Do more enzymes with patterns 6 bp or more have a pattern starting with 'C'?
**Hint: use an `and` to join two tests together in an `if` statement**

In [None]:
#do your working here

## Making it stick

So far we have just been presuming that where the pattern of our restriction enzyme matched was the cut site. As we know, 
restriction enzymes can cut in many places to give compatible overlaps, even though the patterns may differ. 
We also have got all our calculations for the fragment length mostly wrong 
(well, all the ends are wrong, the internal fragments are mostly right) 

Check the diagram below which shows the three main classes of restriction enzyme cut site. 
It can form a 5' overlap where the 5' end of the fragment (the reverse strand) is not paired with the complementary strand, 
a 3' overlap where the 3' end of the fragment (the forward strand) is not paired, or a blunt cut where there is no 'sticky end'

![restrictioncuts.png](restrictioncuts.png)

If the pattern is the NNNNNN, then the forward cut value is the number of bases before the cut on the forward strand, and the reverse cut is the number of bases before the cut on the reverse strand (or after if you look at the reverse strand from 5' to 3'). If we look at our enzymes we can divide them into four classes:

| class | criteria | cut location |
|---|---|---|
|Blunt cutter| Forward cut == Reverse Cut | Inside pattern |
|5' overhang | Forward cut < Reverse cut | Inside Pattern | 
|3' overhang | Forward cut > Reverse cut | Inside Pattern |
|External cutter | Either | Cuts outside the pattern |

If we want to be able to find compatible sticky ends then we need to classify our enzymes by their sticky ends. What we will do is to create a key 'overlap' for each enzyme. It will contain the text 'Blunt' for blunt cutters, 'External' for those that cut outside the pattern, and the overlap for the enzymes that create sticky ends inside their recognition site. To distinguish between the 3' overlap and 5' overlap we will make the 3' overlap upper case and the 5' overlap lower case.


In [86]:
overlaps={} # create a dictionary to hold the overlaps
for enz in enzymes.values(): # go through our list of enzymes one at a time. Each time the dictionary is called enz
    overlap='' # a variable to hold our overlap value
    if enz['forcut'] > len(enz['pattern']) or enz['revcut'] > len(enz['pattern']):
        # enzyme cuts outwith the pattern
        overlap='External'
    elif enz['forcut']==enz['revcut']:
        #fill this in 
        overlap='Blunt'
    elif enz['forcut'] < enz['revcut']: #fill this in and don't forget the : at the end - this should get the 5' overhangs
        overlap=enz['pattern'][enz['forcut']:enz['revcut']].lower() 
    else:
        # this should be the upper case overlap so set overlap to the overlap in upper case. 
        #remember to put the smaller value first when getting the overlap
        overlap=enz['pattern'][enz['revcut']:enz['forcut']]
    enz['overlap']=overlap
    try:
        overlaps[overlap].append(enz)
    except:
        overlaps[overlap]=[enz]
        
print('We found '+str(len(overlaps.keys()))+' sticky ends')  


We found 42 sticky ends


In [87]:
#run this block when you can run the block above without an error to make sure it all worked properly. 
#If it gives an error, try to fix it and then run this block again.
test9()

congratulations, you have found the overlaps


In [88]:
# Check our enzyme information to see the new data we have calculated
enzymes['BamHI']

{'forcut': 1,
 'name': 'BamHI',
 'overlap': 'gatc',
 'pattern': 'GGATCC',
 'revcut': 5}

In [89]:
# find all the enzymes with the same overlap
overlaps[enzymes['BamHI']['overlap']]

[{'forcut': 1,
  'name': 'BclI',
  'overlap': 'gatc',
  'pattern': 'TGATCA',
  'revcut': 5},
 {'forcut': 1,
  'name': 'Ksp22I',
  'overlap': 'gatc',
  'pattern': 'TGATCA',
  'revcut': 5},
 {'forcut': 1,
  'name': 'BglII',
  'overlap': 'gatc',
  'pattern': 'AGATCT',
  'revcut': 5},
 {'forcut': 1,
  'name': 'FbaI',
  'overlap': 'gatc',
  'pattern': 'TGATCA',
  'revcut': 5},
 {'forcut': 1,
  'name': 'BamHI',
  'overlap': 'gatc',
  'pattern': 'GGATCC',
  'revcut': 5}]

> #### try: ... except:
>
> What is that `try:` and `except:` all about? If we try to read a value from a dictionary that doesn't exist 
> then we will get an error. This will normally cause the program to crash but python has a way of letting
> us make use of it. If our dodgy code that might crash is in a `try:` block then if (and only if) 
> it gives an error, python will run the code in the `except:` block. 
> So it works like this
> ```
> try:
>     add to the list that is at the key overlap in  the dictionary overlaps
>     #If the key doesn't exist, it will give an error and run the except: block
> except:
>     #We get here if an error happened
>     Create a list containing the item and put it in overlaps with the key overlap
> ```
> Pretty clever and means we don't have to write an if statement to check if the list exists.



### putting it all together
We are now ready to start putting it all together. The challenge, now that we have our enzymes, is to clone a gene. 

We need to:
1. load up a vector (plasmid) sequence (in the file `'vector.txt'`)and identify which enzymes cut in a particular location but nowhere else. 
2. Then we will try to find suitable enzymes that will make compatible sticky ends taking as much of our gene sequence as possible. They should only cut once (or could cut twice, once at each end but we'll go for once to start with).
3. Then we will put it all together to get the sequence we expect so we could check it against an experimental DNA sequence.

In [97]:
# Part 1. read in the plasmid sequence
plasmid=open('plasmid.txt').read() # opens and reads in the entire file.

#now get a list of enzymes that cut once only and cut between bp 650 and 750
pcutters=[] # a list of enzymes that cut our plasmid in the right place.
for  enz in enzymes.values(): 
    if plasmid.count(enz['pattern']) == 1 and plasmid.find(enz['pattern']) >650 and plasmid.find(enz['pattern']) < 750:
        # check that the cut site is between 650 and 750 with the text.find(pattern) command
        enz['pcut'] = plasmid.find(enz['pattern'])+enz['forcut'] # catch the actual enzyme cut site in the plasmid
        pcutters.append(enz)

# make a list of compatible sticky ends
stickies = {}
for p in pcutters:
    try:
        stickies[p['overlap']].append(p)
    except:
        stickies[p['overlap']]=[p]
        
print(str(len(stickies))+' candidate sticky ends')
print([(x,len(stickies[x])) for x in stickies])

14 candidate sticky ends
[('ctag', 4), ('aatt', 1), ('gtac', 2), ('ccgg', 3), ('tcga', 6), ('cg', 7), ('GTAC', 1), ('GGCC', 1), ('TGCA', 2), ('GC', 5), ('ggcc', 10), ('Blunt', 4), ('agct', 1), ('gatc', 1)]


In [93]:
test10()

An error occurred in your code. 
Please check the previous cells have all been run correctly and in order


So now we have read our plasmid sequence in. We have a list of enzymes and their sticky ends. The next stage is to see if we can find a suitable sticky end in our gene sequence (in the variable _longsequence_). We'll read longsequence in again just to be on the safe side, then make a list of all the enzymes that cut once.

In [110]:
#enter your matriculation number to get your sequence into gene

matricnum= 12345  # put your matric number here

gene=getmysequence(matricnum) #
print('My gene sequence is '+str(len(gene))+' bp long.')

gcutters=[] # enzymes that cut only once

#use some previous code to make a list of enzymes that cut only once in our gene sequence.
for enz in enzymes.values():
    if gene.count(enz['pattern'])==1:
        gcutters.append(enz)
    #fill me in 

My gene sequence is 12615 bp long.


In [103]:
test11()

NameError: name 'test11' is not defined

In [112]:
# we can check if a key exists in a dictionary with the test 'if key in dictionary:'
# lets use this to find out where our first cutting enzyme and last cutting enzyme are.
firstcutsite=len(gene)
lastcutsite=0
firstcutter='' #name of our first cutting enzyme
lastcutter='' #name of our last cutting enzyme
for gc in gcutters:
    if gc['overlap'] in stickies:
        # yes, we found a compatible cutter
        ecut=gene.find(gc['pattern']) # add something to make this the correct postion - where does the enzyme actually cut?
        if ecut < firstcutsite:
            firstcutsite=ecut
            firstcutter=gc
        #add an if statement and block to ckeck if it is the last cutter.
        if ecut > lastcutsite:
            lastcutsite=ecut
            lastcutter=gc
print('gene cut between '+str(firstcutsite)+ ' ('+firstcutter['name']+') and '+str(lastcutsite)+' ('+lastcutter['name']+')')

gene cut between 3 (HpaI) and 12266 (KroI)


In [None]:
test12()

We nearly have our cloning strategy. We just have to look up the two sets of enzymes we could use to clone our gene into the plasmid. How can we find which enzymes could be used to create the compatible sticky ends in the plasmid? **Hint:** we have them all sorted by their overlaps in _stickies_, and we have the overlap for the first and last cutter as _firstcutter['overlap']_ and _lastcutter['overlap']_ and we captured the enzyme cut site in the plasmid as _enzyme['pcut']_ 

In [None]:
#Find the sites and the enzymes
plasmidcutsite1=
plasmidcutter1=
plasmidcutsite2=
plasmidcutter2=

In [None]:
test13()

## Final Challenge
The final part is to get the full cloned sequence. Join together the first part of the plasmid sequence to the first cut,
the cut out gene sequence and the plasmid sequence after the second cut

In [None]:
clonedsequence=

In [None]:
test14()


## Assessment
In the assessment you will need to answer some questions, reusing code that you have already written. Here are some example questions, and you can try the practice CMA.

1. How many cut sites are there in the original plasmid sequence for BglI?
2. If the cloned plasmid is cut with EcoRI, how long is the shortest fragment (remember that the genome is circular so you have to add the first and the last together to get over the break in the sequence.)
3. Which enzyme has the longest recognition pattern?
4. How many 3' cutters are there in the list?

In [None]:
cuts={}
for e in enzymes.values():
    cuts[e['name']]=plasmid.count(e['pattern'])
import random
qfh=open('assessmentqs.txt','w')
for e,c in cuts:
    qfh.write('<Q>How many times does the enzyme %s cut the original plasmid sequence?\n<T>MC\n<D>%s plasmid cut sites\n<C+>%s\n'%(e,e,c))
    values=list(range(2,25))
    wrongs=[0,1]+random.sample(values,3)
    if c in wrongs:
        wrongs.remove(c)
    else:
        wrongs=random.sample(wrongs,4)
    for w in wrongs:
        qfh.write('<C>%s\n'%w)
    qfh.write('\n')