# ESBS Python courses TD 1

## Around a piece of DNA

This TD aims at showing you how DNA sequences can be manipulated using Python string objects. We will analyse a plasmide sequence named PBR322, one of the historical vector used in molecular biology. The following code will read a file on the disk names PBR322.seq that contains the vector sequence.

In [68]:
## Read sequence from a file
import re
def read_DNA (fname = "PBR322.seq"):
    f = open(fname) # Open a connection to a file
    all_lines = f.readlines() # read all the lines contained in the file
    f.close() # Close the file 
    base = re.compile(r"([atgc]+)")
    sequence = ""
    print (all_lines[0]) # Print the first line of the file
    for line in all_lines[1:]: # Process all lines except the first one
        wds = re.findall(base,line) # Find the motifs corresponding to a group of DNA bases
        for seq in wds: # Concatenate segments of 10 bases
            sequence = sequence + seq
    print (" {:d} bases read ".format(len(sequence)))
    return sequence

## 1. Computing a complementary strand

The string method **replace** can be used to compute efficiently a complementary strand by substituting A with T, C with G and so on. In theory, four lines should do the work. However, one realize that running the four substitutions sequentially will cause trouble because the A to T substitution will be reversed by the T to A.

Modify the following script to get the correct result:

    5'>  ttctcatgtttgacagctta
    3'>  aagagtacaaactgtcgaat

In [72]:
seq = read_DNA (fname = "PBR322.seq")

comp = seq.replace('t','a')
comp = seq.replace('c','g')
comp = seq.replace('g','c')
comp = seq.replace('a','t')

print ("5'> ", seq[:20])
print ("3'> ", comp[:20])

SQ   Sequence 4361 BP; 983 A; 1210 C; 1134 G; 1034 T; 0 other;

 4361 bases read 
5'>  ttctcatgtttgacagctta
3'>  ttctcttgtttgtctgcttt


## 2. Base composition statistics

- Write a script to compute the % of each of the four base
  The script should make use of a for loop
  
  Note : In the TD, many print statements will make use of the [format method](https://www.python-course.eu/python3_formatted_output.php) to controle exactly the way numbers and text are formatted.


In [99]:
# Solution 
seq = read_DNA (fname = "PBR322.seq")


SQ   Sequence 4361 BP; 983 A; 1210 C; 1134 G; 1034 T; 0 other;

 4361 bases read 
 The pourcentage of a is :  22.54 %
 The pourcentage of t is :  23.71 %
 The pourcentage of g is :  26.00 %
 The pourcentage of c is :  27.75 %


- What is the following script doing ?
- Modify the script by adding a for statement such as the result is given for all four bases.


In [109]:
c = 1
while seq.count('a'*c) > 0:
    c = c + 1
print ('a'*c)

aaaaaaaa


aaaaaaaa
tttttttt
ccccccc
gggggg


## 3. Restriction enzymes

Restriction enzymes are endonucleases able to recognize specifically sequence motifs and cut at specific location. The following exercises aim at showing how string methods could be used to generate restriction maps with little efforts.

The enzymes we will use are :

    BamH1 G/GATCC
    HpaII C/CGG
    MboI /GATC
    Sau3A GATC/
    SmaI CCC/GGG
    AccI GT/CGAC
    KpnI GGTAC/C
    EcoRI G/AATTC
    PstI CTGCA/G
    BglII A/GATCT

### A dictionary of Restriction Enzymes

   * Create a dictionary **RE_dict** with the name of the above RE as keys and the recognized sequence as values.
      Note : In the first version, the precise cutting point indicated by the character "/" will not be considered.
      
    
   * Use the method [split](https://www.tutorialspoint.com/python3/string_split.htm) to generate a first restriction map of the PBR plasmide with the enzyme MboI. Beware that the plasmide is circular and tne string containing the sequence is not. The output should have the following form:
    
    
    SQ   Sequence 4361 BP; 983 A; 1210 C; 1134 G; 1034 T; 0 other;
    4361 bases read       
    22 fragments generated by the enzyme MboI with following length :
    4
    7
    8
    11
    13
    

In [None]:
# Define here the dictionary


# Reading the PBR322 sequence
seq = read_DNA (fname = "PBR322.seq")

# Complete the script

re_map = 
print (" {} gragments generated by the enzyme MboI with following length :".format(len(re_map)-1))

map_len = []


for frag in map_len:
    print(frag)

## Restriction site version 2

The script above does not provide the restriction fragment precisely, due to the property of the **split** method. Create a second dictionary that contains the cutting position such as:

    RE_cut = {"BamH1" : 1, "MboI" : 0 , ...}

And use this dictionary to get the correct fragment's length and sequences.

SQ   Sequence 4361 BP; 983 A; 1210 C; 1134 G; 1034 T; 0 other;

 4361 bases read 
 1 fragments generated by the enzyme MboI with following length :
4361


## 4. Searching open reading frames

We will use the [find method](https://www.tutorialspoint.com/python/string_find.htm) to search all occurences of the start codon in the PBR322 sequence. 

The method fin only return the index of the first occurence of the string. To find them all, we need to perform the search as many times as needed using a [while loop](https://www.tutorialspoint.com/python/python_while_loop.htm).

### Searching for START codon

Modify the following script to return a list with all occurences of the atg codon.

* How many atg codons are in the PBR sequence ?

In [77]:
seq = read_DNA (fname = "PBR322.seq")
start = 0
start_pos = []
while seq.find('atg',start) > 0:
    

SQ   Sequence 4361 BP; 983 A; 1210 C; 1134 G; 1034 T; 0 other;

 4361 bases read 
36


### Searching for STOP codons

Using the same procedure than above, search for all occurences of the three STOP codons:

    TAA (ocre)
    TAG (ambre)
    TGA (opale)
    
Complete the following script to create a list named stop_pos that will contain the index of stop codons. The list should be sorted according the stop codons positions.

In [None]:
seq = read_DNA (fname = "PBR322.seq")
stop_pos = []
for stop in ['taa', 'tag', 'tga']:


### A method (algorithm) to find open reading frames

We propose to implement the following algorithme to identify the ORF.

for every start codon, look through the list of stop codons (placed after the start position) until one stop is found in the same reading frame.

- Modify the following script to return a dictionary that contains the position of the start codon as key, and the length of the corresponding ORF as value

- Complete the script to print the ten longuest ORF in decreasing order, together with the corresponding sequences

- Put the sequences of these ORF in a list named orf_list

Note : A dictionary can be sorted using the python command sorted :

    # if d is a dictionary:
    for key in sorted(d, key=d.get, reverse=True):
        print d[key]


In [129]:
for start in start_pos:
    for stop in stop_pos:
        if stop > start and (stop-start) % 3 == 0:
            # print (stop-start)
            break

## 4. Translation

The dictionary gcode below contains the genetic code. Write some python lines to perform the translation of the ten longuest orf found in the PBR322 sequence.

Using BLAST, could you identify the protein ?

In [133]:
gcode = {"TTT":"F","TTC":"F","TTA":"L","TTG":"L","TCT":"S","TCC":"S",
	"TCA":"S","TCG":"S","TAT":"Y","TAC":"Y","TAA":"*","TAG":"*",
	"TGT":"C","TGC":"C","TGA":"*","TGG":"W","CTT":"L","CTC":"L",
	"CTA":"L","CTG":"L","CCT":"P","CCC":"P","CCA":"P","CCG":"P",
	"CAT":"H","CAC":"H","CAA":"Q","CAG":"Q","CGT":"R","CGC":"R",
	"CGA":"R","CGG":"R","ATT":"I","ATC":"I","ATA":"I","ATG":"M",
	"ACT":"T","ACC":"T","ACA":"T","ACG":"T","AAT":"N","AAC":"N",
	"AAA":"K","AAG":"K","AGT":"S","AGC":"S","AGA":"R","AGG":"R",
	"GTT":"V","GTC":"V","GTA":"V","GTG":"V","GCT":"A","GCC":"A",
	"GCA":"A","GCG":"A","GAT":"D","GAC":"D","GAA":"E","GAG":"E",
	"GGT":"G","GGC":"G","GGA":"G","GGG":"G"}

In [137]:
prot_list = []
for sequence in orf_list:

    
print(prot_list)

['MKSNNALIVILGTVTLDAVGIGLVMPVLPGLLRDIVHSDSIASHYGVLLALYALMQFLCAPVLGALSDRFGRRPVLLASLLGATIDYAIMATTPVLWILYAGRIVAGITGATGAVAGAYIADITDGEDRARHFGLMSACFGVGMVAGPVAGGLLGAISLHAPFLAAAVLNGLNLLLGCFLMQESHKGERRPMPLRAFNPVSSFRWARGMTIVAALMTVFFIMQLVGQVPAALWVIFGEDRFRWSATMIGLSLAVFGILHALAQAFVTGPATKRFGEKQAIIAGMAADALGYVLLAFATRGWMAFPIMILLASGGIGMPALQAMLSRQVDDDHQGQLQGSLAALTSLTSITGPLIVTAIYAASASTWNGLAWIVGAALYLVCLPALRRGAWSRATST*', 'MPVLPGLLRDIVHSDSIASHYGVLLALYALMQFLCAPVLGALSDRFGRRPVLLASLLGATIDYAIMATTPVLWILYAGRIVAGITGATGAVAGAYIADITDGEDRARHFGLMSACFGVGMVAGPVAGGLLGAISLHAPFLAAAVLNGLNLLLGCFLMQESHKGERRPMPLRAFNPVSSFRWARGMTIVAALMTVFFIMQLVGQVPAALWVIFGEDRFRWSATMIGLSLAVFGILHALAQAFVTGPATKRFGEKQAIIAGMAADALGYVLLAFATRGWMAFPIMILLASGGIGMPALQAMLSRQVDDDHQGQLQGSLAALTSLTSITGPLIVTAIYAASASTWNGLAWIVGAALYLVCLPALRRGAWSRATST*', 'MQFLCAPVLGALSDRFGRRPVLLASLLGATIDYAIMATTPVLWILYAGRIVAGITGATGAVAGAYIADITDGEDRARHFGLMSACFGVGMVAGPVAGGLLGAISLHAPFLAAAVLNGLNLLLGCFLMQESHKGERRPMPLRAFNPVSSFRWARGMTIVAALMTVFFIMQLVGQVPAALWVIFGEDRFRWSATMIGLSLAVFGILHALAQAFVTGPATKRF

## Conclusion

Read carefully the sequences found by our algorithme. What can we notice ?

Propose a correction to get a more effective code.
