# Lab 13

## Open Reading Frames & Virtual Ribosome

__BACKGROUND__: Given a DNA coding sequence, we frequently wish to see the corresponding protein sequence that it represents. We can do this by building a python program to translate the DNA sequence codon‐by‐codon into a protein (amino acid) sequence, somewhat like a "virtual ribosome." Such a program can also be useful for gene finding: we can translate a given DNA sequence in all six possible reading frames and look for an open reading frame (ORF). Recall from the lecture that an ORF is a stretch of DNA sequence that begins with a start/methionine codon (ATG), proceeds for some significant length without a stop codon (UAA, UAG, UGA), and then finally terminates with a stop codon.

__TASK__: Build a virtual ribosome program that will translate the three positive reading frames of a given DNA sequence. Find the true reading frame and identify the corresponding protein using BLAST.

### Part A: load the DNA sequence

In order to use a script to translate the DNA sequence into an amino acid sequence, first we need to load the sequence. The DNA sequence is provided for you in FASTA format as part of the lab materials on eng-grid (```dna.fasta```). If you followed the instructions, this file should be in the same directory as the current notebook.

Write python code to read in `dna.fasta` and create a variable that contains the concatenated DNA sequence as a string. You may use or adapt your solution from Lab 9 for this purpose.

In [25]:
##Add your python code here:
with open('dna.fasta', 'r') as file:
  next(file)
 #skip first line

  dna = ''
 #string to concatenate sequence in dna variable

  for line in file:
    if line == '>':
        dna = ''
    else:
      dna += (line.strip('\n')) # storing the DNA sequence

  print (dna)



CTAGGCTAATGCAAATTTTTGTCAAGACTTTGACTGGTAAGACCATCACTTTGGAAGTTGAATCTTCTGACACTATTGACAATGTCAAGTCAAAGATTCAAGACAAGGAAGGTATCCCACCTGACCAACAAAGATTGATCTTTGCTGGTAAGCAATTGGAAGACGGTAGAACCTTGTCTGACTACAACATTCAAAAAGAATCCACTTTGCACTTAGTCTTGAGATTGAGAGGTGGTATCATTGAACCATCTTTGAAAGCTTTGGCTTCCAAGTACAACTGTGACAAATCTGTTTGCCGTAAGTGTTATGCTAGATTGCCACCAAGAGCTACCAACTGTAGAAAGAGAAAGTGTGGTCACACCAACCAATTGCGTCCAAAGAAGAAGTTAAAATGACGGATTCCGGATCTCGCGCTAG


### Part B: load and store the genetic code

In order to translate from DNA to protein, we must know which codons code for which amino acids, and this is best accomplished by saving the information as a dictionary. We have provided the genetic code as a separate file on Blackboard (```universal_genetic_code.tab```). Each of the 64 lines in the file looks like: ```AAA\tB\n```, i.e. the three‐letter codon, a tab (\t), the single‐letter amino acid designation, and a newline character (\n).

1. Read this file line‐by‐line.
2. Split each line into a codon string and a one‐letter amino acid string.
3. Store this pair in a dictionary, with the codon being the key, and the amino acid being the value.

A "```*```" is used to represent a translated STOP codon.

__HINT__: we accomplished a similar task, i.e. splitting a string into substrings and then using those substrings to build a dictionary, in Lab 9.

In [29]:
##Add your python code here:
with open('universal_genetic_code.tab','r') as file:
 mydict = {}          # dictionary to store value of gentic codon
 for line in file:
    gen = line.split("\t")
    mydict[gen[0]] = gen[1].strip() #storing the genetic codon in mydict
    
 print(mydict) 


{'AAA': 'K', 'AAC': 'N', 'AAG': 'K', 'AAT': 'N', 'ACA': 'T', 'ACC': 'T', 'ACG': 'T', 'ACT': 'T', 'AGA': 'R', 'AGC': 'S', 'AGG': 'R', 'AGT': 'S', 'ATA': 'I', 'ATC': 'I', 'ATG': 'M', 'ATT': 'I', 'CAA': 'Q', 'CAC': 'H', 'CAG': 'Q', 'CAT': 'H', 'CCA': 'P', 'CCC': 'P', 'CCG': 'P', 'CCT': 'P', 'CGA': 'R', 'CGC': 'R', 'CGG': 'R', 'CGT': 'R', 'CTA': 'L', 'CTC': 'L', 'CTG': 'L', 'CTT': 'L', 'GAA': 'E', 'GAC': 'D', 'GAG': 'E', 'GAT': 'D', 'GCA': 'A', 'GCC': 'A', 'GCG': 'A', 'GCT': 'A', 'GGA': 'G', 'GGC': 'G', 'GGG': 'G', 'GGT': 'G', 'GTA': 'V', 'GTC': 'V', 'GTG': 'V', 'GTT': 'V', 'TAA': '*', 'TAC': 'Y', 'TAG': '*', 'TAT': 'Y', 'TCA': 'S', 'TCC': 'S', 'TCG': 'S', 'TCT': 'S', 'TGA': '*', 'TGC': 'C', 'TGG': 'W', 'TGT': 'C', 'TTA': 'L', 'TTC': 'F', 'TTG': 'L', 'TTT': 'F'}


### Part C: translating from DNA to protein in 3 frames

We have loaded our DNA sequence, and saved all of the genetic code to an accessible file. Now we need to split the DNA into codons and use our dictionary to translate this into amino acids. However, we need to do this in three reading frames!

1. Use a for loop and the functions range and len to split the DNA into codons. __HINT__: in Lab 6 we used ```range``` and ```len``` to build a matrix. Although we don’t want to build a matrix, we want to make use of the step function of ```range``` in order to change the opening reading frame.
2. Translate the codons of the DNA sequence by looking the codons up in the dictionary, and printing the corresponding amino acid.
3. Print the translated sequence to the screen, or save it or a variable to be printed later.
4. Visually inspect the amino acid to see if it corresponds to an ORF, i.e. does it begin with a start/methionine codon (ATG)?
5. Use BLAST to identify the protein.

---

```
range(start, stop, step)
```

This function creates lists containing arithmetic progressions. It is often used in ```for``` loops. The arguments must be plain integers. If the step argument is omitted, it defaults to 1. If the start argument is omitted, it defaults to 0. The full form returns a list of plain integers ```[start, start + step, start + 2*step, ...]```.

Example:

```
>>> range(10)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> range(1, 11)
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
>>> range(0, 30, 5)
[0, 5, 10, 15, 20, 25]
```

---

In [27]:


##Add your python code here:
def orf(dna):
    for start in range(3):              #creates three reading frames in forward direction
        protein = ''                    #variable to concatenate the protein
        for i in range(start,len(dna),3):
            seq = dna[i:i+3]            #read a codon which is a three cosecutive letter

            aa =  mydict[seq]
            
            protein += aa             #store the protein sequence

            if aa == '*':             # terminate the protein sequence if there is a stop codon 
                print("reading frame",start+1,protein)
                break
orf(dna)        

reading frame 1 LG*
reading frame 2 *
reading frame 3 RLMQIFVKTLTGKTITLEVESSDTIDNVKSKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGGIIEPSLKALASKYNCDKSVCRKCYARLPPRATNCRKRKCGHTNQLRPKKKLK*


In [28]:
There are three reading frames.The third reading frame has stop codon M(ATG).Using Blast I identified the protein to be ubiquitin-ribosomal 60S subunit protein L40A fusion protein [Saccharomyces cerevisiae S288C]

SyntaxError: invalid syntax (<ipython-input-28-fee5931dcf72>, line 1)

### EXTRA: translating from DNA to protein in 6 frames

This is __NOT REQUIRED__ for completion of the lab. However, if you would like a personal challenge, we encourage you to translate the DNA sequence in the full 6 frames. There are an additional 3 reading frames in the reverse complement of the DNA sequence that could also code for a protein. The "reverse complement" of a sequence is backwards and the complementary nucleotides are used (e.g. the reverse complement of "```ATTTGC```" is "```GCAAAT```").

1. Build the reverse complement in a new variable by, e.g. using a for loop to read the original DNA sequence, and concatenating or adding the complementary nucleotide. Make sure the output is reversed!
2. Now you can use the code you wrote for Part C to also translate this extra sequence.

In [23]:
        
        
#Defining a fucntion that  returns the complemented the DNA sequence
def complement(sequences):
    complement = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'} #collection of complement base pairs
    bases = list(sequences)                               #converting bases into a list 
    bases = [complement[base] for base in bases]          #finding the complementary base
    return ''.join(bases)                                 #return all complemented bases
def reverse_complement(string):
    return complement(string[::-1])                       #reading the sequence backwards


orf(dna)                                                  # three open reading frames in forward direction in a DNA sequence
print("Next three protein reading")
orf(reverse_complement(dna))                              # three open reading frames in backward direction in a DNA sequence

reading frame 1 LG*
reading frame 2 *
reading frame 3 RLMQIFVKTLTGKTITLEVESSDTIDNVKSKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGGIIEPSLKALASKYNCDKSVCRKCYARLPPRATNCRKRKCGHTNQLRPKKKLK*
Next three protein reading
reading frame 1 LARDPESVILTSSLDAIGWCDHTFSFYSW*
reading frame 2 *
reading frame 3 SARSGIRHFNFFFGRNWLV*
