<a href="https://colab.research.google.com/github/ebatty/ComputationalBootcamp/blob/master/content/DNAParsing_Solutions.ipynb" target="_blank"><img alt="Open In Colab" src="https://colab.research.google.com/assets/colab-badge.svg"/></a>

## DNA Parsing 

In [None]:
import numpy as np
import matplotlib.pyplot as plt

We will be parsing a real DNA sequence to translate it into the corresponding protein sequence.

DNA consists of a sequence of nucleotides (A, G, T, or C).  In the genetic code, each group of three consecutive nucleotides form a codon that translates to a single amino acid. There are a small number of common amino acids - we can use a look-up table to pair each codon with the respective amino acid. We also have stop codons that signify that the DNA should stop being translated there.

In this way, we can work through a DNA sequence, taking each group of three nucleotides (the first three, then the next three, and so on) and translating them to the corresponding amino acid. The resulting sequence of amino acids constitutes the protein that the DNA sequence codes. 

We are getting our data from a public repository of DNA sequences from NCBI.  We will be looking at a DNA sequence from a Golden Retriever. The data can be found here.


The cell below assigns the data to variables. You do not need to do anything.

`dna_sequence` contains the DNA sequence.

`dna_codons` stores the pairings from triplets/codons to amino acids in a dictionary. The triplets are the keys and the amino acids are the values.

`true_translation` contains the translated protein (from the NCBI website, under CDS/translation)

In [None]:
dna_sequence = 'ATGAGCGAGTCGAGCTCGAAGTCCAGCCAGCCTTTGGCCTCCAAGCAGGAAAAGGACGGCACTGAGAAGCGAGGGCGGGGCCGGCCGCGCAAGCAGCCTCCGAAGGAACCCAGTGAAGTGCCAACACCTAAGAGACCTCGGGGCCGACCAAAGGGGAGCAAAAACAAGGGTGCTGCCAAGACCCGGAAAACTACCACAACTCCAGGGAGGAAACCGAGGGGCAGACCCAAAAAACTGGAGAAGGAGGAAGAAGAGGGCATCTCGCAGGAGTCCTCCGAAGAGGAGCAGTGA'

dna_codons = {'TTT' : 'F', 'CTT' : 'L', 'ATT' : 'I', 'GTT' : 'V',
           'TTC' : 'F', 'CTC' : 'L', 'ATC' : 'I', 'GTC' : 'V',
           'TTA' : 'L', 'CTA' : 'L', 'ATA' : 'I', 'GTA' : 'V',
           'TTG' : 'L', 'CTG' : 'L', 'ATG' : 'M', 'GTG' : 'V',
           'TCT' : 'S', 'CCT' : 'P', 'ACT' : 'T', 'GCT' : 'A',
           'TCC' : 'S', 'CCC' : 'P', 'ACC' : 'T', 'GCC' : 'A',
           'TCA' : 'S', 'CCA' : 'P', 'ACA' : 'T', 'GCA' : 'A',
           'TCG' : 'S', 'CCG' : 'P', 'ACG' : 'T', 'GCG' : 'A',
           'TAT' : 'Y', 'CAT' : 'H', 'AAT' : 'N', 'GAT' : 'D',
           'TAC' : 'Y', 'CAC' : 'H', 'AAC' : 'N', 'GAC' : 'D',
           'TAA' : 'STOP', 'CAA' : 'Q', 'AAA' : 'K', 'GAA' : 'E',
           'TAG' : 'STOP', 'CAG' : 'Q', 'AAG' : 'K', 'GAG' : 'E',
           'TGT' : 'C', 'CGT' : 'R', 'AGT' : 'S', 'GGT' : 'G',
           'TGC' : 'C', 'CGC' : 'R', 'AGC' : 'S', 'GGC' : 'G',
           'TGA' : 'STOP', 'CGA' : 'R', 'AGA' : 'R', 'GGA' : 'G',
           'TGG' : 'W', 'CGG' : 'R', 'AGG' : 'R', 'GGG' : 'G' 
           }

true_translation = 'MSESSSKSSQPLASKQEKDGTEKRGRGRPRKQPPKEPSEVPTPK\
RPRGRPKGSKNKGAAKTRKTTTTPGRKPRGRPKKLEKEEEEGISQESSEEEQ'

 Click here to see hints


1) Translation

For each consecutive triplet, we want to find the corresponding amino acid and add it to our amino acid sequence. 

Let's go through an example. Let's say the DNA sequence is:
CCCCATAGTGGGAGC. We would separate this into triplets: CCC CAT AGT GGG AGC. We would then use the look up dictionary to convert this to amino acids: P H S G S. We can store the protein as a string: 'PHSGR'.


In [None]:
# @markdown Click here to see hints

"""

Stuck? Break it up into little steps.

1) How can you index into coding_sequence to get 3 consecutive nucleotides?

2) How do we get the amino acid that corresponds to that triplet?

3) How can you loop through those indices so on each loop, you get a new triplet?

4) How do we continually add those amino acids to a string? (Hints: initialize an empty string with '', remember string concatenation)

""";


In [None]:
# your code here
protein = ''

# Loop over codons
for ind in range(0, len(dna_sequence), 3):

  # Extract codon
  triplet = dna_sequence[ind : ind + 3]

  # Extract amino acid
  amino_acid = dna_codons[triplet]
  
  # Add to protein sequence
  protein += amino_acid
  
print(protein)

MSESSSKSSQPLASKQEKDGTEKRGRGRPRKQPPKEPSEVPTPKRPRGRPKGSKNKGAAKTRKTTTTPGRKPRGRPKKLEKEEEEGISQESSEEEQSTOP


2) When we hit a stop codon (the amino acid is 'STOP'), we want to stop translating. Add this to the code above. Enter this part of the code in LC

If our sequence in the example above had been CCCCATAGTGGGAGCTAG, we would get'PHSGRSTOP' since TAG is a stop codon. We do not want to include the 'STOP' 

In [None]:
# your code here
protein = ''

# Loop over codons
for ind in range(0, len(dna_sequence), 3):

  # Extract codon
  triplet = dna_sequence[ind : ind + 3]

  # Extract amino acid
  amino_acid = dna_codons[triplet]

  # Check if stop codon
  if amino_acid == 'STOP':
    break

  # Add to protein sequence
  protein += amino_acid
  
print(protein)

MSESSSKSSQPLASKQEKDGTEKRGRGRPRKQPPKEPSEVPTPKRPRGRPKGSKNKGAAKTRKTTTTPGRKPRGRPKKLEKEEEEGISQESSEEEQ


3) Verify success

Compare the translated protein from your code to the true translation (stored as `true_translation`). Return True if they match (they should).


In [None]:
protein == true_translation

True