DNA Translation
===============

If you haven't done the "DNA String" or the "DNA Dictionary" exercises yet,
you probably should do those exercises before attempting this one.

Sequences of DNA are frequently represented by strings of letters corresponding
to the bases:

    "A" is adenine
    "C" is cytosine
    "G" is guanine
    "T" is thymine

A gene encodes a protein by specifying the amino acids that compose it via
groups of 3 bases (called "codons").  Each codon corresponds to an amino acid,
or a special "start" or "stop" sequence.

In the usual genetic code the sequence "ATG" indicates the start of the
encoding of the protein (and also encodes the amino acid methionine). 
The start codon is not necessarily a multiple of 3 bases into the DNA
sequence. The three codons "TAA", "TAG" and "TGA" are stop codons and 
indicate that the protein is finished.

In the code below there is a dictionary ``codon_table`` that maps codons to
their corresponding amino acid abbreviations.

In [None]:
codon_table = {
    'TTT': 'F', 'TTC': 'F', 'TTA': 'L', 'TTG': 'L',
    'TCT': 'S', 'TCC': 'S', 'TCA': 'S', 'TCG': 'S',
    'TAT': 'Y', 'TAC': 'Y', 'TAA': '*', 'TAG': '*',
    'TGT': 'C', 'TGC': 'C', 'TGA': '*', 'TGG': 'W',

    'CTT': 'L', 'CTC': 'L', 'CTA': 'L', 'CTG': 'L',
    'CCT': 'P', 'CCC': 'P', 'CCA': 'P', 'CCG': 'P',
    'CAT': 'H', 'CAC': 'H', 'CAA': 'Q', 'CAG': 'Q',
    'CGT': 'R', 'CGC': 'R', 'CGA': 'R', 'CGG': 'R',

    'ATT': 'I', 'ATC': 'I', 'ATA': 'I', 'ATG': 'M',
    'ACT': 'T', 'ACC': 'T', 'ACA': 'T', 'ACG': 'T',
    'AAT': 'N', 'AAC': 'N', 'AAA': 'K', 'AAG': 'K',
    'AGT': 'S', 'AGC': 'S', 'AGA': 'R', 'AGG': 'R',

    'GTT': 'V', 'GTC': 'V', 'GTA': 'V', 'GTG': 'V',
    'GCT': 'A', 'GCC': 'A', 'GCA': 'A', 'GCG': 'A',
    'GAT': 'D', 'GAC': 'D', 'GAA': 'E', 'GAG': 'E',
    'GGT': 'G', 'GGC': 'G', 'GGA': 'G', 'GGG': 'G',
}

In this example, we will look at a genetic sequence from the human genome which
encodes the [histone cluster 1, H1b.](http://www.ncbi.nlm.nih.gov/nuccore/NM_005322)

In [None]:
sequence = (
    "AACCTGCTCTTTAGATTTCGAGCTTATTCTCTTCTAGCAGTTTCTTGCCACCATGTCGGAAACCGCTCCT" +
    "GCCGAGACAGCCACCCCAGCGCCGGTGGAGAAATCCCCGGCTAAGAAGAAGGCAACTAAGAAGGCTGCCG" +
    "GCGCCGGCGCTGCTAAGCGCAAAGCGACGGGGCCCCCAGTCTCAGAGCTGATCACCAAGGCTGTGGCTGC" +
    "TTCTAAGGAGCGCAATGGCCTTTCTTTGGCAGCCCTTAAGAAGGCCTTAGCGGCCGGTGGCTACGACGTG" +
    "GAGAAGAATAACAGCCGCATTAAGCTGGGCCTCAAGAGCTTGGTGAGCAAGGGCACCCTGGTGCAGACCA" +
    "AGGGCACTGGTGCTTCTGGCTCCTTTAAACTCAACAAGAAGGCGGCCTCCGGGGAAGCCAAGCCCAAAGC" +
    "CAAGAAGGCAGGCGCCGCTAAAGCTAAGAAGCCCGCGGGGGCCACGCCTAAGAAGGCCAAGAAGGCTGCA" +
    "GGGGCGAAAAAGGCAGTGAAGAAGACTCCGAAGAAGGCGAAGAAGCCCGCGGCGGCTGGCGTCAAAAAGG" +
    "TGGCGAAGAGCCCTAAGAAGGCCAAGGCCGCTGCCAAACCGAAAAAGGCAACCAAGAGTCCTGCCAAGCC" +
    "CAAGGCAGTTAAGCCGAAGGCGGCAAAGCCCAAAGCCGCTAAGCCCAAAGCAGCAAAACCTAAAGCTGCA" +
    "AAGGCCAAGAAGGCGGCTGCCAAAAAGAAGTAGGAAGCTGGCGTGTGAAAACCGCAACAAAGCCCCAAAG" +
    "GCTCTTTTCAGAGCCACCCA"
)

Question 1
----------

Write Python code that:

a. Finds the first start codon in the sequence (Hint: remember what you did
   in the "DNA String" exercise).

b. Loops over the codons, building a string of the abbreviations of the
   protein's amino acids (eg. the protein should start with "MSETAPA...")
   
c. Stops when it reaches a stop codon.

d. Prints out the amino acid string.

In [None]:
# your code goes here

<div class="btn-group"><button class="btn" onclick="IPython.canopy_exercise.toggle_solution('5')">Solution</button></div>

In [None]:
location = sequence.find('ATG')

amino_acids = ""

while True:
    codon = sequence[location:location + 3]
    amino_acid = codon_table[codon]
    if amino_acid == '*':
        break
    amino_acids += amino_acid
    location += 3

print(amino_acids)

Question 2
----------

Print the number of amino acids in the protein.

In [None]:
# your code goes here

<div class="btn-group"><button class="btn" onclick="IPython.canopy_exercise.toggle_solution('6')">Solution</button></div>

In [None]:
print("The number of amino acids is", len(amino_acids))

Question 3
----------

There is another dictionary `amino_acid_table` that maps the
abbreviations to their full names.  Take the string of the abbreviations of
the amino acids and print out, for each amino acid, its full name and
whether or not it is used by the protein.

In [None]:
amino_acid_table = {
    'A': "alanine",
    'C': "cystine",
    'D': "aspartic acid",
    'E': "glutamic acid",
    'F': "phenylalanine",
    'G': "glycine",
    'H': "histidine",
    'I': "isoleucine",
    'K': "lysine",
    'L': "leucine",
    'M': "methionine/start",
    'N': "asparagine",
    'P': "proline",
    'Q': "glutamine",
    'R': "arginine",
    'S': "serine",
    'T': "threonine",
    'V': "valine",
    'W': "tryptophan",
    'Y': "tyrosine",
    '*': "stop",
}

In [None]:
# your code goes here

<div class="btn-group"><button class="btn" onclick="IPython.canopy_exercise.toggle_solution('8')">Solution</button></div>

In [None]:
for code, amino_acid in list(amino_acid_table.items()):
    if code in amino_acids:
        print(amino_acid + " is in protein")
    else:
        print(amino_acid + " is not in protein")

Bonus
-----

Because most amino acids have multiple codons which can produce them, there
are many different sequences that will potentially produce this protein.
Compute how many there are.

In [None]:
# your code goes here

<div class="btn-group"><button class="btn" onclick="IPython.canopy_exercise.toggle_solution('9')">Solution</button></div>

In [None]:
# build a table of amino acid -> number of codons
codon_count = {code: 0 for code in amino_acid_table}
for codon, amino_acid in list(codon_table.items()):
    codon_count[amino_acid] += 1

# now for each amino acid, multiply together the choices
combinations = 1
for amino_acid in amino_acids:
    combinations *= codon_count[amino_acid]

# aren't you glad Python handles arbitrarily long integer values :)
print("There are {0} different ways of encoding this protein".format(combinations))


Note
----

If you need to do this sort of bioinformatics manipulation, the "Biopython"
library does all of these sorts of things and more.


Copyright 2008-2016, Enthought, Inc.  
Use only permitted under license.  Copying, sharing, redistributing or other unauthorized use strictly prohibited.  
http://www.enthought.com