### Intro to DNA Translation

We can think of DNA as a one dimensional string of characters
with four characters to choose from.
These characters are A, C, G, and T. They stand for the first letters with the four nucleotides used to construct DNA.

The full names of these nucleotides are
- adenine
- cytosine
- guanine,
- thymine.

Each unique three character sequence of nucleotides,
sometimes called a nucleotide triplet, corresponds to one amino acid.
The sequence of amino acids is unique for each type of protein
and all proteins are built from the same set of just 20 amino acids
for all living things.
Protein molecules dominate the behavior of the cell
serving as structural supports, chemical catalysts, molecular motors, and so on.
The so called central dogma of molecular biology
describes the flow of genetic information in a biological system.

Instructions in the DNA are first transcribed into RNA
and the RNA is then translated into proteins.
We can think of DNA, when read as sequences of three letters,
as a dictionary of life.

In this case study, we will
- download a DNA strand as a text file from a public web-based repository of DNA sequences.
- write code to translate the DNA sequence to a sequence of amino acids where each amino acid is represented by a unique letter.
- download the amino acid sequence to check our solution.

The input to our program is going to be a DNA sequence that
consists of a four letter alphabet.
We then read this sequence three letters at a time,
translate each triplet to a single letter
that stands for a specific amino acid, and then proceed
to the next set of three letters.
We do this until we have reached the end of the input sequence.

### 4 Tasks in this case study
- Manually download DNA and Protein sequence data
- Import the DNA data into python
- Create an algorithm to translate the DNA
- Check if Translation matches the download

#### Downloading DNA data

The NCBI is the National Center for Biotechnology Information,
and it is United States' main public repository of DNA and related
information.
The first is a strand of DNA and the second
is the corresponding protein sequence of amino acids translated from this DNA.

Go to the NCBI Website. Specify that we are searching for Nucleotide, type `NM_207618.2` in the search bar

At the top of the page, click on `FASTA` the copy the sequence

Make sure you include the very first letter, which is G,
and the very last letter, which is T.
Save it to a file and name as `dna.txt`

Go back to the page where you clicked on `FASTA` and click on `CDS`

you will now see the translation at the bottom

copy paste to a new file as `protein.txt`

### Importing DNA Data into python

In [1]:
#this is to check the python working directory
#pwd

In [2]:
inputfile="dna.txt"
f=open(inputfile,"r")
seq=f.read()
seq

'GGTCAGAAAAAGCCCTCTCCATGTCTACTCACGATACATCCCTGAAAACCACTGAGGAAGTGGCTTTTCA\nGATCATCTTGCTTTGCCAGTTTGGGGTTGGGACTTTTGCCAATGTATTTCTCTTTGTCTATAATTTCTCT\nCCAATCTCGACTGGTTCTAAACAGAGGCCCAGACAAGTGATTTTAAGACACATGGCTGTGGCCAATGCCT\nTAACTCTCTTCCTCACTATATTTCCAAACAACATGATGACTTTTGCTCCAATTATTCCTCAAACTGACCT\nCAAATGTAAATTAGAATTCTTCACTCGCCTCGTGGCAAGAAGCACAAACTTGTGTTCAACTTGTGTTCTG\nAGTATCCATCAGTTTGTCACACTTGTTCCTGTTAATTCAGGTAAAGGAATACTCAGAGCAAGTGTCACAA\nACATGGCAAGTTATTCTTGTTACAGTTGTTGGTTCTTCAGTGTCTTAAATAACATCTACATTCCAATTAA\nGGTCACTGGTCCACAGTTAACAGACAATAACAATAACTCTAAAAGCAAGTTGTTCTGTTCCACTTCTGAT\nTTCAGTGTAGGCATTGTCTTCTTGAGGTTTGCCCATGATGCCACATTCATGAGCATCATGGTCTGGACCA\nGTGTCTCCATGGTACTTCTCCTCCATAGACATTGTCAGAGAATGCAGTACATATTCACTCTCAATCAGGA\nCCCCAGGGGCCAAGCAGAGACCACAGCAACCCATACTATCCTGATGCTGGTAGTCACATTTGTTGGCTTT\nTATCTTCTAAGTCTTATTTGTATCATCTTTTACACCTATTTTATATATTCTCATCATTCCCTGAGGCATT\nGCAATGACATTTTGGTTTCGGGTTTCCCTACAATTTCTCCTTTACTGTTGACCTTCAGAGACCCTAAGGG\nTCCTTGTTCTGTGTTCTTCAACTGTTGAAAGCCAGAGTCACTAAAAATGCCAAACACAGAAGA

we can see `\n` in the data above but if we print it it looks different

In [3]:
print(seq)

GGTCAGAAAAAGCCCTCTCCATGTCTACTCACGATACATCCCTGAAAACCACTGAGGAAGTGGCTTTTCA
GATCATCTTGCTTTGCCAGTTTGGGGTTGGGACTTTTGCCAATGTATTTCTCTTTGTCTATAATTTCTCT
CCAATCTCGACTGGTTCTAAACAGAGGCCCAGACAAGTGATTTTAAGACACATGGCTGTGGCCAATGCCT
TAACTCTCTTCCTCACTATATTTCCAAACAACATGATGACTTTTGCTCCAATTATTCCTCAAACTGACCT
CAAATGTAAATTAGAATTCTTCACTCGCCTCGTGGCAAGAAGCACAAACTTGTGTTCAACTTGTGTTCTG
AGTATCCATCAGTTTGTCACACTTGTTCCTGTTAATTCAGGTAAAGGAATACTCAGAGCAAGTGTCACAA
ACATGGCAAGTTATTCTTGTTACAGTTGTTGGTTCTTCAGTGTCTTAAATAACATCTACATTCCAATTAA
GGTCACTGGTCCACAGTTAACAGACAATAACAATAACTCTAAAAGCAAGTTGTTCTGTTCCACTTCTGAT
TTCAGTGTAGGCATTGTCTTCTTGAGGTTTGCCCATGATGCCACATTCATGAGCATCATGGTCTGGACCA
GTGTCTCCATGGTACTTCTCCTCCATAGACATTGTCAGAGAATGCAGTACATATTCACTCTCAATCAGGA
CCCCAGGGGCCAAGCAGAGACCACAGCAACCCATACTATCCTGATGCTGGTAGTCACATTTGTTGGCTTT
TATCTTCTAAGTCTTATTTGTATCATCTTTTACACCTATTTTATATATTCTCATCATTCCCTGAGGCATT
GCAATGACATTTTGGTTTCGGGTTTCCCTACAATTTCTCCTTTACTGTTGACCTTCAGAGACCCTAAGGG
TCCTTGTTCTGTGTTCTTCAACTGTTGAAAGCCAGAGTCACTAAAAATGCCAAACACAGAAGACAGCTTT
GCTAAT

so we use the replace method to replace all the `\n`

In [4]:
seq=seq.replace("\n","")
seq

'GGTCAGAAAAAGCCCTCTCCATGTCTACTCACGATACATCCCTGAAAACCACTGAGGAAGTGGCTTTTCAGATCATCTTGCTTTGCCAGTTTGGGGTTGGGACTTTTGCCAATGTATTTCTCTTTGTCTATAATTTCTCTCCAATCTCGACTGGTTCTAAACAGAGGCCCAGACAAGTGATTTTAAGACACATGGCTGTGGCCAATGCCTTAACTCTCTTCCTCACTATATTTCCAAACAACATGATGACTTTTGCTCCAATTATTCCTCAAACTGACCTCAAATGTAAATTAGAATTCTTCACTCGCCTCGTGGCAAGAAGCACAAACTTGTGTTCAACTTGTGTTCTGAGTATCCATCAGTTTGTCACACTTGTTCCTGTTAATTCAGGTAAAGGAATACTCAGAGCAAGTGTCACAAACATGGCAAGTTATTCTTGTTACAGTTGTTGGTTCTTCAGTGTCTTAAATAACATCTACATTCCAATTAAGGTCACTGGTCCACAGTTAACAGACAATAACAATAACTCTAAAAGCAAGTTGTTCTGTTCCACTTCTGATTTCAGTGTAGGCATTGTCTTCTTGAGGTTTGCCCATGATGCCACATTCATGAGCATCATGGTCTGGACCAGTGTCTCCATGGTACTTCTCCTCCATAGACATTGTCAGAGAATGCAGTACATATTCACTCTCAATCAGGACCCCAGGGGCCAAGCAGAGACCACAGCAACCCATACTATCCTGATGCTGGTAGTCACATTTGTTGGCTTTTATCTTCTAAGTCTTATTTGTATCATCTTTTACACCTATTTTATATATTCTCATCATTCCCTGAGGCATTGCAATGACATTTTGGTTTCGGGTTTCCCTACAATTTCTCCTTTACTGTTGACCTTCAGAGACCCTAAGGGTCCTTGTTCTGTGTTCTTCAACTGTTGAAAGCCAGAGTCACTAAAAATGCCAAACACAGAAGACAGCTTTGCTAATACCATTAAATACT

the extra line breaks are gone
Sometimes there may be another character hiding in a string,
and depending on your computer, it may or may not be visible.
Just to be on the safe side, let's remove that as well.
There is no harm in running this extra step
because if you don't have the extra character nothing happens. 
so we remove the `\r` character

In [5]:
seq=seq.replace("\r","")

### Translating the DNA Sequence

In [6]:
table = {
    'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
    'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
    'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
    'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',
    'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
    'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
    'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
    'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
    'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
    'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
    'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
    'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
    'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
    'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
    'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_',
    'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W',
}

In [7]:
table['AAC']

'N'

#### Thought process
- check that the length of the sequence is actually divisible by three. It should be, but sometimes things go wrong so it's important to check this.
- look up each three-letter string in our table and store the result somewhere.
- keep doing this in a loop until you get to the end of the sequence.

In [8]:
#check seq length is divisible by three
len(seq)%3

2

We use slicing to get the sequence in threes

In [9]:
seq[0:3]

'GGT'

In [10]:
seq[3:6]

'CAG'

In [11]:
protein=""
if len(seq)%3==0:
    for i in range(0,len(seq),3):
        codon=seq[i:i+3] #this extracts codons
        protein+=table[codon] #we are cocantenating the  codon translated to protein

In [12]:
def translate(seq):
    table = {
    'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
    'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
    'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
    'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',
    'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
    'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
    'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
    'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
    'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
    'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
    'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
    'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
    'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
    'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
    'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_',
    'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W',
    }
    protein=""
    if len(seq)%3==0:
        for i in range(0,len(seq),3):
            codon=seq[i:i+3] #this extracts codons
            protein+=table[codon] #we are cocantenating the  codon translated to protein
    return protein

In [13]:
translate("ATA")

'I'

#### Adding a docstring
let's add a docstring to our function. A docstring is a string literal that occurs as the first statement in a module function, or a class, or a method definition, and it becomes part of that object.
The docstring should summarize the behavior of the function
and document its arguments, returned values, possible side effects,
and anything else that would be important for a user
to know about the function.

In [14]:
def translate(seq):
    """ Translate a string containing a nucleotide sequence into a string containing the corresponding sequence of amino acids. Nucleotides are translated in triplets using the table dictionary: each amino acid is encoded with a string of length 1."""
    table = {
    'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
    'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
    'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
    'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',
    'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
    'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
    'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
    'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
    'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
    'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
    'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
    'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
    'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
    'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
    'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_',
    'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W',
    }
    protein=""
    if len(seq)%3==0:
        for i in range(0,len(seq),3):
            codon=seq[i:i+3] #this extracts codons
            protein+=table[codon] #we are cocantenating the  codon translated to protein
    return protein

In [15]:
help(translate)

Help on function translate in module __main__:

translate(seq)
    Translate a string containing a nucleotide sequence into a string containing the corresponding sequence of amino acids. Nucleotides are translated in triplets using the table dictionary: each amino acid is encoded with a string of length 1.



In [16]:
table["GCC"]

'A'

In [17]:
138%13

8

In [18]:
seq[40:50]

'CCTGAAAACC'

#### Comparing your translation
with is a compound statement which is better to use for opening files as it handles errors

In [19]:
def read_seq(inputfile):
    """Reads and returns input sequence with special characters removed """
    with open(inputfile,"r") as f:
        seq=f.read()
        seq=seq.replace("\n","")
        seq=seq.replace("\r","")
        return seq

In [20]:
prt= read_seq("protein.txt")
dna= read_seq("dna.txt")



From the site the DNA is for 21 to 938 only
In Python, indexing starts at 0, so genome positions 21 and 938
correspond to Python string positions 20 and 937.
And since slicing stops before the end value specified, we will use the range [20:938]

In [21]:
translate(dna[20:938])

'MSTHDTSLKTTEEVAFQIILLCQFGVGTFANVFLFVYNFSPISTGSKQRPRQVILRHMAVANALTLFLTIFPNNMMTFAPIIPQTDLKCKLEFFTRLVARSTNLCSTCVLSIHQFVTLVPVNSGKGILRASVTNMASYSCYSCWFFSVLNNIYIPIKVTGPQLTDNNNNSKSKLFCSTSDFSVGIVFLRFAHDATFMSIMVWTSVSMVLLLHRHCQRMQYIFTLNQDPRGQAETTATHTILMLVVTFVGFYLLSLICIIFYTYFIYSHHSLRHCNDILVSGFPTISPLLLTFRDPKGPCSVFFNC_'

In [22]:
prt

'MSTHDTSLKTTEEVAFQIILLCQFGVGTFANVFLFVYNFSPISTGSKQRPRQVILRHMAVANALTLFLTIFPNNMMTFAPIIPQTDLKCKLEFFTRLVARSTNLCSTCVLSIHQFVTLVPVNSGKGILRASVTNMASYSCYSCWFFSVLNNIYIPIKVTGPQLTDNNNNSKSKLFCSTSDFSVGIVFLRFAHDATFMSIMVWTSVSMVLLLHRHCQRMQYIFTLNQDPRGQAETTATHTILMLVVTFVGFYLLSLICIIFYTYFIYSHHSLRHCNDILVSGFPTISPLLLTFRDPKGPCSVFFNC'

At the very end of a protein coding sequence, nature places what's called a stop codon. There are three stop codons, and their function is to tell someone reading the sequence that this is where you should stop reading. It's almost like an end of paragraph sign.
The stop codon is not included in the downloaded protein, because it's usually not of interest.
But when we download the DNA sequence and translate it ourselves,
the stop codon is included in the translation.

So we remove the last three letters in order to match

In [23]:
translate(dna[20:935])

'MSTHDTSLKTTEEVAFQIILLCQFGVGTFANVFLFVYNFSPISTGSKQRPRQVILRHMAVANALTLFLTIFPNNMMTFAPIIPQTDLKCKLEFFTRLVARSTNLCSTCVLSIHQFVTLVPVNSGKGILRASVTNMASYSCYSCWFFSVLNNIYIPIKVTGPQLTDNNNNSKSKLFCSTSDFSVGIVFLRFAHDATFMSIMVWTSVSMVLLLHRHCQRMQYIFTLNQDPRGQAETTATHTILMLVVTFVGFYLLSLICIIFYTYFIYSHHSLRHCNDILVSGFPTISPLLLTFRDPKGPCSVFFNC'

In [24]:
prt==translate(dna[20:935])

True

In [25]:
prt==translate(dna[20:938])[:-1]

True