# DNA Translation
Life depends on the ability of cells to store, retrieve, and translate genetic instructions.
These instructions are needed to make and maintain living organisms.
For a long time, it was not clear what molecules were able to copy and transmit genetic information.
We now know that this information is carried by the dioxyribonucleic acid or DNA in all living things.
DNA is a discrete code physically present in almost every cell of an organism.
We can think of DNA as a one dimensional string of characterswith four characters to choose from.
These characters are A, C, G, and T. Theystand for the first letters with the four nucleotides used to construct DNA.
The full names of these nucleotides are adenine, cytosine, guanine, and thymine.
Each unique three character sequence of nucleotides,sometimes called a nucleotide triplet, corresponds to one amino acid.
The sequence of amino acids is unique for each type of protein and all proteins are built from the same set of just 20 amino acids for all living things.
Protein molecules dominate the behavior of the cell serving as structural supports, chemical catalysts, molecular motors, and so on.
The so called central dogma of molecular biology describes the flow of genetic information in a biological system.
Instructions in the DNA are first transcribed into RNA and the RNA is then translated into proteins.
We can think of DNA, when read as sequences of three letters, as a dictionary of life.
In this case study, we will first download a DNA strand as a text file from a public web-based repository of DNA sequences.
We will then write code to translate the DNA sequence

In [1]:

def translate(seq):
    """Translate a String con 3.1.4"""
    table = {
    'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
    'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
    'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
    'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',
    'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
    'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
    'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
    'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
    'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
    'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
    'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
    'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
    'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
    'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
    'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_',
    'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W',}
    #check that the sequence length is divisible by 3
        # loop over the sequence
            #extract a codon
            #look up the codon and store the result
    protein = ""
    if len(seq) % 3 == 0:
        for i in range(0,len(seq),3):
            codon = seq[i:i+3]
            protein += table[codon]
    return protein
            

In [2]:
def read_seq(inputfile):
    """Reads and returns the input sequence with special characters removed"""
    with open(inputfile, "r") as f:
        seq = f.read()
    seq = seq.replace("\n", "")
    seq = seq.replace("\r", "")
    return seq

In [3]:
dna = read_seq("dna.txt")   
prt =  read_seq("protein.txt")           


If we go to NCBI WEB PAGE https://www.ncbi.nlm.nih.gov/nuccore/NM_201917.1 
If you look at the website where it says CDS, you will see two numbers next to it, 21 and 938.
These are the locations of the gene where the coding sequence starts and ends.
So instead of taking the entire DNA sequence, we will translation starting at position 21
and ending at 938.
However, we need to be careful with the indices.If you investigate the NCBI website, you will
see that the sequence positions are numbered from 1 to 1157.
so genome positions 21 and 938 correspond to Python string positions 20 and 937. So the starting point of the string slice will be 20,but the stopping location of the string is 938.

In [4]:
translate(dna[20:938])


'MSTHDTSLKTTEEVAFQIILLCQFGVGTFANVFLFVYNFSPISTGSKQRPRQVILRHMAVANALTLFLTIFPNNMMTFAPIIPQTDLKCKLEFFTRLVARSTNLCSTCVLSIHQFVTLVPVNSGKGILRASVTNMASYSCYSCWFFSVLNNIYIPIKVTGPQLTDNNNNSKSKLFCSTSDFSVGIVFLRFAHDATFMSIMVWTSVSMVLLLHRHCQRMQYIFTLNQDPRGQAETTATHTILMLVVTFVGFYLLSLICIIFYTYFIYSHHSLRHCNDILVSGFPTISPLLLTFRDPKGPCSVFFNC_'

There is an underscore character that appears at the end of our translated sequence.
 At the very end of a protein coding sequence, nature places what's called a stop codon.
There are three stop codons, and their function is to tell someone reading the sequence that this is where you should stop reading.It's almost like an end of paragraph sign.
The stop codon is not included in the downloaded protein, because it's usually not of interest.
So we will just remove it

In [5]:
translate(dna[20:938])[:-1]

'MSTHDTSLKTTEEVAFQIILLCQFGVGTFANVFLFVYNFSPISTGSKQRPRQVILRHMAVANALTLFLTIFPNNMMTFAPIIPQTDLKCKLEFFTRLVARSTNLCSTCVLSIHQFVTLVPVNSGKGILRASVTNMASYSCYSCWFFSVLNNIYIPIKVTGPQLTDNNNNSKSKLFCSTSDFSVGIVFLRFAHDATFMSIMVWTSVSMVLLLHRHCQRMQYIFTLNQDPRGQAETTATHTILMLVVTFVGFYLLSLICIIFYTYFIYSHHSLRHCNDILVSGFPTISPLLLTFRDPKGPCSVFFNC'