# Translating RNA to protein (rosalind-PROT)

The 20 commonly occurring amino acids are abbreviated by using 20 letters from the English alphabet (all letters except for B, J, O, U, X, and Z). Protein strings are constructed from these 20 symbols. Henceforth, the term genetic string will incorporate protein strings along with DNA strings and RNA strings.

The RNA codon table dictates the details regarding the encoding of specific codons into the amino acid alphabet.

**Given:** An RNA string s corresponding to a strand of mRNA (of length at most 10 kbp).

**Return:** The protein string encoded by s.

### Sample Dataset

```
AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA
```

### Sample Output

```
MAMAPRTEINSTRING
```

## Solution:

The task is to translate an RNA sequence to aminoacids; and clearly, we are given an open reading frame, so there is no need to go searching for start codons. What we need to solve this are

1. a way to split up the RNA sequence to triplets
2. a way to map triplets of nucleotides to a single aminoacid

### 1. splitting the RNA to triplets

Let's first try to solve the first part, splitting the RNA sequence to triplets. I will be using the example sequence to practice.

In [1]:
s = "AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA"

What is the first codon?

In [2]:
s[0:3]

'AUG'

I can verify that manually: the first codon is indeed from indices 0 to 3. What is the second codon?

In [3]:
s[3:6]

'GCC'

Again, I can verify this by eye. How would I get the i'th codon?

$\rightarrow$ this might be too hard to imagine immediately, so let's try writing everything down:

In [4]:
s[0:3] # first codon, or 0th, for Python
s[3:6] # second codon, or 1st for Python
s[6:9] # third codon, or 2nd for Python

'AUG'

The 10th codon will have 9 codons before it; since every codon is 3nt long, this means there will have been $3 \times 9 = 27$ nucleotides before we arrive at the 10th codon. The 10th codon will then also stretch for 3 nucleotides.

This means that for the i'th codon there will be $i-1$ codons before it; and therefore $(i-1) \times 3$ nucleotides. If we start at 0, like Python does, we can modify that to $i$ codons and therefore use $i \times 3$ as the starting position of the i-th codon:

In [5]:
i = 1 # in Python notation
start = i * 3
end = start + 3
s[start:end]

'GCC'

Then $i$ represents the current codon; how many of those will be contained in an ORF? $\rightarrow$ of course the length of the sequence divided by three, as the sequence is made up of triplets.

In [6]:
no_codons = len(s) // 3 # integer division
for i in range(no_codons):
    start = i * 3
    end = start + 3
    codon = s[start:end]
    print(codon)

AUG
GCC
AUG
GCG
CCC
AGA
ACU
GAG
AUC
AAU
AGU
ACC
CGU
AUU
AAC
GGG
UGA


Looking at the sequence verifies that we did this correctly!

Of course, now that we get how this works, we can also do it with slicing notation, exploiting the fact that _every third position_ in the RNA string will be a codon start position:

In [7]:
for i in range(0, len(s), 3): # the third argument to range is the step size!
    # start = i
    # end = start + 3
    # codon = s[start:end]
    codon = s[i:i+3] # both options are equally correct; this is just more compact!
    print(codon)

AUG
GCC
AUG
GCG
CCC
AGA
ACU
GAG
AUC
AAU
AGU
ACC
CGU
AUU
AAC
GGG
UGA


### 2. replacing codons with their respective aminoacids

Following the link to the codon table we see that each codon is paired up with one aminoacid. We clearly have pairs of things that belong together; maybe a dictionary would be a natural way of saving this information.

But if it's a dictionary, what are the values and what are the keys? We know from the definition of dictionaries that the keys have to be unique. Each aminoacid can be encoded by multiple codons, but each codon encodes exactly one aminoacid. Therefore, the keys should be codons, and the values aminoacids.

We can copy the codon table from the exercise, but we need to be careful: this dictionary assigns an one-letter aminoacid to each codon. This means that the stop codons will need to be handled in a special manner, since "Stop" is a four-letter word, breaking the symmetry of the rest of the table. We will leave them like that for now, but may need to revisit this later.

In [8]:
codons = {
    "UUU": "F", "CUU": "L", "AUU": "I", "GUU": "V",
    "UUC": "F", "CUC": "L", "AUC": "I", "GUC": "V",
    "UUA": "L", "CUA": "L", "AUA": "I", "GUA": "V",
    "UUG": "L", "CUG": "L", "AUG": "M", "GUG": "V",
    "UCU": "S", "CCU": "P", "ACU": "T", "GCU": "A",
    "UCC": "S", "CCC": "P", "ACC": "T", "GCC": "A",
    "UCA": "S", "CCA": "P", "ACA": "T", "GCA": "A",
    "UCG": "S", "CCG": "P", "ACG": "T", "GCG": "A",
    "UAU": "Y", "CAU": "H", "AAU": "N", "GAU": "D",
    "UAC": "Y", "CAC": "H", "AAC": "N", "GAC": "D",
    "UAA": "Stop", "CAA": "Q", "AAA": "K", "GAA": "E",
    "UAG": "Stop", "CAG": "Q", "AAG": "K", "GAG": "E",
    "UGU": "C", "CGU": "R", "AGU": "S", "GGU": "G",
    "UGC": "C", "CGC": "R", "AGC": "S", "GGC": "G",
    "UGA": "Stop", "CGA": "R", "AGA": "R", "GGA": "G",
    "UGG": "W", "CGG": "R", "AGG": "R", "GGG": "G"
}

Now we can simply go through our codons and write out the correct aminoacid for each codon!

In [9]:
for i in range(0, len(s), 3):
    codon = s[i:i+3]
    aminoacid = codons[codon]
    print(aminoacid)

M
A
M
A
P
R
T
E
I
N
S
T
R
I
N
G
Stop


We could try making this into a word, but the "Stop" at the end will mess things up. Maybe we should check for this somehow.

In [10]:
peptide = ""
for i in range(0, len(s), 3):
    codon = s[i:i+3]
    aminoacid = codons[codon]
    if aminoacid != "Stop":
        peptide = peptide + aminoacid
print(peptide)

MAMAPRTEINSTRING


We're done!

Please note: this approach will not stop reading an RNA string if a stop codon is seen; it will continue until there is no more sequence to parse.