# Problem

Either strand of a DNA double helix can serve as the coding strand for RNA transcription. Hence, a
given DNA string implies six total reading frames, or ways in which the same region of DNA can be
translated into amino acids: three reading frames result from reading the string itself, whereas
three more result from reading its reverse complement.

An open reading frame (ORF) is one which starts from the start codon and ends by stop codon, without
any other stop codons in between. Thus, a candidate protein string is derived by translating an open
reading frame into amino acids until a stop codon is reached.

<font color="green">Given</font>: A DNA string $s$ of length at most 1 kbp in FASTA format.

<font color="green">Return</font>: Every distinct candidate protein string that can be translated
from ORFs of s. Strings can be returned in any order.

### Sample Dataset

```
>Rosalind_99
AGCCATGTAGCTAACTCAGGTTACATGGGGATGACCCCGCGACTTGGATTAGAGTCTCTTTTGGAATAAGCCTGAATGATCCGAGTAGCATCTCAG
```

### Sample Output

```
MLLGSFRLIPKETLIQVAGSSPCNLS
M
MGMTPRLGLESLLE
MTPRLGLESLLE
```

In [1]:
test = [1,4,2,7,9]
test.sort()

In [2]:
def read_fasta(file):
    """
    Reads a fasta file and returns a dictionary with the name of the sequence as key and the
    nucleotide sequence as value.

    Parameters
    ----------
        file : str
            Path to the fasta file

    Returns
    -------
        fasta : dict
            Dictionary with the name of the sequence as key and the nucleotide sequence as value

    Examples
    --------
    >>> read_fasta("data/sequence.fasta")
    {'seq1': 'ACGTGAGCTAGC', 'seq2': 'ACGTGAGCTAGC'}
    """
    fasta = {}
    with open(file, "r") as f:
        data = f.read().split(">")
        for seq in data:
            name = seq.split("\n")[0]
            nuc = "".join(seq.split("\n")[1:])
            fasta[name] = nuc
    return fasta

In [3]:
CODONS = {
    "UUU": "F", "CUU": "L", "AUU": "I", "GUU": "V",
    "UUC": "F", "CUC": "L", "AUC": "I", "GUC": "V",
    "UUA": "L", "CUA": "L", "AUA": "I", "GUA": "V",
    "UUG": "L", "CUG": "L", "AUG": "M", "GUG": "V",
    "UCU": "S", "CCU": "P", "ACU": "T", "GCU": "A",
    "UCC": "S", "CCC": "P", "ACC": "T", "GCC": "A",
    "UCA": "S", "CCA": "P", "ACA": "T", "GCA": "A",
    "UCG": "S", "CCG": "P", "ACG": "T", "GCG": "A",
    "UAU": "Y", "CAU": "H", "AAU": "N", "GAU": "D",
    "UAC": "Y", "CAC": "H", "AAC": "N", "GAC": "D",
    "UAA": "Stop", "CAA": "Q", "AAA": "K", "GAA": "E",
    "UAG": "Stop", "CAG": "Q", "AAG": "K", "GAG": "E",
    "UGU": "C", "CGU": "R", "AGU": "S", "GGU": "G",
    "UGC": "C", "CGC": "R", "AGC": "S", "GGC": "G",
    "UGA": "Stop", "CGA": "R", "AGA": "R", "GGA": "G",
    "UGG": "W", "CGG": "R", "AGG": "R", "GGG": "G"
}

def reverse_complement(s):
    reverse = s[::-1]
    reverse = reverse.replace('A', 't')
    reverse = reverse.replace('T', 'a')
    reverse = reverse.replace('C', 'g')
    reverse = reverse.replace('G', 'c')
    return reverse.upper()

def transcribe(s):
    return s.replace('T', 'U')

def orf_trim(dna):
    # trim the dna sequence to be a multiple of 3
    return dna[:len(dna) - (len(dna) % 3)]

def translate(s, codon_table=CODONS):
    protein = ''
    for i in range(0, len(s), 3):
        codon = s[i:i+3]
        protein += codon_table[codon]
    return protein

def find_orfs(s, codon_table=CODONS):
    s = orf_trim(s)
    # find start, stop codons
    starts = []
    stops = []
    for i in range(0, len(s), 3):
        codon = s[i:i+3]
        if codon == 'AUG':
            starts.append(i)
        elif codon_table[codon] == "Stop":
            stops.append(i)
    return starts, stops

def find_next_stop(start, stops):
    stops.sort()
    for stop in stops:
        if stop > start:
            return stop
    return None

def extract_orfs(s, codon_table=CODONS):
    starts, stops = find_orfs(s)
    # translate whatever is between a start and the next stop codon
    proteins = []
    for start in starts:
        stop = find_next_stop(start, stops)
        if stop is None:
            continue
        mrna = s[start:stop]
        protein = translate(mrna, codon_table)
        proteins.append(protein)
    return proteins

In [4]:
dna = "AGCCATGTAGCTAACTCAGGTTACATGGGGATGACCCCGCGACTTGGATTAGAGTCTCTTTTGGAATAAGCCTGAATGATCCGAGTAGCATCTCAG"

In [5]:
def six_frame_translation(s):
    reverse = reverse_complement(s)
    mrna_forward = transcribe(s)
    mrna_reverse = transcribe(reverse)

    # print(mrna_forward)
    # print(mrna_reverse)
    
    forward_frame_1 = extract_orfs(mrna_forward)
    forward_frame_2 = extract_orfs(mrna_forward[1:])
    forward_frame_3 = extract_orfs(mrna_forward[2:])
    reverse_frame_1 = extract_orfs(mrna_reverse)
    reverse_frame_2 = extract_orfs(mrna_reverse[1:])
    reverse_frame_3 = extract_orfs(mrna_reverse[2:])

    return forward_frame_1, forward_frame_2, forward_frame_3, reverse_frame_1, reverse_frame_2, reverse_frame_3

def unique_proteins(s):
    orfs = six_frame_translation(s)
    proteins = set()
    for frame in orfs:
        for protein in frame:
            proteins.add(protein)
    
    for protein in proteins:
        print(protein)
    
    # return proteins

In [6]:
unique_proteins(dna)

MGMTPRLGLESLLE
MLLGSFRLIPKETLIQVAGSSPCNLS
MTPRLGLESLLE
M


In [7]:
fasta = read_fasta("../files/rosalind_orf.txt")

In [8]:
unique_proteins(fasta["Rosalind_5776"])

MSLGGARIFRGANSVLKQVLVCQPTRSYVMLTG
MTQYPWHTDRY
M
MTLMNVVTG
MFYFQIRFSYFQMVTSRKSVREAKL
MSCCPQGSDRMAPMGSSERVSV
MFVICLRHRYSTINYPLWCFATRLRSGVFRASLTDAFEPNKTG
MAHRQILMLLVCCCLTGLIRFESVC
MLQ
MAPMGSSERVSV
MLTG
MSCCNSL
MAQADYKHPRAGRLQFTTFR
MNVVTG
MLC
MGSSERVSV
MLLALAPAV
MVL
MLLVCCCLTGLIRFESVC
MLPTGI
MVTSRKSVREAKL
MSHESVLTLAVVFR
