# Rosalind ORF

**Given**: A DNA sequence in FASTA format

**Asked:** all possible peptides that can be translated from it

## The Plan

- [x] transcribe the DNA to RNA (RNA exercise)
- [x] find all reading frames $\rightarrow$ three forward, three reverse $\rightarrow$ REVC exercise
- [x] find valid ORFs in an RNA sequence <font color="red">!!!!</font>
- [x] translate the mRNAs to peptides (PROT exercise)
- [x] make sure there are no duplicate protein sequences!

Ideas:

- much like in PROT, we might need to work with triplets
- whatever we do for a single reading frame, we will have to repeat 5x more $\rightarrow$ functions?
    - function that takes RNA sequence and produces all valid peptides

## Set up environment

- [x] load helper functions
- [x] get testing data

In [1]:
from util import read_fasta, reverse_complement
from util import codons

In [2]:
fasta_file = read_fasta('test.fa')
fasta = list(fasta_file.values())[0]

In [3]:
fasta

'AGCCATGTAGCTAACTCAGGTTACATGGGGATGACCCCGCGACTTGGATTAGAGTCTCTTTTGGAATAAGCCTGAATGATCCGAGTAGCATCTCAG'

### 1. transcribe to RNA

In [4]:
rna = fasta.replace('T', 'U')

### 2. find all reading frames

In [5]:
def get_reading_frames(rna):
    # three forward, three backward reading frames
    rf1 = rna
    rf2 = rna[1:]
    rf3 = rna[2:]

    revc = reverse_complement(rna)
    rf4 = revc
    rf5 = revc[1:]
    rf6 = revc[2:]
    return [rf1, rf2, rf3, rf4, rf5, rf6]

In [6]:
frames = get_reading_frames(rna)
frames

['AGCCAUGUAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAGUAGCAUCUCAG',
 'GCCAUGUAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAGUAGCAUCUCAG',
 'CCAUGUAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAGUAGCAUCUCAG',
 'CUGAGAUGCUACUCGGAUCAUUCAGGCUUAUUCCAAAAGAGACUCUAAUCCAAGUCGCGGGGUCAUCCCCAUGUAACCUGAGUUAGCUACAUGGCU',
 'UGAGAUGCUACUCGGAUCAUUCAGGCUUAUUCCAAAAGAGACUCUAAUCCAAGUCGCGGGGUCAUCCCCAUGUAACCUGAGUUAGCUACAUGGCU',
 'GAGAUGCUACUCGGAUCAUUCAGGCUUAUUCCAAAAGAGACUCUAAUCCAAGUCGCGGGGUCAUCCCCAUGUAACCUGAGUUAGCUACAUGGCU']

## 3. finding ORFs

In [7]:
# just to practice, use RF1
rf = frames[0]

I notice that I cannot solve this in one go; the problem are overlapping peptides. What is the
problem  about them? They share a stop codon, but have different start codons $\rightarrow$ there
is no solution with just one for loop.

My idea:

- [x] find all start codons
- [x] given a start codon, find the correspodnign ORF

In [8]:
def find_AUG(rf):
    starting_positions = []
    for i in range(0, len(rf), 3):
        codon = rf[i:i+3]
        if codon == 'AUG':
            starting_positions.append(i)
    return starting_positions

In [9]:
starts = find_AUG(rf)
starts

[24, 30, 75]

In [10]:
def find_ORF_and_translate(rf, start, codons):
    peptide = ''
    for i in range(start, len(rf), 3):
        codon = rf[i:i+3]
        if codons[codon] is None:
            return peptide
        peptide += codons[codon]
    return None

negative test: this sequence contains a start codon but no end codon; our function should return
None

In [11]:
neg = 'AUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAA'
peptide = find_ORF_and_translate(neg, 0, codons)
print(peptide)

None


In [12]:
def find_all_proteins(rf, codons):
    starts = find_AUG(rf)
    proteins = []
    for start in starts:
        peptide = find_ORF_and_translate(rf, start, codons)
        if peptide is not None:
            proteins.append(peptide)
    return proteins

In [13]:
find_all_proteins(frames[0], codons)

['MGMTPRLGLESLLE', 'MTPRLGLESLLE']

## Putting it all together:

In [14]:
def all_possible_proteins(fasta_path, codons):
    # 1. read the file
    fasta_file = read_fasta(fasta_path)
    fasta = list(fasta_file.values())[0]
    # 2. transcribe to RNA
    rna = fasta.replace('T', 'U')
    # 3. get reading frames
    frames = get_reading_frames(rna)
    proteins = []
    for frame in frames:
        to_add = find_all_proteins(frame, codons)
        # make sure the new proteins are not
        # already in our list
        for prot in to_add:
            if prot not in proteins:
                proteins.extend([prot])

    print('\n'.join(proteins))

## Solve example

In [15]:
all_possible_proteins('test.fa', codons)

MGMTPRLGLESLLE
MTPRLGLESLLE
M
MLLGSFRLIPKETLIQVAGSSPCNLS


## Solve exercise

In [16]:
all_possible_proteins('/Users/npapadop/Downloads/rosalind_orf.txt', codons)

MGAIGTMRDKAHNSFQSYAWLYGYYNDRCTRALTKPYELLFERVKNTHQHLILCSVSYSSPILSLDTSCCNR
MRDKAHNSFQSYAWLYGYYNDRCTRALTKPYELLFERVKNTHQHLILCSVSYSSPILSLDTSCCNR
MPVPLIGAAMAFETQPR
MAFETQPR
MSCMGPESVQIV
MGPESVQIV
MIYNSFLRAIKRSKRRDFCHSYVEYARCRTFSRQALAVFLTSTFPQEHHRPEVYIES
MLGYTVITTTDVLER
MNRPFYEFLGRPLR
MALRQDWSTCRCP
MPLFSPIS
MGLLGIIFSYL
MRAAVRLAGKL
MPQSVKNGHHN
MY
MGRPLTLDVYFWSMMFLRES
MMFLRES
MFLRES
MTEVAPF
MMPSNPMRAAARLKVPYRALPRLLKWSG
MPSNPMRAAARLKVPYRALPRLLKWSG
MRAAARLKVPYRALPRLLKWSG
MGENNGMAFETQPR
MYTSGR
MAWRSKHSPVS
MWTSPA
MFVT
MVGSSLIEY
MMSVFNTLGHGAPSNSRCILLVDDVPAGKLKSGIPLELAC
MSVFNTLGHGAPSNSRCILLVDDVPAGKLKSGIPLELAC
MNDRSRAFLNAL
MVGLRDGRKQWHGVRNTAPLANGAVFRTP
MYPATV
