Problem
-------
Either strand of a DNA double helix can serve as the coding strand for RNA transcription. Hence, a given DNA string implies six total reading frames, or ways in which the same region of DNA can be translated into amino acids: three reading frames result from reading the string itself, whereas three more result from reading its reverse complement.

An open reading frame (ORF) is one which starts from the start codon and ends by stop codon, without any other stop codons in between. Thus, a candidate protein string is derived by translating an open reading frame into amino acids until a stop codon is reached.

**Given**: A DNA string ss of length at most 1 kbp in FASTA format.

**Return**: Every distinct candidate protein string that can be translated from ORFs of ss. Strings can be returned in any order.

In [33]:
DNA_codon_table = {
    'TTT': 'F',     'CTT': 'L',     'ATT': 'I',     'GTT': 'V',
    'TTC': 'F',     'CTC': 'L',     'ATC': 'I',     'GTC': 'V',
    'TTA': 'L',     'CTA': 'L',     'ATA': 'I',     'GTA': 'V',
    'TTG': 'L',     'CTG': 'L',     'ATG': 'M',     'GTG': 'V',
    'TCT': 'S',     'CCT': 'P',     'ACT': 'T',     'GCT': 'A',
    'TCC': 'S',     'CCC': 'P',     'ACC': 'T',     'GCC': 'A',
    'TCA': 'S',     'CCA': 'P',     'ACA': 'T',     'GCA': 'A',
    'TCG': 'S',     'CCG': 'P',     'ACG': 'T',     'GCG': 'A',
    'TAT': 'Y',     'CAT': 'H',     'AAT': 'N',     'GAT': 'D',
    'TAC': 'Y',     'CAC': 'H',     'AAC': 'N',     'GAC': 'D',
    'TAA': '-',     'CAA': 'Q',     'AAA': 'K',     'GAA': 'E',
    'TAG': '-',     'CAG': 'Q',     'AAG': 'K',     'GAG': 'E',
    'TGT': 'C',     'CGT': 'R',     'AGT': 'S',     'GGT': 'G',
    'TGC': 'C',     'CGC': 'R',     'AGC': 'S',     'GGC': 'G',
    'TGA': '-',     'CGA': 'R',     'AGA': 'R',     'GGA': 'G',
    'TGG': 'W',     'CGG': 'R',     'AGG': 'R',     'GGG': 'G'
}
def readTab(infile): # read in txt file
    with open(infile, 'r') as input_file:
    # read in tab-delim text
        output = []
        for input_line in input_file:
            input_line = input_line.strip()
            temp = input_line.split('\t')
            output.append(temp)
    return output
def extract_fasta(fasta):
    sequences = {}
    headers = []
    flag = ""
    for i in fasta:
        if i[0].startswith(">"):
            headers.append(i[0])
            flag = i[0]
            sequences[flag] = ""
        else:
            sequences[flag] = sequences[flag] + i[0]
    return sequences, headers
def translateDNA_protein(sequence):
    protein = ""
    for i in range(0, len(sequence), 3):
        if len(sequence[i:i+3]) == 3:
            if DNA_codon_table[sequence[i:i+3]] == "-":
                return protein
                break
            else:
                protein = protein+DNA_codon_table[sequence[i:i+3]]
def reverse_complement(sequence):
    complement = {"A":"T","C":"G","G":"C","T":"A"}
    rev = ""
    for i in sequence[::-1]:
        rev += complement[i] 
    return rev
def extract_ORFs(sequence): # ADD STOP CODON
    reverse = reverse_complement(sequence)
    ORFs = []
    for i in range(3):
        seq = sequence[0+i:]
        for j in range(0, len(seq), 3):
            if seq[j:j+3] == "ATG":
                temp = translateDNA_protein(seq[j:])
                if temp not in ORFs:
                    if temp != None:
                        ORFs.append(temp)
        rev = reverse[0+i:]
        for k in range(0, len(rev), 3):
            if rev[k:k+3] == "ATG":
                temp = translateDNA_protein(rev[k:])
                if temp not in ORFs:
                    if temp != None:
                        ORFs.append(temp) 
    return "\n".join(ORFs)

In [21]:
sequences, headers = extract_fasta(readTab("ORF.fasta"))

In [34]:
print extract_ORFs(sequences[headers[0]])

MGMTPRLGLESLLE
MTPRLGLESLLE
M
MLLGSFRLIPKETLIQVAGSSPCNLS
