In [None]:
import pandas as pd
import glob


# Code from the lab
To ensure that everyone is working with the same material, I have provided a set of common functions and code that might be useful to you as part of the homework. Feel free to build off of these in your answers. 


In [None]:
# Reading in a codon table 
codon_dict = {}
with open('codon-tables/std-code.tab', 'r') as f: #open file
    for line in f: # read through each line
        trip, aa1, code3 = line.strip().split(' ') #strip white space and get the codon code, single letter, and three leter values
        codon_dict[trip] = aa1 # write out a dictionary

In [None]:
# Basic translate function (converts given nucleotide sequence to amino acids)

def get_protein(sequence='AATGCTGACTCCTTCGT'): #if no sequence is provided-- will default to the tester sequence
    assert(set(sequence).issubset(set(['A', 'C','T','G']))), "Sequence must consist only of A, C, T, G" 
    #Assertion statement verifys that the user is only providing a sequence that contains ACTG -- if other characters are 
    #present it will raise an error
    protein='' #initiate an empty string that can be added to
    for i in range(0,len(sequence), 3): #loops over the length of the protein
        codon = sequence[i:i+3] #takes chunks of 3 from the sequence based on i position
        if len(codon)!=3: # allows for sequences that are not mod 3 to work by passing if the sequence is 1 or 2
            pass
        else:
            protein += codon_dict[codon] #otherwise, codon is looked up in dictionary and appeneded to protein string
    return(protein) 


In [None]:
#Basic reverse complement function 

def reverse_complement(sequence='AATGCTGACTCCTTCGTG'):   
    complement = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'} #complement dictionary 
    comp_seq = '' # initiate empty string to be added to
    for base in sequence: #loop over input sequence 
        comp_seq += complement[base] #write complement string
    rev_comp_seq = comp_seq[::-1] #reverse string   
    return(rev_comp_seq)


# Six Frame translation 
The code we wrote in lab efficiently translates our nucleotide sequence into protein space. However, proteins can be identified from a DNA sequence in a total of 6 different frames. Forward (5' --> 3') starting at 0, 1, or 2 nucleotide postitions in and the reverse complement (3' --> 5') starting at the 0, 1, or 2 nucleotide postions. Our inital function (`get_protein`) only returns 1 frame (forward starting at the 0 position). 

**Q1) Write a python function called `six_frame` which returns each of the 6 frames as a dictionary with the form: dict={'F0':SEQUENCE, 'F1':SEQUENCE, 'F2':SEQUENCE, 'R0':SEQUENCE, 'R1':SEQUENCE, 'R2':SEQUENCE}. Make sure that you *assert* that the input sequence only contains ACTGs. Make sure that the function can run on strings that are not divisible by 3.**  

**Hint:** You can call on the reverse complement function that we wrote previously and use the function we wrote in class (get_protein) to help you think about how to tackle this problem. 


In [None]:
def six_frame(sequence='AATGCTGACTCCTTCGTGTGATTA'):
    #Fill in code here    
    
    return()

six_frame()


## TESTING
To help you **test** your code you can run the following assertion statements. If any of the commands come up as `FAIL` then something is not working in your code. If nothing is printed after you run the cell-- your code is passing. 

**Once your code passes all the assertion statements below add one more assertion statement to test your code**

In [None]:
TESTOUT = six_frame('CTAGCTCTCTAGCTATGATTA')
TESTOUT2 = six_frame()
assert set(TESTOUT.keys()).issubset(set(['F0','F1','F2','R0','R1','R2'])), ("FAILED: dictionary keys are missing or incorrect")
assert TESTOUT['F0']=='LAL*L*L', ('FAILED: F0 does not have correct translation')
assert TESTOUT['R2']=='IIAREL', ('FAILED: R2 does not have correct translation')
assert TESTOUT2['R1']=='NHTKESA', ("FAILED: R1 does not havve correct translation OR default is not set")

# Returning one frame only
   **Q2) Create a new function called `six_frame_specify` that is a modification of your `six_frame` function. This function should allow users who know what translation frame (with parameter `frame`) they want to specify that as the output (F0, F1, etc.). If the user does not provide an input for `frame`, return the whole dictionary as in `six_frame`. If the user inputs something other than F1, F0 (i.e. the keys of your dictionary) use an assertion statement to prompt the user to input the proper key.**

In [None]:
def six_frame_specify(sequence='AATGCTGACTCCTTCGTGTGATTA'):
    ## FILL IN YOUR CODE HERE
    return()
six_frame_specify()

## TESTING
Again, `assert` statements are provided to help you test your code. 

In [None]:
TESTOUT3 = six_frame_specify('CTAGCTCTCTAGCTATGATTA')
TESTOUT4 = six_frame_specify('CTAGCTCTCTAGCTATGATTA', frame='R0')

assert type(TESTOUT3) is dict, ('FAILED: if no frame is provided function should return a dictionary')
assert type(TESTOUT4) is str, ('FAILED: if frame is provided function should return a string')
assert TESTOUT4 == '*S*LES*', ('FAILED: incorrect translation provided for R0')


# Alternate codon dictionaries
Not all organisms use the same 'codon' dictionary. Moreover, unique translations exist between organelles. Vertebrate and invertebrate mitocondria have a unique code. Look at the folder `codon-tables`. I have included examples of four different codon tables: 1) the standard eukaryotic (`std-code.tab`), 2) the vertebrate mitochondrial (`vemito-code.tab`), 3) the invertebrate mitochondrial (`inmito-code.tab`), and 4) the ciliate mesodinium (`meso-code.tab`). 

**Q3) Create a new function called `six_frame_changetable` by copying and modifying `six_frame_specify`. This new function should allow users to specify which codon table they would like to use with the parameter `codon_table`. `codon_table` can be set to the following key words: `std-code`, `inmito-code`, `vemito-code`, or `meso-code`, all of which correspond with the filenames in the directory `codon-tables/`. If the table is not specified make standard the default. If an improper key is used you should default to standard. In all cases you should PRINT what codon dictionary is used.** 

Note: You should create another function called `read_in_tables` which when provided with a *directory* will read in all codon tables within that directory. This function can assume that only codon tables are in that directory and that all tables have the same format. Hint: You might find the `glob` function useful. You might also decide that you want to read in tables with pandas but then you will likely have to modify some elements of the other function. 

In [None]:
def read_in_tables(folder='codon-tables/'):
    
    #read in all codon tables to EITHER a dictionary of dictionaries OR a pandas dataframes 
    return()

def six_frame_changetable(sequence='TAAAGGGCTGACTCCTTCGTGTGATTA'):
    ## FILL IN YOUR CODE HERE
    return()


## TESTING 

In [None]:
TESTOUT5 = six_frame_changetable('CTAGCTCTCTAGCTATGATTA', frame='F0', codon_table='std-code')
TESTOUT6 = six_frame_changetable('CTAGCTCTCTAGCTATGATTA', frame='F0', codon_table='meso-code')
TESTOUT7 = six_frame_changetable('CTAGCTCTCTAGCTATGATTA', frame='F0', codon_table='messedUP')
TESTOUT8 = six_frame_changetable('CTAGCTCTCTAGCTATGATTA', codon_table='invert-mito-code')

assert TESTOUT5 == 'LAL*L*L', ('FAILED: incorrect translation provided. Is correct codon table being used?')
assert TESTOUT6 == 'LALYL*L', ('FAILED: incorrect translation provided. Is correct codon table being used?')
assert TESTOUT7 == TESTOUT5, ('FAILED: if there is a typo should default to std code.')
assert type(TESTOUT8) is dict, ('FAILED: if no frame is specified function should return a dictionary.')


## BioPython

Great job! You created a function that can programatically help users translate their sequences of interest. Unsurprisingly, you are not the first to write a program like this. A python package, `BioPython`, exists that does a fantastic job of handling biological sequences and has the functionality that we just created. 

Read up on the `Seq` module in `BioPython` here: https://biopython.org/wiki/Seq. 

**Q4) Check the output of your final function `reverse_complement` and `six_frame_changetable` compared to the `Seq` module functions. How do the outputs compare?**

You will want to install BioPython within your conda environment and then run the reverse complement and six frame translation function with BioPython as well as your own functions for `sequence1` and `sequence2`. 

In [None]:
sequence1 = 'CATTATCTGGGGGGGG'
sequence2 = 'CTATATTTAAAGGAAAAA'