__This portion of the code deals with aligning the 5'-UTR portion of sequences__

Sequences that are aligned must have a 5_UTR of atleast 100 nucleotides so that an alignment can be produced. Filtering steps that were included in the previous section to make Synplot2 run are not included here or in downstream analyses

__Mammalia Results__


The sequences that lack a conserved CUG codon are Camelus_ferus, Vombatus_ursinus, Phascolarctos_cinereus , Monodelphis_domestica. 

Sarcophilus_harrisii has the CUG codon but the surrounding sequence is a bit odd

Unsurprisingly, the length of the alternative frame in the CDS for sequences are the following 

Camelus_ferus: 197 nucleotides
Sarcophilus_harrisii: 86 nucleotides
Vombatus_ursinus: 86 nucleotides
Phascolarctos_cinereus: 86 nucleotides
Monodelphis_domestica: 86 nucleotides

This supports the idea that the mammals that lack the CUG initiation codon, also do not have the long ORF either

Monodelphis_domestica has a gap where the CUG should be located in mammals so I've artificially just chosen the next nucleotide as where the reference location begins. This can be seen by referring to the KozakMotif function


__Sauropsid Results__

Most sequences do not appear to have a long ORF in the incorrect reading frame. Unlike mammalia, I may have to manually pick and choose sequences that work best for me to align since an alignment with ALL the kozak sequences isn't yielding anything informative. Perhaps a manual inspection is in order.

The peptide generated by a CUG uORF by Gallus gallus has terrible protein alignment to the Homo sapiens protein

--> Sequences must have a 3'-UTR of 100 nucleotides or greater
--> The extension in the correct ORF must be greater than 300 nucleotides

_Gallus_gallus_

-->CUG is properly offset from start codon
-->Length of ORF from CUG to TF stop codon is 1110 nucleotides

Empidonax_traillii

-->CUG is not properly offset from start codon to access frame
-->Length of start codon to supposed TF stop codon is 398 nucleotides

Neopelma_chrysocephalum
-->CUG is not properly offset from start codon to access frame
-->Length of start codon to supposed TF stop codon is 398 nucleotides

_Parus_major_
-->CUG is properly offset from start codon
-->Length of ORF from CUG to TF stop codon is 777 nucleotides

_Zonotrichia_albicollis_
-->CUG is properly offset from the start codon
-->Length of ORF from CUG to TF stop codon is 813 nucleotides

Corvus_brachyrhynchos
-->CUG is not properly offset from the start codon
-->Length of ORF from start codon to supposed TF stop codon is 536 nucleotides

_Corapipo_altera_
-->CUG is properly offset from the start codon
-->Length of ORF from CUG to TF stop codon is 471 nucleotides

_Lepidothrix_coronata_
-->CUG is properly offset from the start codon
-->Length of ORF from CUG to TF stop codon is 471 nucleotides

_Pipra_filicauda_
-->CUG is properly offset from the start codon
-->Length of ORF from CUG to TF stop codon is 471 nucleotides

Ficedula_albicollis
-->CUG is not properly offset from the start codon
-->Length of ORF from start to supposed stop codon is 353 nucleotides

_Anser_cygnoides_domesticus_
-->There are two CUGs that are properly offset from the start codon. One is 100 nucleotides away and the other is 178 nucleotides away. I assume that the one that is 100 nucleotides is likely to be the one used (or formerly used). 
-->Length of ORF from CUG to TF stop codon is 708 nucleotides


List of sauropsids that require further analysis: sauropsid_list = ['Gallus_gallus', 'Parus_major', 'Zonotrichia_albicollis', 'Corapipo_altera', 'Lepidothrix_coronata', 'Pipra_filicauda', 'Anser_cygnoides_domesticus']

These sauropsids have no alignment with their prospective CUG initiation codon in a MUSCLE alignment of their 5UTRs

__Amphibia__

There are only three organisms that are analyzed. None have an overlapping extension greater than 300 nucleotides

Xenopus_tropicalis:XM_002932235
Length into TF ORF 71 nucleotides

Xenopus_laevis:XM_018250789
Length into TF ORF 62 nucleotides

Nanorana_parkeri: XM_018571894
Length into TF ORF 38 nucleotides


__Teleost fish__

Only 3 sequences have an overlapping extension greater than 300 nucleotides

Austrofundulus_limnaeus (XM_014005514): 323 nucleotides
Contains two CUGs in the proper frame 70 and 82 nucleotides away but there is an intervening stop codon. 

Sinocyclocheilus_rhinocerous (XM_016528575): 314 nucleotides
Contains one CUG in the proper frame but there is an intervening stop codon

Clupea_harengus (XM_012832592): 320 nucleotides
Contains one CUG in the proper frame without an intervening stop codon
-->The surrounding sequence for this CUG is ACAAAA __CTG__ A
>Potential_protein
MKRNVAVMVTPLSTTTVPCTGAQEVAKFLGRASPGTGAQLSSEKSKPTQHPDDVCKPAQADIPRIGASIHTGRCGAEYQAPAEASALGQRSCASSRRGAEVARNVWQRH

In [122]:
from Bio.Seq import Seq
from Bio import Entrez
from Bio import SeqIO
from Bio.Alphabet import IUPAC
import matplotlib
import matplotlib.pyplot as plt
% matplotlib inline
import numpy as np
from Bio.Align.Applications import MuscleCommandline
import os
from Bio import SeqIO
import csv
from itertools import islice
from Bio.Emboss.Applications import TranalignCommandline

In [123]:
cwd = os.getcwd()+os.sep
muscle_executable = cwd+"muscle3.8.31_i86win32.exe"
BLAST_genbank = 'Representative_Species/tblastn_refseqrna_mammalia_queryhomosapiens_1-12-20.gb'
BLAST_textoutput = 'Representative_Species/tblastn_refseqrna_mammalia_queryhomosapiens_1-12-20.txt'
tranalign_exe = r"C:\mEMBOSS\Tranalign.exe"
tranalignseq_out = cwd + 'POLG_tranalign_output_80.fasta'
fiveUTR_100nt = cwd+'trimmed_5UTR.fasta'
aligned_fiveUTR_100nt = cwd+'aligned_trimmed_5UTR.fasta'
kozak_file = cwd+'kozak_sequence_aligned.fasta'
CUGRNAfasta = cwd + 'CUG_RNA.fasta'
CUGalignedRNA = cwd + 'CUG_RNA_aligned.clw'
CUG_RNA_aligned_fasta = cwd + 'CUG_RNA_aligned.fasta'

In [124]:
#This is the POLG201 sequence for gallus gallus because the NCBI version has the CUG in the incorrect frame
custom_POLG201_Gallus_gallus_sequence = Seq('GTCGCCCCGGAGCCCCGCGTTGCACCGCGATCCGACCCGGGCGGCGCGGTGTGGGGGCGGGGGGGCGCGTGGGGACAGCGCGGGGCTGCACGGCGCGGGGAAGGGGAGGTGCGGAGCTTTGGGGGCGCGTGCAGAGCTGTGGGGAGCGGCGGGGCGTGCCGCGTGCCTTGCAGGGGTGCCGCGTGCTTTGCAGGGGTGCCGCGTGCGTGCAGAGGTGTTGCACGCCTTGCAGGGGTGCTGCGTGCTTCGACGCAGTGGTGCGCCCCAGCTGCGCATCCCGGCGCGCCGCACACCTACGAGGTGTCGCTTTTAGCAGCGCCGGTTGAAGCTCCCGCTGCGGACCCCTCCCCTCACCGCCGCTCTCCCCCTGCAGGAGGACCCCCCCCTTATCCGGCAGCACCGCGATGCTCCGCGCGCTCCGCCGAGGCTCAGCGCCGCGCCGCGCCGCCTCCCGGCCGTGCTCCGGGCCCTCCGCGCACCGCCCGCAGCCACGCGGCGACGAGGCCGAGCCGTCGGAGCGGAGCGAGCGCCGCGTGAACCCGCTGCACATCCAGATGTTGTCCCGGAACCTCCACGAGCAGATCTTCCGCGGGGCGCCCGTGCGGCACTCGGAGGCGGCCGTGCGGCGCAGCGTCGAACACCTGCAGCGGCACGGCCTGTGGGGCCGGCACGGCCCGTCGCTGCCCGACGTGAGTTTGCGCCTGCCCCGCATGTACGGCGCCGACATCGACGAGCATTTCCGCCGCCTGGCGCAGAAGCAGAGCCTGCCCTACCTGGAGGCGGCCGAGGAGCTGCTGCGCTGCCGCCTGCCCCCCGCACCACAGAGCTGGGCCCGGCAGCAGGGCTGGACGCGCTACGGCCCCGACGGGCGGCCCGAGGCGGTGGAGTGCCCGCGGGAGCGCGCGTTGGTGCTGGACGTGGAGGTGTGCGTGGCCGCCGGGCAGTGCCCCACTATGGCCGTGGCGGTGTCGCCGCACGCGTGGTACTCGTGGTGCAGCCGGCGCCTGCTGGAGCAGCGCTACTCGTGGGGCCCCCGGCTGGCGCTGCACGACCTCGTGCCTCTGGAGGGGACCGGCAGGCAGCAGGAGGGCGGCGAGAGGGTGGTGGTGGGGCACAACGTGGCTTTCGACCGCGCCTTCATCAGGGAGCAGTACCTCGTGCAGGGCTCCCGGGTGCGCTTCCTGGACACCATGAGCATGCACATGGCCATCTCGGGGCTCACGGGCTTCCAGCGCAGCCTCTGGATGGCCGCCAAGCACGGCAAGAGGAAGGGGCTGCAGCAGGTCAGGCAGCACATGAAGAAGACACGCAGCAAAGCCGAGGGGCCGGCGGTCTCTTCATGGGACTGGGTGCACGTCAGCAGCATCAACAACCTGGCAGATGTGCATGCACTGTACGTGGGAGGGGAACCGCTGCAGAAGGAGGCACGAGAGCTGTTTGTTAAGGGGACCATGGCTGACGTCAGGAATAACTTCCAGGAGCTGATGTCGTACTGTGCCAGCGATGTCCGGGCCACCTATGAGGTGTTCCAGGAGCAGCTGCCGCTCTTCATGGAGAGGTGCCCCCACCCCGTGACGTTTGCTGGGATGTTGGAGATGGGGGTGTCCTACCTGCCGGTCAACAGCAACTGGAGGAGGTACCTGGACGATGCTCAGGGCACCTATGAGGAGCTGCAGAAGGAGATGAAAAAGTCCTTGATGAACCTGGCCAACGATGCCTGCCAGCTGCTGCACGAGGACAGGTACAAGGAGGACCCCTGGCTCTGGGATCTGGAGTGGGACACGCAAGAGTTTAAGCAGAAGAAACCCGCTAAGAGGAAGAAGGATCAGAAAATAAACAGTGAAGCTTCCGAGACGGGCTCTGCTCAGGAGTGGAGGGAAGACCCCGGTCCCCCCAGCGAGGAGGAGGAGCTGAGAGCCCCCGAGAGCAGCACCTGCCTGGAGCGCCTGAAGGAGACGATCACACTGCAGCCCAAGAGGCTGCAGCACCTCCCGGGCCACCCGGGCTGGTACCGCAAGCTCTGCCCGCGCCTGGAGGAGGAGGGCTGGGTGCCGGGGCCCAGCCTCATCAGCCTGCAGATGCGGGTGACCCCGAAACTGATGCGCCTGGCCTGGGATGGCTTCCCTCTGCACTACTCGGAGAAGCACGGCTGGGGCTACCTGGTGCCGGGGCGGCAGGACAACCTGCCTGCAGCCTCTGCGGAGCCAGAGGGGCCTGTCTGCCCACACAGGGCGATCGAGCGGCTGTATCGGCAGCACTGCCTGCAGAGGGGCCAGGAGCAGCCCCCAGAGGAGGCTGGCGTGGAGGATGAGCTGATGGTGCTGGAGGGCAGCAGCATGTGGCAGAAGGTGGAGGAGCTGAGCCAGCTGGAGCTGGACATGGAGCGGCCGGGCAGGGCAGAGCAGAGCCAGATGCAGGATGAGGACGGGCTGCCAGAGCTGGTGGAGGAGAGCAGCCAGCCCTCATTCCACCACGGCAATGGCCCCTACAACGACGTCAACATCCCTGGATGCTGGTTCTTCAAGCTGCCCCACAAGGACGGCAATGAGAACAACGTGGGGAGCCCCTTTGCCAAGGACTTCCTGCCCCGCATGGAGGATGGCACGCTGCGGGCCACCGTGGGCCGCACCCATGGGACCAGAGCCCTGGAGATCAACAAGATGGTGTCCTTCTGGAGGAACGCTCACAAGCGGGTCAGTTCCCAGGTGGTTGTGTGGCTGAAGAAGGGGGAGCTGCCCCGTGCGGTGACCAGGCACCCGGCCTACAGCGAGGAGGAGGACTACGGGGCCATCCTGCCGCAGGTGGTGACTGCGGGTACCATCACCCGTCGGGCCGTGGAGCCCACGTGGCTGACAGCCAGCAATGCCCGGGCTGACCGTGTGGGCAGCGAGCTGAAGGCCATGGTCCAGGTGCCGCCCGGCTACTCTCTGGTGGGTGCAGATGTGGACTCCCAGGAGCTGTGGATAGCGGCGGTCCTGGGCGAGGCTCACTTTGCTGGCATGCACGGGTGCACGGCCTTCGGCTGGATGACCCTGCAAGGGAAGAAGAGCGACGGGACCGACCTGCATAGCAAGACGGCCGCCACGGTGGGCATCAGCCGGGAGCACGCCAAGGTCTTCAACTACGGGCGCATCTACGGGGCTGGGCAGCCCTTTGCCGAGCGGCTGCTGATGCAGTTCAATCACCGGCTGACACAGCAGCAGGCACGTGAGAAGGCACAGCAGATGTATGCAGTCACAAAGGGCATCCGGAGGTTTCATCTCAGCGAGGAGGGCGAGTGGCTGGTGAAGGAACTGGAGCTGGCTGTGGACAAAGCAGAAGATGGTACGGTGTCGGCCCAGGATGTGCAGAAGATCCAGAGAGAAGCCATGAGAAAGTCCCGAAGGAAGAAGAAGTGGGACGTGGTGGCTCACCGAATGTGGGCTGGAGGCACCGAGTCCGAAATGTTCAACAAGCTGGAGAGCATCGCTCTGTCCGCCTCGCCACAGACCCCGGTGCTGGGCTGTCATATCAGCAGGGCTCTGGAGCCTGCAGTGGCCAAAGGGGAGTTTCTAACCAGCAGAGTGAACTGGGTGGTGCAGAGCTCAGCTGTTGACTACCTGCACCTCATGCTGGTCTCCATGAAGTGGCTCTTTGAGGAGTATGACATAAATGGTCGCTTCTGCATCAGCATCCACGACGAGGTGCGCTACCTGGTGCAGGAGCAGGACCGCTACCGGGCAGCACTGGCCCTGCAGATCACCAACCTGCTCACACGGTGCATGTTTGCCTACAAGCTGGGCCTCCAGGATCTGCCGCAGTCCGTGGCTTTCTTCAGCGCTGTGGACATTGACCGGTGCTTAAGGAAGGAGGTGACCATGAACTGTGCGACTCCATCAAATCCAACCGGCATGGAGAAGAAGTACGGCATTCCTCGAGGAGAAGCACTGGATATATATCAGATAATTGAAATAACCAAAGGCTCACTGGAGAAGAAGTGATAACGTGAGAGTGCCAGAAGGTGCAAGTTGTCCAGAGAGCACACGGGAACCTGGCTGTCCTTTCAGAAGCACATACATGGCAGGGACCAATCCTGGTTGCGCCGCTTCCTTCTCGTGGTAAGAAAAAGATGTTCCTGATGAAGATTTTCATAGCAGCACATCTGAATGGGAGAGCTTGCATATTTGAATGGCTGGCAGCCAGCTTTAAGACCTGAGACACCTGACAGAGTCACTGCTTGCACACCCGTGGGGATGAAGAAAGAAGTCTTGAGTATTTGCCAGGAGACAGAATCAAATCAATCATCTGTACGTGCAGTTCTCCAAGACCAAGGTGAGGCTGCCACAGCACAGGTGCTGTAGGAGAAGGAGGTGGCAGCAGTTGCAAGCACACATTCTATTTTTTTCGCCTTCTTTTCTTTTGGGGTTCCTGGTTTTCATCTGGCTGCTCTGCTGTGCCGGACTGGAGAGAAATAGAGAGTTAAGAGTACCAAGTGTGAACGTTTGTGT')

In [125]:
#This function requires a genbank file from a blast result, the text_table of the results of a blast result. An optional
#parameter exists called optional_cutoff --> if an individual sorts sequences by something like query cover before downloading
#his/her blast result, the user can choose to stop processing at a specific sequence so that only sequences above a certain
#query cover are considered in downstream analysis. Alternatively, one could simply download sequences manually that are above
#a certain query cover
def processHitTable(genbank_file,text_table, optional_cutoff = ''):
    Sequence_dict = {}
    for file in SeqIO.parse(genbank_file, 'gb'):
        for feature in file.features:
                if feature.type == 'gene':
                    if 'gene' in feature.qualifiers.keys():
                        symbol = feature.qualifiers['gene']
                    if 'locus_tag' in feature.qualifiers.keys():
                        symbol = feature.qualifiers['locus_tag']
                if feature.type == 'source':
                    organism = feature.qualifiers['organism'][0].replace(" ", "_")
                    
                    #automatically should use POLG-201 transcript
                    if organism == 'Gallus_gallus':
                        Sequence_dict['Gallus_gallus'] = {}
                        Sequence_dict['Gallus_gallus']['POLG_201'] = {}
                        Sequence_dict['Gallus_gallus']['POLG_201']['nam'] = 'Ensembl_transcript_POLG-201'
                        Sequence_dict['Gallus_gallus']['POLG_201']['seq'] = (custom_POLG201_Gallus_gallus_sequence)
                        Sequence_dict['Gallus_gallus']['POLG_201']['start'] = 404
                        Sequence_dict['Gallus_gallus']['POLG_201']['end'] = 3983
                        Sequence_dict['Gallus_gallus']['POLG_201']['bit score'] = 1000000
                        
                if feature.type == 'CDS':
                    CDS = [int(a) for a in feature.location]
                    start = CDS[0]
                    end = CDS[-1]
                    accession = file.name
                    full_name = file.description
                    if organism not in Sequence_dict.keys():
                        Sequence_dict[organism] = dict()
                    Sequence_dict[organism][accession] = dict()
                    Sequence_dict[organism][accession]['nam'] = full_name
                    Sequence_dict[organism][accession]['seq'] = file.seq
                    Sequence_dict[organism][accession]['start'] = start
                    Sequence_dict[organism][accession]['end'] = end + 1
    
    final_hit_dict = {}
    with open(text_table) as f:
        reader = csv.DictReader(f, delimiter = "\t")
        for initial_row in islice(reader, 4, 5):
            header_list = str((initial_row['# tblastn'])).split('# Fields: ')[1].split(', ')
        hit_number = 0
        for row in islice(reader, 1, None):   
            hit_dict = {}
            hit_number +=1
            query_id = []
            query_id.append(str(row['# tblastn']))
            result_list = row[None]
            combined_results = query_id + result_list
            i = 0
            for item in header_list:
                hit_dict[item] = combined_results[i]
                i+=1  
            if optional_cutoff != '':
                if hit_dict['subject acc.ver'] == optional_cutoff:
                    break
            key = (hit_dict['subject acc.ver'].split('.'))[0]
            
            
            organism = ''
            accession = ''
            for item in Sequence_dict:
                for item2 in Sequence_dict[item]:
                    if item2 == key:
                        organism = item
                        accession = item2
                        Sequence_dict[organism][accession].update(hit_dict)
            
            
            
            
            #final_hit_dict[key] = hit_dict     
    return Sequence_dict

In [126]:
def bestHitPerOrganism(hitTable):
    singleHitDict = {}
    for organism in hitTable:
        bestScore = 0.0
        final_accession = ''
        transcript_variant = 0
        for accession in hitTable[organism]:
            current_score = float(hitTable[organism][accession]['bit score'])
            if current_score > bestScore:
                bestScore = current_score
                final_accession = accession
        singleHitDict[organism] = {'accession':final_accession, 'bit_score': bestScore,
                                   'sequence':hitTable[organism][final_accession]['seq'],
                                  'nam':hitTable[organism][final_accession]['nam'],
                                  'start':hitTable[organism][final_accession]['start'],
                                  'end':hitTable[organism][final_accession]['end']}
    return singleHitDict     

In [127]:
hitTable = processHitTable(BLAST_genbank,BLAST_textoutput)
singleTranscriptTable = bestHitPerOrganism(hitTable)

In [128]:
def extractfiveUTR(singleTranscriptTable, UTR_size = 100):
    five_UTR_dict = {}
    for item in singleTranscriptTable:
        if singleTranscriptTable[item]['start'] > UTR_size:
            accession = singleTranscriptTable[item]['accession']
            sequence = singleTranscriptTable[item]['sequence']
            fiveUTR = sequence[0:singleTranscriptTable[item]['start']]
            five_UTR_dict[item] = {'accession':accession,'fiveUTR':fiveUTR}
    return five_UTR_dict

In [129]:
def fiveUTRTrim(five_UTR_dict, trim_size = 100):
    UTR_trim_dict = {}
    for item in five_UTR_dict:
        len_UTR = len(five_UTR_dict[item]['fiveUTR'])
        start = len_UTR-trim_size
        trimmed_UTR = five_UTR_dict[item]['fiveUTR'][start:len_UTR]
        UTR_trim_dict[item] = {'accession':five_UTR_dict[item]['accession'],'sequence':trimmed_UTR}
    return UTR_trim_dict
    

In [130]:
def trimmedfiveUTRwrite(UTR_trim_dict):
    UTR_file = open(fiveUTR_100nt,'w')
    for item in UTR_trim_dict:
        UTR_file.write('>'+item+'\n'+str(UTR_trim_dict[item]['sequence'])+'\n')
    UTR_file.close()
    
    

In [131]:
def determineAltFrameLength(singleTranscriptTable):
    plus1_dict = {}
    for item in singleTranscriptTable:
        sequence = singleTranscriptTable[item]['sequence']
        start = singleTranscriptTable[item]['start']
        stop = singleTranscriptTable[item]['end']
        accession = singleTranscriptTable[item]['accession']
        CDS = sequence[start:stop]
        CDS_truncateStart = CDS[3:]
        plusOne = 'ATGG' + CDS_truncateStart
        plusOneLength = (3*len(plusOne.translate(to_stop=True)))+2
        plus1_dict[item] = {'accession':accession,'+1_length_intoCDS':plusOneLength}
    return plus1_dict

In [132]:
def kozakMotif(fiveUTRalignment_dict,reference_location):
    reference_location -=1
    kozak_dict = {}
    for item in fiveUTRalignment_dict:
        if item == 'Monodelphis_domestica':
            reference_location +=1
        
        
        fiveUTR = fiveUTRalignment_dict[item]
        CUG = fiveUTR[reference_location:reference_location+3]
        motif = CUG
        i = 0
        nextnt = fiveUTR[reference_location+3+i:reference_location+3+i+1]
        while nextnt == '-':
            i +=1
        motif += nextnt
    
        k = 0
        j = 6
    
        while j > 0:
            previousnt = fiveUTR[reference_location-1+k:reference_location+k]
            if previousnt == '-':
                k -=1
            else:
                motif = previousnt + motif
                k -=1
                j -=1
        kozak_dict[item] = motif
    return kozak_dict
    

In [133]:
def runMuscle(in_file, out_file, muscle_executable):
    muscle_cline = MuscleCommandline(muscle_executable, input=in_file, out=out_file)
    muscle_cline()

In [134]:
def writeKozakMotif(kozak_motif_dict, kozak_file):
    file = open(kozak_file, 'w')
    for item in kozak_motif_dict:
        file.write('>'+item+'\n'+kozak_motif_dict[item]+'\n')
    file.close()

In [135]:
def readAlignment(in_file):
    alignment_dict = {}
    for seq_record in (SeqIO.parse(in_file, 'fasta')):
        name = seq_record.id
        seq = seq_record.seq
        alignment_dict[name] = str(seq)
    return (alignment_dict)

In [136]:
five_UTR_dict = extractfiveUTR(singleTranscriptTable)

In [137]:
UTR_trim_dict = fiveUTRTrim(five_UTR_dict)

In [138]:
trimmedfiveUTRwrite(UTR_trim_dict)

In [139]:
runMuscle(fiveUTR_100nt, aligned_fiveUTR_100nt, muscle_executable)

In [140]:
fiveUTRalignment_dict = readAlignment(aligned_fiveUTR_100nt)

In [141]:
plus1_dict = determineAltFrameLength(singleTranscriptTable)



In [142]:
#custom_number that will depend on the alignment and requires manual insepction
mammalian_CUG_location = 86

In [143]:
kozak_motif_dict = kozakMotif(fiveUTRalignment_dict,mammalian_CUG_location)

In [144]:
#This will output a file with your aligned sequences with the kozak sequence
writeKozakMotif(kozak_motif_dict, kozak_file)

__Looking at RNA secondary structure between CUG and AUG start codon__

__Mammalia__

I'm going to use the same sequences that I used for the protein analysis. If the CUG is in the p-site, then the following 3 nucleotides will be in the A-site. The following 5 nucleotides will be in the entry tunnel of the ribosome. Thus I should only consider the nucleotides 8 away from the CUG intiiation site

In [145]:
mammal_ignore_list = ['Camelus_ferus', 'Vombatus_ursinus', 'Phascolarctos_cinereus' ,'Monodelphis_domestica']
mammal_ignore_list_extended = ['Camelus_ferus', 'Vombatus_ursinus', 'Phascolarctos_cinereus' ,'Monodelphis_domestica','Sarcophilus_harrisii']

In [146]:
def getCUGRNA(UTR_trim_dict,ignore_list,singleTranscriptTable):
    
    CUG_RNA_dict = {}
    for item in UTR_trim_dict:
        if item in ignore_list:
            continue
        fiveUTRcode = fiveUTRalignment_dict[item][86:]
        fiveUTRCDS = ''
        for nt in fiveUTRcode:
            if nt != '-':
                fiveUTRCDS = fiveUTRCDS+ nt
        fiveUTRCDS = fiveUTRCDS[10:]
        CUG_RNA_dict[item] = Seq(fiveUTRCDS).transcribe()
    return CUG_RNA_dict

In [147]:
def writeRNA(RNA_dict, outfile):
    file = open(outfile,'w')
    for item in RNA_dict:
        file.write('>'+item+'\n'+str(RNA_dict[item])+'\n')
    file.close()
        

In [148]:
def generateAlignmentFile(alignment_in, alignment_dict):
    alignment_file = open(alignment_in, 'w')
    for item in alignment_dict:
        alignment_file.write('>'+item+'\n'+(alignment_dict[item])+'\n')
    alignment_file.close()

In [149]:
CUGRNA_dict = getCUGRNA(UTR_trim_dict,mammal_ignore_list_extended,singleTranscriptTable)

In [150]:
writeRNA(CUGRNA_dict, CUGRNAfasta)

In [151]:
runMuscle(CUGRNAfasta, CUGalignedRNA, muscle_executable)

In [152]:
CUG_RNA_alignment_dict = readAlignment(CUGalignedRNA)

In [153]:
generateAlignmentFile(CUG_RNA_aligned_fasta, CUG_RNA_alignment_dict)