5/9/19

I am attempting my POLG analysis by following a more proper method for choosing my sequences. I am basing my approach off of Dinan et. al (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5644247/). For this study, I will obtain sequences by performing tblastn searches against the nr/nt NCBI database using selected reference species. 

--> I think it would be best to blast against the ref_seqRNA database rather than the nr/nt database

https://www.ncbi.nlm.nih.gov/homologene/2016 

Homologene claims to find orthologous sequences all the way back to S. pombe, but let's just start with vertebrate clades first? If it is completely conserved then I should look further back

My selected reference species' sequences will be


__Homo sapiens__: NM_002693.2, corresponding protein: NP_002684.1. TaxID Mammalia (taxid 40674).

__Gallus gallus__: XM_015292047.2 is the NCBI transcript for POLG but the Ensembl transcript, POLG-201 has the upstream CUG in the correct frame. I suspect that while this won't change my BLAST results, I'll use it for all the downstream analyses after BLAST and force the Ensembl version into my dictionary. Corresponding protein: XP_015147533.1. TaxID Sauropsid

__Xenopus tropicalis__: XM_002932235.4, corresponding protein: XP_002932281.2. TaxID Amphibia (taxid:8292)

__Danio rerio__: XM_001921095.6, corresponding protein: XP_001921130.3. TaxID teleost fishes (taxid:32443)

I am performing tblastn searches, using default parameters except I looked for the top 500 hits, with each of the corresponding protein sequences for each selected reference species using taxid Mammalia (taxid:40674), sauropsid, amphibian, teleost or vertebrate excluding mammalia, sauropsid, amphibian and teleost. Lastly the database is the RefSeqRNA database

Sequences that had 'partial mRNA' in the name were removed

__Mammalia investigation:__

These sequences were removed from synplot analysis due to the amount of gaps in their alignment from Mammalian

PREDICTED: Camelus bactrianus polymerase (DNA directed), gamma (POLG), partial mRNA
NCBI Reference Sequence: XM_010955340.1  --> This sequence is a partial mRNA sequence (the genomic assembly lacks the 5' portion of the CDS) and has thus been excluded in synplot and CUG kozak analysis. 

These sequences were also removed due to their gappyness and synplot wouldn't run: Bison_bison_bison(XM_010841133.1), Oryctolagus_cuniculus (XM_017337563), Camelus_ferus (XM_006192570)

__Sauropsid investigation__

Sequences that had 'partial mRNA' in the name were removed during the BLAST search. The synplot doesn't look terribly significant but Gallus_gallus most definitely has a stop codon free area in the +1 frame

__Amphibian investgiation__

Too few sequences present to do synplot analysis. Xenopus_tropicalis doesn't have a stop codon free region in the +1 frame.

There are only three organisms that are analyzed

Xenopus_tropicalis:XM_002932235

Xenopus_laevis:XM_018250789

Nanorana_parkeri: XM_018571894

__Teleost fish investgiation__

Austrofundulus_limnaeus filtered to gappyness

The synplot doesn't look significant and there isn't a stop codon free area in the +1 frame for Danio_rerio

10/1/19

I realized that I can directly get the counts of the ribosome profiling for POLG from Trips-Viz and I think it would be a useful analysis to have to compare the reads from the different ORF's: ORF-Z, ORF-Y (portion not overlapping with main CDS and portion overlapping), and CDS (same as ORF-Y, not overlapping and overlapping portions) and compare the different read counts. 




In [2]:
from Bio.Seq import Seq
from Bio import Entrez
from Bio import SeqIO
from Bio.Alphabet import IUPAC
import matplotlib
import matplotlib.pyplot as plt
% matplotlib inline
import numpy as np
from Bio.Align.Applications import MuscleCommandline
import os
from Bio import SeqIO
import csv
from itertools import islice
from Bio.Emboss.Applications import TranalignCommandline

__Before running the code below to generate tranaligned files, remember to change the __


1)BLAST_genbank, 

2)BLAST_textoutput, 

3)and which filter list is being used in the writeCDS and writeProtein commands. 


This portion of the code deals with generated tranaligned sequences__

In [3]:
cwd = os.getcwd()+os.sep

BLAST_genbank = 'Representative_Species/tblastn_refseqrna_mammalia_queryhomosapiens_5-9-19_80cover.gb'
BLAST_textoutput = 'Representative_Species/tblastn_refseqrna_mammalia_queryhomosapiens_5-9-19_80cover.txt'

muscle_executable = cwd+"muscle3.8.31_i86win32.exe"
output_all_CDS = cwd+"80_CDS_POLG_new.fasta"
output_protein = cwd+'80_POLG_protein.fasta'
output_all_CDS_ali = cwd+"80_POLG_muscle_out_new.clw"
ordered_CDS = cwd+'80_Ordered_CDS_POLG.fasta'
alignment_file = cwd+'80_POLG_muscle_alignment.fasta'

mammal_filter_list = ['Camelus_bactrianus','Bison_bison_bison','Oryctolagus_cuniculus','Camelus_ferus']
sauropsid_filter_list = []
teleost_filter_list = ['Austrofundulus_limnaeus']
amphibian_filter_list = []


tranalign_exe = r"C:\mEMBOSS\Tranalign.exe"
tranalignseq_out = cwd + 'POLG_tranalign_output_80.fasta'


fiveUTR_100nt = cwd+'trimmed_5UTR.fasta'
aligned_fiveUTR_100nt = cwd+'aligned_trimmed_5UTR.fasta'
kozak_file = cwd+'kozak_sequence_aligned.fasta'

CUGproteinfasta = cwd+'CUG_protein.fasta'
CUGalignedprotein = cwd+'CUG_protein_aligned.clw'
CUG_aligned_fasta = cwd+'CUG_protein_aligned.fasta'

CUGRNAfasta = cwd + 'CUG_RNA.fasta'
CUGalignedRNA = cwd + 'CUG_RNA_aligned.clw'
CUG_RNA_aligned_fasta = cwd + 'CUG_RNA_aligned.fasta'

full_5UTR_fasta = cwd + 'full_5UTR.fasta'
aligned_fiveUTR_full = cwd+'aligned_full_5UTR.fasta'

uORF_kozak_file = cwd + 'uORF_kozak_sequence_aligned.fasta'
uORFproteinfasta = cwd + 'uORF_protein_sequence.fasta'
uORFalignedprotein = cwd + 'uORF_aligned_protein.clw'
uORF_aligned_fasta = cwd + 'uORF_aligned_protein.fasta'
uORF_RNA_outfile = cwd + 'uORF_RNA_sequences.fasta'
uORF_alignedRNA = cwd + 'uORF_RNA_aligned.clw'
uORF_RNA_aligned_fasta = cwd + 'uORF_RNA_aligned.fasta'

riboseq_table = cwd + 'POLG_counts_10_1_19_ENST00000268124.csv'

In [4]:
#This is the POLG201 sequence for gallus gallus because the NCBI version has the CUG in the incorrect frame
custom_POLG201_Gallus_gallus_sequence = Seq('GTCGCCCCGGAGCCCCGCGTTGCACCGCGATCCGACCCGGGCGGCGCGGTGTGGGGGCGGGGGGGCGCGTGGGGACAGCGCGGGGCTGCACGGCGCGGGGAAGGGGAGGTGCGGAGCTTTGGGGGCGCGTGCAGAGCTGTGGGGAGCGGCGGGGCGTGCCGCGTGCCTTGCAGGGGTGCCGCGTGCTTTGCAGGGGTGCCGCGTGCGTGCAGAGGTGTTGCACGCCTTGCAGGGGTGCTGCGTGCTTCGACGCAGTGGTGCGCCCCAGCTGCGCATCCCGGCGCGCCGCACACCTACGAGGTGTCGCTTTTAGCAGCGCCGGTTGAAGCTCCCGCTGCGGACCCCTCCCCTCACCGCCGCTCTCCCCCTGCAGGAGGACCCCCCCCTTATCCGGCAGCACCGCGATGCTCCGCGCGCTCCGCCGAGGCTCAGCGCCGCGCCGCGCCGCCTCCCGGCCGTGCTCCGGGCCCTCCGCGCACCGCCCGCAGCCACGCGGCGACGAGGCCGAGCCGTCGGAGCGGAGCGAGCGCCGCGTGAACCCGCTGCACATCCAGATGTTGTCCCGGAACCTCCACGAGCAGATCTTCCGCGGGGCGCCCGTGCGGCACTCGGAGGCGGCCGTGCGGCGCAGCGTCGAACACCTGCAGCGGCACGGCCTGTGGGGCCGGCACGGCCCGTCGCTGCCCGACGTGAGTTTGCGCCTGCCCCGCATGTACGGCGCCGACATCGACGAGCATTTCCGCCGCCTGGCGCAGAAGCAGAGCCTGCCCTACCTGGAGGCGGCCGAGGAGCTGCTGCGCTGCCGCCTGCCCCCCGCACCACAGAGCTGGGCCCGGCAGCAGGGCTGGACGCGCTACGGCCCCGACGGGCGGCCCGAGGCGGTGGAGTGCCCGCGGGAGCGCGCGTTGGTGCTGGACGTGGAGGTGTGCGTGGCCGCCGGGCAGTGCCCCACTATGGCCGTGGCGGTGTCGCCGCACGCGTGGTACTCGTGGTGCAGCCGGCGCCTGCTGGAGCAGCGCTACTCGTGGGGCCCCCGGCTGGCGCTGCACGACCTCGTGCCTCTGGAGGGGACCGGCAGGCAGCAGGAGGGCGGCGAGAGGGTGGTGGTGGGGCACAACGTGGCTTTCGACCGCGCCTTCATCAGGGAGCAGTACCTCGTGCAGGGCTCCCGGGTGCGCTTCCTGGACACCATGAGCATGCACATGGCCATCTCGGGGCTCACGGGCTTCCAGCGCAGCCTCTGGATGGCCGCCAAGCACGGCAAGAGGAAGGGGCTGCAGCAGGTCAGGCAGCACATGAAGAAGACACGCAGCAAAGCCGAGGGGCCGGCGGTCTCTTCATGGGACTGGGTGCACGTCAGCAGCATCAACAACCTGGCAGATGTGCATGCACTGTACGTGGGAGGGGAACCGCTGCAGAAGGAGGCACGAGAGCTGTTTGTTAAGGGGACCATGGCTGACGTCAGGAATAACTTCCAGGAGCTGATGTCGTACTGTGCCAGCGATGTCCGGGCCACCTATGAGGTGTTCCAGGAGCAGCTGCCGCTCTTCATGGAGAGGTGCCCCCACCCCGTGACGTTTGCTGGGATGTTGGAGATGGGGGTGTCCTACCTGCCGGTCAACAGCAACTGGAGGAGGTACCTGGACGATGCTCAGGGCACCTATGAGGAGCTGCAGAAGGAGATGAAAAAGTCCTTGATGAACCTGGCCAACGATGCCTGCCAGCTGCTGCACGAGGACAGGTACAAGGAGGACCCCTGGCTCTGGGATCTGGAGTGGGACACGCAAGAGTTTAAGCAGAAGAAACCCGCTAAGAGGAAGAAGGATCAGAAAATAAACAGTGAAGCTTCCGAGACGGGCTCTGCTCAGGAGTGGAGGGAAGACCCCGGTCCCCCCAGCGAGGAGGAGGAGCTGAGAGCCCCCGAGAGCAGCACCTGCCTGGAGCGCCTGAAGGAGACGATCACACTGCAGCCCAAGAGGCTGCAGCACCTCCCGGGCCACCCGGGCTGGTACCGCAAGCTCTGCCCGCGCCTGGAGGAGGAGGGCTGGGTGCCGGGGCCCAGCCTCATCAGCCTGCAGATGCGGGTGACCCCGAAACTGATGCGCCTGGCCTGGGATGGCTTCCCTCTGCACTACTCGGAGAAGCACGGCTGGGGCTACCTGGTGCCGGGGCGGCAGGACAACCTGCCTGCAGCCTCTGCGGAGCCAGAGGGGCCTGTCTGCCCACACAGGGCGATCGAGCGGCTGTATCGGCAGCACTGCCTGCAGAGGGGCCAGGAGCAGCCCCCAGAGGAGGCTGGCGTGGAGGATGAGCTGATGGTGCTGGAGGGCAGCAGCATGTGGCAGAAGGTGGAGGAGCTGAGCCAGCTGGAGCTGGACATGGAGCGGCCGGGCAGGGCAGAGCAGAGCCAGATGCAGGATGAGGACGGGCTGCCAGAGCTGGTGGAGGAGAGCAGCCAGCCCTCATTCCACCACGGCAATGGCCCCTACAACGACGTCAACATCCCTGGATGCTGGTTCTTCAAGCTGCCCCACAAGGACGGCAATGAGAACAACGTGGGGAGCCCCTTTGCCAAGGACTTCCTGCCCCGCATGGAGGATGGCACGCTGCGGGCCACCGTGGGCCGCACCCATGGGACCAGAGCCCTGGAGATCAACAAGATGGTGTCCTTCTGGAGGAACGCTCACAAGCGGGTCAGTTCCCAGGTGGTTGTGTGGCTGAAGAAGGGGGAGCTGCCCCGTGCGGTGACCAGGCACCCGGCCTACAGCGAGGAGGAGGACTACGGGGCCATCCTGCCGCAGGTGGTGACTGCGGGTACCATCACCCGTCGGGCCGTGGAGCCCACGTGGCTGACAGCCAGCAATGCCCGGGCTGACCGTGTGGGCAGCGAGCTGAAGGCCATGGTCCAGGTGCCGCCCGGCTACTCTCTGGTGGGTGCAGATGTGGACTCCCAGGAGCTGTGGATAGCGGCGGTCCTGGGCGAGGCTCACTTTGCTGGCATGCACGGGTGCACGGCCTTCGGCTGGATGACCCTGCAAGGGAAGAAGAGCGACGGGACCGACCTGCATAGCAAGACGGCCGCCACGGTGGGCATCAGCCGGGAGCACGCCAAGGTCTTCAACTACGGGCGCATCTACGGGGCTGGGCAGCCCTTTGCCGAGCGGCTGCTGATGCAGTTCAATCACCGGCTGACACAGCAGCAGGCACGTGAGAAGGCACAGCAGATGTATGCAGTCACAAAGGGCATCCGGAGGTTTCATCTCAGCGAGGAGGGCGAGTGGCTGGTGAAGGAACTGGAGCTGGCTGTGGACAAAGCAGAAGATGGTACGGTGTCGGCCCAGGATGTGCAGAAGATCCAGAGAGAAGCCATGAGAAAGTCCCGAAGGAAGAAGAAGTGGGACGTGGTGGCTCACCGAATGTGGGCTGGAGGCACCGAGTCCGAAATGTTCAACAAGCTGGAGAGCATCGCTCTGTCCGCCTCGCCACAGACCCCGGTGCTGGGCTGTCATATCAGCAGGGCTCTGGAGCCTGCAGTGGCCAAAGGGGAGTTTCTAACCAGCAGAGTGAACTGGGTGGTGCAGAGCTCAGCTGTTGACTACCTGCACCTCATGCTGGTCTCCATGAAGTGGCTCTTTGAGGAGTATGACATAAATGGTCGCTTCTGCATCAGCATCCACGACGAGGTGCGCTACCTGGTGCAGGAGCAGGACCGCTACCGGGCAGCACTGGCCCTGCAGATCACCAACCTGCTCACACGGTGCATGTTTGCCTACAAGCTGGGCCTCCAGGATCTGCCGCAGTCCGTGGCTTTCTTCAGCGCTGTGGACATTGACCGGTGCTTAAGGAAGGAGGTGACCATGAACTGTGCGACTCCATCAAATCCAACCGGCATGGAGAAGAAGTACGGCATTCCTCGAGGAGAAGCACTGGATATATATCAGATAATTGAAATAACCAAAGGCTCACTGGAGAAGAAGTGATAACGTGAGAGTGCCAGAAGGTGCAAGTTGTCCAGAGAGCACACGGGAACCTGGCTGTCCTTTCAGAAGCACATACATGGCAGGGACCAATCCTGGTTGCGCCGCTTCCTTCTCGTGGTAAGAAAAAGATGTTCCTGATGAAGATTTTCATAGCAGCACATCTGAATGGGAGAGCTTGCATATTTGAATGGCTGGCAGCCAGCTTTAAGACCTGAGACACCTGACAGAGTCACTGCTTGCACACCCGTGGGGATGAAGAAAGAAGTCTTGAGTATTTGCCAGGAGACAGAATCAAATCAATCATCTGTACGTGCAGTTCTCCAAGACCAAGGTGAGGCTGCCACAGCACAGGTGCTGTAGGAGAAGGAGGTGGCAGCAGTTGCAAGCACACATTCTATTTTTTTCGCCTTCTTTTCTTTTGGGGTTCCTGGTTTTCATCTGGCTGCTCTGCTGTGCCGGACTGGAGAGAAATAGAGAGTTAAGAGTACCAAGTGTGAACGTTTGTGT')

In [6]:
#This function requires a genbank file from a blast result, the text_table of the results of a blast result. An optional
#parameter exists called optional_cutoff --> if an individual sorts sequences by something like query cover before downloading
#his/her blast result, the user can choose to stop processing at a specific sequence so that only sequences above a certain
#query cover are considered in downstream analysis. Alternatively, one could simply download sequences manually that are above
#a certain query cover
def processHitTable(genbank_file,text_table, optional_cutoff = ''):
    Sequence_dict = {}
    for file in SeqIO.parse(genbank_file, 'gb'):
        for feature in file.features:
                if feature.type == 'gene':
                    if 'gene' in feature.qualifiers.keys():
                        symbol = feature.qualifiers['gene']
                    if 'locus_tag' in feature.qualifiers.keys():
                        symbol = feature.qualifiers['locus_tag']
                if feature.type == 'source':
                    organism = feature.qualifiers['organism'][0].replace(" ", "_")
                    
                    #automatically should use POLG-201 transcript
                    if organism == 'Gallus_gallus':
                        Sequence_dict['Gallus_gallus'] = {}
                        Sequence_dict['Gallus_gallus']['POLG_201'] = {}
                        Sequence_dict['Gallus_gallus']['POLG_201']['nam'] = 'Ensembl_transcript_POLG-201'
                        Sequence_dict['Gallus_gallus']['POLG_201']['seq'] = (custom_POLG201_Gallus_gallus_sequence)
                        Sequence_dict['Gallus_gallus']['POLG_201']['start'] = 404
                        Sequence_dict['Gallus_gallus']['POLG_201']['end'] = 3983
                        Sequence_dict['Gallus_gallus']['POLG_201']['bit score'] = 1000000
                        
                if feature.type == 'CDS':
                    CDS = [int(a) for a in feature.location]
                    start = CDS[0]
                    end = CDS[-1]
                    accession = file.name
                    full_name = file.description
                    if organism not in Sequence_dict.keys():
                        Sequence_dict[organism] = dict()
                    Sequence_dict[organism][accession] = dict()
                    Sequence_dict[organism][accession]['nam'] = full_name
                    Sequence_dict[organism][accession]['seq'] = file.seq
                    Sequence_dict[organism][accession]['start'] = start
                    Sequence_dict[organism][accession]['end'] = end + 1
    
    final_hit_dict = {}
    with open(text_table) as f:
        reader = csv.DictReader(f, delimiter = "\t")
        for initial_row in islice(reader, 4, 5):
            header_list = str((initial_row['# tblastn'])).split('# Fields: ')[1].split(', ')
        hit_number = 0
        for row in islice(reader, 1, None):   
            hit_dict = {}
            hit_number +=1
            query_id = []
            query_id.append(str(row['# tblastn']))
            result_list = row[None]
            combined_results = query_id + result_list
            i = 0
            for item in header_list:
                hit_dict[item] = combined_results[i]
                i+=1  
            if optional_cutoff != '':
                if hit_dict['subject acc.ver'] == optional_cutoff:
                    break
            key = (hit_dict['subject acc.ver'].split('.'))[0]
            
            
            organism = ''
            accession = ''
            for item in Sequence_dict:
                for item2 in Sequence_dict[item]:
                    if item2 == key:
                        organism = item
                        accession = item2
                        Sequence_dict[organism][accession].update(hit_dict)
            
            
            
            
            #final_hit_dict[key] = hit_dict     
    return Sequence_dict

In [7]:
def bestHitPerOrganism(hitTable):
    singleHitDict = {}
    for organism in hitTable:
        bestScore = 0.0
        final_accession = ''
        transcript_variant = 0
        for accession in hitTable[organism]:
            current_score = float(hitTable[organism][accession]['bit score'])
            if current_score > bestScore:
                bestScore = current_score
                final_accession = accession
        singleHitDict[organism] = {'accession':final_accession, 'bit_score': bestScore,
                                   'sequence':hitTable[organism][final_accession]['seq'],
                                  'nam':hitTable[organism][final_accession]['nam'],
                                  'start':hitTable[organism][final_accession]['start'],
                                  'end':hitTable[organism][final_accession]['end']}
    return singleHitDict     

In [8]:
def writeCDS(out_file, D, skip = [],nam='organism'):
    text_file = open(out_file, 'w')
    for item in D:
        if item in skip:
            continue 
        if nam == 'organism':
            text_file.write(">%s\n%s\n" % (item,
                                           (D[item]['sequence'][D[item]['start']:D[item]['end']])))
        else:
            text_file.write(">%s\n%s\n" % (D[item][nam],
                                           D[item]['sequence'][D[item]['start']:D[item]['end']]))            
    text_file.close()

In [9]:
def writeProtein(out_file, D,skip = [], nam='organism'):
    text_file = open(out_file, 'w')
    for item in D:
        if item in skip:
            continue
        if nam == 'organism':
            text_file.write(">%s\n%s\n" % (item,
                                           ((D[item]['sequence'][D[item]['start']:D[item]['end']])).translate(to_stop=True)))
        else:
            text_file.write(">%s\n%s\n" % (D[item][nam],
                                           D[item]['sequence'][D[item]['start']:D[item]['end']]))            
    text_file.close()

In [10]:
def runMuscle(in_file, out_file, muscle_executable):
    muscle_cline = MuscleCommandline(muscle_executable, input=in_file, out=out_file)
    muscle_cline()

In [11]:
def readAlignment(in_file):
    alignment_dict = {}
    for seq_record in (SeqIO.parse(in_file, 'fasta')):
        name = seq_record.id
        seq = seq_record.seq
        alignment_dict[name] = str(seq)
    return (alignment_dict)

In [12]:
def orderCDSfile(final, alignment_dict, filename):
    CDS_order_file = open(filename,'w')
    for item in alignment_dict:
        CDS_order_file.write('>'+item+'\n'+str(final[item]['sequence'][final[item]['start']:final[item]['end']])+'\n')
    CDS_order_file.close()

In [13]:
def generateAlignmentFile(alignment_in, alignment_dict):
    alignment_file = open(alignment_in, 'w')
    for item in alignment_dict:
        alignment_file.write('>'+item+'\n'+(alignment_dict[item])+'\n')
    alignment_file.close()

In [14]:
def runTranalign(Tranalign_exe, orderedCDS_file_in, POLG_muscle_alignment_in, tranalignseq_out):
    needle_cline = TranalignCommandline(Tranalign_exe,asequence=orderedCDS_file_in,
                                 bsequence=POLG_muscle_alignment_in,
                                 stdout=True,outseq=tranalignseq_out)
    needle_cline()

In [15]:
hitTable = processHitTable(BLAST_genbank,BLAST_textoutput)
singleTranscriptTable = bestHitPerOrganism(hitTable)

FileNotFoundError: [Errno 2] No such file or directory: 'Representative_Species/tblastn_refseqrna_mammalia_queryhomosapiens_1-12-20.gb'

In [36]:
writeCDS(output_all_CDS, singleTranscriptTable, mammal_filter_list)

In [37]:
writeProtein(output_protein, singleTranscriptTable, mammal_filter_list)

In [38]:
runMuscle(output_protein, output_all_CDS_ali, muscle_executable)

In [39]:
alignment_dict = readAlignment(output_all_CDS_ali)

In [40]:
orderCDSfile(singleTranscriptTable, alignment_dict,ordered_CDS)

In [41]:
generateAlignmentFile(alignment_file, alignment_dict)

In [42]:
runTranalign(tranalign_exe, ordered_CDS, alignment_file, tranalignseq_out)

In [43]:
tranaligned_dict = {}
for file in SeqIO.parse(tranalignseq_out, 'fasta'):
    tranaligned_dict[file.name] = str(file.seq)

In [44]:
#Run synplot on the online database: http://guinevere.otago.ac.nz/cgi-bin/aef/synplot.pl

In [45]:
#CleanupFiles
os.remove(output_all_CDS)
os.remove(output_protein)
os.remove(alignment_file)
os.remove(ordered_CDS)
os.remove(output_all_CDS_ali)
os.remove(tranalignseq_out)

__This portion of the code deals with aligning the 5'-UTR portion of sequences__

Sequences that are aligned must have a 5_UTR of atleast 100 nucleotides so that an alignment can be produced. Filtering steps that were included in the previous section to make Synplot2 run are not included here or in downstream analyses

__Mammalia Results__


The sequences that lack a conserved CUG codon are Camelus_ferus, Vombatus_ursinus, Phascolarctos_cinereus , Monodelphis_domestica. 

Sarcophilus_harrisii has the CUG codon but the surrounding sequence is a bit odd

Unsurprisingly, the length of the alternative frame in the CDS for sequences are the following 

Camelus_ferus: 197 nucleotides
Sarcophilus_harrisii: 86 nucleotides
Vombatus_ursinus: 86 nucleotides
Phascolarctos_cinereus: 86 nucleotides
Monodelphis_domestica: 86 nucleotides

This supports the idea that the mammals that lack the CUG initiation codon, also do not have the long ORF either

Monodelphis_domestica has a gap where the CUG should be located in mammals so I've artificially just chosen the next nucleotide as where the reference location begins. This can be seen by referring to the KozakMotif function


__Sauropsid Results__

Most sequences do not appear to have a long ORF in the incorrect reading frame. Unlike mammalia, I may have to manually pick and choose sequences that work best for me to align since an alignment with ALL the kozak sequences isn't yielding anything informative. Perhaps a manual inspection is in order.

The peptide generated by a CUG uORF by Gallus gallus has terrible protein alignment to the Homo sapiens protein

--> Sequences must have a 3'-UTR of 100 nucleotides or greater
--> The extension in the correct ORF must be greater than 300 nucleotides

_Gallus_gallus_

-->CUG is properly offset from start codon
-->Length of ORF from CUG to TF stop codon is 1110 nucleotides

Empidonax_traillii

-->CUG is not properly offset from start codon to access frame
-->Length of start codon to supposed TF stop codon is 398 nucleotides

Neopelma_chrysocephalum
-->CUG is not properly offset from start codon to access frame
-->Length of start codon to supposed TF stop codon is 398 nucleotides

_Parus_major_
-->CUG is properly offset from start codon
-->Length of ORF from CUG to TF stop codon is 777 nucleotides

_Zonotrichia_albicollis_
-->CUG is properly offset from the start codon
-->Length of ORF from CUG to TF stop codon is 813 nucleotides

Corvus_brachyrhynchos
-->CUG is not properly offset from the start codon
-->Length of ORF from start codon to supposed TF stop codon is 536 nucleotides

_Corapipo_altera_
-->CUG is properly offset from the start codon
-->Length of ORF from CUG to TF stop codon is 471 nucleotides

_Lepidothrix_coronata_
-->CUG is properly offset from the start codon
-->Length of ORF from CUG to TF stop codon is 471 nucleotides

_Pipra_filicauda_
-->CUG is properly offset from the start codon
-->Length of ORF from CUG to TF stop codon is 471 nucleotides

Ficedula_albicollis
-->CUG is not properly offset from the start codon
-->Length of ORF from start to supposed stop codon is 353 nucleotides

_Anser_cygnoides_domesticus_
-->There are two CUGs that are properly offset from the start codon. One is 100 nucleotides away and the other is 178 nucleotides away. I assume that the one that is 100 nucleotides is likely to be the one used (or formerly used). 
-->Length of ORF from CUG to TF stop codon is 708 nucleotides


List of sauropsids that require further analysis: sauropsid_list = ['Gallus_gallus', 'Parus_major', 'Zonotrichia_albicollis', 'Corapipo_altera', 'Lepidothrix_coronata', 'Pipra_filicauda', 'Anser_cygnoides_domesticus']

These sauropsids have no alignment with their prospective CUG initiation codon in a MUSCLE alignment of their 5UTRs

__Amphibia__

There are only three organisms that are analyzed. None have an overlapping extension greater than 300 nucleotides

Xenopus_tropicalis:XM_002932235
Length into TF ORF 71 nucleotides

Xenopus_laevis:XM_018250789
Length into TF ORF 62 nucleotides

Nanorana_parkeri: XM_018571894
Length into TF ORF 38 nucleotides


__Teleost fish__

Only 3 sequences have an overlapping extension greater than 300 nucleotides

Austrofundulus_limnaeus (XM_014005514): 323 nucleotides
Contains two CUGs in the proper frame 70 and 82 nucleotides away but there is an intervening stop codon. 

Sinocyclocheilus_rhinocerous (XM_016528575): 314 nucleotides
Contains one CUG in the proper frame but there is an intervening stop codon

Clupea_harengus (XM_012832592): 320 nucleotides
Contains one CUG in the proper frame without an intervening stop codon
-->The surrounding sequence for this CUG is ACAAAA __CTG__ A
>Potential_protein
MKRNVAVMVTPLSTTTVPCTGAQEVAKFLGRASPGTGAQLSSEKSKPTQHPDDVCKPAQADIPRIGASIHTGRCGAEYQAPAEASALGQRSCASSRRGAEVARNVWQRH

In [46]:
def extractfiveUTR(singleTranscriptTable, UTR_size = 100):
    five_UTR_dict = {}
    for item in singleTranscriptTable:
        if singleTranscriptTable[item]['start'] > UTR_size:
            accession = singleTranscriptTable[item]['accession']
            sequence = singleTranscriptTable[item]['sequence']
            fiveUTR = sequence[0:singleTranscriptTable[item]['start']]
            five_UTR_dict[item] = {'accession':accession,'fiveUTR':fiveUTR}
    return five_UTR_dict

In [47]:
def fiveUTRTrim(five_UTR_dict, trim_size = 100):
    UTR_trim_dict = {}
    for item in five_UTR_dict:
        len_UTR = len(five_UTR_dict[item]['fiveUTR'])
        start = len_UTR-trim_size
        trimmed_UTR = five_UTR_dict[item]['fiveUTR'][start:len_UTR]
        UTR_trim_dict[item] = {'accession':five_UTR_dict[item]['accession'],'sequence':trimmed_UTR}
    return UTR_trim_dict
    

In [48]:
def trimmedfiveUTRwrite(UTR_trim_dict):
    UTR_file = open(fiveUTR_100nt,'w')
    for item in UTR_trim_dict:
        UTR_file.write('>'+item+'\n'+str(UTR_trim_dict[item]['sequence'])+'\n')
    UTR_file.close()
    
    

In [49]:
def determineAltFrameLength(singleTranscriptTable):
    plus1_dict = {}
    for item in singleTranscriptTable:
        sequence = singleTranscriptTable[item]['sequence']
        start = singleTranscriptTable[item]['start']
        stop = singleTranscriptTable[item]['end']
        accession = singleTranscriptTable[item]['accession']
        CDS = sequence[start:stop]
        CDS_truncateStart = CDS[3:]
        plusOne = 'ATGG' + CDS_truncateStart
        plusOneLength = (3*len(plusOne.translate(to_stop=True)))+2
        plus1_dict[item] = {'accession':accession,'+1_length_intoCDS':plusOneLength}
    return plus1_dict

In [50]:
def kozakMotif(fiveUTRalignment_dict,reference_location):
    reference_location -=1
    kozak_dict = {}
    for item in fiveUTRalignment_dict:
        if item == 'Monodelphis_domestica':
            reference_location +=1
        
        
        fiveUTR = fiveUTRalignment_dict[item]
        CUG = fiveUTR[reference_location:reference_location+3]
        motif = CUG
        i = 0
        nextnt = fiveUTR[reference_location+3+i:reference_location+3+i+1]
        while nextnt == '-':
            i +=1
        motif += nextnt
    
        k = 0
        j = 6
    
        while j > 0:
            previousnt = fiveUTR[reference_location-1+k:reference_location+k]
            if previousnt == '-':
                k -=1
            else:
                motif = previousnt + motif
                k -=1
                j -=1
        kozak_dict[item] = motif
    return kozak_dict
    

In [51]:
def writeKozakMotif(kozak_motif_dict, kozak_file):
    file = open(kozak_file, 'w')
    for item in kozak_motif_dict:
        file.write('>'+item+'\n'+kozak_motif_dict[item]+'\n')
    file.close()

In [52]:
five_UTR_dict = extractfiveUTR(singleTranscriptTable)

In [53]:
UTR_trim_dict = fiveUTRTrim(five_UTR_dict)

In [54]:
trimmedfiveUTRwrite(UTR_trim_dict)

In [55]:
runMuscle(fiveUTR_100nt, aligned_fiveUTR_100nt, muscle_executable)

In [56]:
fiveUTRalignment_dict = readAlignment(aligned_fiveUTR_100nt)

In [57]:
plus1_dict = determineAltFrameLength(singleTranscriptTable)



In [58]:
#custom_number that will depend on the alignment and requires manual insepction



mammalian_CUG_location = 86

In [59]:
kozak_motif_dict = kozakMotif(fiveUTRalignment_dict,mammalian_CUG_location)

In [60]:
writeKozakMotif(kozak_motif_dict, kozak_file)

In [71]:
singleTranscriptTable['Sarcophilus_harrisii']

{'accession': 'XM_003755551',
 'bit_score': 1793.0,
 'sequence': Seq('CTGGCGCCTCGTTCTAGACCAATCTCTGAGTTTTGCGGCGGGAGGGGGCGGGAC...AAG', IUPACAmbiguousDNA()),
 'nam': 'PREDICTED: Sarcophilus harrisii DNA polymerase gamma, catalytic subunit (POLG), transcript variant X1, mRNA',
 'start': 500,
 'end': 4097}

__Looking at protein__

__Mammals__

For this section, I am selecting all mammals that have a conserved CUG uORF start codon. 

Original ignore list I used --> This means that every mammal (with a 100 nt 3UTR) will be included excluding Camelus_ferus, Vombatus_ursinus, Phascolarctos_cinereus ,and Monodelphis_domestica.

I am switching to the extended ignore list (getting rid of Sarcophilus_harisii) to see if it improves the protein alignment

Most significant portion of the mammalian synplot (below a p-value of 10^-20) corresonds to the following amino acid sequence of the CUG ORF (roughly): GHPRGAGPGVRRGGLL. This starts at position 205 in the homo sapiens CUG ORF and ends at 221

For transmembrane analysis, I'm going to take one mammal from each of the most major orders
-->Rodentia: Mus_musculus
-->Chiroptera: Myotis_lucifugus
-->Eulipotyphla:I can't find one,
------> I'm going to just use Orcinus_orca
-->Primate: Homo_sapiens

In [131]:
mammal_ignore_list = ['Camelus_ferus', 'Vombatus_ursinus', 'Phascolarctos_cinereus' ,'Monodelphis_domestica']
mammal_ignore_list_extended = ['Camelus_ferus', 'Vombatus_ursinus', 'Phascolarctos_cinereus' ,'Monodelphis_domestica','Sarcophilus_harrisii']

In [132]:
def getCUGProtein(UTR_trim_dict,ignore_list,singleTranscriptTable):
    
    CUG_protein_dict = {}
    for item in UTR_trim_dict:
        if item in ignore_list:
            continue
        fiveUTRcode = fiveUTRalignment_dict[item][86:]
        fiveUTRCDS = ''
        for nt in fiveUTRcode:
            if nt != '-':
                fiveUTRCDS = fiveUTRCDS+ nt
        CUG_protein_dict[item] = (Seq('A') + fiveUTRCDS[1:] + (singleTranscriptTable[item]['sequence'])[singleTranscriptTable[item]['start']:]).translate(to_stop=True)
    return CUG_protein_dict

In [133]:
def writeUORFproteins(CUGprotein_dict, outfile):
    file = open(outfile,'w')
    for item in CUGprotein_dict:
        file.write('>'+item+'\n'+str(CUGprotein_dict[item])+'\n')
    file.close()
        

In [134]:
CUGprotein_dict = getCUGProtein(UTR_trim_dict,mammal_ignore_list_extended,singleTranscriptTable)



In [135]:
writeUORFproteins(CUGprotein_dict,CUGproteinfasta)

In [136]:
runMuscle(CUGproteinfasta, CUGalignedprotein, muscle_executable)

In [137]:
CUG_protein_alignment_dict = readAlignment(CUGalignedprotein)

In [138]:
generateAlignmentFile(CUG_aligned_fasta, CUG_protein_alignment_dict)

In [150]:
str(singleTranscriptTable['Myotis_lucifugus']['sequence'])

'CGCGGGACCGTGCGCGGCGCAGACGGGAAGTTGCGGCTGCCAGCGGAGCGCACGGGGCGCGGGGAGGCCACACGCCACCCCGAGGCTGCGTAGGCCGCGCGGAGGGAGCAGCCGCGCCGCTGGCCTGGGGTCGGGAGCGGCAGGCCCGGAGGCCCTAGGCGGACCGAGGATTGCGGGTGGAAGGCAGGCATGGTCAGGCCCATTGCACTGACGGGAAGACAGAGACGAGACGTGTCTCTCCCCACGTCTTGCATCCGGTAAAAGCAGCCAAGCTGGAGCCCAAAGCCAGGGGTCCCGAATCCCAGCGGGGAGCTCCCTGCACCCACCATGAGCCGCCTGCTCTGGAGGAAAGTGGCCGGCTCCGCGGCCGTCGGGCCCGGGCCGGGGCCAGCAGCTCCGGGGCGCTGGGTCTCCAGCTCCGCCGCCATCCCCGGCCCCAGCGACGGGCTGCCGCCGCCGCCGCCAGCGCCATCCTCGGAGGAGCAGATATTAGGGGCCGGCGGCGGGGAGACGCCAGAAGAGTCCGCGGTGCGCCGCAGCGTGGAGCAATTGCAGAAGCACGGTCTTTGGGGGCACCCGGCAGCGCCCCTGCCCGACGTGGAGCTGCGCCTGCCGCCCCTCTACGGGGGCAGCCTGGACCAGCACTTCCGCCTCCTGGCGCAGAAGCAGAGCCTGCCCTACCTGGAGGCGGCCCACTCGCTCTTGCAGGCCCAGCTGCCCCCCAGGCCCCCGAGCTGGGCCTGGGCGGAGGGCTGGACCCGGTACGGCCCCGCGGGGGAGGCCGAACCCGTGGCCATCCCCGAGGAGCGGGCCCTGGTGTTCGACGTGGAGGTCTGCTTGGCAGAGGGAACCTGCCCCACGCTGGCGGTGGCCATATCCCCCTCGGCCTGGTATTCCTGGTGCAGCCGGCGGCTGGTGGAAGAGCGTTACTCCTGGACCAGCCAGCTGTCGCCGGCTGACCTCATCCCCCTGGAGGCCCCCGCCAGCGCCGGCCCCCCC

In [153]:
(str(CUGprotein_dict['Homo_sapiens']))

'MEPKARCSDSQRGGPCTNHEPPALEEGGRRHRRARAGSSSGALGLQLRPRVRPQRRAAAAAAAAAAAAAAATAASAAASAILGGRAAAAQPIGHPDALERAARANLRARRGDAWRGRGAPQRRAPAEARALGAASRALARRGAAPAAPLRGQPGPALPPPGPEAEPALPGGGQLAVAGPAAPEAPGLGLGGGLDPVRPRGGGRTRGHPRGAGPGVRRGGLLGRGNLPHIGGGHIPLGLVFLVQPAAGGRALLLDQPAVAG'

__Looking at RNA secondary structure between CUG and AUG start codon__

__Mammalia__

I'm going to use the same sequences that I used for the protein analysis. If the CUG is in the p-site, then the following 3 nucleotides will be in the A-site. The following 5 nucleotides will be in the entry tunnel of the ribosome. Thus I should only consider the nucleotides 8 away from the CUG intiiation site

In [115]:
def getCUGRNA(UTR_trim_dict,ignore_list,singleTranscriptTable):
    
    CUG_RNA_dict = {}
    for item in UTR_trim_dict:
        if item in ignore_list:
            continue
        fiveUTRcode = fiveUTRalignment_dict[item][86:]
        fiveUTRCDS = ''
        for nt in fiveUTRcode:
            if nt != '-':
                fiveUTRCDS = fiveUTRCDS+ nt
        fiveUTRCDS = fiveUTRCDS[10:]
        CUG_RNA_dict[item] = Seq(fiveUTRCDS).transcribe()
    return CUG_RNA_dict

In [116]:
def writeRNA(RNA_dict, outfile):
    file = open(outfile,'w')
    for item in RNA_dict:
        file.write('>'+item+'\n'+str(RNA_dict[item])+'\n')
    file.close()
        

In [117]:
CUGRNA_dict = getCUGRNA(UTR_trim_dict,mammal_ignore_list_extended,singleTranscriptTable)

In [118]:
writeRNA(CUGRNA_dict, CUGRNAfasta)

In [119]:
runMuscle(CUGRNAfasta, CUGalignedRNA, muscle_executable)

In [120]:
CUG_RNA_alignment_dict = readAlignment(CUGalignedRNA)

In [121]:
generateAlignmentFile(CUG_RNA_aligned_fasta, CUG_RNA_alignment_dict)

__Looking at second uORF__

In homo sapiens, the uORF is located 138 nucleotides away from the main AUG, with the UTR length being 334 total nucleotides (POLG-201 from Ensembl). The length of this ORF is 72 nucleotides.  I think I will filter sequences here to 150 nucleotides of length and try to align the entire 5'-UTRs of sequences

I decided to add Sarcophilus harisii to the ignore list because just because it has the CUG doesn't mean the uORFs are conserved --> it also messes up the alignment

This is additionally interesting because 4 out of 5 of the sequences on the ignore list are marsupials and the 5th is Camelus_ferus. For some reason, I feel like all Camel genomes are super gappy when put into an alignment with other species. Perhaps the quality of deposition for Camels on NCBI was not good?

The consensus Kozak Sequence for the upstream AUG is quite decent





In [84]:
mammal_ignore_list_extended = ['Camelus_ferus', 'Vombatus_ursinus', 'Phascolarctos_cinereus' ,'Monodelphis_domestica','Sarcophilus_harrisii']

In [113]:
def writeUTRfile(five_150_UTR_dict, filename, ignore_list):
    file = open(filename, 'w')
    for item in five_150_UTR_dict:
        if item in ignore_list:
            print(item)
            continue    
        file.write('>'+item+'\n'+str(five_150_UTR_dict[item]['fiveUTR'])+'\n')
    file.close()

In [187]:
def getuORFProtein(UTR_dict,ignore_list,singleTranscriptTable,reference):
    reference -= 1
    uORF_protein_dict = {}
    for item in UTR_dict:
        if item in ignore_list:
            continue
        fiveUTRcode = UTR_dict[item][reference:]
        fiveUTRCDS = ''
        for nt in fiveUTRcode:
            if nt != '-':
                fiveUTRCDS = fiveUTRCDS+ nt

        
        start = int(singleTranscriptTable[item]['sequence'].find(fiveUTRCDS))
        uORF_protein_dict[item] = (singleTranscriptTable[item]['sequence'][start:]).translate(to_stop=True)
            
        
        #uORF_protein_dict[item] = (Seq('A') + fiveUTRCDS[1:] + singleTranscriptTable[item]['sequence']).translate(to_stop=True)
    return uORF_protein_dict

In [227]:
def getuORFRNA(UTR_dict,ignore_list,singleTranscriptTable):
    
    uORF_RNA_dict = {}
    for item in UTR_dict:
        if item in ignore_list:
            continue
        fiveUTRcode = UTR_dict[item][uORF_AUG_reference:]
        fiveUTRCDS = ''
        for nt in fiveUTRcode:
            if nt != '-':
                fiveUTRCDS = fiveUTRCDS+ nt
        fiveUTRCDS = fiveUTRCDS[10:]
        uORF_RNA_dict[item] = Seq(fiveUTRCDS).transcribe()
    return uORF_RNA_dict

In [145]:
five_150_UTR_dict = extractfiveUTR(singleTranscriptTable, 150)

In [146]:
writeUTRfile(five_150_UTR_dict, full_5UTR_fasta, mammal_ignore_list_extended)

Phascolarctos_cinereus
Vombatus_ursinus
Sarcophilus_harrisii
Camelus_ferus


In [147]:
runMuscle(full_5UTR_fasta, aligned_fiveUTR_full, muscle_executable)

In [148]:
full5UTR_alignment_dict = readAlignment(aligned_fiveUTR_full)

In [149]:
file = open('reordered_5UTR_Full_alignment.fasta','w')
for item in full5UTR_alignment_dict:
    if item == 'Homo_sapiens':
        file.write('>'+item+'\n'+full5UTR_alignment_dict[item]+'\n')
for item in full5UTR_alignment_dict:
    if item != 'Homo_sapiens':
        file.write('>'+item+'\n'+full5UTR_alignment_dict[item]+'\n')
file.close()

With this alignment, the uORF AUG begins at position 1222 in the alignment

In [150]:
uORF_AUG_reference = 1222

In [151]:
uORF_AUG_kozak_dict = kozakMotif(full5UTR_alignment_dict,uORF_AUG_reference)

In [152]:
writeKozakMotif(uORF_AUG_kozak_dict, uORF_kozak_file)

In [188]:
uORF_protein_dict = getuORFProtein(full5UTR_alignment_dict,mammal_ignore_list_extended,singleTranscriptTable,uORF_AUG_reference)



In [193]:
writeUORFproteins(uORF_protein_dict,uORFproteinfasta)

In [196]:
runMuscle(uORFproteinfasta, uORFalignedprotein, muscle_executable)

In [197]:
uORF_protein_alignment_dict = readAlignment(uORFalignedprotein)

In [199]:
generateAlignmentFile(uORF_aligned_fasta, uORF_protein_alignment_dict)

In [238]:
uORF_RNA_dict = getuORFRNA(full5UTR_alignment_dict,mammal_ignore_list_extended,singleTranscriptTable)

In [239]:
writeRNA(uORF_RNA_dict, uORF_RNA_outfile)

In [241]:
runMuscle(uORF_RNA_outfile, uORF_alignedRNA, muscle_executable)

In [244]:
uORF_RNA_alignment_dict = readAlignment(uORF_alignedRNA)

In [245]:
generateAlignmentFile(uORF_RNA_aligned_fasta, uORF_RNA_alignment_dict)

__Analysis of ribosome profiling__

This data was downloaded from Tripz-Viz on 10-1-19 from the following link (https://trips.ucc.ie/homo_sapiens/Gencode_v25/interactive_plot/?files=&ribo_studies=18,20,21,23,24,27,28,29,31,32,33,34,35,38,39,42,43,44,45,56,58,60,62,63,64,67,89,90,99,101,102,103,107,113,122,124,130,134,138,141,144,150,152,153,165,171,172,176,177,178,179,181,183,190,191,192,201,204,212,&tran=ENST00000268124&minread=25&maxread=150&user_dir=fiveprime&ambig=F&cov=F&lg=T&nuc=F&rs=0&crd=F&short=mcr) with the following transcript identifier (ENST00000268124)

In [1]:
def processRiboSeqcsv(riboseq_file):
    riboseq_dict = {}
    Position_list = []
    Sequence_list = []
    Frame1_list = []
    Frame2_list = []
    Frame3_list = []
    RNASeq_list = []
    with open(riboseq_file) as f:
            reader = csv.DictReader(f, delimiter = ",")
            for row in reader:
                Position_list.append(row['ï»¿Position'])
                Sequence_list.append(row['Sequence'])
                Frame1_list.append(row['Frame 1'])
                Frame2_list.append(row['Frame 2'])
                Frame3_list.append(row[' Frame 3'])
                RNASeq_list.append(row['RNA-Seq'])
            
    Riboseq_dict = {'Position':Position_list,'Sequence':Sequence_list,
                   'Frame1':Frame1_list,'Frame2':Frame2_list,'Frame3':Frame3_list,'RNA_Seq':RNASeq_list}
    return Riboseq_dict

In [2]:
def queryRiboSeq(sequence,Riboseq_dict):
    full_length_sequence = ''
    for nt in Riboseq_dict['Sequence']:
        full_length_sequence += nt
    if full_length_sequence.count(sequence) == 0:
        print('Sequence was not found')
        return
    if full_length_sequence.count(sequence) !=1:
        print('This sequence occurs more than once, try again with a unique sequence')
        return
    if full_length_sequence.count(sequence) ==1:
        beg_index = 1+full_length_sequence.find(sequence)
        end_index = beg_index + len(sequence)
        
    print(beg_index)
    print(end_index)
    subsequence_dict = {}
    index_counter = beg_index
    for nt in sequence:
        if index_counter == end_index:
            continue
        
        Frame1_count = Riboseq_dict['Frame1'][index_counter-1]
        Frame2_count = Riboseq_dict['Frame2'][index_counter-1]
        Frame3_count = Riboseq_dict['Frame3'][index_counter-1]
        RNASeq_count = Riboseq_dict['RNA_Seq'][index_counter-1]
        
        subsequence_dict[index_counter] = {}
        subsequence_dict[index_counter]['Nucleotide'] = {nt}
        subsequence_dict[index_counter]['Frame1'] = {int(Frame1_count)}
        subsequence_dict[index_counter]['Frame2'] = {int(Frame2_count)}
        subsequence_dict[index_counter]['Frame3'] = {int(Frame3_count)}
        subsequence_dict[index_counter]['RNASeq'] = {int(RNASeq_count)}
        index_counter +=1
    return subsequence_dict
        
        
    

In [37]:
def printRiboSeqAnalysis(sequence_dict):
    full_subsequence = ''
    x_total_fr1_counts = float(0)
    x_total_fr2_counts = float(0)
    x_total_fr3_counts = float(0)
    total_fr1_counts = float(0)
    total_fr2_counts = float(0)
    total_fr3_counts = float(0)
    total_all_counts = float(0)
    for index in sequence_dict:
        for item in sequence_dict[index]['Nucleotide']:
            nt = item
        full_subsequence += nt
        for item in sequence_dict[index]['Frame1']:
            fr1_count = float(item)
        for item in sequence_dict[index]['Frame2']:
            fr2_count = float(item)
        for item in sequence_dict[index]['Frame3']:
            fr3_count = float(item)
        for item in sequence_dict[index]['RNASeq']:
            RNASeq_count = float(item)
            if RNASeq_count == 0:
                RNASeq_count = 1
        
        
        x_total_fr1_counts += (fr1_count/RNASeq_count)
        x_total_fr2_counts += (fr2_count/RNASeq_count)
        x_total_fr3_counts += (fr3_count/RNASeq_count)
    total_fr1_counts = x_total_fr1_counts/float(len(full_subsequence))
    total_fr2_counts = x_total_fr2_counts/float(len(full_subsequence))
    total_fr3_counts = x_total_fr3_counts/float(len(full_subsequence))
    total_all_counts = total_fr1_counts+total_fr2_counts+total_fr3_counts
    #counts_per_nt = total_all_counts/float(len(full_subsequence))
    print('For the sequence %s, Frame 1 has %f reads, Frame 2 has %f reads, Frame 3 has %f reads, and there are a total of %f reads. '%(full_subsequence,total_fr1_counts,total_fr2_counts,total_fr3_counts,total_all_counts))

In [38]:
uORF_sequence = 'ATGGTCAAACCCATTTCACTGACAGGAGAGCAGAGACAGGACGTGTCTCTCTCCACGTCTTCCAGCCAGTAAA'
CTG_sequence = 'CTGGAGCCCAAAGCCAGGTGTTCTGACTCCCAGCGTGGGGGTCCCTGCACCAACCATGAGCCGCCTGCTCTGGAGGAAGGTGGCCGGCGCCACCGTCGGGCCAGGGCCGGTTCCAGCTCCGGGGCGCTGGGTCTCCAGCTCCGTCCCCGCGTCCGACCCCAGCGACGGGCAGCGGCGGCGGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAACAGCAGCCTCAGCAGCCGCAAGTGCTATCCTCGGAGGGCGGGCAGCTGCGGCACAACCCATTGGACATCCAGATGCTCTCGAGAGGGCTGCACGAGCAAATCTTCGGGCAAGGAGGGGAGATGCCTGGCGAGGCCGCGGTGCGCCGCAGCGTCGAGCACCTGCAGAAGCACGGGCTCTGGGGGCAGCCAGCCGTGCCCTTGCCCGACGTGGAGCTGCGCCTGCCGCCCCTCTACGGGGACAACCTGGACCAGCACTTCCGCCTCCTGGCCCAGAAGCAGAGCCTGCCCTACCTGGAGGCGGCCAACTTGCTGTTGCAGGCCCAGCTGCCCCCGAAGCCCCCGGCTTGGGCCTGGGCGGAGGGCTGGACCCGGTACGGCCCCGAGGGGGAGGCCGTACCCGTGGCCATCCCCGAGGAGCGGGCCCTGGTGTTCGACGTGGAGGTCTGCTTGGCAGAGGGAACTTGCCCCACATTGGCGGTGGCCATATCCCCCTCGGCCTGGTATTCCTGGTGCAGCCAGCGGCTGGTGGAAGAGCGTTACTCTTGGACCAGCCAGCTGTCGCCGGCTGA'
Main_ORF= 'ATGAGCCGCCTGCTCTGGAGGAAGGTGGCCGGCGCCACCGTCGGGCCAGGGCCGGTTCCAGCTCCGGGGCGCTGGGTCTCCAGCTCCGTCCCCGCGTCCGACCCCAGCGACGGGCAGCGGCGGCGGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAACAGCAGCCTCAGCAGCCGCAAGTGCTATCCTCGGAGGGCGGGCAGCTGCGGCACAACCCATTGGACATCCAGATGCTCTCGAGAGGGCTGCACGAGCAAATCTTCGGGCAAGGAGGGGAGATGCCTGGCGAGGCCGCGGTGCGCCGCAGCGTCGAGCACCTGCAGAAGCACGGGCTCTGGGGGCAGCCAGCCGTGCCCTTGCCCGACGTGGAGCTGCGCCTGCCGCCCCTCTACGGGGACAACCTGGACCAGCACTTCCGCCTCCTGGCCCAGAAGCAGAGCCTGCCCTACCTGGAGGCGGCCAACTTGCTGTTGCAGGCCCAGCTGCCCCCGAAGCCCCCGGCTTGGGCCTGGGCGGAGGGCTGGACCCGGTACGGCCCCGAGGGGGAGGCCGTACCCGTGGCCATCCCCGAGGAGCGGGCCCTGGTGTTCGACGTGGAGGTCTGCTTGGCAGAGGGAACTTGCCCCACATTGGCGGTGGCCATATCCCCCTCGGCCTGGTATTCCTGGTGCAGCCAGCGGCTGGTGGAAGAGCGTTACTCTTGGACCAGCCAGCTGTCGCCGGCTGACCTCATCCCCCTGGAGGTCCCTACTGGTGCCAGCAGCCCCACCCAGAGAGACTGGCAGGAGCAGTTAGTGGTGGGGCACAATGTTTCCTTTGACCGAGCTCATATCAGGGAGCAGTACCTGATCCAGGGTTCCCGCATGCGTTTCCTGGACACCATGAGCATGCACATGGCCATCTCAGGGCTAAGCAGCTTCCAGCGCAGTCTGTGGATAGCAGCCAAGCAGGGCAAACACAAGGTCCAGCCCCCCACAAAGCAAGGCCAGAAGTCCCAGAGGAAAGCCAGAAGAGGCCCAGCGATCTCATCCTGGGACTGGCTGGACATCAGCAGTGTCAACAGTCTGGCAGAGGTGCACAGACTTTATGTAGGGGGGCCTCCCTTAGAGAAGGAGCCTCGAGAACTGTTTGTGAAGGGCACCATGAAGGACATTCGTGAGAACTTCCAGGACCTGATGCAGTACTGTGCCCAGGACGTGTGGGCCACCCATGAGGTTTTCCAGCAGCAGCTACCGCTCTTCTTGGAGAGGTGTCCCCACCCAGTGACTCTGGCCGGCATGCTGGAGATGGGTGTCTCCTACCTGCCTGTCAACCAGAACTGGGAGCGTTACCTGGCAGAGGCACAGGGCACTTATGAGGAGCTCCAGCGGGAGATGAAGAAGTCGTTGATGGATCTGGCCAATGATGCCTGCCAGCTGCTCTCAGGAGAGAGGTACAAAGAAGACCCCTGGCTCTGGGACCTGGAGTGGGACCTGCAAGAATTTAAGCAGAAGAAAGCTAAGAAGGTGAAGAAGGAACCAGCCACAGCCAGCAAGTTGCCCATCGAGGGGGCTGGGGCCCCTGGTGATCCCATGGATCAGGAAGACCTCGGCCCCTGCAGTGAGGAGGAGGAGTTTCAACAAGATGTCATGGCCCGCGCCTGCTTGCAGAAGCTGAAGGGGACCACAGAGCTCCTGCCCAAGCGGCCCCAGCACCTTCCTGGACACCCTGGATGGTACCGGAAGCTCTGCCCCCGGCTAGACGACCCTGCATGGACCCCGGGCCCCAGCCTCCTCAGCCTGCAGATGCGGGTCACACCTAAACTCATGGCACTTACCTGGGATGGCTTCCCTCTGCACTACTCAGAGCGTCATGGCTGGGGCTACTTGGTGCCTGGGCGGCGGGACAACCTGGCCAAGCTGCCGACAGGTACCACCCTGGAGTCAGCTGGGGTGGTCTGCCCCTACAGAGCCATCGAGTCCCTGTACAGGAAGCACTGTCTCGAACAGGGGAAGCAGCAGCTGATGCCCCAGGAGGCCGGCCTGGCGGAGGAGTTCCTGCTCACTGACAATAGTGCCATATGGCAAACGGTAGAAGAACTGGATTACTTAGAAGTGGAGGCTGAGGCCAAGATGGAGAACTTGCGAGCTGCAGTGCCAGGTCAACCCCTAGCTCTGACTGCCCGTGGTGGCCCCAAGGACACCCAGCCCAGCTATCACCATGGCAATGGACCTTACAACGACGTGGACATCCCTGGCTGCTGGTTTTTCAAGCTGCCTCACAAGGATGGTAATAGCTGTAATGTGGGAAGCCCCTTTGCCAAGGACTTCCTGCCCAAGATGGAGGATGGCACCCTGCAGGCTGGCCCAGGAGGTGCCAGTGGGCCCCGTGCTCTGGAAATCAACAAAATGATTTCTTTCTGGAGGAACGCCCATAAACGTATCAGCTCCCAGATGGTGGTGTGGCTGCCCAGGTCAGCTCTGCCCCGTGCTGTGATCAGGCACCCCGACTATGATGAGGAAGGCCTCTATGGGGCCATCCTGCCCCAAGTGGTGACTGCCGGCACCATCACTCGCCGGGCTGTGGAGCCCACATGGCTCACCGCCAGCAATGCCCGGCCTGACCGAGTAGGCAGTGAGTTGAAAGCCATGGTGCAGGCCCCACCTGGCTACACCCTTGTGGGTGCTGATGTGGACTCCCAAGAGCTGTGGATTGCAGCTGTGCTTGGAGACGCCCACTTTGCCGGCATGCATGGCTGCACAGCCTTTGGGTGGATGACACTGCAGGGCAGGAAGAGCAGGGGCACTGATCTACACAGTAAGACAGCCACTACTGTGGGCATCAGCCGTGAGCATGCCAAAATCTTCAACTACGGCCGCATCTATGGTGCTGGGCAGCCCTTTGCTGAGCGCTTACTAATGCAGTTTAACCACCGGCTCACACAGCAGGAGGCAGCTGAGAAGGCCCAGCAGATGTACGCTGCCACCAAGGGCCTCCGCTGGTATCGGCTGTCGGATGAGGGCGAGTGGCTGGTGAGGGAGTTGAACCTCCCAGTGGACAGGACTGAGGGTGGCTGGATTTCCCTGCAGGATCTGCGCAAGGTCCAGAGAGAAACTGCAAGGAAGTCACAGTGGAAGAAGTGGGAGGTGGTTGCTGAACGGGCATGGAAGGGGGGCACAGAGTCAGAAATGTTCAATAAGCTTGAGAGCATTGCTACGTCTGACATACCACGTACCCCGGTGCTGGGCTGCTGCATCAGCCGAGCCCTGGAGCCCTCGGCTGTCCAGGAAGAGTTTATGACCAGCCGTGTGAATTGGGTGGTACAGAGCTCTGCTGTTGACTACTTACACCTCATGCTTGTGGCCATGAAGTGGCTGTTTGAAGAGTTTGCCATAGATGGGCGCTTCTGCATCAGCATCCATGACGAGGTTCGCTACCTGGTGCGGGAGGAGGACCGCTACCGCGCTGCCCTGGCCTTGCAGATCACCAACCTCTTGACCAGGTGCATGTTTGCCTACAAGCTGGGTCTGAATGACTTGCCCCAGTCAGTCGCCTTTTTCAGTGCAGTCGATATTGACCGGTGCCTCAGGAAGGAAGTGACCATGGATTGTAAAACCCCTTCCAACCCAACTGGGATGGAAAGGAGATACGGGATTCCCCAGGGTGAAGCGCTGGATATTTACCAGATAATTGAACTCACCAAAGGCTCCTTGGAAAAACGAAGCCAGCCTGGACCATAG'
Overlap_ORF = 'ATGAGCCGCCTGCTCTGGAGGAAGGTGGCCGGCGCCACCGTCGGGCCAGGGCCGGTTCCAGCTCCGGGGCGCTGGGTCTCCAGCTCCGTCCCCGCGTCCGACCCCAGCGACGGGCAGCGGCGGCGGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAACAGCAGCCTCAGCAGCCGCAAGTGCTATCCTCGGAGGGCGGGCAGCTGCGGCACAACCCATTGGACATCCAGATGCTCTCGAGAGGGCTGCACGAGCAAATCTTCGGGCAAGGAGGGGAGATGCCTGGCGAGGCCGCGGTGCGCCGCAGCGTCGAGCACCTGCAGAAGCACGGGCTCTGGGGGCAGCCAGCCGTGCCCTTGCCCGACGTGGAGCTGCGCCTGCCGCCCCTCTACGGGGACAACCTGGACCAGCACTTCCGCCTCCTGGCCCAGAAGCAGAGCCTGCCCTACCTGGAGGCGGCCAACTTGCTGTTGCAGGCCCAGCTGCCCCCGAAGCCCCCGGCTTGGGCCTGGGCGGAGGGCTGGACCCGGTACGGCCCCGAGGGGGAGGCCGTACCCGTGGCCATCCCCGAGGAGCGGGCCCTGGTGTTCGACGTGGAGGTCTGCTTGGCAGAGGGAACTTGCCCCACATTGGCGGTGGCCATATCCCCCTCGGCCTGGTATTCCTGGTGCAGCCAGCGGCTGGTGGAAGAGCGTTACTCTTGGACCAGCCAGCTGTCGCCGGCTGA'
Nonoverlap_ORF = 'CCTCATCCCCCTGGAGGTCCCTACTGGTGCCAGCAGCCCCACCCAGAGAGACTGGCAGGAGCAGTTAGTGGTGGGGCACAATGTTTCCTTTGACCGAGCTCATATCAGGGAGCAGTACCTGATCCAGGGTTCCCGCATGCGTTTCCTGGACACCATGAGCATGCACATGGCCATCTCAGGGCTAAGCAGCTTCCAGCGCAGTCTGTGGATAGCAGCCAAGCAGGGCAAACACAAGGTCCAGCCCCCCACAAAGCAAGGCCAGAAGTCCCAGAGGAAAGCCAGAAGAGGCCCAGCGATCTCATCCTGGGACTGGCTGGACATCAGCAGTGTCAACAGTCTGGCAGAGGTGCACAGACTTTATGTAGGGGGGCCTCCCTTAGAGAAGGAGCCTCGAGAACTGTTTGTGAAGGGCACCATGAAGGACATTCGTGAGAACTTCCAGGACCTGATGCAGTACTGTGCCCAGGACGTGTGGGCCACCCATGAGGTTTTCCAGCAGCAGCTACCGCTCTTCTTGGAGAGGTGTCCCCACCCAGTGACTCTGGCCGGCATGCTGGAGATGGGTGTCTCCTACCTGCCTGTCAACCAGAACTGGGAGCGTTACCTGGCAGAGGCACAGGGCACTTATGAGGAGCTCCAGCGGGAGATGAAGAAGTCGTTGATGGATCTGGCCAATGATGCCTGCCAGCTGCTCTCAGGAGAGAGGTACAAAGAAGACCCCTGGCTCTGGGACCTGGAGTGGGACCTGCAAGAATTTAAGCAGAAGAAAGCTAAGAAGGTGAAGAAGGAACCAGCCACAGCCAGCAAGTTGCCCATCGAGGGGGCTGGGGCCCCTGGTGATCCCATGGATCAGGAAGACCTCGGCCCCTGCAGTGAGGAGGAGGAGTTTCAACAAGATGTCATGGCCCGCGCCTGCTTGCAGAAGCTGAAGGGGACCACAGAGCTCCTGCCCAAGCGGCCCCAGCACCTTCCTGGACACCCTGGATGGTACCGGAAGCTCTGCCCCCGGCTAGACGACCCTGCATGGACCCCGGGCCCCAGCCTCCTCAGCCTGCAGATGCGGGTCACACCTAAACTCATGGCACTTACCTGGGATGGCTTCCCTCTGCACTACTCAGAGCGTCATGGCTGGGGCTACTTGGTGCCTGGGCGGCGGGACAACCTGGCCAAGCTGCCGACAGGTACCACCCTGGAGTCAGCTGGGGTGGTCTGCCCCTACAGAGCCATCGAGTCCCTGTACAGGAAGCACTGTCTCGAACAGGGGAAGCAGCAGCTGATGCCCCAGGAGGCCGGCCTGGCGGAGGAGTTCCTGCTCACTGACAATAGTGCCATATGGCAAACGGTAGAAGAACTGGATTACTTAGAAGTGGAGGCTGAGGCCAAGATGGAGAACTTGCGAGCTGCAGTGCCAGGTCAACCCCTAGCTCTGACTGCCCGTGGTGGCCCCAAGGACACCCAGCCCAGCTATCACCATGGCAATGGACCTTACAACGACGTGGACATCCCTGGCTGCTGGTTTTTCAAGCTGCCTCACAAGGATGGTAATAGCTGTAATGTGGGAAGCCCCTTTGCCAAGGACTTCCTGCCCAAGATGGAGGATGGCACCCTGCAGGCTGGCCCAGGAGGTGCCAGTGGGCCCCGTGCTCTGGAAATCAACAAAATGATTTCTTTCTGGAGGAACGCCCATAAACGTATCAGCTCCCAGATGGTGGTGTGGCTGCCCAGGTCAGCTCTGCCCCGTGCTGTGATCAGGCACCCCGACTATGATGAGGAAGGCCTCTATGGGGCCATCCTGCCCCAAGTGGTGACTGCCGGCACCATCACTCGCCGGGCTGTGGAGCCCACATGGCTCACCGCCAGCAATGCCCGGCCTGACCGAGTAGGCAGTGAGTTGAAAGCCATGGTGCAGGCCCCACCTGGCTACACCCTTGTGGGTGCTGATGTGGACTCCCAAGAGCTGTGGATTGCAGCTGTGCTTGGAGACGCCCACTTTGCCGGCATGCATGGCTGCACAGCCTTTGGGTGGATGACACTGCAGGGCAGGAAGAGCAGGGGCACTGATCTACACAGTAAGACAGCCACTACTGTGGGCATCAGCCGTGAGCATGCCAAAATCTTCAACTACGGCCGCATCTATGGTGCTGGGCAGCCCTTTGCTGAGCGCTTACTAATGCAGTTTAACCACCGGCTCACACAGCAGGAGGCAGCTGAGAAGGCCCAGCAGATGTACGCTGCCACCAAGGGCCTCCGCTGGTATCGGCTGTCGGATGAGGGCGAGTGGCTGGTGAGGGAGTTGAACCTCCCAGTGGACAGGACTGAGGGTGGCTGGATTTCCCTGCAGGATCTGCGCAAGGTCCAGAGAGAAACTGCAAGGAAGTCACAGTGGAAGAAGTGGGAGGTGGTTGCTGAACGGGCATGGAAGGGGGGCACAGAGTCAGAAATGTTCAATAAGCTTGAGAGCATTGCTACGTCTGACATACCACGTACCCCGGTGCTGGGCTGCTGCATCAGCCGAGCCCTGGAGCCCTCGGCTGTCCAGGAAGAGTTTATGACCAGCCGTGTGAATTGGGTGGTACAGAGCTCTGCTGTTGACTACTTACACCTCATGCTTGTGGCCATGAAGTGGCTGTTTGAAGAGTTTGCCATAGATGGGCGCTTCTGCATCAGCATCCATGACGAGGTTCGCTACCTGGTGCGGGAGGAGGACCGCTACCGCGCTGCCCTGGCCTTGCAGATCACCAACCTCTTGACCAGGTGCATGTTTGCCTACAAGCTGGGTCTGAATGACTTGCCCCAGTCAGTCGCCTTTTTCAGTGCAGTCGATATTGACCGGTGCCTCAGGAAGGAAGTGACCATGGATTGTAAAACCCCTTCCAACCCAACTGGGATGGAAAGGAGATACGGGATTCCCCAGGGTGAAGCGCTGGATATTTACCAGATAATTGAACTCACCAAAGGCTCCTTGGAAAAACGAAGCCAGCCTGGACCATAG'
ORFY_only_nonoverlap = 'CTGGAGCCCAAAGCCAGGTGTTCTGACTCCCAGCGTGGGGGTCCCTGCACCAACC'

In [42]:
Riboseq_dict = processRiboSeqcsv(riboseq_table)

In [51]:
subsequence_dict = queryRiboSeq(Main_ORF,Riboseq_dict)

335
4055


In [52]:
printRiboSeqAnalysis(subsequence_dict)

For the sequence ATGAGCCGCCTGCTCTGGAGGAAGGTGGCCGGCGCCACCGTCGGGCCAGGGCCGGTTCCAGCTCCGGGGCGCTGGGTCTCCAGCTCCGTCCCCGCGTCCGACCCCAGCGACGGGCAGCGGCGGCGGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAACAGCAGCCTCAGCAGCCGCAAGTGCTATCCTCGGAGGGCGGGCAGCTGCGGCACAACCCATTGGACATCCAGATGCTCTCGAGAGGGCTGCACGAGCAAATCTTCGGGCAAGGAGGGGAGATGCCTGGCGAGGCCGCGGTGCGCCGCAGCGTCGAGCACCTGCAGAAGCACGGGCTCTGGGGGCAGCCAGCCGTGCCCTTGCCCGACGTGGAGCTGCGCCTGCCGCCCCTCTACGGGGACAACCTGGACCAGCACTTCCGCCTCCTGGCCCAGAAGCAGAGCCTGCCCTACCTGGAGGCGGCCAACTTGCTGTTGCAGGCCCAGCTGCCCCCGAAGCCCCCGGCTTGGGCCTGGGCGGAGGGCTGGACCCGGTACGGCCCCGAGGGGGAGGCCGTACCCGTGGCCATCCCCGAGGAGCGGGCCCTGGTGTTCGACGTGGAGGTCTGCTTGGCAGAGGGAACTTGCCCCACATTGGCGGTGGCCATATCCCCCTCGGCCTGGTATTCCTGGTGCAGCCAGCGGCTGGTGGAAGAGCGTTACTCTTGGACCAGCCAGCTGTCGCCGGCTGACCTCATCCCCCTGGAGGTCCCTACTGGTGCCAGCAGCCCCACCCAGAGAGACTGGCAGGAGCAGTTAGTGGTGGGGCACAATGTTTCCTTTGACCGAGCTCATATCAGGGAGCAGTACCTGATCCAGGGTTCCCGCATGCGTTTCCTGGACACCATGAGCATGCACATGGCCATCTCAGGGCTAAGCAGCTTCCAGCGCAGTCTGTGGATAGCAGCCAAGCAGGGCAAACACAAGGTCCAGCCCCCCACAAAGCA