# 4. Data preprocessing - ORF and transcript characteristics

The initial data preprocessing steps calculate relevant data and organizes them in a format that is easy for later analysis: tab- and comma-separated files compatible with the DataFrame format ("\_main.df" and "\_profile.df"). These may be easily read by either R or Python (using the Pandas package).

This data preprocessing step collates ORF characteristics (sequence features, RNA-seq and ribosome profiling data) for downstream analyses.

ORF parameters are fully described later, but include calculations for initiation context Weighted Relative ENTropy (WRENT) score, seconday structure, length, number of reads and position within transcript.

Before ORF parameters can be calculated, the position specific scoring matrix for determining WRENT scores must first be constructed.

## Dataset

Both the WRENT scoring matrix computation and the collation of ORF characteristics take substantial computational time (a few hours) on a desktop computer / laptop. Copies of this notebook may be concurrently opened to run these analyses on multiple datasets (by commenting / uncommenting the cell below), especially for computers with multi-core/multi-thread capabilities.

In [None]:
species = "mm"
stage = "mES"
ASSEMBLY = "GRCm38_ens"

# species = "dr"
# stage = "Shield"
# ASSEMBLY = "Zv9_ens"

# species = "hs"
# stage = "HeLa"
# ASSEMBLY = "GRCh37_ens"

## Imports, File Locations and Parameters

In [None]:
# IMPORTS
import RNA
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import corebio
import weblogolib


from Bio import SeqIO
from numpy import log1p, log2
from ast import literal_eval
from pandas import Series, DataFrame

%matplotlib inline

'\_canonical.trpedf' means the .trpedf file was generated based on a transcriptome that used the longest transcript with longest coding sequence (historically the "canonical" transcript in transcript annotations). 

Because the ORF_starts, ORF_ends, RPF_csvProfile and CDS columns in the .trpedf files are comma-separated values, **literal_eval** is used to parse them.

Corresponding fasta file is indexed using Biopython's SeqIO module.

In [None]:
# FILENAME PARAMETERS
DATA_DIR = "./data/" + species + "/"
ANNOTATIONS_DIR = "./annotations/"
TRPE_FILE = DATA_DIR + stage + "_canonical.trpedf"
CONVERTERS = {i:literal_eval for i in ("ORF_starts", "ORF_ends", "RPF_csvProfile", "CDS")}

FASTA_FILE = ANNOTATIONS_DIR + ASSEMBLY + "_genes_canonical.fasta"
SEQS = SeqIO.index(FASTA_FILE, "fasta")

More parameters defined. Transcripts were required to have a minimum 5' UTR and RNA-seq expression value (in FPKM) to be used in determining the weighted relative entropy (WRENT) scoring matrix.

The size/position of the initiation context used for WRENT scoring and secondary structure prediction are defined around the start codon and start position respectively.

When reads over ORFs are considered, a number of positions upstream of the stop are discarded, as determined from biases detected in metagene profiles.

For plotting ribosome profiling reads around the start and stop codon of a metagene (for quality control purposes), minimum CDS and UTR lengths are required.

In [None]:
# OTHER PARAMETERS
UTR5_LENGTH_MIN = 25
FPKM_MIN = 1

WRENT_CONTEXT = (10, 10) # BEFORE START CODON, AFTER START CODON
SS_CONTEXT_L = (25, 10)     # BEFORE ORF START, AFTER ORF START
SS_CONTEXT_R = (-1, 36)

ORF_END_TRIM = 10 # NUMBER OF POSITIONS BEFORE uORF END TO
                  # DISCARD READS DUE TO ARTIFACTUAL EXPERIMENTAL
                  # ACCUMULATION OF RPF READS AT uORF END.

uORF_LENGTH_MIN = 21
uORF_FROM_TSS_MIN = 25

NT_COLOR = ('#00d700', '#df1f00', '#0226cc', '#ffb700')

Some helper functions to retrieve sequences of initiation contexts for determining weighted Kozak scores and RNA secondary structure, as well as calculating translational efficiency (TE) of an ORF and scoring the weighted Kozak score of a sequence.

In [None]:
def wrent_context_seq(seq, ORF, WRENT_CONTEXT):
    '''Takes transcript sequence and ORF (in transcript coordinates),
    along with weighted Kozak context size as inputs,
    returns sequence'''
    left_flank, right_flank = WRENT_CONTEXT
    seq_to_return = str(seq[ORF[0] - left_flank:ORF[0]] + seq[ORF[0] + 3: ORF[0] + right_flank + 3])
    if len(seq_to_return) == left_flank + right_flank:
        return seq_to_return
    else: return "A"

def ss_context_seq(seq, ORF, SS_CONTEXT):
    '''Takes transcript sequence and ORF (in transcript coordinates),
    along with secondary structure context size as inputs,
    returns sequence'''
    left_flank, right_flank = SS_CONTEXT
    seq_to_return = str(seq[ORF[0] - left_flank : ORF[0] + right_flank])
    if len(seq_to_return) == left_flank + right_flank:
        return seq_to_return
    else: return "A"

def TE(RPF_csvProfile, ORF, expression, ORF_END_TRIM):
    '''Normalizes the density of ribosome profiling reads over ORF
    by the expression of the transcript'''
    trimmed_ORF_RPF_reads = sum(RPF_csvProfile[ORF[0]:ORF[1] - ORF_END_TRIM])
    trimmed_ORF_length = max(((ORF[1] - ORF[0]) - ORF_END_TRIM), 1)
    return float(trimmed_ORF_RPF_reads) / trimmed_ORF_length / expression

def wrent_score_seq(en_score, seq_to_score):
    '''Scores a sequence based on weighted entropy score matrix'''
    return sum([en_score[pos][nt] for pos, nt in enumerate(seq_to_score)])

In [None]:
def UTR5_uORF_segments(CDS, uORFs):
    is_uORF = [False for i in xrange(int(CDS[0]))]
    for uORF in uORFs:
        for pos in xrange(uORF[0], min(uORF[1], CDS[0])):
            is_uORF[pos] = True
    return is_uORF

## Calculating weighted relative entropy (WRENT) scoring matrix

Using translational efficiencies as weights, we determine the sequence motif for efficient translation initiation based on information content, with the nucleotide frequencies at the initiation context as background.

To do this, we read in the frequencies (weighted and unweighted) of nucleotides at each position independently. Position-specific scoring matrices (PSSMs) are initiated as a pandas DataFrames.

A file iterator is created to read the .trpedf file line-by-line, and a counter is initiated to count the number of transcripts.

In [None]:
# INITIALIZE PSSMS FOR RELATIVE ENTROPY SCORING MATRICES
en_pssm = DataFrame(index=[nt for nt in 'ATCGN'], columns=range(sum(WRENT_CONTEXT)))
en_pssm.fillna(0., inplace=True)
en_pssm_unweighted = DataFrame(index=[nt for nt in 'ATCGN'], columns=range(sum(WRENT_CONTEXT)))
en_pssm_unweighted.fillna(0., inplace=True)

uORF_en_pssm = DataFrame(index=[nt for nt in 'ATCGN'], columns=range(sum(WRENT_CONTEXT)))
uORF_en_pssm.fillna(0., inplace=True)
uORF_en_pssm_unweighted = DataFrame(index=[nt for nt in 'ATCGN'], columns=range(sum(WRENT_CONTEXT)))
uORF_en_pssm_unweighted.fillna(0., inplace=True)

In [None]:
# INITIALIZE FILE ITERATOR, LIST OF TRANSCRIPTS
trpedf_file_iterator = pd.read_table(TRPE_FILE, converters=CONVERTERS, chunksize=1)
transcript_list = []

The .trpedf file is read line-by-line. ORFs are merged, and uORFs are defined as beginning before the CDS.

If the transcript passes the filter parameters defined earlier (minimum 5' UTR length and RNA-seq expression), the initiation context sequence is added to the PSSMs, either unweighted (i.e. weight of 1) weighted by the transcript CDS's translational efficiency.

In [None]:
# ITERATES OVER TRANSCRIPTS IN .trpedf FILE

CDS_motif_count = 0
uORF_motif_count = 0

for transcript in trpedf_file_iterator:
    transcript_list.append(transcript["Transcript"][0])
    
    seq = SEQS[transcript["Transcript"][0]].seq        # get transcript sequence
    
    expression = transcript["Gene_Expression_FPKM"][0]
    RPF_csvProfile = transcript["RPF_csvProfile"][0]
    CDS = transcript["CDS"][0]

    ORF_starts = transcript["ORF_starts"][0]           
    ORF_ends = transcript["ORF_ends"][0]
    if type(ORF_starts) is np.int64:                   # corrects for single-entry
        ORF_starts = (ORF_starts,)
        ORF_ends = (ORF_ends,)
    ORFs = zip(ORF_starts, ORF_ends)                   # zips starts and stops into ORF
    uORFs = [ORF for ORF in ORFs if (ORF[0] < CDS[0])] # uORFs defined as beginning before CDS

    if CDS[0] >= UTR5_LENGTH_MIN and expression >= FPKM_MIN:
        if len(uORFs) == 0:                            # filter for transcripts with
                                                       # minimum 5' UTR length and expression
            weight = log1p(TE(RPF_csvProfile, CDS, expression, ORF_END_TRIM))
            context_seq = wrent_context_seq(seq, CDS, WRENT_CONTEXT)
            for pos1, nt1 in enumerate(context_seq):
                en_pssm[pos1][nt1] += weight
                en_pssm_unweighted[pos1][nt1] += 1
            CDS_motif_count += 1
        
        elif (len(uORFs) == 1)\
        and (uORFs[0][1] < CDS[0])\
        and (uORFs[0][1] - uORFs[0][0] > uORF_LENGTH_MIN)\
        and (uORFs[0][0] > uORF_FROM_TSS_MIN):             # 1 uORF, non-overlapping, min uORF length, min distance from TSS
            weight = log1p(TE(RPF_csvProfile, uORFs[0], expression, ORF_END_TRIM))
            context_seq = wrent_context_seq(seq, uORFs[0], WRENT_CONTEXT)
            for pos1, nt1 in enumerate(context_seq):
                uORF_en_pssm[pos1][nt1] += weight
                uORF_en_pssm_unweighted[pos1][nt1] += 1
            uORF_motif_count += 1
        
transcript_count = len(transcript_list)
print "%g transcripts used for CDS WRENT motif" % CDS_motif_count
print "%g transcripts used for uORF WRENT motif" % uORF_motif_count

We first determine relative entropy: unweighted, then weighted for TE.

'N's are ignored.

In [None]:
# IGNORE N
en_pssm_unweighted.drop('N', inplace=True)
en_pssm.drop('N', inplace=True)
uORF_en_pssm_unweighted.drop('N', inplace=True)
uORF_en_pssm.drop('N', inplace=True)

PSSMs are stored for later.

In [None]:
en_pssm_unweighted.to_csv(DATA_DIR + stage + "_pssm_unweighted.df", sep="\t")
en_pssm.to_csv(DATA_DIR + stage + "_pssm.df", sep="\t")

In [None]:
uORF_en_pssm_unweighted.to_csv(DATA_DIR + stage + "_pssm_uORF_unweighted.df", sep="\t")
uORF_en_pssm.to_csv(DATA_DIR + stage + "_pssm_uORF.df", sep="\t")

For calculating both unweighted and unweighted relative entropies, the background nucleotide distribution is from the total nucleotide frequency from the unweighted PSSM.

In [None]:
en_pssm = DataFrame.from_csv(DATA_DIR + stage + "_pssm.df", sep="\t")
en_pssm_unweighted = DataFrame.from_csv(DATA_DIR + stage + "_pssm_unweighted.df", sep="\t")
uORF_en_pssm = DataFrame.from_csv(DATA_DIR + stage + "_pssm_uORF.df", sep="\t")
uORF_en_pssm_unweighted = DataFrame.from_csv(DATA_DIR + stage + "_pssm_uORF_unweighted.df", sep="\t")

In [None]:
en_unweighted_nt_freq = en_pssm_unweighted.mean(axis='columns')  # background model is the unweighted PSSM's total nucleotide freq
en_unweighted_nt_prob = en_unweighted_nt_freq / en_unweighted_nt_freq.sum(axis='rows')

en_pssm_unweighted_prob = en_pssm_unweighted / en_pssm_unweighted.sum(axis="rows")
en_unweighted_score = en_pssm_unweighted_prob.divide(en_unweighted_nt_prob, axis='rows').apply(log2)

en_pssm_prob = en_pssm / en_pssm.sum(axis='rows')             # normalized nucleotide probabilities
en_score = en_pssm_prob.divide(en_unweighted_nt_prob, axis='rows').apply(log2)

Stores the scoring matrices (weighted and unweighted) for later.

In [None]:
en_score.to_csv(DATA_DIR + stage + "_en_score.df", sep="\t")
en_unweighted_score.to_csv(DATA_DIR + stage + "_en_unweighted_score.df", sep="\t")

In [None]:
en_score = DataFrame.from_csv(DATA_DIR + stage + "_en_score.df", sep="\t")
en_unweighted_score = DataFrame.from_csv(DATA_DIR + stage + "_en_unweighted_score.df", sep="\t")

en_score.columns = [int(i) for i in en_score.columns]
en_unweighted_score.columns = [int(i) for i in en_unweighted_score.columns]

**en_score** is subsequently used to score all ORFs.

## Precalculating weighted relative entropy (WRENT) scores, secondary structure ensemble free energies
For faster downstream analysis, weighted relative entropy (WRENT) scores and the computationally-predicted ensemble free energies of secondary structure around initiation contexts of all ORFs are precalculated and stored.

Ribosome profiling reads contained within ORFs, as well as other ORF and transcript parameters are also determined and stored.

Dataframe columns elaborated in following table:
<table>
<tr><td>**Column**</td><td>**Description**</td></tr>
<tr><td>Transcript</td><td>Transcript ID</td></tr>
<tr><td>Gene</td><td>Gene ID</td></tr>
<tr><td>Gene_Name</td><td>Gene Name</td></tr>
<tr><td>Gene_Expression_FPKM</td><td>Expression at gene level (from corresponding RNA-seq data; Tophat + Cufflinks)</td></tr>

<tr><td>UTR5_length</td><td>Length of 5' UTR</td></tr>
<tr><td>UTR3_length</td><td>Length of 3' UTR</td></tr>
<tr><td>UTR5_reads</td><td>Ribosome profiling reads in 5' UTR</td></tr>
<tr><td>UTR5_reads_trunc</td><td>Ribosome profiling reads in 5' UTR, truncated</td></tr>
<tr><td>UTR5_GC</td><td>Total GC content of 5' leader</td></tr>
<tr><td>num_uORFs</td><td>Number of uORFs (ORFs beginning before CDS)</td></tr>
<tr><td>uORFs_read</td><td>Number of reads in each uORF, comma-separated values (CSV)</td></tr>
<tr><td>UTR5_uORF_reads</td><td>Number of reads in 5' leader within uORFs</td></tr>
<tr><td>UTR5_uORF_tlength</td><td>Total length of uORFs within 5' leader</td></tr>
<tr><td>uORFs_length</td><td>Length of uORFs, CSV</td></tr>
<tr><td>uORFs_wrent_score</td><td>Calculated WRENT score of uORFs, CSV</td></tr>
<tr><td>uORFs_urent_score</td><td>Calculated URENT (unweighted) score of uORFs, CSV</td></tr>
<tr><td>uORFs_wrent_seq</td><td>Sequences used to calculate WRENT scores, CSV</td></tr>
<tr><td>uORFs_sec_struct_EFE_L</td><td>Computationally predicted ensemble free energy of secondary structure around uORF start left (-25), CSV</td></tr>
<tr><td>uORFs_sec_struct_EFE_R</td><td>Computationally predicted ensemble free energy of secondary structure around uORF start right (+1), CSV</td></tr>
<tr><td>uORFs_start_pos_wrt_tss</td><td>Start position of uORFs with respect to 5' end of transcript, CSV</td></tr>
<tr><td>uORFs_end_pos_wrt_CDS</td><td>End position of uORFs with respect to start of CDS, CSV</td></tr>
<tr><td>CDS_read</td><td>Number of reads in CDS</td></tr>
<tr><td>CDS_GC</td><td>Total GC content of CDS</td></tr>
<tr><td>CDS_length</td><td>Length of CDS</td></tr>
<tr><td>CDS_wrent_score</td><td>Calculated WRENT score of CDS</td></tr>
<tr><td>CDS_urent_score</td><td>Calculated URENT score of CDS</td></tr>
<tr><td>CDS_wrent_seq</td><td>Sequence used to calculate WRENT score, CSV</td></tr>
<tr><td>CDS_sec_struct_EFE_L</td><td>Computationally predicted ensemble free energy of secondary structure around CDS start, left (-25)</td></tr>
<tr><td>CDS_sec_struct_EFE_R</td><td>Computationally predicted ensemble free energy of secondary structure around CDS start, right (+1)</td></tr>
<tr><td>ORF_5CI3</td><td>Annotation of ORF within transcript, if begins in **5**' UTR, is **C**DS, begins with**I**n CDS, or in **3**' UTR, string of letters</td></tr>
<tr><td>ORFs_wrent_score</td><td>Calculated WRENT score of ORFs, CSV</td></tr>
<tr><td>ORFs_urent_score</td><td>Calculated URENT score of ORFs, CSV</td></tr>
<tr><td>ORFs_wrent_seq</td><td>Sequences used to calculate WRENT score, CSV</td></tr>
<tr><td>ORFs_sec_struct_EFE_L</td><td>Computationally predicted ensemble free energy of secondary structure around ORF start, left (-25), CSV</td></tr>
<tr><td>ORFs_sec_struct_EFE_R</td><td>Computationally predicted ensemble free energy of secondary structure around ORF start, right (+1), CSV</td></tr>
</table>

In [None]:
INDEX = "Transcript"
COLUMNS = ["Gene", "Gene_Name", "Gene_Expression_FPKM", "UTR5_length", "UTR3_length", "UTR5_reads", "UTR5_reads_trunc",
           "UTR5_GC", "num_uORFs", "uORFs_reads", "UTR5_uORF_reads", "UTR5_uORF_tlength", "uORFs_length",
           "uORFs_wrent_score", "uORFs_urent_score", "uORFs_wrent_seq", "uORFs_sec_struct_EFE_L", "uORFs_sec_struct_EFE_R",
           "uORFs_start_pos_wrt_tss", "uORFs_end_pos_wrt_CDS",
           "CDS_reads", "CDS_length", "CDS_GC",
           "CDS_wrent_score", "CDS_urent_score", "CDS_wrent_seq", "CDS_sec_struct_EFE_L", "CDS_sec_struct_EFE_R",
           "ORFs_5CI3", "ORFs_wrent_score", "ORFs_urent_score", "ORFs_wrent_seq",
           "ORFs_sec_struct_EFE_L", "ORFs_sec_struct_EFE_R"]

Calculated values are first stored in a dictionary, and a DataFrame is constructed afterward.

In [None]:
# INITIALIZE DICTIONARIES FOR STORAGE
df_main = DataFrame(columns=COLUMNS)
df_main.index.name = INDEX

# FILE ITERATOR FOR .trpedf
trpedf_file_iterator = pd.read_table(TRPE_FILE, converters=CONVERTERS, chunksize=1)
transcript_count2 = 0      # counter

A counter counts every 500 transcripts and prints to output.

Initiation contexts for weighted Kozak scoring and secondary structure prediction are first determined using the earlier helper functions. uORFs, CDSes, then all ORFs are scored, with ORFs annotated for whether they begin in the 5' UTR, are the CDS, begin within the CDS, or in the 3' UTR.

An additional dictionary stores the profile of ribosome profiling reads around the starts and ends of transcript CDSes, with equal weight to each transcript, to generate a metaprofile.

In [None]:
# ITERATE OVER .trpedf
for transcript in trpedf_file_iterator:
    
    # Counter for iPython Notebook display, every 100 transcripts
    transcript_count2 += 1
    if transcript_count2 % 100 == 0:
        print "%d transcripts read" % transcript_count2
    
    # Reads in data from each transcript
    RPF_csvProfile = transcript["RPF_csvProfile"][0]
    seq = SEQS[transcript["Transcript"][0]].seq     # get transcript sequence
    if "N" in seq: continue
    ORF_starts = transcript["ORF_starts"][0]
    ORF_ends = transcript["ORF_ends"][0]
    if type(ORF_starts) is np.int64:      # corrects for single-entry
        ORF_starts = (ORF_starts,)
        ORF_ends = (ORF_ends,)
    CDS = transcript["CDS"][0]
    ORFs = zip(ORF_starts, ORF_ends)
    uORFs = [ORF for ORF in ORFs if ORF[0] < CDS[0]]  # uORFs defined as beginning before CDS

    # Determines initiation contexts for weighted Kozak scoring of:
    # uORFs
    uORF_wrent_seqs = [wrent_context_seq(seq, uORF, WRENT_CONTEXT) for uORF in uORFs]
        
    # CDS
    CDS_wrent_seq = wrent_context_seq(seq, CDS, WRENT_CONTEXT)
    
    # All ORFs
    ORF_wrent_seqs = [wrent_context_seq(seq, ORF, WRENT_CONTEXT) for ORF in ORFs]
        
    # Determines initiation contexts for secondary structure prediction of:
    # uORFs
    uORF_ss_seqs_l = [ss_context_seq(seq, uORF, SS_CONTEXT_L) for uORF in uORFs]
    uORF_ss_seqs_r = [ss_context_seq(seq, uORF, SS_CONTEXT_R) for uORF in uORFs]
    
    # CDS
    CDS_ss_seq_l = ss_context_seq(seq, CDS, SS_CONTEXT_L)
    CDS_ss_seq_r = ss_context_seq(seq, CDS, SS_CONTEXT_R)
    
    # All ORFs
    ORF_ss_seqs_l = [ss_context_seq(seq, ORF, SS_CONTEXT_L) for ORF in ORFs]
    ORF_ss_seqs_r = [ss_context_seq(seq, ORF, SS_CONTEXT_R) for ORF in ORFs]
    
    # Calculate and store values in main dictionary
    entry = {}
    for j in ("Gene", "Gene_Name", "Gene_Expression_FPKM"):
        entry[j] = transcript[j][0]
    entry["UTR5_length"] = CDS[0]
    entry["UTR3_length"] = len(RPF_csvProfile) - CDS[1]
    entry["UTR5_reads"] = sum(RPF_csvProfile[:CDS[0]])
    entry["UTR5_reads_trunc"] = sum(RPF_csvProfile[:CDS[0]-ORF_END_TRIM])
    entry["UTR5_GC"] = 0.0 if CDS[0] == 0 else float(seq[:CDS[0]].count("G") + seq[:CDS[0]].count("C")) / CDS[0]
    entry["num_uORFs"] = len(uORFs)

    entry["uORFs_reads"] = [sum(RPF_csvProfile[uORF[0]:uORF[1]-ORF_END_TRIM]) for uORF in uORFs]
    
    transcript_UTR5_uORF_segments = UTR5_uORF_segments(CDS, uORFs)
    entry["UTR5_uORF_reads"] = sum([i for i, j in zip(RPF_csvProfile, transcript_UTR5_uORF_segments) if j])
    entry["UTR5_uORF_tlength"] = sum(transcript_UTR5_uORF_segments)
    
    entry["uORFs_length"] = [uORF[1] - uORF[0] for uORF in uORFs]
    entry["uORFs_wrent_score"] = [wrent_score_seq(en_score, init_seq) \
                                            for init_seq in uORF_wrent_seqs]
    entry["uORFs_urent_score"] = [wrent_score_seq(en_unweighted_score, init_seq) \
                                            for init_seq in uORF_wrent_seqs]
    entry["uORFs_wrent_seq"] = uORF_wrent_seqs
    entry["uORFs_sec_struct_EFE_L"] = [RNA.pf_fold(init_seq)[1] for init_seq in uORF_ss_seqs_l]
    entry["uORFs_sec_struct_EFE_R"] = [RNA.pf_fold(init_seq)[1] for init_seq in uORF_ss_seqs_r]
    entry["uORFs_start_pos_wrt_tss"] = [uORF[0] for uORF in uORFs]
    entry["uORFs_end_pos_wrt_CDS"] = [uORF[1] - CDS[0] for uORF in uORFs]

    entry["CDS_reads"] = sum(RPF_csvProfile[CDS[0]:CDS[1]-ORF_END_TRIM])
    entry["CDS_GC"] = 0.0 if CDS[0] == CDS[1] else float(seq[CDS[0]:CDS[1]].count("G") \
                                                         + seq[CDS[0]:CDS[1]].count("C")) \
                                                         / (CDS[1] - CDS[0])
    entry["CDS_length"] = CDS[1] - CDS[0]
    entry["CDS_wrent_score"] = wrent_score_seq(en_score, CDS_wrent_seq)
    entry["CDS_urent_score"] = wrent_score_seq(en_unweighted_score, CDS_wrent_seq)
    entry["CDS_wrent_seq"] = CDS_wrent_seq
    entry["CDS_sec_struct_EFE_L"] = RNA.pf_fold(CDS_ss_seq_l)[1]
    entry["CDS_sec_struct_EFE_R"] = RNA.pf_fold(CDS_ss_seq_r)[1]

    # 5CIS refers to annotation of ORF as beginning in 5' UTR ('5'),
    # being the CDS ('C'),
    # beginning in the CDS ('I'), or
    # beginning in the 3' UTR ('3')
    entry["ORFs_5CI3"] = "".join(['C' if ORF_start == CDS[0] else \
                                  '5' if ORF_start < CDS[0] else \
                                  'I' if CDS[0] < ORF_start < CDS[1] else \
                                  '3' for ORF_start in ORF_starts])
    entry["ORFs_wrent_score"] = [wrent_score_seq(en_score, init_seq) \
                                           for init_seq in ORF_wrent_seqs]
    entry["ORFs_urent_score"] = [wrent_score_seq(en_unweighted_score, init_seq) \
                                           for init_seq in ORF_wrent_seqs]
    entry["ORFs_wrent_seq"] = ORF_wrent_seqs
    entry["ORFs_sec_struct_EFE_L"] = [RNA.pf_fold(init_seq)[1] for init_seq in ORF_ss_seqs_l]
    entry["ORFs_sec_struct_EFE_R"] = [RNA.pf_fold(init_seq)[1] for init_seq in ORF_ss_seqs_r]
    
    df_main.loc[transcript["Transcript"][0]] = Series(entry)

In [None]:
df_main.to_csv(DATA_DIR + stage + "_main.df", sep="\t")