## Finding index and length of frameshift

08/04/2016: this code is going over the exon sequences I received from Shilpa, which correspond to the HMMER results.
In some cases, Shilpa noted a frameshift, meaning, 1 or two bases were missing in a paticular location, and those need to be
added in order to get the correct translation to amino-acids.
The code here find all those cases and retreive the filename along with the fraemshift index and the length (1 or 2) of bases
needed to be added for the correct translation.

The code then save the results as a pickled dicionary called: "exons_index_length.pik", 
and also in a readable table called: "exons_index_length_table.csv".

In [123]:
import fileinput
import sys
import pickle
import pandas as pd

from IPython.core.display import HTML
HTML("<style>.container { width:100% !important; }</style>")

In [2]:
curr_dir = !pwd
my_path = curr_dir[0]+"/from_shilpa/exons_seqs/"

In [4]:
chromosome_names = ["1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "X", "Y"]

In [132]:
index_list = []
length_list = []
filename_list = []
bps_list = []
frameshifts_dict = {}
for chrom in chromosome_names:
    chrom_dir = my_path+chrom+"/"
    chrom_files = !ls $chrom_dir
    for gene_dir in chrom_files:
        exons_files = !ls $chrom_dir$gene_dir
        for f in exons_files:
            index = -1
            for line in fileinput.input(chrom_dir+"/"+gene_dir+"/"+f):
                
                #Getting exons data from the first line
                if (line.find("chromosome") >= 0):
                    chrom_raw_data = line[line.find("GRCh37"):line.find("length")-1]
                    #Removing the complement bracates if exist
                    if (chrom_raw_data.find("complement(") >= 0):
                        chrom_raw_data = chrom_raw_data[chrom_raw_data.find("complement(")+11:-1]
                        #Removing the join bracates if exist
                    if (chrom_raw_data.find("join(") >= 0):
                        chrom_raw_data = chrom_raw_data[chrom_raw_data.find("join(")+5:-1]

                    exons_list = chrom_raw_data.split(",")
                    if (exons_list[0][0] == "-"):
                        index = int(exons_list[0][1:exons_list[0].find("..")])
                        continue
                
                #Getting the frameshift length from another line
                if (index > 0 and line[0] == "-"):
                    length = len(line.split("\t")[1][:-1])
                    bps = line.split("\t")[1][:-1]
                    break
                    
        
            #After interating all the lines: saving index and length information
            if (index > 0):
                frameshifts_dict[f] = (index, length, bps)
                index_list.append(index)
                length_list.append(length)
                filename_list.append(f)
                bps_list.append(bps)
            
            fileinput.close()
        
    print "Finished Chromosome "+chrom
    
with open(curr_dir[0]+"/domains_frameshifts/exons_index_length.pik', 'wb') as handle:
    pickle.dump(frameshifts_dict, handle, protocol=pickle.HIGHEST_PROTOCOL)

Finished Chromosome 1
Finished Chromosome 2
Finished Chromosome 3
Finished Chromosome 4
Finished Chromosome 5
Finished Chromosome 6
Finished Chromosome 7
Finished Chromosome 8
Finished Chromosome 9
Finished Chromosome 10
Finished Chromosome 11
Finished Chromosome 12
Finished Chromosome 13
Finished Chromosome 14
Finished Chromosome 15
Finished Chromosome 16
Finished Chromosome 17
Finished Chromosome 18
Finished Chromosome 19
Finished Chromosome 20
Finished Chromosome 21
Finished Chromosome 22
Finished Chromosome X
Finished Chromosome Y


In [117]:
fileinput.close()

In [130]:
frameshifts_table = pd.DataFrame([filename_list, index_list, length_list, bps_list])
frameshifts_table = frameshifts_table.transpose()
frameshifts_table.columns = ["filename", "index", "length", "base-pairs"]
frameshifts_table.to_csv(curr_dir[0]+"/domains_frameshifts/exons_index_length_table.csv", sep='\t')

In [133]:
frameshifts_dict

{'TTC3.009.exons.txt': (2342, 3, 'AAA'),
 'NCK1.005.exons.txt': (227, 2, 'GC'),
 'TXNRD1.019.exons.txt': (160, 3, 'GGT'),
 'ATG13.013.exons.txt': (171, 1, 'C'),
 'TBL3.001.exons.txt': (1, 2, 'NN'),
 'PRSS16.005.exons.txt': (1, 2, 'NN'),
 'PCM1.005.exons.txt': (1, 1, 'N'),
 'AP1M1.009.exons.txt': (316, 3, 'CAG'),
 'ACTR10.005.exons.txt': (1, 2, 'NN'),
 'TMEM144.004.exons.txt': (262, 3, 'ACC'),
 'DNAJB2.004.exons.txt': (446, 2, 'AT'),
 'DDX11.003.exons.txt': (675, 1, 'T'),
 'CRISPLD2.003.exons.txt': (57, 1, 'A'),
 'NAALAD2.003.exons.txt': (266, 2, 'AG'),
 'CAMKMT.005.exons.txt': (280, 1, 'N'),
 'MORF4L1.013.exons.txt': (223, 3, 'GAG'),
 'TRPV2.005.exons.txt': (1, 2, 'NN'),
 'MYCBPAP.007.exons.txt': (1, 2, 'NN'),
 'FAM234A.005.exons.txt': (1, 1, 'N'),
 'P2RX6.002.exons.txt': (1, 2, 'NN'),
 'CDK14.009.exons.txt': (383, 2, 'CA'),
 'NUP35.004.exons.txt': (161, 2, 'GT'),
 'COG1.006.exons.txt': (1, 2, 'NN'),
 'TENM3.002.exons.txt': (512, 2, 'AG'),
 'C12orf65.005.exons.txt': (110, 2, 'CT'),
 'P