In this iPython notebook, we integrate data for running SVM/logistic regression from start to finish. The code depends on quicksect.py

In [2]:
import os
import glob
import pandas as pd
import numpy as np

<h3>Import data</h3>

Place data in their respective folders.  
- CHROMATIN - contains chromatin marks files, excluding chromM and chromX
- p300 - contains the distal p300 txt file
- eRNA - contains the condensed eRNA txt file that only has data for the 4 cell types we looked at
- TFBS - contains the sequence data gz file which is used for all cells

In [4]:
#Set cell name
CELLNAME = "HepG2"
CELLNAME_2 = "CNhs12328"

In [5]:
#function for importing chromatin marks, to be used under matching intervals
def get_data(filename, chrom):
    dat = pd.read_table(filename, compression = 'gzip', skiprows=1)
    nrows = dat.shape[0]
    dat['chr'] = chrom
    dat['lower'] = np.arange(1,200*(nrows),200)
    dat['upper'] = np.arange(200,200*(nrows+1),200).reshape(nrows,1)
    #dat.head()
    return dat

In [6]:
file_name = glob.glob('data/p300/*.txt')[0] # get filename
p300 = pd.read_table(file_name, header=None)
p300.columns = ['chr','lower']
p300['upper'] = p300['lower']+200
p300.head(2)

Unnamed: 0,chr,lower,upper
0,chr10,100009001,100009201
1,chr10,100022401,100022601


In [7]:
df = pd.read_table('data/eRNA/eRNA_condensed.txt')
eRNA = df[df[CELLNAME_2]>0] # select rows with TPM>0
eRNA = eRNA.iloc[:,0:3]
eRNA.head(2)

Unnamed: 0,chr,lower,upper
7,1,918449,918555
8,1,936652,937138


In [8]:
tfbs = pd.read_table('data/TFBS/tfbsConsSites.txt.gz', compression = 'gzip', header=None)
tfbs = tfbs.ix[:, 1:3]
tfbs.columns = ['chr','lower','upper']
tfbs.head(2)

Unnamed: 0,chr,lower,upper
0,chr1,894640,894654
1,chr1,894641,894657


<h3>Match overlapping intervals</h3>

In [None]:
files = glob.glob('data/CHROMATIN/*.txt.gz') # create the list of chromatin marks file names
data_chroms = []
for filename in files:
    chrom = os.path.split(filename)[1].split("_")[1][3:] # get chromosome number from file name
    print "Integrating data for Chromosome " + chrom
    data_histones = get_data(filename, chrom)     # Import chromatin marks
    # Select only rows of data with corresponding chromosome number
    data_list = []
    data_list.append(p300[p300['chr']=='chr'+chrom])
    data_list.append(eRNA[eRNA['chr']==chrom])
    data_list.append(tfbs[tfbs['chr']=='chr'+chrom])
    data_names = ['p300','eRNA','tfbs']
    
    # Find overlapping intervals
    query = zip(data_histones['lower'],data_histones['upper'])
    
    for d, name in zip(data_list, data_names):
        data = zip(d['lower'],d['upper'])

        # Modified code from: https://www.biostars.org/p/99/
        from random import randint, seed
        from quicksect import IntervalNode
        def find(start, end, tree):
            #Finds a list with the overlapping intervals
            out = []
            tree.intersect( start, end, lambda x: out.append(x) )
            return int(not not out) #return 1 if there is an intersection

        # start the root at the first element
        start, end = data[0]
        tree = IntervalNode( start, end )

        # build an interval tree from the rest of the data
        for start, end in data[1:]:
            tree = tree.insert( start, end )

        overlap = []
        for start, end in query:
            overlap.append(find(start, end , tree))

        data_histones[name] = overlap
        print "Added " + name
        #print data_histones[name].value_counts()
    data_chroms.append(data_histones)

Integrating data for Chromosome 10
Added p300
Added eRNA
Added tfbs
Integrating data for Chromosome 11
Added p300
Added eRNA
Added tfbs
Integrating data for Chromosome 12
Added p300
Added eRNA
Added tfbs
Integrating data for Chromosome 13
Added p300
Added eRNA
Added tfbs
Integrating data for Chromosome 14
Added p300
Added eRNA
Added tfbs
Integrating data for Chromosome 15
Added p300
Added eRNA
Added tfbs
Integrating data for Chromosome 16
Added p300
Added eRNA
Added tfbs
Integrating data for Chromosome 17
Added p300
Added eRNA
Added tfbs
Integrating data for Chromosome 18
Added p300
Added eRNA
Added tfbs
Integrating data for Chromosome 19
Added p300
Added eRNA
Added tfbs
Integrating data for Chromosome 1
Added p300
Added eRNA
Added tfbs
Integrating data for Chromosome 20
Added p300
Added eRNA
Added tfbs
Integrating data for Chromosome 21
Added p300
Added eRNA
Added tfbs
Integrating data for Chromosome 22
Added p300
Added eRNA
Added tfbs
Integrating data for Chromosome 2
Added p300
Adde

In [None]:
features = pd.concat(data_chroms)
features = features.ix[:, [8,9,10,0,1,2,3,4,5,6,7,11,12,13]] #rearrange columns
features.head(2)

<h3>Write to output file</h3>

In [None]:
new_filename = "data/"+ CELLNAME + "_features.txt"
with open(new_filename, 'w') as the_file:
    features.to_csv(the_file, sep='\t', index=False)