In this iPython notebook, we want to **add a feature to an existing dataset of features contained in one text file**. Provided below is an example of for adding tfbs to an existing dataset of 10 columns (8 histone marks, 1 TPM, and 1 p300). The output file is the same format as the input file. The code depends on quicksect.py

In [1]:
import pandas as pd
import numpy as np

<h3>Import data</h3>

In [2]:
CELLNAME = 'GM12878'

In [10]:
tfbs = pd.read_table('data/tfbs/tfbsConsSites.txt.gz', compression = 'gzip', header=None)
tfbs.columns = ['bin','chrom','chromStart','chromEnd','name','score','strand','zScore']
tfbs.head(2)

Unnamed: 0,bin,chrom,chromStart,chromEnd,name,score,strand,zScore
0,591,chr1,894640,894654,V$P300_01,842,-,1.68
1,591,chr1,894641,894657,V$ELK1_01,898,-,2.7


In [8]:
df_add = tfbs.ix[:, [1,2,3]]
df_add.columns = ['chr','lower','upper']
df_add.head(2)

Unnamed: 0,chr,lower,upper
0,chr1,894640,894654
1,chr1,894641,894657


In [5]:
features_data = pd.read_table('data/'+CELLNAME+'_features.txt.gz', compression='gzip')
features_data.head(2)

Unnamed: 0,chr,lower,upper,H3K27ac,H3K27me3,H3K36me3,H3K4me1,H3K4me2,H3K4me3,H3K9ac,H4K20me1,p300,eRNA,tfbs
0,1,1,200,0,0,0,0,0,0,0,0,0,0,0
1,1,201,400,0,0,0,0,0,0,0,0,0,0,0


<h3>Match intervals</h3>

In [11]:
dats = []
for chrom in range(1,23):
    print 'Chromosome %d' %chrom
    # Select only rows with corresponding chromosome number for both input files
    dat = features_data[features_data['chr']==chrom] 
    d = df_add[df_add['chr']=='chr'+str(chrom)]  
    
    # Find overlapping intervals
    query = zip(dat['lower'],dat['upper'])
    data = zip(d['lower'],d['upper'])
    
    # Modified code from: https://www.biostars.org/p/99/
    from quicksect import IntervalNode
    def find(start, end, tree):
        #Finds a list with the overlapping intervals
        out = []
        tree.intersect( start, end, lambda x: out.append(x) )
        return int(not not out) #return 1 if there is an intersection

    # start the root at the first element
    start, end = data[0]
    tree = IntervalNode( start, end )

    # build an interval tree from the rest of the data
    for start, end in data[1:]:
        tree = tree.insert( start, end )

    overlap = []
    for start, end in query:
        overlap.append(find(start, end , tree))

    dat['tfbs'] = overlap
    print dat['tfbs'].value_counts()
    dats.append(dat)

Chromosome 1
0    1068307
1     177946
dtype: int64
Chromosome 2
0    1032127
1     183869
dtype: int64
Chromosome 3
0    836544
1    153568
dtype: int64
Chromosome 4
0    831106
1    124665
dtype: int64
Chromosome 5
0    768766
1    135810
dtype: int64
Chromosome 6
0    737496
1    118079
dtype: int64
Chromosome 7
0    684934
1    110759
dtype: int64
Chromosome 8
0    634882
1     96938
dtype: int64
Chromosome 9
0    616936
1     89131
dtype: int64
Chromosome 10
0    579660
1     98013
dtype: int64
Chromosome 11
0    568797
1    106235
dtype: int64
Chromosome 12
0    578069
1     91190
dtype: int64
Chromosome 13
0    514786
1     61063
dtype: int64
Chromosome 14
0    466287
1     70460
dtype: int64
Chromosome 15
0    444049
1     68607
dtype: int64
Chromosome 16
0    389411
1     62362
dtype: int64
Chromosome 17
0    336706
1     69270
dtype: int64
Chromosome 18
0    340195
1     50191
dtype: int64
Chromosome 19
0    266433
1     29211
dtype: int64
Chromosome 20
0    268504
1     4662

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [7]:
result = pd.concat(dats)

<h3>Write to Output file</h3>

In [8]:
new_filename = 'data/'+CELLNAME+'_features_2.txt'
with open(new_filename, 'w') as the_file:
    result.to_csv(the_file, sep='\t', index=False)