In [1]:
import TADselect

  from ._conv import register_converters as _register_converters


DEBUG:matplotlib:CACHEDIR=/home/dmitry/.cache/matplotlib
DEBUG:matplotlib.font_manager:Using fontManager instance from /home/dmitry/.cache/matplotlib/fontList.json
DEBUG:matplotlib.backends:backend module://ipykernel.pylab.backend_inline version unknown
DEBUG:matplotlib.backends:backend module://ipykernel.pylab.backend_inline version unknown


# Input data

## From an array-like object

In [2]:
r1 = TADselect.GenomicRanges([[10, 20], [30, 40], [50, 60]])

The representation shows one range in each line: start, end and coverage. By default, all coverages are equal to 1.

In [3]:
r1

10	20	1
30	40	1
50	60	1

## From a bed-like file using `load_BED()` function

`load_BED()` returns a dictionary with chrN as keys obtained from the file. The coverage is also taken from the file.
Accepted formats are:

|columns|column names|
|-------|------------|
|2|start, end|
|3|chr, start, end|
|\>=6|chr, start, end, name, score, strand|

In [6]:
%%bash
cat ../example_data/ENCFF329MOF_human_hepg2_CTCF.bed | head -n3

chr7	4690503	4690699	.	593	.	4.81784	-1.00000	-0.02902	98
chr18	42548160	42548356	.	612	.	4.88871	-1.00000	-0.02238	98
chr16	13483111	13483307	.	696	.	4.89992	-1.00000	-0.02358	98


In [4]:
TADselect.load_BED(filename='../example_data/ENCFF329MOF_human_hepg2_CTCF.bed')

{'chr1': 55788	55984	871
 81452	81648	562
 94351	94547	947
 122443	122639	691
 148837	149033	713
 190409	190605	758
 208637	208833	647
 224257	224453	844
 253106	253302	631
 268502	268698	658
 286612	286808	595
 290183	290379	720
 292366	292562	699
 325697	325893	698
 369955	370151	738
 382956	383152	619
 390054	390250	745
 412595	412791	890
 414333	414529	703
 415814	416010	714
 459911	460107	624
 465014	465210	826
 499023	499219	557
 501242	501438	814
 520754	520950	554
 524163	524359	586
 530446	530642	612
 535319	535515	552
 537464	537660	572
 541431	541627	821
 548376	548572	541
 556135	556331	657
 558142	558338	612
 571553	571749	619
 587799	587995	668
 629987	630183	593
 634172	634368	901
 657590	657786	546
 667834	668030	683
 683019	683215	874
 696805	697001	609
 705641	705837	577
 718096	718292	598
 776721	776917	555
 816826	817022	828
 861694	861890	643
 882624	882820	725
 892793	892989	754
 894893	895089	613
 903233	903429	595
 923109	923305	665
 936105	936301	719
 939873	94

You can specify prefered chromosome as a dictionary key or by passing its name to the function in case of 2-column file.

In [5]:
TADselect.load_BED(filename='../example_data/ENCFF329MOF_human_hepg2_CTCF.bed', chrm='chr10')

{'chr1': 55788	55984	871
 81452	81648	562
 94351	94547	947
 122443	122639	691
 148837	149033	713
 190409	190605	758
 208637	208833	647
 224257	224453	844
 253106	253302	631
 268502	268698	658
 286612	286808	595
 290183	290379	720
 292366	292562	699
 325697	325893	698
 369955	370151	738
 382956	383152	619
 390054	390250	745
 412595	412791	890
 414333	414529	703
 415814	416010	714
 459911	460107	624
 465014	465210	826
 499023	499219	557
 501242	501438	814
 520754	520950	554
 524163	524359	586
 530446	530642	612
 535319	535515	552
 537464	537660	572
 541431	541627	821
 548376	548572	541
 556135	556331	657
 558142	558338	612
 571553	571749	619
 587799	587995	668
 629987	630183	593
 634172	634368	901
 657590	657786	546
 667834	668030	683
 683019	683215	874
 696805	697001	609
 705641	705837	577
 718096	718292	598
 776721	776917	555
 816826	817022	828
 861694	861890	643
 882624	882820	725
 892793	892989	754
 894893	895089	613
 903233	903429	595
 923109	923305	665
 936105	936301	719
 939873	94

If you want to downscale your segmentation by the factor of interaction matrix resolution, pass the value to the `scale` argument.

In [10]:
TADselect.load_BED(filename='../example_data/ENCFF329MOF_human_hepg2_CTCF.bed', scale=20000)['chr1']

2.7894	2.7992	871
4.0726	4.0824	562
4.71755	4.72735	947
6.12215	6.13195	691
7.44185	7.45165	713
9.52045	9.53025	758
10.43185	10.44165	647
11.21285	11.22265	844
12.6553	12.6651	631
13.4251	13.4349	658
14.3306	14.3404	595
14.50915	14.51895	720
14.6183	14.6281	699
16.28485	16.29465	698
18.49775	18.50755	738
19.1478	19.1576	619
19.5027	19.5125	745
20.62975	20.63955	890
20.71665	20.72645	703
20.7907	20.8005	714
22.99555	23.00535	624
23.2507	23.2605	826
24.95115	24.96095	557
25.0621	25.0719	814
26.0377	26.0475	554
26.20815	26.21795	586
26.5223	26.5321	612
26.76595	26.77575	552
26.8732	26.883	572
27.07155	27.08135	821
27.4188	27.4286	541
27.80675	27.81655	657
27.9071	27.9169	612
28.57765	28.58745	619
29.38995	29.39975	668
31.49935	31.50915	593
31.7086	31.7184	901
32.8795	32.8893	546
33.3917	33.4015	683
34.15095	34.16075	874
34.84025	34.85005	609
35.28205	35.29185	577
35.9048	35.9146	598
38.83605	38.84585	555
40.8413	40.8511	828
43.0847	43.0945	643
44.1312	44.141	725
44.63965	44.64945	754
44.7

# GenomicRanges instance attributes

Range coordinates are stored in `.data`

In [14]:
r1.data

array([[10, 20],
       [30, 40],
       [50, 60]])

Coverages are stored in `.coverage`

In [15]:
r1.coverage

array([1, 1, 1])

The number of ranges is stored in `.length`

In [16]:
r1.length

3

Ranges sizes are stored in `.sizes`

In [17]:
r1.sizes

array([10, 10, 10])

# Count various coefficients

The main method is `.count_coef()`. It can return Jaccard index (JI), overlap coefficient (OC), TPR, FDR, PPV (respective to the second GenomicRanges object) for ranges themselves (TADs) or for their boundaries.

In [25]:
r1 = TADselect.GenomicRanges([[10, 20], [30, 40], [50, 60]])
r2 = TADselect.GenomicRanges([[10, 20], [35, 40], [50, 60], [70, 80]])

In [26]:
r1.count_coef(r2, coef='JI TADs')

0.4

In [27]:
r2.count_coef(r1, coef='JI TADs')

0.4

In [28]:
r1.count_coef(r2, coef='JI boundaries')

0.5555555555555556

In [29]:
r1.count_coef(r2, coef='TPR TADs')

0.5

In [30]:
r2.count_coef(r1, coef='TPR TADs')

0.6666666666666666

If you suspect your data to have wiggled boundaries, you can pass an `offset` argument.

In [31]:
r1 = TADselect.GenomicRanges([[10, 20], [30, 40], [50, 60]])
r2 = TADselect.GenomicRanges([[10, 20], [31, 42], [50, 60]])

In [32]:
r1.count_coef(r2, coef='JI TADs')

0.5

In [33]:
r1.count_coef(r2, coef='JI TADs', offset=2)

1.0

You can get the share of shared ranges with given identity via `.count_shared()`. It is almost identical to JI but insted of `offset`, that looks for wiggling boundaries, the `ident` argument defines how much two ranges must overlap to be considered shared.

In [35]:
r1.count_shared(r2)

0.5

In [36]:
r1.count_shared(r2, ident=0.8)

1.0

In case you want to define whether a dot is lying in triangle based on some range, use `.dot_in_triangle()`. If the dot falls into the triangle, the index of that range is returned, otherwise `None` is returned.

In [37]:
r1.dot_in_triangle([35, 39])

1

In [38]:
r1.dot_in_triangle([25,27])

# Finding closest ranges

Some cases require finding features from another track that are closest to each feature in given track or even counting respective distances. Two methods: `.find_closest()` and `.dist_closest()` can be used to utilise that function.

There are 3 modes for finding closest features:
* Boundariwise: return boundaries in the other track that are closest to the boundaries in given track.
* Binwise: treating each range as one object, returns bins in the other track that are closest ones to the bins in given track. If ranges overlap, the distance is zero.
* Bin-boundariwise: treat ranges in boundariwise except a boundary in the given track overlaps a range in the other track, than the distance is zero.

In [2]:
r1 = TADselect.GenomicRanges([[10, 20], [30, 40], [70, 80]])
r2 = TADselect.GenomicRanges([[7, 11], [25, 28], [72, 74]])

## `.find_closest`

`.find_closest` in __boundariwise mode__ returns an array __[indices_start, indices_end]__, where __indices_start__ are coordinates of the closest boundaries in the other track for start boundaries in the given track and so on.

In [47]:
r1.find_closest(r2, mode='boundariwise')

array([[[0, 1],
        [1, 1],
        [2, 0]],

       [[1, 0],
        [1, 1],
        [2, 1]]])

`.find_closest` in __binwise__ mode returns only one array with coordinates of ranges in the first column and a code in the second column. The code values are:
* 0 &mdash; the range in the given track is __leftmost__ regarding the range in the other track;
* 1 &mdash; the range in the given track is __rightmost__ regarding the range in the other track;
* 2 &mdash;  ranges __overlap__.

In [51]:
r1.find_closest(r2, mode='binwise')

array([[0, 2],
       [1, 1],
       [2, 2]])

__bin-boundariwise__ mode returns the same array as __boundariwise__ one, but column coordinate might be equal to 2 in case the boundary in the given track overlaps a range in the other track.

In [52]:
r1.find_closest(r2, mode='bin-boundariwise')

array([[[0, 2],
        [1, 1],
        [2, 0]],

       [[1, 0],
        [1, 1],
        [2, 1]]])

## `.dist_closest`

__boundariwise__ mode returns 2-dim array with distances from start and end boundaries in the given track to the closest ones in the other track. Negative distance denotes that boundary in the other is leftmost to the boundary in the given track.

In [54]:
r1.dist_closest(r2, mode='boundariwise')

array([[  1,   5],
       [ -2, -12],
       [  2,  -6]])

The other modes work the same as in `.find_closest`,but return distances instead of indices.

In [55]:
r1.dist_closest(r2, mode='binwise')

array([ 0, -2,  0])

In [56]:
r1.dist_closest(r2, mode='bin-boundariwise')

array([[  0,   5],
       [ -2, -12],
       [  2,  -6]])

# Save to file

Two formats for output are available: 2-column for TAD segmentations and 6-column for BED files. The filetype is specified in the `filetype` argument, the chromosome &mdash; in the `chrm` argument.

In [3]:
r1.save('sample_tads.txt', filetype='TADs')

In [4]:
%%bash
cat sample_tads.txt

10	20
30	40
70	80


In [3]:
r1.save('sample_bed.bed', filetype='BED', chrm='chr10')

In [5]:
%%bash
cat sample_bed.bed

chr10	10	20	.	1	.
chr10	30	40	.	1	.
chr10	70	80	.	1	.
