performance for pgltools merge #3

crazyhottommy · 2017-12-19T04:30:14Z

I have a pgl file with over 1.5 million rows, pgltools merge is taking very long time to run.
Any way to speed it up?

Thanks for this useful tool!

Tommy

billgreenwald · 2017-12-19T21:48:10Z

I have a few ideas, but wont be able to get to it for a little while. If you are willing could you tell me :

what kind of data are you looking at
how many items do you expect to get merged together on average (ie every 4 entries will end up as 1 after merging)

crazyhottommy · 2017-12-20T16:40:44Z

thanks for replying.

those are chromatin interaction data. In Reconstruction of enhancer–target networks in 935 samples of human primary cells, tissues and cell lines. They have 127 interaction data from ENCODE. I was trying to merge all 127 files together to make a super-set

you can download the data:

wget http://yiplab.cse.cuhk.edu.hk/jeme/encoderoadmap_lasso.zip

#unzip and test one file
head encoderoadmap_lasso.1.csv 
chrX:100040000-100041800,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.49
chrX:100046800-100048000,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.37
chrX:100128800-100130000,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.39
chrX:99749000-99751000,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.47
chrX:99851000-99853000,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.57
chrX:99854000-99856200,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.68
chrX:99858800-99859600,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.59
chrX:99863600-99865600,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.57
chrX:99866800-99867800,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.55
chrX:99868400-99868600,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.55

## convert to pgl format, add 2kb flanking for the promoter site

cat   encoderoadmap_lasso.1.csv | tr "$" ":" |  sed -E 's/(chr[0-9XY]+:[0-9]+-[0-9]+),.+(chr[0-9XY]+:[0-9]+):.+/\1,\2/' | tr ":,-" "\t" | awk -v OFS="\t" '{print $1,$2,$3,$4,$5-2000,$5+2000}' | pgltools formatbedpe |  sed 's/[ \t]*$//'> encoderoadmap_lasso1.bedpe

If I merge all 127 bedpe together, total 3049673 rows.

## takes very long time, let me try something else..., 3 million rows!
cat *bedpe | pgltools sort | pgltools merge -stdInA > ENCODE_merged_interaction.bedpe

I expect to see some overlaps between different files, but not sure how frequent are they, if they are from the same cell type, most of them will be overlapping since enhancers are cell type specific.

Or I can merge the files recursively. 1.bedpe merge with 2.bedpe, and then the resulted bedpe merge with 3.bedpe...continue...

To implement this recursive function in python, I need the pygl.sort() API to work.( in this issue #4)

Thanks!
Tommy

billgreenwald · 2017-12-20T17:09:51Z

A few quick things that may help, since I won't be able to do much change to the implementation for about 2.5 weeks:

If you expect the same regions popping up, and you simply want to get a merged set with no duplicates, would running the sorted file through the unix "uniq" command help?
Are you running this through Cpython or through pypy? Pypy would help speed up the process a ton
The reason that the merge command is slow is, since the pgl format has 2 discrete entries, merging two items in the list can cause a previous item to be now be mergable, even though at first it was not (quick sketch can be found below). This requires going through the list of the entries that have been merged together each iteration, which greatly boosts run time. If you only want to do an immediate merge, I could quickly add an option for this. Otherwise, there is probably an intelligent way to hash the data that could speed this up, but I don't have one in mind right now.

Also, just to put it here as well, the sort function in the python implementation is pygl.pyglSort()

Sketch:
Start with 3 entries:

#####----------#####--------
--#####--------------#####--
------#####--------#########

1A overlaps 2A, but not 3A. 1B overlaps 3B, but not 2B. 2 and 3 overlap.
Entry 2 and 3 get merged, resulting in:

#####----------#####--------
--#########--------#########

Now entries 1 and 2 can be merged. This requires re-iterating through the list, which boosts run time

crazyhottommy · 2017-12-20T18:45:31Z

Thanks for the illustration!

Man! this is smart, I should have done this first! initially I thought they will not be exactly the same regions...but some bases off, apparently, I was wrong:

 cat *bedpe | wc -l
3049673
cat *bedpe | cut -f1-6 | sort | uniq | wc -l
667490

This reduced the region number a lot!

I do not know much python :) will check Pypy
I now understand the situation much better (thanks for the visual presentation). I think a immediate merge will help if it speed up a lot. then I can do multiple immediate merge to get the final merged files if necessary.

Learned a lot!

Best,
Tommy

crazyhottommy · 2017-12-20T18:53:48Z

will report back how long it takes for the 60k regions. memory is not an issue though?

billgreenwald · 2017-12-20T19:15:00Z

How much memory are you working with? Pypy will drop memory consumption as well, though it may hit a problem depending on how much you have. I haven't practically tried running merge on a file with hundreds of thousands of entries, but I have run coverage on files with millions of entries. it took a while, but it did use quite a bit of RAM.

crazyhottommy · 2017-12-20T19:32:27Z

I have 35G RAM. Thanks, memory seems to be fine now. I have an idea of speeding up. one can split the files by chromosome, and run pgltools merge by chromosome in parallel and then merge the final list.
assume only intra-chromsomomal interactions are in the data. (similar to calling mutations)

billgreenwald · 2017-12-20T21:22:58Z

Thats a good idea; will take a bit of time to implement. I will give it go in a few weeks

crazyhottommy · 2017-12-20T22:01:57Z

splitting by chr speeds up!! takes only several minutes to merge.
Thanks again! I had a write-up here http://crazyhottommy.blogspot.com/2017/12/merge-enhancer-promoter-interaction.html

billgreenwald · 2017-12-20T22:45:06Z

I'm going to close this and mark as a feature improvement for now.

StayHungryStayFool · 2019-07-14T02:14:54Z

I also meet some problem when I use pgltools merge.
input file like this:
chr8 37590000 37600000 chr8 38320000 38330000 A ERLIN2_1
chr8 37590000 37600000 chr8 38320000 38330000 A ERLIN2_2
chr8 37590000 37600000 chr8 38320000 38330000 B FGFR1_1
chr8 37590000 37600000 chr8 38320000 38330000 B FGFR1_2
chr8 37590000 37600000 chr8 38320000 38330000 B FGFR1_4
chr8 37590000 37600000 chr8 38330000 38340000 A ERLIN2_1
chr8 37590000 37600000 chr8 38330000 38340000 A ERLIN2_2
chr8 37590000 37600000 chr8 38340000 38350000 A ERLIN2_1
chr8 37590000 37600000 chr8 38340000 38350000 A ERLIN2_2
chr8 37590000 37600000 chr8 38350000 38360000 A ERLIN2_1

my command line:
pgltools merge -a test.10.pgl -o collapse,distinct -c 7,8 > test.10.merge.pgl

output file like this:
#chrA startA endA chrB startB endB collapse_of_7 distinct_of_8
chr8 37590000 37600000 chr8 38320000 38360000 A,A,A,A,B,B,B,A,A,A ERLIN2_1,ERLIN2_2,FGFR1_1,FGFR1_2,FGFR1_4
my question is why pgtools merge two loops when just one archor loci was overlapping. I think that is reasonable merging loop which both anchor were overlaping.

billgreenwald · 2019-07-14T02:31:43Z

That would be because I think I let the edges count as overlapping when they are exactly the same. ie, bp 1000 as a start overlaps bp 1000 as an end -- I can and should change that. I'll put a fix out this week on it.

Or if you have time, fell free to change it and put in a pull request.

StayHungryStayFool · 2019-07-14T02:48:26Z

thanks for replying
I think I have no confidence to finish this work, but I will try. I will be very grateful if you can fix this.
Best wish

billgreenwald · 2019-07-14T02:49:59Z

Don't worry about it if you can wait a few days.

…

On Sat, Jul 13, 2019, 7:14 PM LiZhe ***@***.***> wrote: I also meet some problem when I use pgltools merge. input file like this: chr8 37590000 37600000 chr8 38320000 38330000 A ERLIN2_1 chr8 37590000 37600000 chr8 38320000 38330000 A ERLIN2_2 chr8 37590000 37600000 chr8 38320000 38330000 B FGFR1_1 chr8 37590000 37600000 chr8 38320000 38330000 B FGFR1_2 chr8 37590000 37600000 chr8 38320000 38330000 B FGFR1_4 chr8 37590000 37600000 chr8 38330000 38340000 A ERLIN2_1 chr8 37590000 37600000 chr8 38330000 38340000 A ERLIN2_2 chr8 37590000 37600000 chr8 38340000 38350000 A ERLIN2_1 chr8 37590000 37600000 chr8 38340000 38350000 A ERLIN2_2 chr8 37590000 37600000 chr8 38350000 38360000 A ERLIN2_1 my command line: pgltools merge -a test.10.pgl -o collapse,distinct -c 7,8 > test.10.merge.pgl output file like this: #chrA startA endA chrB startB endB collapse_of_7 distinct_of_8 chr8 37590000 37600000 chr8 38320000 38360000 A,A,A,A,B,B,B,A,A,A ERLIN2_1,ERLIN2_2,FGFR1_1,FGFR1_2,FGFR1_4 my question is why pgtools merge two loops when just one archor loci was overlapping. I think that is reasonable merging loop which both anchor were overlaping. — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#3?email_source=notifications&email_token=ADDS5EI7XAI4BA725V552BLP7KDZ5A5CNFSM4EIZWYU2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZ34XIA#issuecomment-511167392>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADDS5EKF7YIAX43ZA44XBXTP7KDZ5ANCNFSM4EIZWYUQ> .

StayHungryStayFool · 2019-07-14T02:54:15Z

That's all right. Please tell me when you finish, please. Thanks very much.
Best whish

billgreenwald · 2019-07-18T17:50:48Z

I pushed a new change (version 2.1.5). Could you pull it down and test it and let me know if it works as intended?

I only updated merge right now, but if it runs as it should now, I will add the fix to all the scripts and then push another new version.

Thanks!

StayHungryStayFool · 2019-07-23T05:48:16Z

I have tested the new version, and it works as intended.
But I find another quetion about parameter collapse and distinct in merge. like this:
chr10 690000 700000 chr10 1100000 1110000 B,B,A,B 4_Strong_Enhancer_16337,PRR26_1,WDR37_1,WDR37_3
The relationship between collapse and distinct , I find PRR26_1 should be in A, but in B in the result .
I am confused.
Tanks!

StayHungryStayFool · 2019-07-23T05:52:29Z

I have tested the new version, and it works as intended.
But I find another quetion about parameter collapse and distinct in merge. like this:
chr10 690000 700000 chr10 1100000 1110000 B,B,A,B 4_Strong_Enhancer_16337,PRR26_1,WDR37_1,WDR37_3
The relationship between collapse and distinct , I find PRR26_1 should be in A, but in B in the result .
I am confused.
Tanks!

I find the sequence between collapse and distinct is reversed.

billgreenwald · 2019-07-26T20:10:05Z

To clarify first: "collapse" return all annotations in their original order. "distinct" returns a unique set of original annotations, as close to the original order as possible.

So, if four loops are merged, and their annotations originally were A, B, C, D, then collapse and distinct will both return A, B, C, D.

However, if the annotations were A, A, B, A, then collapse will return A, A, B, A, and distinct will return A,B.

I had an ordering function being called but it wasn't doing what I wanted. I changed it and pushed a new version (2.2.0) so that the behavior should be as above; can you let me know if it works?

I also updated all functions to include the edge case you found earlier.

billgreenwald closed this as completed Dec 20, 2017

billgreenwald added the enhancement label Dec 20, 2017

billgreenwald reopened this Jul 18, 2019

billgreenwald closed this as completed Oct 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance for pgltools merge #3

performance for pgltools merge #3

crazyhottommy commented Dec 19, 2017

billgreenwald commented Dec 19, 2017

crazyhottommy commented Dec 20, 2017

billgreenwald commented Dec 20, 2017

crazyhottommy commented Dec 20, 2017

crazyhottommy commented Dec 20, 2017

billgreenwald commented Dec 20, 2017

crazyhottommy commented Dec 20, 2017 •

edited

billgreenwald commented Dec 20, 2017

crazyhottommy commented Dec 20, 2017

billgreenwald commented Dec 20, 2017

StayHungryStayFool commented Jul 14, 2019

billgreenwald commented Jul 14, 2019

StayHungryStayFool commented Jul 14, 2019

billgreenwald commented Jul 14, 2019 via email

StayHungryStayFool commented Jul 14, 2019

billgreenwald commented Jul 18, 2019

StayHungryStayFool commented Jul 23, 2019

StayHungryStayFool commented Jul 23, 2019

billgreenwald commented Jul 26, 2019

performance for pgltools merge #3

performance for pgltools merge #3

Comments

crazyhottommy commented Dec 19, 2017

billgreenwald commented Dec 19, 2017

crazyhottommy commented Dec 20, 2017

billgreenwald commented Dec 20, 2017

crazyhottommy commented Dec 20, 2017

crazyhottommy commented Dec 20, 2017

billgreenwald commented Dec 20, 2017

crazyhottommy commented Dec 20, 2017 • edited

billgreenwald commented Dec 20, 2017

crazyhottommy commented Dec 20, 2017

billgreenwald commented Dec 20, 2017

StayHungryStayFool commented Jul 14, 2019

billgreenwald commented Jul 14, 2019

StayHungryStayFool commented Jul 14, 2019

billgreenwald commented Jul 14, 2019 via email

StayHungryStayFool commented Jul 14, 2019

billgreenwald commented Jul 18, 2019

StayHungryStayFool commented Jul 23, 2019

StayHungryStayFool commented Jul 23, 2019

billgreenwald commented Jul 26, 2019

crazyhottommy commented Dec 20, 2017 •

edited