Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance for pgltools merge #3

Closed
crazyhottommy opened this issue Dec 19, 2017 · 19 comments
Closed

performance for pgltools merge #3

crazyhottommy opened this issue Dec 19, 2017 · 19 comments

Comments

@crazyhottommy
Copy link

I have a pgl file with over 1.5 million rows, pgltools merge is taking very long time to run.
Any way to speed it up?

Thanks for this useful tool!

Tommy

@billgreenwald
Copy link
Owner

I have a few ideas, but wont be able to get to it for a little while. If you are willing could you tell me :

  1. what kind of data are you looking at
  2. how many items do you expect to get merged together on average (ie every 4 entries will end up as 1 after merging)

@crazyhottommy
Copy link
Author

thanks for replying.

  1. those are chromatin interaction data. In Reconstruction of enhancer–target networks in 935 samples of human primary cells, tissues and cell lines. They have 127 interaction data from ENCODE. I was trying to merge all 127 files together to make a super-set

you can download the data:

wget http://yiplab.cse.cuhk.edu.hk/jeme/encoderoadmap_lasso.zip

#unzip and test one file
head encoderoadmap_lasso.1.csv 
chrX:100040000-100041800,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.49
chrX:100046800-100048000,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.37
chrX:100128800-100130000,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.39
chrX:99749000-99751000,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.47
chrX:99851000-99853000,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.57
chrX:99854000-99856200,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.68
chrX:99858800-99859600,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.59
chrX:99863600-99865600,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.57
chrX:99866800-99867800,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.55
chrX:99868400-99868600,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.55

## convert to pgl format, add 2kb flanking for the promoter site

cat   encoderoadmap_lasso.1.csv | tr "$" ":" |  sed -E 's/(chr[0-9XY]+:[0-9]+-[0-9]+),.+(chr[0-9XY]+:[0-9]+):.+/\1,\2/' | tr ":,-" "\t" | awk -v OFS="\t" '{print $1,$2,$3,$4,$5-2000,$5+2000}' | pgltools formatbedpe |  sed 's/[ \t]*$//'> encoderoadmap_lasso1.bedpe

If I merge all 127 bedpe together, total 3049673 rows.

## takes very long time, let me try something else..., 3 million rows!
cat *bedpe | pgltools sort | pgltools merge -stdInA > ENCODE_merged_interaction.bedpe
  1. I expect to see some overlaps between different files, but not sure how frequent are they, if they are from the same cell type, most of them will be overlapping since enhancers are cell type specific.

Or I can merge the files recursively. 1.bedpe merge with 2.bedpe, and then the resulted bedpe merge with 3.bedpe...continue...

To implement this recursive function in python, I need the pygl.sort() API to work.( in this issue #4)

Thanks!
Tommy

@billgreenwald
Copy link
Owner

A few quick things that may help, since I won't be able to do much change to the implementation for about 2.5 weeks:

  1. If you expect the same regions popping up, and you simply want to get a merged set with no duplicates, would running the sorted file through the unix "uniq" command help?
  2. Are you running this through Cpython or through pypy? Pypy would help speed up the process a ton
  3. The reason that the merge command is slow is, since the pgl format has 2 discrete entries, merging two items in the list can cause a previous item to be now be mergable, even though at first it was not (quick sketch can be found below). This requires going through the list of the entries that have been merged together each iteration, which greatly boosts run time. If you only want to do an immediate merge, I could quickly add an option for this. Otherwise, there is probably an intelligent way to hash the data that could speed this up, but I don't have one in mind right now.

Also, just to put it here as well, the sort function in the python implementation is pygl.pyglSort()

Sketch:
Start with 3 entries:

#####----------#####--------
--#####--------------#####--
------#####--------#########

1A overlaps 2A, but not 3A. 1B overlaps 3B, but not 2B. 2 and 3 overlap.
Entry 2 and 3 get merged, resulting in:

#####----------#####--------
--#########--------#########

Now entries 1 and 2 can be merged. This requires re-iterating through the list, which boosts run time

@crazyhottommy
Copy link
Author

Thanks for the illustration!

  1. Man! this is smart, I should have done this first! initially I thought they will not be exactly the same regions...but some bases off, apparently, I was wrong:
 cat *bedpe | wc -l
3049673
cat *bedpe | cut -f1-6 | sort | uniq | wc -l
667490

This reduced the region number a lot!

  1. I do not know much python :) will check Pypy

  2. I now understand the situation much better (thanks for the visual presentation). I think a immediate merge will help if it speed up a lot. then I can do multiple immediate merge to get the final merged files if necessary.

Learned a lot!

Best,
Tommy

@crazyhottommy
Copy link
Author

will report back how long it takes for the 60k regions. memory is not an issue though?

@billgreenwald
Copy link
Owner

How much memory are you working with? Pypy will drop memory consumption as well, though it may hit a problem depending on how much you have. I haven't practically tried running merge on a file with hundreds of thousands of entries, but I have run coverage on files with millions of entries. it took a while, but it did use quite a bit of RAM.

@crazyhottommy
Copy link
Author

crazyhottommy commented Dec 20, 2017

I have 35G RAM. Thanks, memory seems to be fine now. I have an idea of speeding up. one can split the files by chromosome, and run pgltools merge by chromosome in parallel and then merge the final list.
assume only intra-chromsomomal interactions are in the data. (similar to calling mutations)

@billgreenwald
Copy link
Owner

Thats a good idea; will take a bit of time to implement. I will give it go in a few weeks

@crazyhottommy
Copy link
Author

splitting by chr speeds up!! takes only several minutes to merge.
Thanks again! I had a write-up here http://crazyhottommy.blogspot.com/2017/12/merge-enhancer-promoter-interaction.html

@billgreenwald
Copy link
Owner

I'm going to close this and mark as a feature improvement for now.

@StayHungryStayFool
Copy link

I also meet some problem when I use pgltools merge.
input file like this:
chr8 37590000 37600000 chr8 38320000 38330000 A ERLIN2_1
chr8 37590000 37600000 chr8 38320000 38330000 A ERLIN2_2
chr8 37590000 37600000 chr8 38320000 38330000 B FGFR1_1
chr8 37590000 37600000 chr8 38320000 38330000 B FGFR1_2
chr8 37590000 37600000 chr8 38320000 38330000 B FGFR1_4
chr8 37590000 37600000 chr8 38330000 38340000 A ERLIN2_1
chr8 37590000 37600000 chr8 38330000 38340000 A ERLIN2_2
chr8 37590000 37600000 chr8 38340000 38350000 A ERLIN2_1
chr8 37590000 37600000 chr8 38340000 38350000 A ERLIN2_2
chr8 37590000 37600000 chr8 38350000 38360000 A ERLIN2_1

my command line:
pgltools merge -a test.10.pgl -o collapse,distinct -c 7,8 > test.10.merge.pgl

output file like this:
#chrA startA endA chrB startB endB collapse_of_7 distinct_of_8
chr8 37590000 37600000 chr8 38320000 38360000 A,A,A,A,B,B,B,A,A,A ERLIN2_1,ERLIN2_2,FGFR1_1,FGFR1_2,FGFR1_4
my question is why pgtools merge two loops when just one archor loci was overlapping. I think that is reasonable merging loop which both anchor were overlaping.

@billgreenwald
Copy link
Owner

That would be because I think I let the edges count as overlapping when they are exactly the same. ie, bp 1000 as a start overlaps bp 1000 as an end -- I can and should change that. I'll put a fix out this week on it.

Or if you have time, fell free to change it and put in a pull request.

@StayHungryStayFool
Copy link

thanks for replying
I think I have no confidence to finish this work, but I will try. I will be very grateful if you can fix this.
Best wish

@billgreenwald
Copy link
Owner

billgreenwald commented Jul 14, 2019 via email

@StayHungryStayFool
Copy link

That's all right. Please tell me when you finish, please. Thanks very much.
Best whish

@billgreenwald
Copy link
Owner

I pushed a new change (version 2.1.5). Could you pull it down and test it and let me know if it works as intended?

I only updated merge right now, but if it runs as it should now, I will add the fix to all the scripts and then push another new version.

Thanks!

@billgreenwald billgreenwald reopened this Jul 18, 2019
@StayHungryStayFool
Copy link

I have tested the new version, and it works as intended.
But I find another quetion about parameter collapse and distinct in merge. like this:
chr10 690000 700000 chr10 1100000 1110000 B,B,A,B 4_Strong_Enhancer_16337,PRR26_1,WDR37_1,WDR37_3
The relationship between collapse and distinct , I find PRR26_1 should be in A, but in B in the result .
I am confused.
Tanks!

@StayHungryStayFool
Copy link

I have tested the new version, and it works as intended.
But I find another quetion about parameter collapse and distinct in merge. like this:
chr10 690000 700000 chr10 1100000 1110000 B,B,A,B 4_Strong_Enhancer_16337,PRR26_1,WDR37_1,WDR37_3
The relationship between collapse and distinct , I find PRR26_1 should be in A, but in B in the result .
I am confused.
Tanks!

I find the sequence between collapse and distinct is reversed.

@billgreenwald
Copy link
Owner

To clarify first: "collapse" return all annotations in their original order. "distinct" returns a unique set of original annotations, as close to the original order as possible.

So, if four loops are merged, and their annotations originally were A, B, C, D, then collapse and distinct will both return A, B, C, D.

However, if the annotations were A, A, B, A, then collapse will return A, A, B, A, and distinct will return A,B.

I had an ordering function being called but it wasn't doing what I wanted. I changed it and pushed a new version (2.2.0) so that the behavior should be as above; can you let me know if it works?

I also updated all functions to include the edge case you found earlier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants