New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
performance for pgltools merge #3
Comments
I have a few ideas, but wont be able to get to it for a little while. If you are willing could you tell me :
|
thanks for replying.
you can download the data: wget http://yiplab.cse.cuhk.edu.hk/jeme/encoderoadmap_lasso.zip
#unzip and test one file
head encoderoadmap_lasso.1.csv
chrX:100040000-100041800,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.49
chrX:100046800-100048000,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.37
chrX:100128800-100130000,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.39
chrX:99749000-99751000,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.47
chrX:99851000-99853000,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.57
chrX:99854000-99856200,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.68
chrX:99858800-99859600,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.59
chrX:99863600-99865600,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.57
chrX:99866800-99867800,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.55
chrX:99868400-99868600,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.55
## convert to pgl format, add 2kb flanking for the promoter site
cat encoderoadmap_lasso.1.csv | tr "$" ":" | sed -E 's/(chr[0-9XY]+:[0-9]+-[0-9]+),.+(chr[0-9XY]+:[0-9]+):.+/\1,\2/' | tr ":,-" "\t" | awk -v OFS="\t" '{print $1,$2,$3,$4,$5-2000,$5+2000}' | pgltools formatbedpe | sed 's/[ \t]*$//'> encoderoadmap_lasso1.bedpe
If I merge all 127 bedpe together, total 3049673 rows.
Or I can merge the files recursively. 1.bedpe merge with 2.bedpe, and then the resulted bedpe merge with 3.bedpe...continue... To implement this recursive function in python, I need the pygl.sort() API to work.( in this issue #4) Thanks! |
A few quick things that may help, since I won't be able to do much change to the implementation for about 2.5 weeks:
Also, just to put it here as well, the sort function in the python implementation is pygl.pyglSort() Sketch:
1A overlaps 2A, but not 3A. 1B overlaps 3B, but not 2B. 2 and 3 overlap.
Now entries 1 and 2 can be merged. This requires re-iterating through the list, which boosts run time |
Thanks for the illustration!
This reduced the region number a lot!
Learned a lot! Best, |
will report back how long it takes for the 60k regions. memory is not an issue though? |
How much memory are you working with? Pypy will drop memory consumption as well, though it may hit a problem depending on how much you have. I haven't practically tried running merge on a file with hundreds of thousands of entries, but I have run coverage on files with millions of entries. it took a while, but it did use quite a bit of RAM. |
I have 35G RAM. Thanks, memory seems to be fine now. I have an idea of speeding up. one can split the files by chromosome, and run pgltools merge by chromosome in parallel and then merge the final list. |
Thats a good idea; will take a bit of time to implement. I will give it go in a few weeks |
splitting by chr speeds up!! takes only several minutes to merge. |
I'm going to close this and mark as a feature improvement for now. |
I also meet some problem when I use pgltools merge. my command line: output file like this: |
That would be because I think I let the edges count as overlapping when they are exactly the same. ie, bp 1000 as a start overlaps bp 1000 as an end -- I can and should change that. I'll put a fix out this week on it. Or if you have time, fell free to change it and put in a pull request. |
thanks for replying |
Don't worry about it if you can wait a few days.
…On Sat, Jul 13, 2019, 7:14 PM LiZhe ***@***.***> wrote:
I also meet some problem when I use pgltools merge.
input file like this:
chr8 37590000 37600000 chr8 38320000 38330000 A ERLIN2_1
chr8 37590000 37600000 chr8 38320000 38330000 A ERLIN2_2
chr8 37590000 37600000 chr8 38320000 38330000 B FGFR1_1
chr8 37590000 37600000 chr8 38320000 38330000 B FGFR1_2
chr8 37590000 37600000 chr8 38320000 38330000 B FGFR1_4
chr8 37590000 37600000 chr8 38330000 38340000 A ERLIN2_1
chr8 37590000 37600000 chr8 38330000 38340000 A ERLIN2_2
chr8 37590000 37600000 chr8 38340000 38350000 A ERLIN2_1
chr8 37590000 37600000 chr8 38340000 38350000 A ERLIN2_2
chr8 37590000 37600000 chr8 38350000 38360000 A ERLIN2_1
my command line:
pgltools merge -a test.10.pgl -o collapse,distinct -c 7,8 >
test.10.merge.pgl
output file like this:
#chrA startA endA chrB startB endB collapse_of_7 distinct_of_8
chr8 37590000 37600000 chr8 38320000 38360000 A,A,A,A,B,B,B,A,A,A
ERLIN2_1,ERLIN2_2,FGFR1_1,FGFR1_2,FGFR1_4
my question is why pgtools merge two loops when just one archor loci was
overlapping. I think that is reasonable merging loop which both anchor were
overlaping.
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#3?email_source=notifications&email_token=ADDS5EI7XAI4BA725V552BLP7KDZ5A5CNFSM4EIZWYU2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZ34XIA#issuecomment-511167392>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADDS5EKF7YIAX43ZA44XBXTP7KDZ5ANCNFSM4EIZWYUQ>
.
|
That's all right. Please tell me when you finish, please. Thanks very much. |
I pushed a new change (version 2.1.5). Could you pull it down and test it and let me know if it works as intended? I only updated merge right now, but if it runs as it should now, I will add the fix to all the scripts and then push another new version. Thanks! |
I have tested the new version, and it works as intended. |
I find the sequence between collapse and distinct is reversed. |
To clarify first: "collapse" return all annotations in their original order. "distinct" returns a unique set of original annotations, as close to the original order as possible. So, if four loops are merged, and their annotations originally were A, B, C, D, then collapse and distinct will both return A, B, C, D. However, if the annotations were A, A, B, A, then collapse will return A, A, B, A, and distinct will return A,B. I had an ordering function being called but it wasn't doing what I wanted. I changed it and pushed a new version (2.2.0) so that the behavior should be as above; can you let me know if it works? I also updated all functions to include the edge case you found earlier. |
I have a pgl file with over 1.5 million rows, pgltools merge is taking very long time to run.
Any way to speed it up?
Thanks for this useful tool!
Tommy
The text was updated successfully, but these errors were encountered: