Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bedgraph loading is very slow #13

Closed
Phlya opened this issue Mar 15, 2018 · 7 comments
Closed

bedgraph loading is very slow #13

Phlya opened this issue Mar 15, 2018 · 7 comments

Comments

@Phlya
Copy link
Contributor

Phlya commented Mar 15, 2018

Hi, I understand that bigWigs should be preferred to bedgraphs for high resolution data, and they should be much faster than bedgraphs, but currently it seems that loading bedgraphs is unreasonably slow... A ~360Mb file takes a few seconds to load with pandas, but I have been waiting a few minutes for it to load as a BedGraphTrack and just had to interrupt it because I got bored. Is it something that might get improved? Thanks!

@fidelram
Copy link
Contributor

fidelram commented Mar 15, 2018

Honestly, it takes less time to convert the bedgraph to bigwig than to wait each time you want to plot something. We use UCSCTools wigToBigWig or bedGraphToBigWig and they are fast.

Pandas has super optimized parsers that can read tables very fast. But besides reading the file, pyGenomeTracks needs to create an interval tree of the bedgraph file to be able to index it and this also adds some further time.

Other solution, besides the bigwig idea, is to split your bedgraph per chromosome. In the case of a mammalian size genome, that should reduce the loading time by about 20 times (e.g. 1 min vs. 20).

@fidelram
Copy link
Contributor

fidelram commented Apr 5, 2018

@Phlya I added support for tabix files for the bedGraph Track. If you convert to bedgraph file to tabix, then the loading is quite fast. To convert to tabix you need to have samtools installed and do:

$ sort -k1,1 -k2,2n bedgraph_file | bgzip > bedgraph_file_sorted.bg.bgz
$ tabix -p bed bedgraph_file_sorted.bg.bgz

@Phlya
Copy link
Contributor Author

Phlya commented Apr 5, 2018

Awesome, thanks!

@BenoitM-I2BC
Copy link

Hi, I recently discovered pyGenomeTracks and I’m very pleased to start using.
This thread is rather old but my question is related.

The tabix support for bedgraph indeed allows to speed-up the loading of the data.
Yet, the initialization of the track (before the loading bar appears) is rather slow. A 800Mb tabix-indexed bedgraph can take more than 1min to initialize (and 2-3 sec to retrieve the data).

Is this a normal behavior? Am I missing something obvious?
Thanks.

====== track.ini file =====
[test bedgraph tabix]
file = Data.bg.gz
title = tabix-indexed bedgraph
type = fill
summary_method = max
number_of_bins = 3000
file_type = bedgraph
===========================
(python 3.8)
(pyGenomeTracks 3.5)

@lldelisle
Copy link
Collaborator

Hi,
First of all, we are happy to have new users...
For the moment, the code only consider a bedgraph as a potential tabix file if it ends with .bgz. So, If I am not wrong the fact that you tabix indexed your file is not used.
To test this, you can simply change your file name (don't forget to change the name of the index also) and see if it is quicker.
Since version 3.5, we do an intersection (with bedtools) between the bedgraph and the regions to plot, so it should be quicker than before. Can you give use the output you get during the initialization (especially the progress bar)?

Thanks

@lldelisle lldelisle reopened this Aug 28, 2020
@BenoitM-I2BC
Copy link

Hi,
Awesome ! that's what I was missing.
Indeed, the file needs to ends with .bgz to be loaded instantaneously (0.04sec versus 79sec)

====== Using a .bgz file =======
INFO:pygenometracks.tracksClass:initialize 1. [test bedgraph tabix]
INFO:pygenometracks.tracksClass:initialize 2. [x-axis]
INFO:pygenometracks.tracksClass:time initializing track(s):
INFO:pygenometracks.tracksClass:0.045243263244628906
DEBUG:pygenometracks.tracksClass:Figure size in cm is 40 x 7.3138297872340425. Dpi is set to 72
====== Using a .bg.gz file =======
INFO:pygenometracks.tracksClass:initialize 1. [test bedgraph tabix]
100%|█████████████████████████| 65475/65475 [00:02<00:00, 24004.00it/s]
INFO:pygenometracks.tracksClass:initialize 2. [x-axis]
INFO:pygenometracks.tracksClass:time initializing track(s):
INFO:pygenometracks.tracksClass:79.6509759426117
DEBUG:pygenometracks.tracksClass:Figure size in cm is 40 x 7.3138297872340425. Dpi is set to 72

Out of curiosity: I thought that one could use the file_type parameter to specify the type of files irrespective of the file extension. Should I conclude that's not possible for (at least) tabix-indexed bedgraphs?

Thanks for this very nice piece of work
And Thanks so much for your super fast answer !

@lldelisle
Copy link
Collaborator

lldelisle commented Aug 28, 2020

The file_type will in fact specify the type of track to display. Associated to each file_type you have different parameters that you can customize. The file_type can have a limited number of values which are described here as a list (https://pygenometracks.readthedocs.io/en/latest/content/all_tracks.html) or here as the columns of a table with the possible parameters (https://pygenometracks.readthedocs.io/en/latest/content/possible-parameters.html).
Here it is more subtile, it is just the way to load the data which is different between bedgraph and tabix bedgraph, this is not customizable by the user.
I proposed a change where loading as tabix would be tested independently of the extension (#276 ).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants