-
Notifications
You must be signed in to change notification settings - Fork 575
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CreateHadoopBamSplittingIndex on cram? #4506
Comments
@jjfarrell You don't need splitting index for cram. The index works around a bam specific problem which makes it hard to find good split points in the file. Cram is designed in a way that makes it easier to find the split points so the index is unnecessary. I don't have good numbers for how long it takes to find the split points for bam. It depends on your filesystem. If you have a low latency file system like a local disk or hdfs setup than finding split points takes very little time (~seconds), but if you have a high latency file system like something backed by google object store then finding split points may take a long time (on the order of minutes to tens of minutes depending on latency and file size). |
Thanks for the explanation! |
We're working on speeding up the splitting on cloud filesystems, but it's going to take a while before we have a good solution other than the splitting index. |
On our hadoop system, there is a long delay of about 30+ minutes before the tasks begin. See delay in log in job log between 13:24 and 13:59. Once the tasks start, it takes a few minutes. During the delay, the executors are not active and waiting for the tasks to start. Just surprised on how long getting the splits are taking.... This is the commandline...
|
@jjfarrell Huh. I expected that sort of annoying delay from splitting on a cloud system but not on a hadoop one. Does running with splitting index avoid the delay? |
I ran the FlagStatSpark tool on a bam with a splitting-index on the hdfs. It ran blazing fast with no delay at all with a total time of 1m41s. So it looks like it must be related to the splits on the cram. I see a similar delay(30-40 min) when testing the StructuralVariationDiscoveryPipelineSpark jobs on 50 crams. Below are some excerpts from the log of the fast FlagStatSpark on the bam. No delay and tasks start right up....
Processed 1,2 billion reads in less than 2 minutes.....
|
FlagStatSpark was also run on the bam file without the splitting index. There was no delay just a slightly longer run of 2.5 min. Both had the same results. The 30-40 minute delay is only found when reading crams. @lbergelson Is there a fix for this long delay for processing a cram? |
@lbergelson @jjfarrell I'm not sure how much of the difference this accounts for, but the cram splitter iterates through all of the cram containers using htsjdk's CramContainerIterator, which decodes and materializes all SAMRecords in each container it sees. The bam (probabilistic) splitter only materializes a few records around each putative split boundary. And decoding cram is inherently slower than bam to start with. |
@cmnbroad @lbergelson For Spark tools, shouldn't the cram-splitter be using Hadoop-Bam and not htsjdk's CramContainerIterator? That would probably explain the 30-40 minute extra time for cram splits versus bam splits. |
@jjfarrell Yes, but Hadoop-Bam in turn uses htsjdk. |
@cmnbroad @lbergelson The cram index looks like it has all the info required to generate the splits without using the CramContainerIterator to look at the cram file directly. Could using the crai index for splits be a potential solution to the glacially slow cram split generation? CRAM index A CRAM index is a gzipped tab delimited file containing the following columns:
In Hadoop-bam this code could read the crai instead of the cram to find the container boundaries. public List getSplits(List splits, Configuration conf) |
@jjfarrell I agree that it would make a lot more sense to use the .crai file. That's something we could certainly do in Hadoop-BAM or the new code. @cmnbroad does CramContainerIterator materialize each record? I was under the impression that it was just finding boundaries - but the slow runtime suggests it may not. |
We've already got code that does this for BAM indices; even if CRAIs have a different API, they should be able to reuse the split calculation code. |
@tomwhite I just took a look and I did overstate the case when I said CramContainerIterator materializes SAMRecords. It stops short of doing that, but it does crack each container open and iterate through and decompress each data block in each slice in each container as it goes along. Its not clear to me how much this affects the difference in split calc time vs. bam. |
I wrote an alternative to On a 6GB CRAM file, So I think we should use the |
@tomwhite That sounds like a pretty good improvement! For reasons that are not clear to me, htsjdk doesn't generate .crai index files, only .bai, so we'd definitely want something like the CramContainerHeaderIterator method for those. One other thought that occurs to me is that we should think about how to ensure that mates are kept together for CRAM. The spec doesn't require that mates be contained in the same slice, and since the default slices-per-container for both htslib and htsjdk is 1, they don't even have to be in the same container. |
@cmnbroad do you know what would be needed for htsjdk to generate .crai files? There's a Keeping mates in a pair together is something we already do in GATK (https://github.com/broadinstitute/gatk/blob/master/src/main/java/org/broadinstitute/hellbender/engine/spark/datasources/ReadsSparkSource.java#L216), but it would make sense to keep it in the new Spark code in #196. |
@tomwhite Yes, it has to do with htsjdk crai->bai conversion. For some reason, the original CRAM implementation used the bai structure internally to satisfy CRAM queries, instead of crai, probably because that was easier than writing a native crai implementation. It writes . |
There is a initial release of the faster and more accurate replacement for Hadoop-Bam at: https://github.com/disq-bio/disq It would be great to see faster reading of crams in spark GATK with this. Any plans for testing this release? |
@jjfarrell Yes, we're working on migrating to disq. See #5138 |
4.1.0.0 version with disq is reading crams much faster! |
Should the CreateHadoopBamSplittingIndex tool also work on a cram? I am getting this error below which suggests not. What are the benefits of a SplittingIndex for a spark job? On average-how long should it take a spark job to get the splits for a 30x bam or cram?
The text was updated successfully, but these errors were encountered: