New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SortSam ShardedInput tests failing. #5881
Comments
@jamesemery This seems to be striking everywhere now. Any further thoughts? |
Further investigation: This seems to be cause by the fact that when reading part's files Hadoop does not order them by filename. Instead they are ordered by whatever order the file system happens to return them in which is explicitly NOT sorted. The root cause can be seen in We can fix this by sorting the results of listStatus. |
* We were relying on the results of FileInputFormat.listStatus() returning a sorted set of statuses when loading multiple parts files. This is not a safe assumption on some files systems which return them in essentially random order. This corrects the problem by overriding listStatus in FileSplitInputFormat and sorting the results. * See broadinstitute/gatk#5881 for additional information.
* Disabling SortSamSparkIntegrationTest.testSortBAMsSharded() * This test is failing in master because of a disq bug. * This should be re-enabled when #5881 is fixed.
* Disabling SortSamSparkIntegrationTest.testSortBAMsSharded() * This test is failing in master because of a disq bug. * This should be re-enabled when #5881 is fixed.
* Disabling SortSamSparkIntegrationTest.testSortBAMsSharded() * This test is failing in master because of a disq bug. * This should be re-enabled when #5881 is fixed.
@cmnbroad and I have both observed that the
SortSamSparkIntegrationTest.testSortBAMsSharded
tests fail locally on our machines despite the tests apparently working on travis. The tests fail because the comparator detects the files are out of their reported sort order. When I went digging into the failing tests it appears that the files are getting correctly sorted and written out correctly into 2 shards with proper names (filename-0000
andfilename-0001
). After reading the sharded directory as input, it appears that the two files are read out of order. That is to say that callingreadsRDD.collect()
clearly places all of thefilename-0001
reads before thefilename-0000
reads.After digging around it appears the problem might lie in Disq somewhere as it appears everything is working as expected until the
abstractSamSource.getReads()
line is encountered inHtsjdkReadsRddStorage
. I suspect something is going awry with the filesystem mechanism for ordering the input files on our Macs that travis is sidestepping.Out of curiosity @tomwhite I thought that the sharded output wrote headerless bam chunks, but that appears not to be the case at all? Was I wrong in that assumption or did that change when we switched to Disq.
The text was updated successfully, but these errors were encountered: