Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SortSam ShardedInput tests failing. #5881

Open
jamesemery opened this issue Apr 12, 2019 · 2 comments
Open

SortSam ShardedInput tests failing. #5881

jamesemery opened this issue Apr 12, 2019 · 2 comments

Comments

@jamesemery
Copy link
Collaborator

jamesemery commented Apr 12, 2019

@cmnbroad and I have both observed that the SortSamSparkIntegrationTest.testSortBAMsSharded tests fail locally on our machines despite the tests apparently working on travis. The tests fail because the comparator detects the files are out of their reported sort order. When I went digging into the failing tests it appears that the files are getting correctly sorted and written out correctly into 2 shards with proper names (filename-0000 and filename-0001). After reading the sharded directory as input, it appears that the two files are read out of order. That is to say that calling readsRDD.collect() clearly places all of the filename-0001 reads before the filename-0000 reads.

After digging around it appears the problem might lie in Disq somewhere as it appears everything is working as expected until the abstractSamSource.getReads() line is encountered in HtsjdkReadsRddStorage. I suspect something is going awry with the filesystem mechanism for ordering the input files on our Macs that travis is sidestepping.

Out of curiosity @tomwhite I thought that the sharded output wrote headerless bam chunks, but that appears not to be the case at all? Was I wrong in that assumption or did that change when we switched to Disq.

@droazen droazen added the bug label Apr 12, 2019
@droazen droazen added this to the Engine-Q2-2019 milestone Apr 12, 2019
@tomwhite tomwhite removed their assignment Oct 28, 2019
@lbergelson
Copy link
Member

@jamesemery This seems to be striking everywhere now. Any further thoughts?

@lbergelson
Copy link
Member

Further investigation:

This seems to be cause by the fact that when reading part's files Hadoop does not order them by filename. Instead they are ordered by whatever order the file system happens to return them in which is explicitly NOT sorted.

The root cause can be seen in FileInputFormat.listStatus() which calls singleThreadedListStatus in one of it's branches to get a list of input files. This delegates to whatever the file system does. The filesystem returns things in random order apparently.

We can fix this by sorting the results of listStatus.

lbergelson added a commit to disq-bio/disq that referenced this issue May 29, 2020
* We were relying on the results of FileInputFormat.listStatus() returning a sorted set of statuses when loading multiple parts files.
  This is not a safe assumption on some files systems which return them in essentially random order.
  This corrects the problem by overriding listStatus in FileSplitInputFormat and sorting the results.
* See broadinstitute/gatk#5881 for additional information.
lbergelson added a commit that referenced this issue Jun 3, 2020
* Disabling SortSamSparkIntegrationTest.testSortBAMsSharded()
* This test is failing in master because of a disq bug.
* This should be re-enabled when #5881 is fixed.
lbergelson added a commit that referenced this issue Jun 3, 2020
* Disabling SortSamSparkIntegrationTest.testSortBAMsSharded()
* This test is failing in master because of a disq bug.
* This should be re-enabled when #5881 is fixed.
jonn-smith pushed a commit that referenced this issue Jul 14, 2020
* Disabling SortSamSparkIntegrationTest.testSortBAMsSharded()
* This test is failing in master because of a disq bug.
* This should be re-enabled when #5881 is fixed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants