SortSam ShardedInput tests failing. #5881

jamesemery · 2019-04-12T16:48:39Z

@cmnbroad and I have both observed that the SortSamSparkIntegrationTest.testSortBAMsSharded tests fail locally on our machines despite the tests apparently working on travis. The tests fail because the comparator detects the files are out of their reported sort order. When I went digging into the failing tests it appears that the files are getting correctly sorted and written out correctly into 2 shards with proper names (filename-0000 and filename-0001). After reading the sharded directory as input, it appears that the two files are read out of order. That is to say that calling readsRDD.collect() clearly places all of the filename-0001 reads before the filename-0000 reads.

After digging around it appears the problem might lie in Disq somewhere as it appears everything is working as expected until the abstractSamSource.getReads() line is encountered in HtsjdkReadsRddStorage. I suspect something is going awry with the filesystem mechanism for ordering the input files on our Macs that travis is sidestepping.

Out of curiosity @tomwhite I thought that the sharded output wrote headerless bam chunks, but that appears not to be the case at all? Was I wrong in that assumption or did that change when we switched to Disq.

The text was updated successfully, but these errors were encountered:

lbergelson · 2020-05-29T17:59:04Z

@jamesemery This seems to be striking everywhere now. Any further thoughts?

lbergelson · 2020-05-29T20:46:00Z

Further investigation:

This seems to be cause by the fact that when reading part's files Hadoop does not order them by filename. Instead they are ordered by whatever order the file system happens to return them in which is explicitly NOT sorted.

The root cause can be seen in FileInputFormat.listStatus() which calls singleThreadedListStatus in one of it's branches to get a list of input files. This delegates to whatever the file system does. The filesystem returns things in random order apparently.

We can fix this by sorting the results of listStatus.

* We were relying on the results of FileInputFormat.listStatus() returning a sorted set of statuses when loading multiple parts files. This is not a safe assumption on some files systems which return them in essentially random order. This corrects the problem by overriding listStatus in FileSplitInputFormat and sorting the results. * See broadinstitute/gatk#5881 for additional information.

* Disabling SortSamSparkIntegrationTest.testSortBAMsSharded() * This test is failing in master because of a disq bug. * This should be re-enabled when #5881 is fixed.

jamesemery added Spark disq labels Apr 12, 2019

jamesemery assigned tomwhite Apr 12, 2019

droazen added the bug label Apr 12, 2019

droazen added this to the Engine-Q2-2019 milestone Apr 12, 2019

tomwhite removed their assignment Oct 28, 2019

lbergelson mentioned this issue May 29, 2020

Fix shard ordering bug on some filesystems. disq-bio/disq#138

Open

lbergelson mentioned this issue Jun 3, 2020

Disable SortSamSparkIntegrationTest.testSortBAMsSharded() #6635

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SortSam ShardedInput tests failing. #5881

SortSam ShardedInput tests failing. #5881

jamesemery commented Apr 12, 2019 •

edited

lbergelson commented May 29, 2020

lbergelson commented May 29, 2020

SortSam ShardedInput tests failing. #5881

SortSam ShardedInput tests failing. #5881

Comments

jamesemery commented Apr 12, 2019 • edited

lbergelson commented May 29, 2020

lbergelson commented May 29, 2020

jamesemery commented Apr 12, 2019 •

edited