GenomicsDBImport scrambles sample names #3682

lbergelson · 2017-10-10T16:47:08Z

There was a critical bug in GenomicsDBImport which permuted the sample names of the data in the output GenomicsDB.

This was caused by a mismatch in the sorting of the sample names when files were written to the database, and when the callset.json was written.

The problem only occurred when:

the input vcfs (or more likely the sample name mapping file) were not lexicographically sorted
'batchSizeargument was set to something other than0`
the number of samples is large enough to be processed in multiple batches.

Each batch is sorted internally by genomicsDB, and the callset.json is sorted separately and globally. Together these produce a mismatch in sample name assignment.

Ex:
Sample name map file:

Sample3	sample3.vcf
Sample1	sample1.vcf
Sample2	sample2.vcf

With batch size 2:

as input:
| Sample3, Sample1 | Sample2 |

as sorted by batch and written:
| Sample1, Sample3 | Sample2 |

as the labels are sorted and recorded:
Sample1, Sample2, Sample3

In this case, Sample2 and Sample3 are mislabelled. This causes call sets genotyped from this database to be mislabelled.

This issue has been fixed in #3667 by sorting the input file globally before passing it to the importer. A tool to fix the labelling of mangled call sets is being added in #3675

The text was updated successfully, but these errors were encountered:

…3675) * new tool FixCallSetSampleOrdering this tool fixes the assignment of names to samples in a VCF that was produced with the buggy version of GenomicsDBImport by rewriting the vcf with a corrected header this tool is hidden and should only be used if you know that your callset is definitely affected by #3682

lbergelson added bug GenomicsDB PRIORITY_HIGH labels Oct 10, 2017

lbergelson closed this as completed Oct 10, 2017

droazen added this to the Engine-4.0 milestone Oct 10, 2017

lbergelson mentioned this issue Oct 10, 2017

new tool FixCallSetSampleOrdering #3675

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GenomicsDBImport scrambles sample names #3682

GenomicsDBImport scrambles sample names #3682

lbergelson commented Oct 10, 2017

GenomicsDBImport scrambles sample names #3682

GenomicsDBImport scrambles sample names #3682

Comments

lbergelson commented Oct 10, 2017