You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There was a critical bug in GenomicsDBImport which permuted the sample names of the data in the output GenomicsDB.
This was caused by a mismatch in the sorting of the sample names when files were written to the database, and when the callset.json was written.
The problem only occurred when:
the input vcfs (or more likely the sample name mapping file) were not lexicographically sorted
'batchSizeargument was set to something other than0`
the number of samples is large enough to be processed in multiple batches.
Each batch is sorted internally by genomicsDB, and the callset.json is sorted separately and globally. Together these produce a mismatch in sample name assignment.
as sorted by batch and written: | Sample1, Sample3 | Sample2 |
as the labels are sorted and recorded: Sample1, Sample2, Sample3
In this case, Sample2 and Sample3 are mislabelled. This causes call sets genotyped from this database to be mislabelled.
This issue has been fixed in #3667 by sorting the input file globally before passing it to the importer. A tool to fix the labelling of mangled call sets is being added in #3675
The text was updated successfully, but these errors were encountered:
…3675)
* new tool FixCallSetSampleOrdering
this tool fixes the assignment of names to samples in a VCF that was produced
with the buggy version of GenomicsDBImport by rewriting the vcf with a corrected header
this tool is hidden and should only be used if you know that your callset is definitely affected by #3682
There was a critical bug in
GenomicsDBImport
which permuted the sample names of the data in the output GenomicsDB.This was caused by a mismatch in the sorting of the sample names when files were written to the database, and when the callset.json was written.
The problem only occurred when:
argument was set to something other than
0`Each batch is sorted internally by genomicsDB, and the callset.json is sorted separately and globally. Together these produce a mismatch in sample name assignment.
Ex:
Sample name map file:
With batch size 2:
as input:
| Sample3, Sample1 | Sample2 |
as sorted by batch and written:
| Sample1, Sample3 | Sample2 |
as the labels are sorted and recorded:
Sample1, Sample2, Sample3
In this case, Sample2 and Sample3 are mislabelled. This causes call sets genotyped from this database to be mislabelled.
This issue has been fixed in #3667 by sorting the input file globally before passing it to the importer. A tool to fix the labelling of mangled call sets is being added in #3675
The text was updated successfully, but these errors were encountered: