Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GenomicsDBImport scrambles sample names #3682

Closed
lbergelson opened this issue Oct 10, 2017 · 0 comments
Closed

GenomicsDBImport scrambles sample names #3682

lbergelson opened this issue Oct 10, 2017 · 0 comments

Comments

@lbergelson
Copy link
Member

There was a critical bug in GenomicsDBImport which permuted the sample names of the data in the output GenomicsDB.

This was caused by a mismatch in the sorting of the sample names when files were written to the database, and when the callset.json was written.

The problem only occurred when:

  1. the input vcfs (or more likely the sample name mapping file) were not lexicographically sorted
  2. 'batchSizeargument was set to something other than0`
  3. the number of samples is large enough to be processed in multiple batches.

Each batch is sorted internally by genomicsDB, and the callset.json is sorted separately and globally. Together these produce a mismatch in sample name assignment.

Ex:
Sample name map file:

Sample3	sample3.vcf
Sample1	sample1.vcf
Sample2	sample2.vcf

With batch size 2:

as input:
| Sample3, Sample1 | Sample2 |

as sorted by batch and written:
| Sample1, Sample3 | Sample2 |

as the labels are sorted and recorded:
Sample1, Sample2, Sample3

In this case, Sample2 and Sample3 are mislabelled. This causes call sets genotyped from this database to be mislabelled.

This issue has been fixed in #3667 by sorting the input file globally before passing it to the importer. A tool to fix the labelling of mangled call sets is being added in #3675

@droazen droazen added this to the Engine-4.0 milestone Oct 10, 2017
lbergelson added a commit that referenced this issue Oct 11, 2017
…3675)

* new tool FixCallSetSampleOrdering

this tool fixes the assignment of names to samples in a VCF that was produced
with the buggy version of GenomicsDBImport by rewriting the vcf with a corrected header

this tool is hidden and should only be used if you know that your callset is definitely affected by  #3682
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants