New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loadVcf does not dedupe sample ID #1874

Closed
fnothaft opened this Issue Jan 15, 2018 · 0 comments

Comments

1 participant
@fnothaft
Member

fnothaft commented Jan 15, 2018

Happens when loading multiple VCFs that have the same sample ID. E.g., from 1kg:

val vcs = sc.loadVcf("1kg/release/20130502/ALL*.vcf.gz")
vcs: org.bdgenomics.adam.rdd.variant.VariantContextRDD = VariantContextRDD with 86 reference sequences and 58825 samples

That might be the right number of samples for ExAC (it isn't), but definitely too many for 1kg...

@fnothaft fnothaft added the bug label Jan 15, 2018

@fnothaft fnothaft added this to the 0.24.0 milestone Jan 15, 2018

@fnothaft fnothaft self-assigned this Jan 15, 2018

fnothaft added a commit to fnothaft/adam that referenced this issue Jan 16, 2018

[ADAM-1874] Dedupe samples when loading VCFs.
Resolves bigdatagenomics#1874. While samples should be unique in a single VCF, we may load data
from multiple VCFs that contain the same samples (e.g., VCFs from a single
sequencing project where the VCFs are split by chromosome). This change dedupes
sample IDs on load.

heuermh added a commit that referenced this issue Jan 22, 2018

[ADAM-1874] Dedupe samples when loading VCFs.
Resolves #1874. While samples should be unique in a single VCF, we may load data
from multiple VCFs that contain the same samples (e.g., VCFs from a single
sequencing project where the VCFs are split by chromosome). This change dedupes
sample IDs on load.

@heuermh heuermh added this to Completed in Release 0.24.0 Feb 10, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment