Loading GZipped VCF returns an empty RDD #1333

Closed
fnothaft opened this Issue Jan 1, 2017 · 7 comments

Comments

Projects
2 participants
@fnothaft
Member

fnothaft commented Jan 1, 2017

Can be reproduced on any of the VCFs from the Variant DB challenge 2. It looks like this is probably not our problem; if you unzip the files, we load them a-OK.

@fnothaft fnothaft added the bug label Jan 1, 2017

@fnothaft fnothaft added this to the 0.22.0 milestone Jan 1, 2017

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Jan 2, 2017

Member

How do you reproduce? Is this with the 0.20.0 release, or the pull request #1288 branch, or head? Then is it with vcf2anno/anno2vcf or a different code path?

I had this working for me at one point, almost certainly with those same files.

We even have unit test coverage
https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/ADAMContextSuite.scala#L176

Member

heuermh commented Jan 2, 2017

How do you reproduce? Is this with the 0.20.0 release, or the pull request #1288 branch, or head? Then is it with vcf2anno/anno2vcf or a different code path?

I had this working for me at one point, almost certainly with those same files.

We even have unit test coverage
https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/ADAMContextSuite.scala#L176

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Jan 2, 2017

Member

This is with the #1288 branch; haven't tested on HEAD yet. I'm just doing a sc.loadVcf on the file paths from the ADAM shell on a clean build. Let me try on head and I will ping back.

Member

fnothaft commented Jan 2, 2017

This is with the #1288 branch; haven't tested on HEAD yet. I'm just doing a sc.loadVcf on the file paths from the ADAM shell on a clean build. Let me try on head and I will ping back.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Jan 25, 2017

Member

Works for me on a local file with 0.21.0 release

$ cp adam-core/src/test/resources/random.vcf random.vcf
$ gzip random.vcf 
$ adam-shell
Using SPARK_SHELL=/usr/local/bin/spark-shell
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.1.0
      /_/
         
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_66)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import org.bdgenomics.adam.rdd.ADAMContext._
import org.bdgenomics.adam.rdd.ADAMContext._

scala> var vcs = sc.loadVcf("random.vcf.gz")
vcs: org.bdgenomics.adam.rdd.variant.VariantContextRDD =
VariantContextRDD(MapPartitionsRDD[1] at flatMap at ADAMContext.scala:933,SequenceDictionary{
1->249250621, 0
2->249250621, 1
13->249250621, 2},ArrayBuffer({"sampleId": "NA12878", "name": null, "attributes": {}}, {"sampleId": "NA12891", "name": null, "attributes": {}}, {"sampleId": "NA12892", "name": null, "attributes": {}}),ArrayBuffer(FILTER=<ID=IndelFS,Description="FS > 200.0">, FILTER=<ID=IndelQD,Description="QD < 2.0">, FILTER=<ID=IndelReadPosRankSum,Description="ReadPosRankSum < -20.0">, FILTER=<ID=LowQual,Description="Low quality">, FILTER=<ID=VQSRTrancheSNP99.50to99.60,Description="Truth sensitivity tranche level for SNP model at VQS Lod: -0.5377 <= x < -0.1787">, FILTER=<ID=VQSRTrancheSNP99.60to99.70,Description="Truth se...

scala> vcs.rdd.count
res1: Long = 6

and on git HEAD

$ ./bin/adam-shell 
Using SPARK_SHELL=/usr/local/bin/spark-shell
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.1.0
      /_/
         
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_66)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import org.bdgenomics.adam.rdd.ADAMContext._
import org.bdgenomics.adam.rdd.ADAMContext._

scala> var vcs = sc.loadVcf("random.vcf.gz")
vcs: org.bdgenomics.adam.rdd.variant.VariantContextRDD =
VariantContextRDD(MapPartitionsRDD[1] at flatMap at ADAMContext.scala:978,SequenceDictionary{
1->249250621, 0
2->249250621, 1
13->249250621, 2},ArrayBuffer({"sampleId": "NA12878", "name": null, "attributes": {}}, {"sampleId": "NA12891", "name": null, "attributes": {}}, {"sampleId": "NA12892", "name": null, "attributes": {}}),ArrayBuffer(FILTER=<ID=IndelFS,Description="FS > 200.0">, FILTER=<ID=IndelQD,Description="QD < 2.0">, FILTER=<ID=IndelReadPosRankSum,Description="ReadPosRankSum < -20.0">, FILTER=<ID=LowQual,Description="Low quality">, FILTER=<ID=VQSRTrancheSNP99.50to99.60,Description="Truth sensitivity tranche level for SNP model at VQS Lod: -0.5377 <= x < -0.1787">, FILTER=<ID=VQSRTrancheSNP99.60to99.70,Description="Truth se...

scala> vcs.rdd.count
Warning: file:/Users/heuermh/working/adam/random.vcf.gz is not splittable, consider using block compressed gzip (BGZF).
res0: Long = 6


$ ./bin/adam-submit --version
Using ADAM_MAIN=org.bdgenomics.adam.cli.ADAMMain
Using SPARK_SUBMIT=/usr/local/bin/spark-submit

       e         888~-_          e             e    e
      d8b        888   \        d8b           d8b  d8b
     /Y88b       888    |      /Y88b         d888bdY88b
    /  Y88b      888    |     /  Y88b       / Y88Y Y888b
   /____Y88b     888   /     /____Y88b     /   YY   Y888b
  /      Y88b    888_-~     /      Y88b   /          Y888b

ADAM version: 0.21.1-SNAPSHOT
Commit: 01297b6cd9eca3fc28757a7edbc761915c539da2 Build: 2017-01-25
Built for: Scala 2.11.8 and Hadoop 2.7.3
Member

heuermh commented Jan 25, 2017

Works for me on a local file with 0.21.0 release

$ cp adam-core/src/test/resources/random.vcf random.vcf
$ gzip random.vcf 
$ adam-shell
Using SPARK_SHELL=/usr/local/bin/spark-shell
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.1.0
      /_/
         
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_66)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import org.bdgenomics.adam.rdd.ADAMContext._
import org.bdgenomics.adam.rdd.ADAMContext._

scala> var vcs = sc.loadVcf("random.vcf.gz")
vcs: org.bdgenomics.adam.rdd.variant.VariantContextRDD =
VariantContextRDD(MapPartitionsRDD[1] at flatMap at ADAMContext.scala:933,SequenceDictionary{
1->249250621, 0
2->249250621, 1
13->249250621, 2},ArrayBuffer({"sampleId": "NA12878", "name": null, "attributes": {}}, {"sampleId": "NA12891", "name": null, "attributes": {}}, {"sampleId": "NA12892", "name": null, "attributes": {}}),ArrayBuffer(FILTER=<ID=IndelFS,Description="FS > 200.0">, FILTER=<ID=IndelQD,Description="QD < 2.0">, FILTER=<ID=IndelReadPosRankSum,Description="ReadPosRankSum < -20.0">, FILTER=<ID=LowQual,Description="Low quality">, FILTER=<ID=VQSRTrancheSNP99.50to99.60,Description="Truth sensitivity tranche level for SNP model at VQS Lod: -0.5377 <= x < -0.1787">, FILTER=<ID=VQSRTrancheSNP99.60to99.70,Description="Truth se...

scala> vcs.rdd.count
res1: Long = 6

and on git HEAD

$ ./bin/adam-shell 
Using SPARK_SHELL=/usr/local/bin/spark-shell
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.1.0
      /_/
         
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_66)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import org.bdgenomics.adam.rdd.ADAMContext._
import org.bdgenomics.adam.rdd.ADAMContext._

scala> var vcs = sc.loadVcf("random.vcf.gz")
vcs: org.bdgenomics.adam.rdd.variant.VariantContextRDD =
VariantContextRDD(MapPartitionsRDD[1] at flatMap at ADAMContext.scala:978,SequenceDictionary{
1->249250621, 0
2->249250621, 1
13->249250621, 2},ArrayBuffer({"sampleId": "NA12878", "name": null, "attributes": {}}, {"sampleId": "NA12891", "name": null, "attributes": {}}, {"sampleId": "NA12892", "name": null, "attributes": {}}),ArrayBuffer(FILTER=<ID=IndelFS,Description="FS > 200.0">, FILTER=<ID=IndelQD,Description="QD < 2.0">, FILTER=<ID=IndelReadPosRankSum,Description="ReadPosRankSum < -20.0">, FILTER=<ID=LowQual,Description="Low quality">, FILTER=<ID=VQSRTrancheSNP99.50to99.60,Description="Truth sensitivity tranche level for SNP model at VQS Lod: -0.5377 <= x < -0.1787">, FILTER=<ID=VQSRTrancheSNP99.60to99.70,Description="Truth se...

scala> vcs.rdd.count
Warning: file:/Users/heuermh/working/adam/random.vcf.gz is not splittable, consider using block compressed gzip (BGZF).
res0: Long = 6


$ ./bin/adam-submit --version
Using ADAM_MAIN=org.bdgenomics.adam.cli.ADAMMain
Using SPARK_SUBMIT=/usr/local/bin/spark-submit

       e         888~-_          e             e    e
      d8b        888   \        d8b           d8b  d8b
     /Y88b       888    |      /Y88b         d888bdY88b
    /  Y88b      888    |     /  Y88b       / Y88Y Y888b
   /____Y88b     888   /     /____Y88b     /   YY   Y888b
  /      Y88b    888_-~     /      Y88b   /          Y888b

ADAM version: 0.21.1-SNAPSHOT
Commit: 01297b6cd9eca3fc28757a7edbc761915c539da2 Build: 2017-01-25
Built for: Scala 2.11.8 and Hadoop 2.7.3
@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Mar 6, 2017

Member

This seems to repro on the GRCh37 20160601 build of dbSNP. Looking deeper now...

Member

fnothaft commented Mar 6, 2017

This seems to repro on the GRCh37 20160601 build of dbSNP. Looking deeper now...

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Mar 6, 2017

Member

Hah, actually, that was user error. I was saving the sites-only VCF as Genotypes (of which, there are none...). While I'm at it though, let me see if this does reproduce on the VariantDB challenge files.

Member

fnothaft commented Mar 6, 2017

Hah, actually, that was user error. I was saving the sites-only VCF as Genotypes (of which, there are none...). While I'm at it though, let me see if this does reproduce on the VariantDB challenge files.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Mar 6, 2017

Member

OK, I don't see this on the VariantDB challenge files anymore. I'm going to close this for now and will reopen if I see it on another file at a later date.

Member

fnothaft commented Mar 6, 2017

OK, I don't see this on the VariantDB challenge files anymore. I'm going to close this for now and will reopen if I see it on another file at a later date.

@fnothaft fnothaft closed this Mar 6, 2017

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Mar 6, 2017

Member

Thanks for following up!

Member

heuermh commented Mar 6, 2017

Thanks for following up!

@heuermh heuermh added this to Completed in Release 0.23.0 Mar 8, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment