Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading GZipped VCF returns an empty RDD #1333

Closed
fnothaft opened this issue Jan 1, 2017 · 7 comments
Closed

Loading GZipped VCF returns an empty RDD #1333

fnothaft opened this issue Jan 1, 2017 · 7 comments

Comments

@fnothaft
Copy link
Member

@fnothaft fnothaft commented Jan 1, 2017

Can be reproduced on any of the VCFs from the Variant DB challenge 2. It looks like this is probably not our problem; if you unzip the files, we load them a-OK.

@fnothaft fnothaft added the bug label Jan 1, 2017
@fnothaft fnothaft added this to the 0.22.0 milestone Jan 1, 2017
@heuermh
Copy link
Member

@heuermh heuermh commented Jan 2, 2017

How do you reproduce? Is this with the 0.20.0 release, or the pull request #1288 branch, or head? Then is it with vcf2anno/anno2vcf or a different code path?

I had this working for me at one point, almost certainly with those same files.

We even have unit test coverage
https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/ADAMContextSuite.scala#L176

@fnothaft
Copy link
Member Author

@fnothaft fnothaft commented Jan 2, 2017

This is with the #1288 branch; haven't tested on HEAD yet. I'm just doing a sc.loadVcf on the file paths from the ADAM shell on a clean build. Let me try on head and I will ping back.

@heuermh
Copy link
Member

@heuermh heuermh commented Jan 25, 2017

Works for me on a local file with 0.21.0 release

$ cp adam-core/src/test/resources/random.vcf random.vcf
$ gzip random.vcf 
$ adam-shell
Using SPARK_SHELL=/usr/local/bin/spark-shell
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.1.0
      /_/
         
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_66)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import org.bdgenomics.adam.rdd.ADAMContext._
import org.bdgenomics.adam.rdd.ADAMContext._

scala> var vcs = sc.loadVcf("random.vcf.gz")
vcs: org.bdgenomics.adam.rdd.variant.VariantContextRDD =
VariantContextRDD(MapPartitionsRDD[1] at flatMap at ADAMContext.scala:933,SequenceDictionary{
1->249250621, 0
2->249250621, 1
13->249250621, 2},ArrayBuffer({"sampleId": "NA12878", "name": null, "attributes": {}}, {"sampleId": "NA12891", "name": null, "attributes": {}}, {"sampleId": "NA12892", "name": null, "attributes": {}}),ArrayBuffer(FILTER=<ID=IndelFS,Description="FS > 200.0">, FILTER=<ID=IndelQD,Description="QD < 2.0">, FILTER=<ID=IndelReadPosRankSum,Description="ReadPosRankSum < -20.0">, FILTER=<ID=LowQual,Description="Low quality">, FILTER=<ID=VQSRTrancheSNP99.50to99.60,Description="Truth sensitivity tranche level for SNP model at VQS Lod: -0.5377 <= x < -0.1787">, FILTER=<ID=VQSRTrancheSNP99.60to99.70,Description="Truth se...

scala> vcs.rdd.count
res1: Long = 6

and on git HEAD

$ ./bin/adam-shell 
Using SPARK_SHELL=/usr/local/bin/spark-shell
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.1.0
      /_/
         
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_66)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import org.bdgenomics.adam.rdd.ADAMContext._
import org.bdgenomics.adam.rdd.ADAMContext._

scala> var vcs = sc.loadVcf("random.vcf.gz")
vcs: org.bdgenomics.adam.rdd.variant.VariantContextRDD =
VariantContextRDD(MapPartitionsRDD[1] at flatMap at ADAMContext.scala:978,SequenceDictionary{
1->249250621, 0
2->249250621, 1
13->249250621, 2},ArrayBuffer({"sampleId": "NA12878", "name": null, "attributes": {}}, {"sampleId": "NA12891", "name": null, "attributes": {}}, {"sampleId": "NA12892", "name": null, "attributes": {}}),ArrayBuffer(FILTER=<ID=IndelFS,Description="FS > 200.0">, FILTER=<ID=IndelQD,Description="QD < 2.0">, FILTER=<ID=IndelReadPosRankSum,Description="ReadPosRankSum < -20.0">, FILTER=<ID=LowQual,Description="Low quality">, FILTER=<ID=VQSRTrancheSNP99.50to99.60,Description="Truth sensitivity tranche level for SNP model at VQS Lod: -0.5377 <= x < -0.1787">, FILTER=<ID=VQSRTrancheSNP99.60to99.70,Description="Truth se...

scala> vcs.rdd.count
Warning: file:/Users/heuermh/working/adam/random.vcf.gz is not splittable, consider using block compressed gzip (BGZF).
res0: Long = 6


$ ./bin/adam-submit --version
Using ADAM_MAIN=org.bdgenomics.adam.cli.ADAMMain
Using SPARK_SUBMIT=/usr/local/bin/spark-submit

       e         888~-_          e             e    e
      d8b        888   \        d8b           d8b  d8b
     /Y88b       888    |      /Y88b         d888bdY88b
    /  Y88b      888    |     /  Y88b       / Y88Y Y888b
   /____Y88b     888   /     /____Y88b     /   YY   Y888b
  /      Y88b    888_-~     /      Y88b   /          Y888b

ADAM version: 0.21.1-SNAPSHOT
Commit: 01297b6cd9eca3fc28757a7edbc761915c539da2 Build: 2017-01-25
Built for: Scala 2.11.8 and Hadoop 2.7.3
@fnothaft
Copy link
Member Author

@fnothaft fnothaft commented Mar 6, 2017

This seems to repro on the GRCh37 20160601 build of dbSNP. Looking deeper now...

@fnothaft
Copy link
Member Author

@fnothaft fnothaft commented Mar 6, 2017

Hah, actually, that was user error. I was saving the sites-only VCF as Genotypes (of which, there are none...). While I'm at it though, let me see if this does reproduce on the VariantDB challenge files.

@fnothaft
Copy link
Member Author

@fnothaft fnothaft commented Mar 6, 2017

OK, I don't see this on the VariantDB challenge files anymore. I'm going to close this for now and will reopen if I see it on another file at a later date.

@fnothaft fnothaft closed this Mar 6, 2017
@heuermh
Copy link
Member

@heuermh heuermh commented Mar 6, 2017

Thanks for following up!

@heuermh heuermh added this to Completed in Release 0.23.0 Mar 8, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.