File _rgdict.avro does not exist #1150

Closed
ooliynyk opened this Issue Sep 3, 2016 · 7 comments

Comments

Projects
None yet
5 participants
@ooliynyk

ooliynyk commented Sep 3, 2016

I have converted vcf file to adam format using the command # adam-submit vcf2adam file:///tmp/A7VAGPU.vcf.gz file:///tmp/a7.adam

When I tried to run # adam-submit count_kmers file:///tmp/a7.adam/ file:///tmp/kmers.adam 10 I got error:

Command body threw exception:
java.io.FileNotFoundException: File file:/tmp/a7.adam/_rgdict.avro does not exist
Exception in thread "main" java.io.FileNotFoundException: File file:/tmp/a7.adam/_rgdict.avro does not exist
    at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534)
    at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
    at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
    at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409)
    at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:140)
    at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:341)
    at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:766)
    at org.bdgenomics.adam.rdd.ADAMContext.loadAvro(ADAMContext.scala:442)
    at org.bdgenomics.adam.rdd.ADAMContext.loadAvroReadGroupMetadata(ADAMContext.scala:160)
    at org.bdgenomics.adam.rdd.ADAMContext.loadParquetAlignments(ADAMContext.scala:520)
    at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadAlignments$1.apply(ADAMContext.scala:1023)
    at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadAlignments$1.apply(ADAMContext.scala:1002)
    at scala.Option.fold(Option.scala:157)
    at org.apache.spark.rdd.Timer.time(Timer.scala:48)
    at org.bdgenomics.adam.rdd.ADAMContext.loadAlignments(ADAMContext.scala:1000)
    at org.bdgenomics.adam.cli.CountReadKmers.run(CountReadKmers.scala:63)
    at org.bdgenomics.utils.cli.BDGSparkCommand$class.run(BDGCommand.scala:55)
    at org.bdgenomics.adam.cli.CountReadKmers.run(CountReadKmers.scala:54)
    at org.bdgenomics.adam.cli.ADAMMain.apply(ADAMMain.scala:132)
    at org.bdgenomics.adam.cli.ADAMMain$.main(ADAMMain.scala:72)
    at org.bdgenomics.adam.cli.ADAMMain.main(ADAMMain.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Files in .adam directory:

root@2ef6c96e995a:/tmp# ls -la /tmp/a7.adam/
total 2044
drwxr-xr-x  2 root root    4096 Sep  3 20:03 .
drwxrwxrwt 75 root root    4096 Sep  3 20:04 ..
-rw-r--r--  1 root root       8 Sep  3 20:03 ._SUCCESS.crc
-rw-r--r--  1 root root     108 Sep  3 20:03 ._common_metadata.crc
-rw-r--r--  1 root root     140 Sep  3 20:03 ._metadata.crc
-rw-r--r--  1 root root      20 Sep  3 20:03 ._samples.avro.crc
-rw-r--r--  1 root root      20 Sep  3 20:03 ._seqdict.avro.crc
-rw-r--r--  1 root root   15628 Sep  3 20:03 .part-r-00000.gz.parquet.crc
-rw-r--r--  1 root root       0 Sep  3 20:03 _SUCCESS
-rw-r--r--  1 root root   12467 Sep  3 20:03 _common_metadata
-rw-r--r--  1 root root   16770 Sep  3 20:03 _metadata
-rw-r--r--  1 root root    1301 Sep  3 20:03 _samples.avro
-rw-r--r--  1 root root    1402 Sep  3 20:03 _seqdict.avro
-rw-r--r--  1 root root 1999220 Sep  3 20:03 part-r-00000.gz.parquet

A7VAGPU.vcf.gz

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Sep 6, 2016

Member

Hi @ooliynyk! The count_kmers command only works on read data, not on variant data. That said, I'd like to better understand your use case here. Are you trying to create consensus sequences from the genotype calls, which you then count k-mers from?

Member

fnothaft commented Sep 6, 2016

Hi @ooliynyk! The count_kmers command only works on read data, not on variant data. That said, I'd like to better understand your use case here. Are you trying to create consensus sequences from the genotype calls, which you then count k-mers from?

@BrandonColbyMD

This comment has been minimized.

Show comment
Hide comment
@BrandonColbyMD

BrandonColbyMD Sep 7, 2016

Hi @fnothaft - Thank you for your reply. I'm working with @ooliynyk on this project. We have installed ADAM as part of Sequencing.com's Altruist Database, a free, open-data initiative. The database contains human genome VCF files as well as those files converted to ADAM format.

Our goal is to enable users of the Altruist Database to be able to utilize the power of ADAM for any type of analysis they want to perform on one or more genotypic files within the Altruist Database. For example, they can select to perform analysis on all Altruist records that are female or they may choose to perform analysis on all Altruist records that are carriers of a Cystic Fibrosis variant in the CFTR gene.

Using the Altruist UI (which is still in development), users will be able to perform analysis of data within the Altruist Database by uploading or entering their own commands, by programming ADAM according to their own specs or use the commands/programs created and shared by other users.

We were testing out the count_kmers command to make sure it worked on our dataset in-case that command was entered by a user.

Hope this info was helpful. Any advice and guidance you can provide will be much appreciated so we can make sure that we enable ADAM to be used to its full potential.

BrandonColbyMD commented Sep 7, 2016

Hi @fnothaft - Thank you for your reply. I'm working with @ooliynyk on this project. We have installed ADAM as part of Sequencing.com's Altruist Database, a free, open-data initiative. The database contains human genome VCF files as well as those files converted to ADAM format.

Our goal is to enable users of the Altruist Database to be able to utilize the power of ADAM for any type of analysis they want to perform on one or more genotypic files within the Altruist Database. For example, they can select to perform analysis on all Altruist records that are female or they may choose to perform analysis on all Altruist records that are carriers of a Cystic Fibrosis variant in the CFTR gene.

Using the Altruist UI (which is still in development), users will be able to perform analysis of data within the Altruist Database by uploading or entering their own commands, by programming ADAM according to their own specs or use the commands/programs created and shared by other users.

We were testing out the count_kmers command to make sure it worked on our dataset in-case that command was entered by a user.

Hope this info was helpful. Any advice and guidance you can provide will be much appreciated so we can make sure that we enable ADAM to be used to its full potential.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Sep 20, 2016

Member

Very interesting use case, @ooliynyk @BrandonColbyMD!

As @fnothaft mentioned, count_kmers doesn't make sense to run on VCF files or on ADAM Parquet directories of Variants or Genotypes. We could help by wrapping the exception thrown in a more user friendly error message, and perhaps adding documentation to ADAM CLI commands as to which ADAM bdg-formats schema records they support.

What do you think?

Member

heuermh commented Sep 20, 2016

Very interesting use case, @ooliynyk @BrandonColbyMD!

As @fnothaft mentioned, count_kmers doesn't make sense to run on VCF files or on ADAM Parquet directories of Variants or Genotypes. We could help by wrapping the exception thrown in a more user friendly error message, and perhaps adding documentation to ADAM CLI commands as to which ADAM bdg-formats schema records they support.

What do you think?

@plexteq

This comment has been minimized.

Show comment
Hide comment
@plexteq

plexteq Oct 17, 2016

Hi @heuermh, @fnothaft.

So we converted VCF to ADAM using vcf2adam command. Which operations can be applied to the resulting ADAM file? What kind of analysis is possible to perform on it?

Regards,
Alex

plexteq commented Oct 17, 2016

Hi @heuermh, @fnothaft.

So we converted VCF to ADAM using vcf2adam command. Which operations can be applied to the resulting ADAM file? What kind of analysis is possible to perform on it?

Regards,
Alex

@BrandonColbyMD

This comment has been minimized.

Show comment
Hide comment
@BrandonColbyMD

BrandonColbyMD Oct 26, 2016

Hi Hi @heuermh and @fnothaft - wanted to follow up about @plexteq question above. We are in the process of implementing operations which users can use on a user-defined subset of ADAM files in the Altruist Database. All ADAM files are being converted from gVCF files using vcf2adam so they aren't from AVRO files.

Please let us know what operations you recommend to allow for analyzing one or more ADAM files within the Altruist Database.

Thank you!

BrandonColbyMD commented Oct 26, 2016

Hi Hi @heuermh and @fnothaft - wanted to follow up about @plexteq question above. We are in the process of implementing operations which users can use on a user-defined subset of ADAM files in the Altruist Database. All ADAM files are being converted from gVCF files using vcf2adam so they aren't from AVRO files.

Please let us know what operations you recommend to allow for analyzing one or more ADAM files within the Altruist Database.

Thank you!

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Oct 26, 2016

Member

@BrandonColbyMD ADAM is kind of like the swiss-army knife for getting data in traditional bioinformatics file formats ready for analysis on Spark; most of the interesting bits can be found in downstream repositories or in workbooks. I'll let others chime in with specific examples.

Member

heuermh commented Oct 26, 2016

@BrandonColbyMD ADAM is kind of like the swiss-army knife for getting data in traditional bioinformatics file formats ready for analysis on Spark; most of the interesting bits can be found in downstream repositories or in workbooks. I'll let others chime in with specific examples.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Mar 3, 2017

Member

Closing as this was a version change issue.

Member

fnothaft commented Mar 3, 2017

Closing as this was a version change issue.

@fnothaft fnothaft closed this Mar 3, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment