Adam files created by vcf2adam is not recognizable #1496

mhaghdad · 2017-04-18T20:01:20Z

Hi

I am running into something strange and I sincerely appreciate any feedback. I am working on chromosome 22 (chr22) 1000 genomes which is a .vcf file and it is about 11GB. To make it smaller in size and speed things up I am trying to convert this to .adam. I used the adam-submit vcf2adam and it successfully creates the .adam files. The problem is that now when I run the adam-submit flagstat or popstrat I am getting the following error message:

java.io.FileNotFoundException: Couldn't find any files matching maprfs:///thelocation

But the files are there and you can see them if running hadoop fs -ls maprfs:///thelocation.
I initially thought that this was probably due to the large size of the chromosome 22 (11GB) so I tried the following:

adam-submit transform maprfs:///thelocation/small.sam maprfs:///thelocation/small.adam
This creates the small.adam files successfully
adam-submit flagstat maprfs:///thelocation/small.adam
This works beautifully no problem

adam-submit vcf2adam maprfs:///thelocation/small.vcf maprfs:///thelocation/small.adam
This creates the small.adam files successfully
adam-submit flagstat maprfs:///thelocation/small.adam
This does NOT work and I am getting the same error as I was getting with the chromosome 22

java.io.FileNotFoundException: Couldn't find any files matching maprfs:///thelocation

I looked at the structure of the two small.adam files and they are different!!
The question is, am I missing something here with the vcf2adam? Is this the right way to convert the vcf to adam? Why am I getting the error even though the files are there?

Thank you

heuermh · 2017-04-18T20:32:44Z

.adam files are Avro formatted records stored in Parquet format.

For BAM/CRAM/SAM files converted to "ADAM format" that means AlignmentRecords stored in Parquet. For VCF files converted to "ADAM format" it will be Genotypes or Variants stored in Parquet.

The flagstat command works on AlignmentRecords stored in Parquet, not Genotypes or Variants stored in Parquet. Thus an error is expected; a FileNotFoundException is probably not the correct one, however. We should see if a more user-friendly error could be thrown.

parquet-tools allows you to inspect the structure of Parquet files on disk.

$ parquet-tools schema foo.adam
message org.bdgenomics.formats.avro.AlignmentRecord {
  optional group contig {
    optional binary contigName (UTF8);
    optional int64 contigLength;
    optional binary contigMD5 (UTF8);
    optional binary referenceURL (UTF8);
    optional binary assembly (UTF8);
    optional binary species (UTF8);
  }
  optional int64 start;
  optional int64 oldPosition;
  optional int64 end;
  optional int32 mapq;
  optional binary readName (UTF8);
  optional binary sequence (UTF8);
  optional binary qual (UTF8);
  optional binary cigar (UTF8);
...

$ parquet-tools head foo.adam
contig:
.contigName = 1
.contigLength = 249250621
start = 26472783
end = 26472858
mapq = 60
readName = simread:1:26472783:false
sequence = GTATAAGAGCAGCCTTATTCCTATTTATAATCAGGGTGAAACACCTGTGCCAATGCCAAGACAGGGGTGCCAAGA
cigar = 75M
...

mhaghdad · 2017-04-19T07:26:44Z

Thank you heuermh for clarifying that :)
I also found the reason why popstrat was giving the error, it had been written for an older spark and I was able to modify the Scala code. The two similar errors with no correlation were creating even further confusion. Thanks

heuermh · 2017-04-19T17:22:51Z

No problem!

heuermh closed this as completed Apr 19, 2017

heuermh modified the milestone: 0.23.0 Jul 22, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adam files created by vcf2adam is not recognizable #1496

Adam files created by vcf2adam is not recognizable #1496

mhaghdad commented Apr 18, 2017 •

edited

heuermh commented Apr 18, 2017

mhaghdad commented Apr 19, 2017 •

edited

heuermh commented Apr 19, 2017

Adam files created by vcf2adam is not recognizable #1496

Adam files created by vcf2adam is not recognizable #1496

Comments

mhaghdad commented Apr 18, 2017 • edited

heuermh commented Apr 18, 2017

mhaghdad commented Apr 19, 2017 • edited

heuermh commented Apr 19, 2017

mhaghdad commented Apr 18, 2017 •

edited

mhaghdad commented Apr 19, 2017 •

edited