New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adam files created by vcf2adam is not recognizable #1496

Closed
mhaghdad opened this Issue Apr 18, 2017 · 3 comments

Comments

Projects
2 participants
@mhaghdad

mhaghdad commented Apr 18, 2017

Hi

I am running into something strange and I sincerely appreciate any feedback. I am working on chromosome 22 (chr22) 1000 genomes which is a .vcf file and it is about 11GB. To make it smaller in size and speed things up I am trying to convert this to .adam. I used the adam-submit vcf2adam and it successfully creates the .adam files. The problem is that now when I run the adam-submit flagstat or popstrat I am getting the following error message:

java.io.FileNotFoundException: Couldn't find any files matching maprfs:///thelocation

But the files are there and you can see them if running hadoop fs -ls maprfs:///thelocation.
I initially thought that this was probably due to the large size of the chromosome 22 (11GB) so I tried the following:

adam-submit transform maprfs:///thelocation/small.sam maprfs:///thelocation/small.adam
This creates the small.adam files successfully
adam-submit flagstat maprfs:///thelocation/small.adam
This works beautifully no problem

adam-submit vcf2adam maprfs:///thelocation/small.vcf maprfs:///thelocation/small.adam
This creates the small.adam files successfully
adam-submit flagstat maprfs:///thelocation/small.adam
This does NOT work and I am getting the same error as I was getting with the chromosome 22

java.io.FileNotFoundException: Couldn't find any files matching maprfs:///thelocation

I looked at the structure of the two small.adam files and they are different!!
The question is, am I missing something here with the vcf2adam? Is this the right way to convert the vcf to adam? Why am I getting the error even though the files are there?

Thank you

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Apr 18, 2017

Member

.adam files are Avro formatted records stored in Parquet format.

For BAM/CRAM/SAM files converted to "ADAM format" that means AlignmentRecords stored in Parquet. For VCF files converted to "ADAM format" it will be Genotypes or Variants stored in Parquet.

The flagstat command works on AlignmentRecords stored in Parquet, not Genotypes or Variants stored in Parquet. Thus an error is expected; a FileNotFoundException is probably not the correct one, however. We should see if a more user-friendly error could be thrown.

parquet-tools allows you to inspect the structure of Parquet files on disk.

$ parquet-tools schema foo.adam
message org.bdgenomics.formats.avro.AlignmentRecord {
  optional group contig {
    optional binary contigName (UTF8);
    optional int64 contigLength;
    optional binary contigMD5 (UTF8);
    optional binary referenceURL (UTF8);
    optional binary assembly (UTF8);
    optional binary species (UTF8);
  }
  optional int64 start;
  optional int64 oldPosition;
  optional int64 end;
  optional int32 mapq;
  optional binary readName (UTF8);
  optional binary sequence (UTF8);
  optional binary qual (UTF8);
  optional binary cigar (UTF8);
...

$ parquet-tools head foo.adam
contig:
.contigName = 1
.contigLength = 249250621
start = 26472783
end = 26472858
mapq = 60
readName = simread:1:26472783:false
sequence = GTATAAGAGCAGCCTTATTCCTATTTATAATCAGGGTGAAACACCTGTGCCAATGCCAAGACAGGGGTGCCAAGA
cigar = 75M
...
Member

heuermh commented Apr 18, 2017

.adam files are Avro formatted records stored in Parquet format.

For BAM/CRAM/SAM files converted to "ADAM format" that means AlignmentRecords stored in Parquet. For VCF files converted to "ADAM format" it will be Genotypes or Variants stored in Parquet.

The flagstat command works on AlignmentRecords stored in Parquet, not Genotypes or Variants stored in Parquet. Thus an error is expected; a FileNotFoundException is probably not the correct one, however. We should see if a more user-friendly error could be thrown.

parquet-tools allows you to inspect the structure of Parquet files on disk.

$ parquet-tools schema foo.adam
message org.bdgenomics.formats.avro.AlignmentRecord {
  optional group contig {
    optional binary contigName (UTF8);
    optional int64 contigLength;
    optional binary contigMD5 (UTF8);
    optional binary referenceURL (UTF8);
    optional binary assembly (UTF8);
    optional binary species (UTF8);
  }
  optional int64 start;
  optional int64 oldPosition;
  optional int64 end;
  optional int32 mapq;
  optional binary readName (UTF8);
  optional binary sequence (UTF8);
  optional binary qual (UTF8);
  optional binary cigar (UTF8);
...

$ parquet-tools head foo.adam
contig:
.contigName = 1
.contigLength = 249250621
start = 26472783
end = 26472858
mapq = 60
readName = simread:1:26472783:false
sequence = GTATAAGAGCAGCCTTATTCCTATTTATAATCAGGGTGAAACACCTGTGCCAATGCCAAGACAGGGGTGCCAAGA
cigar = 75M
...
@mhaghdad

This comment has been minimized.

Show comment
Hide comment
@mhaghdad

mhaghdad Apr 19, 2017

Thank you heuermh for clarifying that :)
I also found the reason why popstrat was giving the error, it had been written for an older spark and I was able to modify the Scala code. The two similar errors with no correlation were creating even further confusion. Thanks

mhaghdad commented Apr 19, 2017

Thank you heuermh for clarifying that :)
I also found the reason why popstrat was giving the error, it had been written for an older spark and I was able to modify the Scala code. The two similar errors with no correlation were creating even further confusion. Thanks

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Apr 19, 2017

Member

No problem!

Member

heuermh commented Apr 19, 2017

No problem!

@heuermh heuermh closed this Apr 19, 2017

@heuermh heuermh modified the milestone: 0.23.0 Jul 22, 2017

@heuermh heuermh added this to Completed in Release 0.23.0 Jan 4, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment