New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adam2vcf -sort_on_save flag broken #940

Closed
andrewmchen opened this Issue Feb 12, 2016 · 14 comments

Comments

Projects
None yet
4 participants
@andrewmchen
Member

andrewmchen commented Feb 12, 2016

Hi all. I tried to run adam2vcf with the sort_on_save flag and got this error.

16/02/11 20:28:01 WARN TaskSetManager: Lost task 10.0 in stage 9.0 (TID 202, amp-bdg-57.amp): com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException
Serialization trace:
attributes (htsjdk.variant.variantcontext.CommonInfo)
commonInfo (htsjdk.variant.variantcontext.VariantContext)
vc (org.seqdoop.hadoop_bam.VariantContextWritable)
    at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
    at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
    at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699)
    at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611)
    at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
    at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648)
    at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
    at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
    at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
    at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:192)
    at org.apache.spark.serializer.DeserializationStream.readValue(Serializer.scala:171)
    at org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:201)
    at org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:198)
    at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
    at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
    at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
    at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:217)
    at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:102)
    at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:90)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:88)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
    at scala.collection.convert.Wrappers$MutableMapWrapper.put(Wrappers.scala:217)
    at com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:135)
    at com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:17)
    at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648)
    at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
    ... 31 more

The adam2vcf worked without the flag so I suspect it's only when I sort. I attached the full log as well..
log.txt

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Feb 12, 2016

Member

I think I know a fix (and the fix should be related to #933). Do you have a VCF on the cluster that reproduces this?

Member

fnothaft commented Feb 12, 2016

I think I know a fix (and the fix should be related to #933). Do you have a VCF on the cluster that reproduces this?

@andrewmchen

This comment has been minimized.

Show comment
Hide comment
@andrewmchen

andrewmchen Feb 12, 2016

Member

Yup it's in hdfs in my home directory. It's called hdfs:///user/amchen/NA12878.mapped.ILLUMINA.bwa.CEU.high_coverage_pcr_free.20130906.bam.unified.raw.SNP.gatk.vcf.adam.filtered

Member

andrewmchen commented Feb 12, 2016

Yup it's in hdfs in my home directory. It's called hdfs:///user/amchen/NA12878.mapped.ILLUMINA.bwa.CEU.high_coverage_pcr_free.20130906.bam.unified.raw.SNP.gatk.vcf.adam.filtered

@andrewmchen

This comment has been minimized.

Show comment
Hide comment
@andrewmchen

andrewmchen Feb 15, 2016

Member

I pulled down your change in #933 and it still doesn't seem to work at least for this adam file. To reproduce you could just use bash /home/eecs/amchen/scripts/adamToVCF.sh

Here's the log
log2.txt

Member

andrewmchen commented Feb 15, 2016

I pulled down your change in #933 and it still doesn't seem to work at least for this adam file. To reproduce you could just use bash /home/eecs/amchen/scripts/adamToVCF.sh

Here's the log
log2.txt

@heuermh heuermh added this to the 0.19.0 milestone Feb 16, 2016

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Feb 16, 2016

Member

@massie is looking at this

Member

fnothaft commented Feb 16, 2016

@massie is looking at this

@massie

This comment has been minimized.

Show comment
Hide comment
@massie

massie Feb 17, 2016

Member

@andrewmchen I'm looking at this now. Thanks for sending the script and files for your job.

When running the script, I get the following error from Parquet..

Feb 16, 2016 4:00:58 PM WARNING: org.apache.parquet.CorruptStatistics: Ignoring statistics because created_by could not be parsed (see PARQUET-251): parquet-mr (build 32c46643845ea8a705c35d4ec8fc654cc8ff816d)

Looking at the meta data for the file, I see

creator:                   parquet-mr (build 32c46643845ea8a705c35d4ec8fc654cc8ff816d) 

which is version Parquet 1.7.0

Whereas the other adam file in the directory (minus the "filtered" suffix), has the following creator:

creator:                   parquet-mr version 1.8.1 (build 4aba4dae7bb0d4edbcf7923ae1339f28fd3f7fcf) 

We switched from Parquet 1.7.0 -> 1.8.1 in July last year.

How hard would it be to regenerate that adam file using a newer version of ADAM? It might be worth a try as I debug the root cause of the exception.

Member

massie commented Feb 17, 2016

@andrewmchen I'm looking at this now. Thanks for sending the script and files for your job.

When running the script, I get the following error from Parquet..

Feb 16, 2016 4:00:58 PM WARNING: org.apache.parquet.CorruptStatistics: Ignoring statistics because created_by could not be parsed (see PARQUET-251): parquet-mr (build 32c46643845ea8a705c35d4ec8fc654cc8ff816d)

Looking at the meta data for the file, I see

creator:                   parquet-mr (build 32c46643845ea8a705c35d4ec8fc654cc8ff816d) 

which is version Parquet 1.7.0

Whereas the other adam file in the directory (minus the "filtered" suffix), has the following creator:

creator:                   parquet-mr version 1.8.1 (build 4aba4dae7bb0d4edbcf7923ae1339f28fd3f7fcf) 

We switched from Parquet 1.7.0 -> 1.8.1 in July last year.

How hard would it be to regenerate that adam file using a newer version of ADAM? It might be worth a try as I debug the root cause of the exception.

@massie

This comment has been minimized.

Show comment
Hide comment
@massie

massie Feb 17, 2016

Member

@andrewmchen I just checked the Avro and Parquet schemas and they are identical so there's likely little use in recreating that file (unless it's trivial to do).

Member

massie commented Feb 17, 2016

@andrewmchen I just checked the Avro and Parquet schemas and they are identical so there's likely little use in recreating that file (unless it's trivial to do).

@andrewmchen

This comment has been minimized.

Show comment
Hide comment
@andrewmchen

andrewmchen Feb 17, 2016

Member

The file meaning .filtered? I can recreate it without any hassle and I'll do it when I get a chance to.

It seems very peculiar that they'd have different parquet versions because I built the .filtered file like a month ago. Could it be because avocado is on a different version of adam/parquet?

Member

andrewmchen commented Feb 17, 2016

The file meaning .filtered? I can recreate it without any hassle and I'll do it when I get a chance to.

It seems very peculiar that they'd have different parquet versions because I built the .filtered file like a month ago. Could it be because avocado is on a different version of adam/parquet?

@massie

This comment has been minimized.

Show comment
Hide comment
@massie

massie Feb 17, 2016

Member

Sorry. I can see why that wasn't clear. Yes, the "*.filtered" file was created using Parquet 1.7.x.

That's odd. As long as Avocado is using ADAM version 0.17.1 or newer, it should be writing Parquet 0.18.x files. Avocado started using ADAM 0.17.1 in August of last year. As long as you have a relatively recent version of Avocado, you should be fine.

Member

massie commented Feb 17, 2016

Sorry. I can see why that wasn't clear. Yes, the "*.filtered" file was created using Parquet 1.7.x.

That's odd. As long as Avocado is using ADAM version 0.17.1 or newer, it should be writing Parquet 0.18.x files. Avocado started using ADAM 0.17.1 in August of last year. As long as you have a relatively recent version of Avocado, you should be fine.

@massie

This comment has been minimized.

Show comment
Hide comment
@massie

massie Feb 17, 2016

Member

@andrewmchen Can you verify the version of Avocado that you're using? If it less than six months old, it shouldn't be saving in Parquet 1.7.x format as far as I can tell looking at the pom files.

Member

massie commented Feb 17, 2016

@andrewmchen Can you verify the version of Avocado that you're using? If it less than six months old, it shouldn't be saving in Parquet 1.7.x format as far as I can tell looking at the pom files.

@andrewmchen

This comment has been minimized.

Show comment
Hide comment
@andrewmchen

andrewmchen Feb 17, 2016

Member

That makes a ton of sense. I should probably rebase my avocado. The commit hash I branched off on was 2e6504f01004cd13c22f36198e6aea490bb94130.

Member

andrewmchen commented Feb 17, 2016

That makes a ton of sense. I should probably rebase my avocado. The commit hash I branched off on was 2e6504f01004cd13c22f36198e6aea490bb94130.

@massie

This comment has been minimized.

Show comment
Hide comment
@massie

massie Feb 18, 2016

Member

@andrewmchen I just submitted a pull request #949 that fixes this issue. When you have a moment, can you verify that it fixes your problem? I've run your test case but it's always good to have more than one set of eyes.

Member

massie commented Feb 18, 2016

@andrewmchen I just submitted a pull request #949 that fixes this issue. When you have a moment, can you verify that it fixes your problem? I've run your test case but it's always good to have more than one set of eyes.

@andrewmchen

This comment has been minimized.

Show comment
Hide comment
@andrewmchen

andrewmchen Feb 18, 2016

Member

Sure. I'll do it later tonight. Thanks for resolving this issue so quickly!

Member

andrewmchen commented Feb 18, 2016

Sure. I'll do it later tonight. Thanks for resolving this issue so quickly!

@andrewmchen

This comment has been minimized.

Show comment
Hide comment
@andrewmchen

andrewmchen Feb 18, 2016

Member

This seems to have solved it. Just curious, how did this line work in the past anyways? https://github.com/bigdatagenomics/adam/pull/949/files#diff-514d6d86034c4dd8aa9ee737c8637a7eL130

Member

andrewmchen commented Feb 18, 2016

This seems to have solved it. Just curious, how did this line work in the past anyways? https://github.com/bigdatagenomics/adam/pull/949/files#diff-514d6d86034c4dd8aa9ee737c8637a7eL130

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Feb 24, 2016

Member

Fixed by commit 0975e30

Member

heuermh commented Feb 24, 2016

Fixed by commit 0975e30

@heuermh heuermh closed this Feb 24, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment