New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAM-1100] Resolve Sample Not Serializable exception #1101

Merged
merged 1 commit into from Aug 10, 2016

Conversation

Projects
None yet
3 participants
@fnothaft
Member

fnothaft commented Aug 6, 2016

Resolves #1100. Registered Sample class with the AvroSerializer in ADAMKryoRegistrator.

I've tested this locally. @jpdna can you verify this works on your side as well?

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Aug 6, 2016

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1370/
Test PASSed.

AmplabJenkins commented Aug 6, 2016

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1370/
Test PASSed.

@jpdna

This comment has been minimized.

Show comment
Hide comment
@jpdna

jpdna Aug 6, 2016

Member

Weirdly I am still getting same error as described in #1100 even with this PR's code - and I am pretty sure that I built and am running this PR branch isssues/1100-sample-serializable as intended.

Are any other additions needed to make a new Avro object Kryo registered and serializable?

[jp@jp1 adam]$ pwd
/jpr1/work/Hbase_July22/adam_sample_fix/adam
[jp@jp1 adam]$ git status
On branch issues/1100-sample-serializable
Your branch is up-to-date with 'origin/issues/1100-sample-serializable'.
nothing to commit, working directory clean

[jp@jp1 adam]$ cd ../../run7_fix_sample_serialize/
[jp@jp1 run7_fix_sample_serialize]$ ../adam_sample_fix/adam/bin/adam-submit adam2vcf HG00096.var.adam outFromAdamHG00096.vcf
Using ADAM_MAIN=org.bdgenomics.adam.cli.ADAMMain
Using SPARK_SUBMIT=/jpr1/work/Hbase_July22/spark1.6.1/spark-1.6.1-bin-hadoop2.6/bin/spark-submit
Command body threw exception:
org.apache.spark.SparkException: Task not serializable
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
    at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
    at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
    at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
    at org.apache.spark.SparkContext.clean(SparkContext.scala:2055)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:742)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:741)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
    at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:741)
    at org.bdgenomics.adam.rdd.variation.VariantContextRDD.saveAsVcf(VariantContextRDD.scala:117)
    at org.bdgenomics.adam.cli.ADAM2Vcf.run(ADAM2Vcf.scala:83)
    at org.bdgenomics.utils.cli.BDGSparkCommand$class.run(BDGCommand.scala:55)
    at org.bdgenomics.adam.cli.ADAM2Vcf.run(ADAM2Vcf.scala:59)
    at org.bdgenomics.adam.cli.ADAMMain.apply(ADAMMain.scala:131)
    at org.bdgenomics.adam.cli.ADAMMain$.main(ADAMMain.scala:71)
    at org.bdgenomics.adam.cli.ADAMMain.main(ADAMMain.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.NotSerializableException: org.bdgenomics.formats.avro.Sample
Serialization stack:
    - object not serializable (class: org.bdgenomics.formats.avro.Sample, value: {"sampleId": "HG00096", "name": null, "attributes": {}})
    - writeObject data (class: scala.collection.immutable.$colon$colon)
    - object (class scala.collection.immutable.$colon$colon, List({"sampleId": "HG00096", "name": null, "attributes": {}}))
    - field (class: org.bdgenomics.adam.rdd.variation.VariantContextRDD, name: samples, type: interface scala.collection.Seq)
    - object (class org.bdgenomics.adam.rdd.variation.VariantContextRDD, VariantContextRDD(MapPartitionsRDD[4] at map at GenotypeRDD.scala:62,SequenceDictionary{
1->249250621, 0
2->243199373, 1
3->198022430, 2
4->191154276, 3
5->180915260, 4
6->171115067, 5
7->159138663, 6
8->146364022, 7
9->141213431, 8
10->135534747, 9
11->135006516, 10
12->133851895, 11
13->115169878, 12
14->107349540, 13
15->102531392, 14
16->90354753, 15
17->81195210, 16
18->78077248, 17
19->59128983, 18
20->63025520, 19
21->48129895, 20
22->51304566, 21
GL000191.1->106433, 22
GL000192.1->547496, 23
GL000193.1->189789, 24
GL000194.1->191469, 25
GL000195.1->182896, 26
GL000196.1->38914, 27
GL000197.1->37175, 28
GL000198.1->90085, 29
GL000199.1->169874, 30
GL000200.1->187035, 31
GL000201.1->36148, 32
GL000202.1->40103, 33
GL000203.1->37498, 34
GL000204.1->81310, 35
GL000205.1->174588, 36
GL000206.1->41001, 37
GL000207.1->4262, 38
GL000208.1->92689, 39
GL000209.1->159169, 40
GL000210.1->27682, 41
GL000211.1->166566, 42
GL000212.1->186858, 43
GL000213.1->164239, 44
GL000214.1->137718, 45
GL000215.1->172545, 46
GL000216.1->172294, 47
GL000217.1->172149, 48
GL000218.1->161147, 49
GL000219.1->179198, 50
GL000220.1->161802, 51
GL000221.1->155397, 52
GL000222.1->186861, 53
GL000223.1->180455, 54
GL000224.1->179693, 55
GL000225.1->211173, 56
GL000226.1->15008, 57
GL000227.1->128374, 58
GL000228.1->129120, 59
GL000229.1->19913, 60
GL000230.1->43691, 61
GL000231.1->27386, 62
GL000232.1->40652, 63
GL000233.1->45941, 64
GL000234.1->40531, 65
GL000235.1->34474, 66
GL000236.1->41934, 67
GL000237.1->45867, 68
GL000238.1->39939, 69
GL000239.1->33824, 70
GL000240.1->41933, 71
GL000241.1->42152, 72
GL000242.1->43523, 73
GL000243.1->43341, 74
GL000244.1->39929, 75
GL000245.1->36651, 76
GL000246.1->38154, 77
GL000247.1->36422, 78
GL000248.1->39786, 79
GL000249.1->38502, 80
MT->16569, 81
NC_007605->171823, 82
X->155270560, 83
Y->59373566, 84
hs37d5->35477943, 85},List({"sampleId": "HG00096", "name": null, "attributes": {}})))
    - field (class: org.bdgenomics.adam.rdd.variation.VariantContextRDD$$anonfun$4, name: $outer, type: class org.bdgenomics.adam.rdd.variation.VariantContextRDD)
    - object (class org.bdgenomics.adam.rdd.variation.VariantContextRDD$$anonfun$4, <function2>)
    at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
    at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
    at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
    at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
    ... 25 more
Aug 6, 2016 12:50:24 PM INFO: org.apache.parquet.hadoop.ParquetInputFormat: Total input paths to process : 6

Member

jpdna commented Aug 6, 2016

Weirdly I am still getting same error as described in #1100 even with this PR's code - and I am pretty sure that I built and am running this PR branch isssues/1100-sample-serializable as intended.

Are any other additions needed to make a new Avro object Kryo registered and serializable?

[jp@jp1 adam]$ pwd
/jpr1/work/Hbase_July22/adam_sample_fix/adam
[jp@jp1 adam]$ git status
On branch issues/1100-sample-serializable
Your branch is up-to-date with 'origin/issues/1100-sample-serializable'.
nothing to commit, working directory clean

[jp@jp1 adam]$ cd ../../run7_fix_sample_serialize/
[jp@jp1 run7_fix_sample_serialize]$ ../adam_sample_fix/adam/bin/adam-submit adam2vcf HG00096.var.adam outFromAdamHG00096.vcf
Using ADAM_MAIN=org.bdgenomics.adam.cli.ADAMMain
Using SPARK_SUBMIT=/jpr1/work/Hbase_July22/spark1.6.1/spark-1.6.1-bin-hadoop2.6/bin/spark-submit
Command body threw exception:
org.apache.spark.SparkException: Task not serializable
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
    at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
    at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
    at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
    at org.apache.spark.SparkContext.clean(SparkContext.scala:2055)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:742)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:741)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
    at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:741)
    at org.bdgenomics.adam.rdd.variation.VariantContextRDD.saveAsVcf(VariantContextRDD.scala:117)
    at org.bdgenomics.adam.cli.ADAM2Vcf.run(ADAM2Vcf.scala:83)
    at org.bdgenomics.utils.cli.BDGSparkCommand$class.run(BDGCommand.scala:55)
    at org.bdgenomics.adam.cli.ADAM2Vcf.run(ADAM2Vcf.scala:59)
    at org.bdgenomics.adam.cli.ADAMMain.apply(ADAMMain.scala:131)
    at org.bdgenomics.adam.cli.ADAMMain$.main(ADAMMain.scala:71)
    at org.bdgenomics.adam.cli.ADAMMain.main(ADAMMain.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.NotSerializableException: org.bdgenomics.formats.avro.Sample
Serialization stack:
    - object not serializable (class: org.bdgenomics.formats.avro.Sample, value: {"sampleId": "HG00096", "name": null, "attributes": {}})
    - writeObject data (class: scala.collection.immutable.$colon$colon)
    - object (class scala.collection.immutable.$colon$colon, List({"sampleId": "HG00096", "name": null, "attributes": {}}))
    - field (class: org.bdgenomics.adam.rdd.variation.VariantContextRDD, name: samples, type: interface scala.collection.Seq)
    - object (class org.bdgenomics.adam.rdd.variation.VariantContextRDD, VariantContextRDD(MapPartitionsRDD[4] at map at GenotypeRDD.scala:62,SequenceDictionary{
1->249250621, 0
2->243199373, 1
3->198022430, 2
4->191154276, 3
5->180915260, 4
6->171115067, 5
7->159138663, 6
8->146364022, 7
9->141213431, 8
10->135534747, 9
11->135006516, 10
12->133851895, 11
13->115169878, 12
14->107349540, 13
15->102531392, 14
16->90354753, 15
17->81195210, 16
18->78077248, 17
19->59128983, 18
20->63025520, 19
21->48129895, 20
22->51304566, 21
GL000191.1->106433, 22
GL000192.1->547496, 23
GL000193.1->189789, 24
GL000194.1->191469, 25
GL000195.1->182896, 26
GL000196.1->38914, 27
GL000197.1->37175, 28
GL000198.1->90085, 29
GL000199.1->169874, 30
GL000200.1->187035, 31
GL000201.1->36148, 32
GL000202.1->40103, 33
GL000203.1->37498, 34
GL000204.1->81310, 35
GL000205.1->174588, 36
GL000206.1->41001, 37
GL000207.1->4262, 38
GL000208.1->92689, 39
GL000209.1->159169, 40
GL000210.1->27682, 41
GL000211.1->166566, 42
GL000212.1->186858, 43
GL000213.1->164239, 44
GL000214.1->137718, 45
GL000215.1->172545, 46
GL000216.1->172294, 47
GL000217.1->172149, 48
GL000218.1->161147, 49
GL000219.1->179198, 50
GL000220.1->161802, 51
GL000221.1->155397, 52
GL000222.1->186861, 53
GL000223.1->180455, 54
GL000224.1->179693, 55
GL000225.1->211173, 56
GL000226.1->15008, 57
GL000227.1->128374, 58
GL000228.1->129120, 59
GL000229.1->19913, 60
GL000230.1->43691, 61
GL000231.1->27386, 62
GL000232.1->40652, 63
GL000233.1->45941, 64
GL000234.1->40531, 65
GL000235.1->34474, 66
GL000236.1->41934, 67
GL000237.1->45867, 68
GL000238.1->39939, 69
GL000239.1->33824, 70
GL000240.1->41933, 71
GL000241.1->42152, 72
GL000242.1->43523, 73
GL000243.1->43341, 74
GL000244.1->39929, 75
GL000245.1->36651, 76
GL000246.1->38154, 77
GL000247.1->36422, 78
GL000248.1->39786, 79
GL000249.1->38502, 80
MT->16569, 81
NC_007605->171823, 82
X->155270560, 83
Y->59373566, 84
hs37d5->35477943, 85},List({"sampleId": "HG00096", "name": null, "attributes": {}})))
    - field (class: org.bdgenomics.adam.rdd.variation.VariantContextRDD$$anonfun$4, name: $outer, type: class org.bdgenomics.adam.rdd.variation.VariantContextRDD)
    - object (class org.bdgenomics.adam.rdd.variation.VariantContextRDD$$anonfun$4, <function2>)
    at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
    at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
    at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
    at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
    ... 25 more
Aug 6, 2016 12:50:24 PM INFO: org.apache.parquet.hadoop.ParquetInputFormat: Total input paths to process : 6

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Aug 6, 2016

Member

@jpdna OOC, can you do a clean build and retry?

Member

fnothaft commented Aug 6, 2016

@jpdna OOC, can you do a clean build and retry?

@jpdna

This comment has been minimized.

Show comment
Hide comment
@jpdna

jpdna Aug 6, 2016

Member

I re-cloned, checkout out the PR branch, and did mvn clean install as below,
still see same error. I checked ADAMKryoRegistrator.scala after checking out the branch and it definitely has the new code from this PR.

git clone https://github.com/fnothaft/adam.git
cd adam/
git checkout issues/1100-sample-serializable 
mvn clean install
mkdir test-1100
cd test-1100
cp /jpr1/work/Hbase_July22/run7_fix_sample_serialize/HG00096.vcf .
../bin/adam-submit vcf2adam HG00096.vcf HG00096.var.adam

../bin/adam-submit adam2vcf HG00096.var.adam roundtrip.vcf
Using ADAM_MAIN=org.bdgenomics.adam.cli.ADAMMain
Using SPARK_SUBMIT=/jpr1/work/Hbase_July22/spark1.6.1/spark-1.6.1-bin-hadoop2.6/bin/spark-submit
Command body threw exception:
org.apache.spark.SparkException: Task not serializable
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
    at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
    at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
    at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
    at org.apache.spark.SparkContext.clean(SparkContext.scala:2055)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:742)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:741)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
    at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:741)
    at org.bdgenomics.adam.rdd.variation.VariantContextRDD.saveAsVcf(VariantContextRDD.scala:117)
    at org.bdgenomics.adam.cli.ADAM2Vcf.run(ADAM2Vcf.scala:83)
    at org.bdgenomics.utils.cli.BDGSparkCommand$class.run(BDGCommand.scala:55)
    at org.bdgenomics.adam.cli.ADAM2Vcf.run(ADAM2Vcf.scala:59)
    at org.bdgenomics.adam.cli.ADAMMain.apply(ADAMMain.scala:131)
    at org.bdgenomics.adam.cli.ADAMMain$.main(ADAMMain.scala:71)
    at org.bdgenomics.adam.cli.ADAMMain.main(ADAMMain.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.NotSerializableException: org.bdgenomics.formats.avro.Sample
Serialization stack:
    - object not serializable (class: org.bdgenomics.formats.avro.Sample, value: {"sampleId": "HG00096", "name": null, "attributes": {}})
    - writeObject data (class: scala.collection.immutable.$colon$colon)
    - object (class scala.collection.immutable.$colon$colon, List({"sampleId": "HG00096", "name": null, "attributes": {}}))
    - field (class: org.bdgenomics.adam.rdd.variation.VariantContextRDD, name: samples, type: interface scala.collection.Seq)
    - object (class org.bdgenomics.adam.rdd.variation.VariantContextRDD, VariantContextRDD(MapPartitionsRDD[4] at map at GenotypeRDD.scala:62,SequenceDictionary{
1->249250621, 0
2->243199373, 1
3->198022430, 2
4->191154276, 3
5->180915260, 4
6->171115067, 5
7->159138663, 6
8->146364022, 7
9->141213431, 8
10->135534747, 9
11->135006516, 10
12->133851895, 11
13->115169878, 12
14->107349540, 13
15->102531392, 14
16->90354753, 15
17->81195210, 16
18->78077248, 17
19->59128983, 18
20->63025520, 19
21->48129895, 20
22->51304566, 21
GL000191.1->106433, 22
GL000192.1->547496, 23
GL000193.1->189789, 24
GL000194.1->191469, 25
GL000195.1->182896, 26
GL000196.1->38914, 27
GL000197.1->37175, 28
GL000198.1->90085, 29
GL000199.1->169874, 30
GL000200.1->187035, 31
GL000201.1->36148, 32
GL000202.1->40103, 33
GL000203.1->37498, 34
GL000204.1->81310, 35
GL000205.1->174588, 36
GL000206.1->41001, 37
GL000207.1->4262, 38
GL000208.1->92689, 39
GL000209.1->159169, 40
GL000210.1->27682, 41
GL000211.1->166566, 42
GL000212.1->186858, 43
GL000213.1->164239, 44
GL000214.1->137718, 45
GL000215.1->172545, 46
GL000216.1->172294, 47
GL000217.1->172149, 48
GL000218.1->161147, 49
GL000219.1->179198, 50
GL000220.1->161802, 51
GL000221.1->155397, 52
GL000222.1->186861, 53
GL000223.1->180455, 54
GL000224.1->179693, 55
GL000225.1->211173, 56
GL000226.1->15008, 57
GL000227.1->128374, 58
GL000228.1->129120, 59
GL000229.1->19913, 60
GL000230.1->43691, 61
GL000231.1->27386, 62
GL000232.1->40652, 63
GL000233.1->45941, 64
GL000234.1->40531, 65
GL000235.1->34474, 66
GL000236.1->41934, 67
GL000237.1->45867, 68
GL000238.1->39939, 69
GL000239.1->33824, 70
GL000240.1->41933, 71
GL000241.1->42152, 72
GL000242.1->43523, 73
GL000243.1->43341, 74
GL000244.1->39929, 75
GL000245.1->36651, 76
GL000246.1->38154, 77
GL000247.1->36422, 78
GL000248.1->39786, 79
GL000249.1->38502, 80
MT->16569, 81
NC_007605->171823, 82
X->155270560, 83
Y->59373566, 84
hs37d5->35477943, 85},List({"sampleId": "HG00096", "name": null, "attributes": {}})))
    - field (class: org.bdgenomics.adam.rdd.variation.VariantContextRDD$$anonfun$4, name: $outer, type: class org.bdgenomics.adam.rdd.variation.VariantContextRDD)
    - object (class org.bdgenomics.adam.rdd.variation.VariantContextRDD$$anonfun$4, <function2>)
    at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
    at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
    at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
    at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
    ... 25 more
Aug 6, 2016 1:16:10 PM INFO: org.apache.parquet.hadoop.ParquetInputFormat: Total input paths to process : 6


Member

jpdna commented Aug 6, 2016

I re-cloned, checkout out the PR branch, and did mvn clean install as below,
still see same error. I checked ADAMKryoRegistrator.scala after checking out the branch and it definitely has the new code from this PR.

git clone https://github.com/fnothaft/adam.git
cd adam/
git checkout issues/1100-sample-serializable 
mvn clean install
mkdir test-1100
cd test-1100
cp /jpr1/work/Hbase_July22/run7_fix_sample_serialize/HG00096.vcf .
../bin/adam-submit vcf2adam HG00096.vcf HG00096.var.adam

../bin/adam-submit adam2vcf HG00096.var.adam roundtrip.vcf
Using ADAM_MAIN=org.bdgenomics.adam.cli.ADAMMain
Using SPARK_SUBMIT=/jpr1/work/Hbase_July22/spark1.6.1/spark-1.6.1-bin-hadoop2.6/bin/spark-submit
Command body threw exception:
org.apache.spark.SparkException: Task not serializable
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
    at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
    at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
    at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
    at org.apache.spark.SparkContext.clean(SparkContext.scala:2055)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:742)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:741)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
    at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:741)
    at org.bdgenomics.adam.rdd.variation.VariantContextRDD.saveAsVcf(VariantContextRDD.scala:117)
    at org.bdgenomics.adam.cli.ADAM2Vcf.run(ADAM2Vcf.scala:83)
    at org.bdgenomics.utils.cli.BDGSparkCommand$class.run(BDGCommand.scala:55)
    at org.bdgenomics.adam.cli.ADAM2Vcf.run(ADAM2Vcf.scala:59)
    at org.bdgenomics.adam.cli.ADAMMain.apply(ADAMMain.scala:131)
    at org.bdgenomics.adam.cli.ADAMMain$.main(ADAMMain.scala:71)
    at org.bdgenomics.adam.cli.ADAMMain.main(ADAMMain.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.NotSerializableException: org.bdgenomics.formats.avro.Sample
Serialization stack:
    - object not serializable (class: org.bdgenomics.formats.avro.Sample, value: {"sampleId": "HG00096", "name": null, "attributes": {}})
    - writeObject data (class: scala.collection.immutable.$colon$colon)
    - object (class scala.collection.immutable.$colon$colon, List({"sampleId": "HG00096", "name": null, "attributes": {}}))
    - field (class: org.bdgenomics.adam.rdd.variation.VariantContextRDD, name: samples, type: interface scala.collection.Seq)
    - object (class org.bdgenomics.adam.rdd.variation.VariantContextRDD, VariantContextRDD(MapPartitionsRDD[4] at map at GenotypeRDD.scala:62,SequenceDictionary{
1->249250621, 0
2->243199373, 1
3->198022430, 2
4->191154276, 3
5->180915260, 4
6->171115067, 5
7->159138663, 6
8->146364022, 7
9->141213431, 8
10->135534747, 9
11->135006516, 10
12->133851895, 11
13->115169878, 12
14->107349540, 13
15->102531392, 14
16->90354753, 15
17->81195210, 16
18->78077248, 17
19->59128983, 18
20->63025520, 19
21->48129895, 20
22->51304566, 21
GL000191.1->106433, 22
GL000192.1->547496, 23
GL000193.1->189789, 24
GL000194.1->191469, 25
GL000195.1->182896, 26
GL000196.1->38914, 27
GL000197.1->37175, 28
GL000198.1->90085, 29
GL000199.1->169874, 30
GL000200.1->187035, 31
GL000201.1->36148, 32
GL000202.1->40103, 33
GL000203.1->37498, 34
GL000204.1->81310, 35
GL000205.1->174588, 36
GL000206.1->41001, 37
GL000207.1->4262, 38
GL000208.1->92689, 39
GL000209.1->159169, 40
GL000210.1->27682, 41
GL000211.1->166566, 42
GL000212.1->186858, 43
GL000213.1->164239, 44
GL000214.1->137718, 45
GL000215.1->172545, 46
GL000216.1->172294, 47
GL000217.1->172149, 48
GL000218.1->161147, 49
GL000219.1->179198, 50
GL000220.1->161802, 51
GL000221.1->155397, 52
GL000222.1->186861, 53
GL000223.1->180455, 54
GL000224.1->179693, 55
GL000225.1->211173, 56
GL000226.1->15008, 57
GL000227.1->128374, 58
GL000228.1->129120, 59
GL000229.1->19913, 60
GL000230.1->43691, 61
GL000231.1->27386, 62
GL000232.1->40652, 63
GL000233.1->45941, 64
GL000234.1->40531, 65
GL000235.1->34474, 66
GL000236.1->41934, 67
GL000237.1->45867, 68
GL000238.1->39939, 69
GL000239.1->33824, 70
GL000240.1->41933, 71
GL000241.1->42152, 72
GL000242.1->43523, 73
GL000243.1->43341, 74
GL000244.1->39929, 75
GL000245.1->36651, 76
GL000246.1->38154, 77
GL000247.1->36422, 78
GL000248.1->39786, 79
GL000249.1->38502, 80
MT->16569, 81
NC_007605->171823, 82
X->155270560, 83
Y->59373566, 84
hs37d5->35477943, 85},List({"sampleId": "HG00096", "name": null, "attributes": {}})))
    - field (class: org.bdgenomics.adam.rdd.variation.VariantContextRDD$$anonfun$4, name: $outer, type: class org.bdgenomics.adam.rdd.variation.VariantContextRDD)
    - object (class org.bdgenomics.adam.rdd.variation.VariantContextRDD$$anonfun$4, <function2>)
    at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
    at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
    at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
    at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
    ... 25 more
Aug 6, 2016 1:16:10 PM INFO: org.apache.parquet.hadoop.ParquetInputFormat: Total input paths to process : 6


@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Aug 6, 2016

Member

Ah, seems like our only recourse here is street justice.

I'll look into this more later. Right now, this runs a-OK on my side... ?

Member

fnothaft commented Aug 6, 2016

Ah, seems like our only recourse here is street justice.

I'll look into this more later. Right now, this runs a-OK on my side... ?

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Aug 6, 2016

Member

Oh actually, a quick question. What version of Spark are you running? Locally, I am on 1.5.2.

Member

fnothaft commented Aug 6, 2016

Oh actually, a quick question. What version of Spark are you running? Locally, I am on 1.5.2.

@jpdna

This comment has been minimized.

Show comment
Hide comment
@jpdna

jpdna Aug 6, 2016

Member

spark-1.6.1

Member

jpdna commented Aug 6, 2016

spark-1.6.1

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Aug 6, 2016

Member

Thanks; I'll pull 1.6.1 down as well and test.

Member

fnothaft commented Aug 6, 2016

Thanks; I'll pull 1.6.1 down as well and test.

@jpdna

This comment has been minimized.

Show comment
Hide comment
@jpdna

jpdna Aug 6, 2016

Member

I tried on Spark 1.5.2 and I get the same error as with Spark 1.6.1 , so something is up.
I made a smaller test VCF file that produces the same error that you might try:
https://drive.google.com/file/d/0B6jh69Ugixwpem1tQWpCbDd3b2s/view?usp=sharing

If you can run vcf2adam and then adam2vcf successfully with that input file let me know.

Member

jpdna commented Aug 6, 2016

I tried on Spark 1.5.2 and I get the same error as with Spark 1.6.1 , so something is up.
I made a smaller test VCF file that produces the same error that you might try:
https://drive.google.com/file/d/0B6jh69Ugixwpem1tQWpCbDd3b2s/view?usp=sharing

If you can run vcf2adam and then adam2vcf successfully with that input file let me know.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Aug 6, 2016

Member

Ah, yes! That file does repro the error on my side. I will continue looking into this. Thanks for putting together a small test file, @jpdna. The file I was running on was a sites only file—no samples—d'oh!

Member

fnothaft commented Aug 6, 2016

Ah, yes! That file does repro the error on my side. I will continue looking into this. Thanks for putting together a small test file, @jpdna. The file I was running on was a sites only file—no samples—d'oh!

@jpdna

This comment has been minimized.

Show comment
Hide comment
@jpdna

jpdna Aug 6, 2016

Member

As suggested by the error trace and confirmed by some extra logging I put in the problem definitely is happening when we reach the code:
val mp = rdd.mapPartitionsWithIndex((idx, iter) => {
https://github.com/fnothaft/adam/blob/issues/1100-sample-serializable/adam-core/src/main/scala/org/bdgenomics/adam/rdd/variation/VariantContextRDD.scala#L117

I'm not sure what to make though of "serialization stack" output:

Serialization stack:
    - object not serializable (class: org.bdgenomics.formats.avro.Sample, value: {"sampleId": "HG00096", "name": null, "attributes": {}})
    - writeObject data (class: scala.collection.immutable.$colon$colon)
    - object (class scala.collection.immutable.$colon$colon, List({"sampleId": "HG00096", "name": null, "attributes": {}}))
    - field (class: org.bdgenomics.adam.rdd.variation.VariantContextRDD, name: samples, type: interface scala.collection.Seq)
    - object (class org.bdgenomics.adam.rdd.variation.VariantContextRDD, VariantContextRDD(MapPartitionsRDD[4] at map at GenotypeRDD.scala:62,SequenceDictionary{
1->249250621, 0
2->243199373, 1
3->198022430, 2

I'm thinking the enclosing VariantContextRDD is being sucked into the closure, but my attempts at creating local vals for the rdd or for sequences which is referenced in the MapPartions function doesn't seem to help. I'd like to get a "clean" copy of just the rdd[VariantContext] but not sure how.

Or.. maybe there is still some simple fix wrt to the Sample serializability specifically.

Member

jpdna commented Aug 6, 2016

As suggested by the error trace and confirmed by some extra logging I put in the problem definitely is happening when we reach the code:
val mp = rdd.mapPartitionsWithIndex((idx, iter) => {
https://github.com/fnothaft/adam/blob/issues/1100-sample-serializable/adam-core/src/main/scala/org/bdgenomics/adam/rdd/variation/VariantContextRDD.scala#L117

I'm not sure what to make though of "serialization stack" output:

Serialization stack:
    - object not serializable (class: org.bdgenomics.formats.avro.Sample, value: {"sampleId": "HG00096", "name": null, "attributes": {}})
    - writeObject data (class: scala.collection.immutable.$colon$colon)
    - object (class scala.collection.immutable.$colon$colon, List({"sampleId": "HG00096", "name": null, "attributes": {}}))
    - field (class: org.bdgenomics.adam.rdd.variation.VariantContextRDD, name: samples, type: interface scala.collection.Seq)
    - object (class org.bdgenomics.adam.rdd.variation.VariantContextRDD, VariantContextRDD(MapPartitionsRDD[4] at map at GenotypeRDD.scala:62,SequenceDictionary{
1->249250621, 0
2->243199373, 1
3->198022430, 2

I'm thinking the enclosing VariantContextRDD is being sucked into the closure, but my attempts at creating local vals for the rdd or for sequences which is referenced in the MapPartions function doesn't seem to help. I'd like to get a "clean" copy of just the rdd[VariantContext] but not sure how.

Or.. maybe there is still some simple fix wrt to the Sample serializability specifically.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Aug 7, 2016

Member

Just pushed an update; give this a try.

Member

fnothaft commented Aug 7, 2016

Just pushed an update; give this a try.

[ADAM-1100] Resolve Sample Not Serializable exception
Resolves #1100. Registered `Sample` class with the `AvroSerializer` in
`ADAMKryoRegistrator`.
@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Aug 7, 2016

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1371/
Test PASSed.

AmplabJenkins commented Aug 7, 2016

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1371/
Test PASSed.

@jpdna

This comment has been minimized.

Show comment
Hide comment
@jpdna

jpdna Aug 7, 2016

Member

+1 this worked for me
Thanks @fnothaft
I'll merge tomorrow if no other comments are made.

Member

jpdna commented Aug 7, 2016

+1 this worked for me
Thanks @fnothaft
I'll merge tomorrow if no other comments are made.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Aug 9, 2016

Member

@jpdna ping for merge

Member

fnothaft commented Aug 9, 2016

@jpdna ping for merge

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Aug 10, 2016

Member

Ping for merge.

Member

fnothaft commented Aug 10, 2016

Ping for merge.

@jpdna jpdna merged commit 273d57f into bigdatagenomics:master Aug 10, 2016

1 check passed

default Merged build finished.
Details
@jpdna

This comment has been minimized.

Show comment
Hide comment
@jpdna

jpdna Aug 10, 2016

Member

merged - thanks @fnothaft

Member

jpdna commented Aug 10, 2016

merged - thanks @fnothaft

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment