New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ADAM compatibility with Spark 2.0 #1021

Closed
jpdna opened this Issue Apr 27, 2016 · 9 comments

Comments

Projects
None yet
2 participants
@jpdna
Member

jpdna commented Apr 27, 2016

I'd like to start a discussion here about the strategy to be able to build ADAM against Spark 2.0 (which I understand to be now basically the current master at: https://github.com/apache/spark )

There are dataset api operations that don't exist in Spark 1.6 that I want to play with and will be needed to move more of ADAM functionality to Dataset API - like dataset sortWithinPartitions

, so I'm motivated to help see this happen ASAP.

At the moment the first error I see in my attempt to build ADAM against most current Spark master (after changing the obvious dependencies in the POM and building/installing latest Spark locally)), is that logging was made private in 2.0
as per apache/spark@8ef3399

So our import of:
org.apache.spark.Logging

fails.

Some questions:

  1. Is there a compelling reason to wait until Spark 2.0 is released in the next month(s), or does it make sense to start trying to make a version of ADAM that builds against the current Apache Spark master on github?

  2. Has any work been done so far towards ADAM on Spark 2.0? What are the known issues - and any that we expect will require substantial effort?

  3. @heuermh - you manage these wizardly build issues - if you have a suggestion of how I can best help here or we can work together I'd like to benefit from your experience.

@jpdna

This comment has been minimized.

Show comment
Hide comment
@jpdna

jpdna Apr 29, 2016

Member

Update: I've solved the problems related to logging by using an ADAM logger rather than the now private Spark one.
Now am dealing with various deprecated functions which are now gone in Spark 2.0 such as:
this

Member

jpdna commented Apr 29, 2016

Update: I've solved the problems related to logging by using an ADAM logger rather than the now private Spark one.
Now am dealing with various deprecated functions which are now gone in Spark 2.0 such as:
this

@jpdna

This comment has been minimized.

Show comment
Hide comment
@jpdna

jpdna Apr 29, 2016

Member

Continuing from previous comment...
The error am getting now due to the deprecated (now gone in Spark 2.0) function is:

'''
ERROR] /home/jp/Berk/work/issues/ADAM_on_Spark2/v1/utils/utils/utils-metrics/src/main/scala/org/apache/spark/rdd/InstrumentedOrderedRDDFunctions.scala:21: error: value rddToOrderedRDDFunctions is not a member of object org.apache.spark.SparkContext
[ERROR] import org.apache.spark.SparkContext.rddToOrderedRDDFunctions
'''

I suspect I can work around by using the implicit which apparently replaces it.

As aside, I was surprised to see at:
https://github.com/bigdatagenomics/utils/blob/master/utils-metrics/src/main/scala/org/apache/spark/rdd/InstrumentedOrderedRDDFunctions.scala#L18

that in ADAM we have code that we actually are placing in package org.apache.spark.rdd
I'm just not aware - Is that a common pattern to add code to existing external package like that?

Member

jpdna commented Apr 29, 2016

Continuing from previous comment...
The error am getting now due to the deprecated (now gone in Spark 2.0) function is:

'''
ERROR] /home/jp/Berk/work/issues/ADAM_on_Spark2/v1/utils/utils/utils-metrics/src/main/scala/org/apache/spark/rdd/InstrumentedOrderedRDDFunctions.scala:21: error: value rddToOrderedRDDFunctions is not a member of object org.apache.spark.SparkContext
[ERROR] import org.apache.spark.SparkContext.rddToOrderedRDDFunctions
'''

I suspect I can work around by using the implicit which apparently replaces it.

As aside, I was surprised to see at:
https://github.com/bigdatagenomics/utils/blob/master/utils-metrics/src/main/scala/org/apache/spark/rdd/InstrumentedOrderedRDDFunctions.scala#L18

that in ADAM we have code that we actually are placing in package org.apache.spark.rdd
I'm just not aware - Is that a common pattern to add code to existing external package like that?

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Apr 29, 2016

Member

I'm just not aware - Is that a common pattern to add code to existing external package like that?

No that is bad practice. It is mostly likely that way because we're using some private or package-private Spark API. Which is sometimes the only way to get things done.

Member

heuermh commented Apr 29, 2016

I'm just not aware - Is that a common pattern to add code to existing external package like that?

No that is bad practice. It is mostly likely that way because we're using some private or package-private Spark API. Which is sometimes the only way to get things done.

@jpdna

This comment has been minimized.

Show comment
Hide comment
@jpdna

jpdna May 2, 2016

Member

There are several places where the ADAM InstrumentedRDD code in bdg Utils uses Spark API functions which no longer exist in Spark 2.0 such as:
https://github.com/bigdatagenomics/utils/blob/master/utils-metrics/src/main/scala/org/apache/spark/rdd/InstrumentedRDD.scala#L161

I have a version of bdg-utils code which comments out the overridden functions in these ADAM instrumentation extensions , which then allows Utils to compile and then when ADAM is built with that version of Utils and Spark 2.0 - ADAM passes all of ADAM's tests - thus seems to works with Spark 2.0

However, some bdg utils tests still fail, such as the below, which reports an error that then comes from deep within Spark itself.

from https://github.com/bigdatagenomics/utils/blob/master/utils-metrics/src/test/scala/org/bdgenomics/utils/instrumentation/InstrumentedPairRDDFunctionsSuite.scala#L43

- Persisting to Hadoop file is instrumented correctly *** FAILED ***
  java.lang.AssertionError: assertion failed: copyAndReset must return a zero value copy
  at scala.Predef$.assert(Predef.scala:165)
  at org.apache.spark.NewAccumulator.writeReplace(NewAccumulator.scala:157)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:497)
  at java.io.ObjectStreamClass.invokeWriteReplace(ObjectStreamClass.java:1118)
  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1136)
  at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
  ...

I'm interested in discussing the history and future plans for the InstrumentedRDD ADAM code, which we shoe horn into Spark packages, as it reaches deep into Spark and is going to have to maintained in parallel over time as this issue demonstrates. What use cases does ADAM present that requires our own private instrumentation codebase? Why is it not part of Spark where it can be tested and maintained - assuming such instrumentation would be useful to other projects beyond ADAM? Do other projects use bdg Utils and this InstrumenteddRDD?

Member

jpdna commented May 2, 2016

There are several places where the ADAM InstrumentedRDD code in bdg Utils uses Spark API functions which no longer exist in Spark 2.0 such as:
https://github.com/bigdatagenomics/utils/blob/master/utils-metrics/src/main/scala/org/apache/spark/rdd/InstrumentedRDD.scala#L161

I have a version of bdg-utils code which comments out the overridden functions in these ADAM instrumentation extensions , which then allows Utils to compile and then when ADAM is built with that version of Utils and Spark 2.0 - ADAM passes all of ADAM's tests - thus seems to works with Spark 2.0

However, some bdg utils tests still fail, such as the below, which reports an error that then comes from deep within Spark itself.

from https://github.com/bigdatagenomics/utils/blob/master/utils-metrics/src/test/scala/org/bdgenomics/utils/instrumentation/InstrumentedPairRDDFunctionsSuite.scala#L43

- Persisting to Hadoop file is instrumented correctly *** FAILED ***
  java.lang.AssertionError: assertion failed: copyAndReset must return a zero value copy
  at scala.Predef$.assert(Predef.scala:165)
  at org.apache.spark.NewAccumulator.writeReplace(NewAccumulator.scala:157)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:497)
  at java.io.ObjectStreamClass.invokeWriteReplace(ObjectStreamClass.java:1118)
  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1136)
  at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
  ...

I'm interested in discussing the history and future plans for the InstrumentedRDD ADAM code, which we shoe horn into Spark packages, as it reaches deep into Spark and is going to have to maintained in parallel over time as this issue demonstrates. What use cases does ADAM present that requires our own private instrumentation codebase? Why is it not part of Spark where it can be tested and maintained - assuming such instrumentation would be useful to other projects beyond ADAM? Do other projects use bdg Utils and this InstrumenteddRDD?

@jpdna

This comment has been minimized.

Show comment
Hide comment
@jpdna

jpdna May 2, 2016

Member

Note as of 20 minutes ago with
apache/spark@44da8d8
Spark's NewAccumulator class was renamed AccumulatorV2
the assertion which is failing above about copyAndReset must return a zero value copy
is now
https://github.com/apache/spark/blob/44da8d8eabeccc12bfed0d43b37d54e0da845c66/core/src/main/scala/org/apache/spark/AccumulatorV2.scala#L155

Member

jpdna commented May 2, 2016

Note as of 20 minutes ago with
apache/spark@44da8d8
Spark's NewAccumulator class was renamed AccumulatorV2
the assertion which is failing above about copyAndReset must return a zero value copy
is now
https://github.com/apache/spark/blob/44da8d8eabeccc12bfed0d43b37d54e0da845c66/core/src/main/scala/org/apache/spark/AccumulatorV2.scala#L155

@jpdna

This comment has been minimized.

Show comment
Hide comment
@jpdna

jpdna May 2, 2016

Member

Note, I believe I have a working/hack solution to this copyAndReset problem:
adding an equals function that always returns true to class ServoTimers, here
after which all tests pass, and seems like equals is only used for purpose of testing the assertion that the accumulator was set to initial value - which is still the case.

I'll highlight this to review with goal of replacing with a proper equals functions when I make the PR.

Member

jpdna commented May 2, 2016

Note, I believe I have a working/hack solution to this copyAndReset problem:
adding an equals function that always returns true to class ServoTimers, here
after which all tests pass, and seems like equals is only used for purpose of testing the assertion that the accumulator was set to initial value - which is still the case.

I'll highlight this to review with goal of replacing with a proper equals functions when I make the PR.

@jpdna

This comment has been minimized.

Show comment
Hide comment
@jpdna

jpdna May 19, 2016

Member

Here you can find branches of bdg-utils and ADAM which compile and pass tests against Spark 2.0.0-SNAPSHOT 4987f39

https://github.com/jpdna/adam/tree/spark2.0_scala_2.11
https://github.com/jpdna/utils/tree/spark2.0_scala_2.11

These branches are re-based on the current master of adam and utils as of 5/18/2016

Once Spark 2.0.0 is officially released we should make a plan about how to move scala 2.11/Spark2.0 long term branches into our main bdg repos and how to keep them in sync with the scala 2_10 / Spark 1.6 version which I imagine we will maintain and release for some time forward as well (applying new PRs and Jenkins testing automatically to both ideally).

Member

jpdna commented May 19, 2016

Here you can find branches of bdg-utils and ADAM which compile and pass tests against Spark 2.0.0-SNAPSHOT 4987f39

https://github.com/jpdna/adam/tree/spark2.0_scala_2.11
https://github.com/jpdna/utils/tree/spark2.0_scala_2.11

These branches are re-based on the current master of adam and utils as of 5/18/2016

Once Spark 2.0.0 is officially released we should make a plan about how to move scala 2.11/Spark2.0 long term branches into our main bdg repos and how to keep them in sync with the scala 2_10 / Spark 1.6 version which I imagine we will maintain and release for some time forward as well (applying new PRs and Jenkins testing automatically to both ideally).

@heuermh heuermh referenced this issue Jun 7, 2016

Closed

Release ADAM version 0.20.0 #1048

47 of 61 tasks complete
@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Sep 8, 2016

Member

Per #1123 we are going to keep Spark 1.x and 2.x in the same development tree. New scripts will modify the pom files in place to create artifacts with -spark2 suffix. Spark 2.0 has been added to our Jenkins build matrix.

@jpdna is there anything left on your branch(es) not included in #1123?

Member

heuermh commented Sep 8, 2016

Per #1123 we are going to keep Spark 1.x and 2.x in the same development tree. New scripts will modify the pom files in place to create artifacts with -spark2 suffix. Spark 2.0 has been added to our Jenkins build matrix.

@jpdna is there anything left on your branch(es) not included in #1123?

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Sep 8, 2016

Member

Closed by #1123

Member

heuermh commented Sep 8, 2016

Closed by #1123

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment