Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARK-3926 [CORE] Result of JavaRDD.collectAsMap() is not Serializable #2805

Closed
wants to merge 4 commits into from

Conversation

srowen
Copy link
Member

@srowen srowen commented Oct 14, 2014

Make JavaPairRDD.collectAsMap result Serializable since Java Maps generally are

@SparkQA
Copy link

SparkQA commented Oct 14, 2014

QA tests have started for PR 2805 at commit 51c26c2.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 14, 2014

QA tests have finished for PR 2805 at commit 51c26c2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class SerializableMapWrapper(underlying: collection.Map[K, V])

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21735/
Test PASSed.

@JoshRosen
Copy link
Contributor

mapAsJavaMap is used in a few other places, too, so we should make sure to update those:

  • reduceByKeyLocally
  • countByKey
  • countByKeyApprox
  • countByValue
  • countByValueApprox

I think it's also used in SparkSQL's Java API Row class and a few other places.

@SparkQA
Copy link

SparkQA commented Oct 15, 2014

QA tests have started for PR 2805 at commit ae1b36f.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 15, 2014

QA tests have finished for PR 2805 at commit ae1b36f.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21763/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Oct 15, 2014

QA tests have started for PR 2805 at commit f4717f9.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 15, 2014

QA tests have finished for PR 2805 at commit f4717f9.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21766/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Oct 15, 2014

QA tests have started for PR 2805 at commit ecb78ee.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 15, 2014

QA tests have finished for PR 2805 at commit ecb78ee.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21769/
Test PASSed.

@JoshRosen
Copy link
Contributor

I can confirm that this seems to have fixed the serialization issue; here's my test-case:

import org.apache.spark.api.java._
val pairs = sc.parallelize(1 to 10).map(x => (x, x))
val map = new JavaPairRDD(pairs).collectAsMap()
def ser(a: AnyRef) =
    (new java.io.ObjectOutputStream(new java.io.ByteArrayOutputStream())).writeObject(a)
ser(map)

It looks like there's one more case in sql/core/src/main/scala/org/apache/spark/sql/api/java/Row.scala that needs to be addressed: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/java/Row.scala#L117. This is a private method, but its return value flows to user-code. I'll fix this up myself on merge.

There still might be some other corner-cases with serializability of results that we haven't tested yet. The result of collect() is serializable, so perhaps this issue only affected our use of MapWrapper. Long term, it would be great to add a fuzz-test that runs random Java API workloads and attempts to serialize their results.

I mentioned this over on JIRA, but for GitHub readers: I've opened an issue to fix this upstream in Scala: https://issues.scala-lang.org/browse/SI-8911

I'll merge this now with my fixup. Thanks!

@@ -587,4 +587,11 @@ trait JavaRDDLike[T, This <: JavaRDDLike[T, This]] extends Serializable {
rdd.foreachAsync(x => f.call(x))
}

private[java] def mapAsSerializableJavaMap[A, B](underlying: collection.Map[A, B]) =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that these methods should live in JavaUtils rather than in this trait so that they can be used by the Streaming and SQL Java APIs, too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, private[java] methods here will become public from Java's POV. We can't use Scala package-private access modifiers when implementing Java API classes.

asfgit pushed a commit that referenced this pull request Oct 18, 2014
Make JavaPairRDD.collectAsMap result Serializable since Java Maps generally are

Author: Sean Owen <sowen@cloudera.com>

Closes #2805 from srowen/SPARK-3926 and squashes the following commits:

ecb78ee [Sean Owen] Fix conflict between java.io.Serializable and use of Scala's Serializable
f4717f9 [Sean Owen] Oops, fix compile problem
ae1b36f [Sean Owen] Expand to cover Maps returned from other Java API methods as well
51c26c2 [Sean Owen] Make JavaPairRDD.collectAsMap result Serializable since Java Maps generally are
@asfgit asfgit closed this in f406a83 Oct 18, 2014
@JoshRosen
Copy link
Contributor

I made those minor fixups and committed this as f406a83. I also cherry-picked it into branch-1.1. Thanks!

@srowen
Copy link
Member Author

srowen commented Oct 19, 2014

Yes, all SGTM.

@srowen srowen deleted the SPARK-3926 branch October 19, 2014 22:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants