-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-9043] Serialize key, value and combiner classes in ShuffleDependency #7403
Conversation
Jenkins, test this please. |
There was a timeout fetching from the git repo. Having Jenkins try again. |
@@ -65,7 +67,7 @@ abstract class NarrowDependency[T](_rdd: RDD[T]) extends Dependency[T] { | |||
* @param mapSideCombine whether to perform partial aggregation (also known as map-side combine) | |||
*/ | |||
@DeveloperApi | |||
class ShuffleDependency[K, V, C]( | |||
class ShuffleDependency[K: ClassTag, V: ClassTag, C: ClassTag]( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, this is a binary-incompatible change to a DeveloperAPI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the source for DeveloperApi, a Developer API is unstable and can change between minor releases.
I don't think that we can make this change as-is since we can't break binary compatibility for stable public APIs like PairRDDFunctions. |
@JoshRosen What suggestions do you have for making the key, value and combiner class names available in the ShuffleDependency in a way that is binary compatible? I had assumed that this would be part of the next major release where binary changes might be allowed. |
Also, the comment in the
|
The Luckily, the original PairRDDFunctions has ClassTags for the key and value; however, the |
Test build #37268 has finished for PR 7403 at commit
|
@massie Although I'm in favor of collecting more type information (e.g. even type tags), but no labeling means it should be stable across the entire Spark 1.x release... |
@JoshRosen @rxin I just pushed an update that reverts the The only remaining sticking point is the three |
Test build #37380 has finished for PR 7403 at commit
|
To answer my own question, adding a |
I just pushed an update (5c58b4df9b) which ensures that we keep binary compatibility in If this is an approach that is agreeable, I'll expand the tests and finalize the work. Any suggestions on a better name of the new combiner methods? I just called it Thanks for the review help @JoshRosen and @rxin. |
Test build #37423 has finished for PR 7403 at commit
|
There is another approach that would work here too. Currently, object RDD {
def rddToPairRDDFunctions[K, V](rdd: RDD[(K, V)])
(implicit kt: ClassTag[K], vt: ClassTag[V], ord: Ordering[K] = null): PairRDDFunctions[K, V] = {
new PairRDDFunctions(rdd)
} Code compiled with this looks like the following,
To keep binary compatibility, we just need to keep the object RDD {
def rddToPairRDDFunctions[K, V](rdd: RDD[(K, V)])
(implicit kt: ClassTag[K], vt: ClassTag[V], ord: Ordering[K] = null): PairRDDFunctions[K, V] = {
new PairRDDFunctions(rdd)
}
implicit def rddToNewPairRDDFunctions[K, V](rdd: RDD[(K, V)])
(implicit kt: ClassTag[K], vt: ClassTag[V], ord: Ordering[K] = null): NewPairRDDFunctions[K, V] = {
new NewPairRDDFunctions(rdd)
}
...
} Old code will still run since the As a small side note, the The advantage of this approach is that the method names remain the same (no need for a |
5c58b4d
to
5aba066
Compare
Test build #37988 has finished for PR 7403 at commit
|
This is ready for review when someone has the time. All unit tests pass.
|
* The key, value and combiner classes are serialized so that shuffle manager | ||
* implementation can use the information to build | ||
*/ | ||
val keyClassName: String = reflect.classTag[K].runtimeClass.getName |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how is this used? it might require the key class to have a 0-arg ctor right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's an example of how I use this in the Parquet shuffle manager to create a schema for the (key, value) or (key, combiner) pairs for the shuffle files.
Test build #38310 has finished for PR 7403 at commit
|
Jenkins, test this please. |
1 similar comment
Jenkins, test this please. |
Test build #38367 has finished for PR 7403 at commit
|
Jenkins tests pass. The other failures were Jenkins hiccups. |
@@ -48,7 +48,7 @@ import org.apache.spark.serializer.Serializer | |||
* you can use `rdd1`'s partitioner/partition size and not worry about running | |||
* out of memory because of the size of `rdd2`. | |||
*/ | |||
private[spark] class SubtractedRDD[K: ClassTag, V: ClassTag, W: ClassTag]( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These ClassTags should not be removed. I just pushed an update that reverts this line change.
Test build #38403 has finished for PR 7403 at commit
|
Test build #38406 has finished for PR 7403 at commit
|
The error isn't related to this PR...
... looks like something related to Kinesis backed streaming. Maybe related to
Maybe @tdas knows what's causing the failure? |
@@ -75,7 +76,8 @@ private[spark] class CoGroupPartition( | |||
* @param part partitioner used to partition the shuffle output | |||
*/ | |||
@DeveloperApi | |||
class CoGroupedRDD[K](@transient var rdds: Seq[RDD[_ <: Product2[K, _]]], part: Partitioner) | |||
class CoGroupedRDD[K: ClassTag](@transient var rdds: Seq[RDD[_ <: Product2[K, _]]], | |||
part: Partitioner) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed this in my style changes. Fixing now.
@andrewor14 I noticed that I missed a few of the style fixes that you recommended. I just pushed 41d2a3c which fixes them. Thanks for the reviewing this PR. I appreciate it. |
Test build #41897 has finished for PR 7403 at commit
|
Test build #41893 has finished for PR 7403 at commit
|
@rxin and @andrewor14 - Is there anything more that needs to be done before this PR is ready to be merged? I've made all recommended changes. There is an open question about having the class names be (a) This PR is ready to go for Option A. Let me know your thoughts. |
…ndency ShuffleManager implementations are currently not given type information for the key, value and combiner classes. Serialization of shuffle objects relies on objects being JavaSerializable, with methods defined for reading/writing the object or, alternatively, serialization via Kryo which uses reflection. Serialization systems like Avro, Thrift and Protobuf generate classes with zero argument constructors and explicit schema information (e.g. IndexedRecords in Avro have get, put and getSchema methods). By serializing the key, value and combiner class names in ShuffleDependency, shuffle implementations will have access to schema information when registerShuffle() is called.
41d2a3c
to
ed1afac
Compare
I just rebased this on top of @rxin @andrewor14 This PR has been open for almost two months. Can you please let me know if you see any remaining work that needs to be done before merging? |
* Simplified version of combineByKey that hash-partitions the output RDD. | ||
* This method is here for backward compatibility. It | ||
* does not provide combiner classtag information to | ||
* the shuffle. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The java doc should start with a description of what the method does. We should use the old one and add that it exists backward compatibility after the first sentence.
@massie The changes here look fine. The only thing I'm still not sure about is the fact that everything is public. You pointed out that Other than that, I don't have strong opinions one way or the other about this patch. I think it's a good addition but I'm wary of the many public API changes in this patch. If @rxin thinks it's in a good shape then we should merge it. |
Test build #42210 has finished for PR 7403 at commit
|
Thanks for the response, @andrewor14. I'll update this PR to make the class names private. As you say, we can make them public in a future PR, if needed. You're right that #7265 doesn't need them to be |
Test build #42220 has finished for PR 7403 at commit
|
The class names are To provide a little more explanation of why the class PlaySuite extends SparkFunSuite {
def combine[C](c: C): Unit = {
println("Running combine without ClassTag")
}
def combine[C](c: C)(implicit ct: ClassTag[C]): Unit = {
println("Running combin with ClassTag")
}
test("Example") {
combine(42)
}
} Causes the compiler to throw the error:
There's unfortunately no way to add ClassTags and not break compatibility without have these new methods. I hope that you find this PR is ready to merge now. Please let me know if you see anything else that needs to be done. |
@massie This looks OK to me. The only thing is that I find the name I understand that it won't compile if you just call it I'll defer the judgment to @rxin. |
@andrewor14 I agree the name We can't make the I hope that in Spark 2.0, when we're able to fix this API, we can simply add the |
Let's add an experimental annotation to it, and then merge this. |
Test build #42297 has finished for PR 7403 at commit
|
LGTM. |
I've merged this. Thanks @massie |
ShuffleManager implementations are currently not given type information for
the key, value and combiner classes. Serialization of shuffle objects relies
on objects being JavaSerializable, with methods defined for reading/writing
the object or, alternatively, serialization via Kryo which uses reflection.
Serialization systems like Avro, Thrift and Protobuf generate classes with
zero argument constructors and explicit schema information
(e.g. IndexedRecords in Avro have get, put and getSchema methods).
By serializing the key, value and combiner class names in ShuffleDependency,
shuffle implementations will have access to schema information when
registerShuffle() is called.