New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-22367][WIP][CORE] Separate the serialization of class and object for iteraor #19586
Conversation
Hi, @cloud-fan @jiangxb1987 @chenghao-intel. Would you mind take a look? Thanks a lot. |
pom.xml
Outdated
@@ -133,7 +133,7 @@ | |||
<hive.parquet.version>1.6.0</hive.parquet.version> | |||
<jetty.version>9.3.20.v20170531</jetty.version> | |||
<javaxservlet.version>3.1.0</javaxservlet.version> | |||
<chill.version>0.8.4</chill.version> | |||
<chill.version>0.9.2</chill.version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure whether it should be changed. If it is unreasonable, I can change it back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need to update it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessary. Chill 0.9.2 uses kryo 4.0. I can change it back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please change it back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
library upgrading deserves a separated PR.
One executor, the configuration as follows: ${SPARK_HOME}/bin/spark-submit \
--class com.intel.KryoTest \
--master yarn \
--deploy-mode cluster \
--conf spark.memory.offHeap.enabled=true \
--conf spark.memory.offHeap.size=50g \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--driver-memory 5G \
--driver-cores 10 \
--executor-memory 40G \
--executor-cores 20 \
--num-executors 1 \
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also slightly concerned about compatibility here. I'm trying to think if there's any case where we intend to support kryo/java serialized objects from 2.x in 2.y.
Also, doesn't Kryo registration solve the problem of writing the class name every time? that's why it compresses it to an identifier.
@@ -205,11 +205,45 @@ class KryoSerializationStream( | |||
|
|||
private[this] var kryo: Kryo = serInstance.borrowKryo() | |||
|
|||
// This is only used when we write object and class separately. | |||
var classWrote = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see why you need this state and need to repeat the logic about writing / not writing classes everywhere. Surely this just goes in one writeAll / asIterator pair?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it used for writeAll / asIterator
. But for MemoryStorea.putIteratorAsBytes
, we don't use the writeAll, we use this state to indicate whether we have written the class first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not write that state as an iterator of stuff, if that's how it behaves? rather than duplicate code. 'values' is already an iterator there. Either way there's no need for state here. This state is local to the writing process. This seems like a recipe for a thread-safety bug later.
pom.xml
Outdated
@@ -133,7 +133,7 @@ | |||
<hive.parquet.version>1.6.0</hive.parquet.version> | |||
<jetty.version>9.3.20.v20170531</jetty.version> | |||
<javaxservlet.version>3.1.0</javaxservlet.version> | |||
<chill.version>0.8.4</chill.version> | |||
<chill.version>0.9.2</chill.version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need to update it?
@srowen Thanks for the reviewing.
After you registered. It still writes the class (not class full name but just a class ID) if you call |
val value = values.next() | ||
if (kryoSerializationStream != null) { | ||
if (!kryoSerializationStream.classWrote) { | ||
kryoSerializationStream.writeClass(value.getClass) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srowen you can see here. Here we don't use the writeAll, because we need acquire memory according to the written size.
@@ -205,11 +205,45 @@ class KryoSerializationStream( | |||
|
|||
private[this] var kryo: Kryo = serInstance.borrowKryo() | |||
|
|||
// This is only used when we write object and class separately. | |||
var classWrote = false | |||
|
|||
override def writeObject[T: ClassTag](t: T): SerializationStream = { | |||
kryo.writeClassAndObject(output, t) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was expecting kryo to buffer the distinct classes and only store an identifier/pointer for duplicated classes. Even if we write object and class every time, the overhead should be small. This is not true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the code, it just write a varInt
if the class have been registered. And also there need some calculation for getting the varInt
. But from the test, the overhead looks more serious than I expected.
looking at the Instead, I think we should introduce |
OK to test |
Hi @cloud-fan, thanks for reviewing. There are some errors about |
@ConeyLiu what about the below example, does your implementation support this? trait Base { val name: String }
case class A(name: String) extends Base
case class B(name: String) extends Base
sc.parallelize(Seq(A("a"), B("b"))).map { i => (i, 1) }.reduceByKey(_ + _).collect() Here not all the elements have same class type, does your PR here support such scenario? |
Hi @jerryshao, Thanks for the reminder, it doesn't support it. I'm sorry I did not take that into account. How about using configuration to determine whether we should use Can you give some advice? Also cc @cloud-fan @srowen |
Using configurations seems not so elegant, also configuration is application based, how would you turn off/on this feature in the runtime? Sorry I cannot give you a good advice, maybe kryo's solution is the best option for general case. |
Currently, I use it directly. Maybe this is suitable for some special case which has same type data, such as ml or else. |
For these cases, they can write their own serializer and set it via |
Hi @cloud-fan, for most case the data type should be same. So I think this optimization is valuable, because it can save the space and cpu resource considerable. What about setting a flag for the RDD, which indicates whether the RDD only has the same types. If it'st not valid, could we putting it to the ml package for special serializer, then user could configure it. But for this case, there must be provided the exactly classtag of the RDD for serialization due to the relocation of unsafeshufflewrite. |
I tend to agree with @cloud-fan , I think you can implement your own serializer out of Spark to be more specialized for your application, that will definitely be more efficient than the built-in one. But for the Spark's default solution, it should be general enough to cover all cases. Setting a flag or a configuration is not intuitive enough from my understanding. And for ML, can you please provide an example about how this could be improved with your approach. From my understanding you approach is more useful when leverage custom class definition, like |
I think this problem will go away after mllib migrate to Spark SQL completely. For now I think we can make the serializer config job-wise and set this special serializer for ml jobs. |
Hi @cloud-fan, @jerryshao. The problem of val conf = new SparkConf().setAppName("Vector Register Test")
conf.registerKryoClasses(Array(classOf[Vector], classOf[DenseVector], classOf[SparseVector]))
val sc = new SparkContext(conf)
val sourceData = sc.sequenceFile[LongWritable, VectorWritable](args(0))
.map { case (k, v) =>
val vector = v.get()
val tmpVector = new Array[Double](v.get().size())
for (i <- 0 until vector.size()) {
tmpVector(i) = vector.get(i)
}
Vectors.dense(tmpVector)
}
sourceData.persist(StorageLevel.OFF_HEAP)
var start = System.currentTimeMillis()
sourceData.count()
println("First: " + (System.currentTimeMillis() - start))
start = System.currentTimeMillis()
sourceData.count()
println("Second: " + (System.currentTimeMillis() - start))
sc.stop() Results: Those classes are very common for ML, and also The reason shoule be the problem of kryo, it will write the full class name instead of the classID if the class is not registered. |
You can call |
also cc @WeichenXu123 |
We can config the class to register by config |
and in |
Thanks for the suggestion, I re-raised a pr to solve this problem. Close it now. |
## What changes were proposed in this pull request? There are still some algorithms based on mllib, such as KMeans. For now, many mllib common class (such as: Vector, DenseVector, SparseVector, Matrix, DenseMatrix, SparseMatrix) are not registered in Kryo. So there are some performance issues for those object serialization or deserialization. Previously dicussed: #19586 ## How was this patch tested? New test case. Author: Xianyang Liu <xianyang.liu@intel.com> Closes #19661 from ConeyLiu/register_vector.
What changes were proposed in this pull request?
Becuase they are all the same class for an iterator. So there is no need write class information for every record in the iterator. We only need write the class information once at the serialization beginning, also only need read the class information once for deserialization.
In this patch, we separate the serialization of class and object for an iterator serialized by Kryo. This can improve the performance of the serialization and deserialization, and save the space.
Test case:
Test result:
The size of serialized:
before: 34.3GB
after: 17.5GB
How was this patch tested?
Existing UT.