[SPARK-17930][CORE]The SerializerInstance instance used when deserializing a TaskResult is not reused #15512
[SPARK-17930][CORE]The SerializerInstance instance used when deserializing a TaskResult is not reused #15512witgo wants to merge 2 commits intoapache:masterfrom
Conversation
|
Hm, if the benchmark you give generalizes much that is certainly compelling. I guess I'm surprised that instantiating the object can be so expensive relative to deserialization since it just happens once per task. But it is a fairly simple change. |
|
Test build #67063 has finished for PR 15512 at commit
|
|
serializing will create buffers, but since these are only used for deserializing, I don't think there should even be any buffers created. I guess the time saved is all the registration which can be skipped? https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala#L85. I suppose in this case, this is the result of The only wrinkle I can see here is if reference-tracking is turned on (which it is, by default). But I think this is taken care of anyway by the way |
|
Jenkins, retest this please |
There was a problem hiding this comment.
nit: I prefer the following codes because Option(...).getOrElse doesn't improve the readability but creates unnecessary objects.
val resultSer = if (resultSer == null) SparkEnv.get.serializer.newInstance() else resultSer
valueObject = resultSer.deserialize(valueBytes)|
Test build #67065 has finished for PR 15512 at commit
|
There was a problem hiding this comment.
nit: Would be nice to add a comment here saying "force deserialization of referenced value" or some such
|
@squito I also think that the time saved is all the registration which can be skipped, but did not verify. |
|
Test build #67102 has finished for PR 15512 at commit
|
|
LGTM. Merging to master. Thanks! |
…erializing a TaskResult is not reused apache#15512
…lizing a TaskResult is not reused
## What changes were proposed in this pull request?
The following code is called when the DirectTaskResult instance is deserialized
```scala
def value(): T = {
if (valueObjectDeserialized) {
valueObject
} else {
// Each deserialization creates a new instance of SerializerInstance, which is very time-consuming
val resultSer = SparkEnv.get.serializer.newInstance()
valueObject = resultSer.deserialize(valueBytes)
valueObjectDeserialized = true
valueObject
}
}
```
In the case of stage has a lot of tasks, reuse SerializerInstance instance can improve the scheduling performance of three times
The test data is TPC-DS 2T (Parquet) and SQL statement as follows (query 2):
```sql
select i_item_id,
avg(ss_quantity) agg1,
avg(ss_list_price) agg2,
avg(ss_coupon_amt) agg3,
avg(ss_sales_price) agg4
from store_sales, customer_demographics, date_dim, item, promotion
where ss_sold_date_sk = d_date_sk and
ss_item_sk = i_item_sk and
ss_cdemo_sk = cd_demo_sk and
ss_promo_sk = p_promo_sk and
cd_gender = 'M' and
cd_marital_status = 'M' and
cd_education_status = '4 yr Degree' and
(p_channel_email = 'N' or p_channel_event = 'N') and
d_year = 2001
group by i_item_id
order by i_item_id
limit 100;
```
`spark-defaults.conf` file:
```
spark.master yarn-client
spark.executor.instances 20
spark.driver.memory 16g
spark.executor.memory 30g
spark.executor.cores 5
spark.default.parallelism 100
spark.sql.shuffle.partitions 100000
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.maxResultSize 0
spark.rpc.netty.dispatcher.numThreads 8
spark.executor.extraJavaOptions -XX:+UseG1GC -XX:+UseStringDeduplication -XX:G1HeapRegionSize=16M -XX:MetaspaceSize=256M
spark.cleaner.referenceTracking.blocking true
spark.cleaner.referenceTracking.blocking.shuffle true
```
Performance test results are as follows
[SPARK-17930](https://github.com/witgo/spark/tree/SPARK-17930)| [ed14633](witgo@ed14633])
------------ | -------------
54.5 s|231.7 s
## How was this patch tested?
Existing tests.
Author: Guoqiang Li <witgo@qq.com>
Closes apache#15512 from witgo/SPARK-17930.
…lizing a TaskResult is not reused
## What changes were proposed in this pull request?
The following code is called when the DirectTaskResult instance is deserialized
```scala
def value(): T = {
if (valueObjectDeserialized) {
valueObject
} else {
// Each deserialization creates a new instance of SerializerInstance, which is very time-consuming
val resultSer = SparkEnv.get.serializer.newInstance()
valueObject = resultSer.deserialize(valueBytes)
valueObjectDeserialized = true
valueObject
}
}
```
In the case of stage has a lot of tasks, reuse SerializerInstance instance can improve the scheduling performance of three times
The test data is TPC-DS 2T (Parquet) and SQL statement as follows (query 2):
```sql
select i_item_id,
avg(ss_quantity) agg1,
avg(ss_list_price) agg2,
avg(ss_coupon_amt) agg3,
avg(ss_sales_price) agg4
from store_sales, customer_demographics, date_dim, item, promotion
where ss_sold_date_sk = d_date_sk and
ss_item_sk = i_item_sk and
ss_cdemo_sk = cd_demo_sk and
ss_promo_sk = p_promo_sk and
cd_gender = 'M' and
cd_marital_status = 'M' and
cd_education_status = '4 yr Degree' and
(p_channel_email = 'N' or p_channel_event = 'N') and
d_year = 2001
group by i_item_id
order by i_item_id
limit 100;
```
`spark-defaults.conf` file:
```
spark.master yarn-client
spark.executor.instances 20
spark.driver.memory 16g
spark.executor.memory 30g
spark.executor.cores 5
spark.default.parallelism 100
spark.sql.shuffle.partitions 100000
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.maxResultSize 0
spark.rpc.netty.dispatcher.numThreads 8
spark.executor.extraJavaOptions -XX:+UseG1GC -XX:+UseStringDeduplication -XX:G1HeapRegionSize=16M -XX:MetaspaceSize=256M
spark.cleaner.referenceTracking.blocking true
spark.cleaner.referenceTracking.blocking.shuffle true
```
Performance test results are as follows
[SPARK-17930](https://github.com/witgo/spark/tree/SPARK-17930)| [ed14633](witgo@ed14633])
------------ | -------------
54.5 s|231.7 s
## How was this patch tested?
Existing tests.
Author: Guoqiang Li <witgo@qq.com>
Closes apache#15512 from witgo/SPARK-17930.
What changes were proposed in this pull request?
The following code is called when the DirectTaskResult instance is deserialized
In the case of stage has a lot of tasks, reuse SerializerInstance instance can improve the scheduling performance of three times
The test data is TPC-DS 2T (Parquet) and SQL statement as follows (query 2):
spark-defaults.conffile:Performance test results are as follows
How was this patch tested?
Existing tests.