[SPARK-23004][SS] Ensure StateStore.commit is called only once in a streaming aggregation task #21124

tdas · 2018-04-23T01:29:16Z

What changes were proposed in this pull request?

A structured streaming query with a streaming aggregation can throw the following error in rare cases.

java.lang.IllegalStateException: Cannot commit after already committed or aborted
	at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$verify(HDFSBackedStateStoreProvider.scala:643)
	at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:135)
	at org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3$$anon$2$$anonfun$hasNext$2.apply$mcV$sp(statefulOperators.scala:359)
	at org.apache.spark.sql.execution.streaming.StateStoreWriter$class.timeTakenMs(statefulOperators.scala:102)
	at org.apache.spark.sql.execution.streaming.StateStoreSaveExec.timeTakenMs(statefulOperators.scala:251)
	at org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3$$anon$2.hasNext(statefulOperators.scala:359)
	at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.processInputs(ObjectAggregationIterator.scala:188)
	at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.<init>(ObjectAggregationIterator.scala:78)
	at org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:114)
	at org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:105)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$apply$24.apply(RDD.scala:830)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$apply$24.apply(RDD.scala:830)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:42)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:336)

This can happen when the following conditions are accidentally hit.

Streaming aggregation with aggregation function that is a subset of TypedImperativeAggregation (for example, collect_set, collect_list, percentile, etc.).
Query running in update} mode
After the shuffle, a partition has exactly 128 records.

This causes StateStore.commit to be called twice. See the JIRA for a more detailed explanation. The solution is to use NextIterator or CompletionIterator, each of which has a flag to prevent the "onCompletion" task from being called more than once. In this PR, I chose to implement using NextIterator.

How was this patch tested?

Added unit test that I have confirm will fail without the fix.

tdas · 2018-04-23T01:38:34Z

@brkyvz PTAL.

tedyu · 2018-04-23T01:42:54Z

+1

ConcurrencyPractitioner · 2018-04-23T03:06:10Z

+1

SparkQA · 2018-04-23T05:15:21Z

Test build #89694 has finished for PR 21124 at commit 304498e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

brkyvz · 2018-04-23T15:54:12Z

LGTM!

…treaming aggregation task ## What changes were proposed in this pull request? A structured streaming query with a streaming aggregation can throw the following error in rare cases. ``` java.lang.IllegalStateException: Cannot commit after already committed or aborted at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$verify(HDFSBackedStateStoreProvider.scala:643) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:135) at org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3$$anon$2$$anonfun$hasNext$2.apply$mcV$sp(statefulOperators.scala:359) at org.apache.spark.sql.execution.streaming.StateStoreWriter$class.timeTakenMs(statefulOperators.scala:102) at org.apache.spark.sql.execution.streaming.StateStoreSaveExec.timeTakenMs(statefulOperators.scala:251) at org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3$$anon$2.hasNext(statefulOperators.scala:359) at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.processInputs(ObjectAggregationIterator.scala:188) at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.<init>(ObjectAggregationIterator.scala:78) at org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:114) at org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:105) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$apply$24.apply(RDD.scala:830) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$apply$24.apply(RDD.scala:830) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:42) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:336) ``` This can happen when the following conditions are accidentally hit. - Streaming aggregation with aggregation function that is a subset of [`TypedImperativeAggregation`](https://github.com/apache/spark/blob/76b8b840ddc951ee6203f9cccd2c2b9671c1b5e8/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala#L473) (for example, `collect_set`, `collect_list`, `percentile`, etc.). - Query running in `update}` mode - After the shuffle, a partition has exactly 128 records. This causes StateStore.commit to be called twice. See the [JIRA](https://issues.apache.org/jira/browse/SPARK-23004) for a more detailed explanation. The solution is to use `NextIterator` or `CompletionIterator`, each of which has a flag to prevent the "onCompletion" task from being called more than once. In this PR, I chose to implement using `NextIterator`. ## How was this patch tested? Added unit test that I have confirm will fail without the fix. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #21124 from tdas/SPARK-23004. (cherry picked from commit 770add8) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

tdas added 2 commits April 22, 2018 18:22

SPARK-23004

a78ba37

Removed unnecessary change

304498e

tdas changed the title ~~SPARK-23004~~ [SPARK-23004][SS] Ensure StateStore.commit is called only once in a streaming aggregation task Apr 23, 2018

asfgit closed this in 770add8 Apr 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-23004][SS] Ensure StateStore.commit is called only once in a streaming aggregation task #21124

[SPARK-23004][SS] Ensure StateStore.commit is called only once in a streaming aggregation task #21124

tdas commented Apr 23, 2018 •

edited

Loading

tdas commented Apr 23, 2018

tedyu commented Apr 23, 2018

ConcurrencyPractitioner commented Apr 23, 2018

SparkQA commented Apr 23, 2018

brkyvz commented Apr 23, 2018

[SPARK-23004][SS] Ensure StateStore.commit is called only once in a streaming aggregation task #21124

[SPARK-23004][SS] Ensure StateStore.commit is called only once in a streaming aggregation task #21124

Conversation

tdas commented Apr 23, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

tdas commented Apr 23, 2018

tedyu commented Apr 23, 2018

ConcurrencyPractitioner commented Apr 23, 2018

SparkQA commented Apr 23, 2018

brkyvz commented Apr 23, 2018

tdas commented Apr 23, 2018 •

edited

Loading