[SPARK-18234][SS] Made update mode public #16360

tdas · 2016-12-20T22:30:25Z

What changes were proposed in this pull request?

Made update mode public. As part of that here are the changes.

Update DatastreamWriter to accept "update"
Changed package of InternalOutputModes from o.a.s.sql to o.a.s.sql.catalyst
Added update mode state removing with watermark to StateStoreSaveExec

How was this patch tested?

Added new tests in changed modules

tdas · 2016-12-20T23:25:01Z

@marmbrus Can you take a look.

brkyvz · 2016-12-20T23:54:27Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StatefulAggregate.scala

+              keyExpressions.find(_.metadata.contains(EventTimeWatermark.delayKey))
+
+            val optionalPredicate = optionalWatermarkAttribute.map { watermarkAttribute =>
+              // If we are evicting based on a window, use the end of the window.  Otherwise just


any reason why you wouldn't want to make this a method? Seems exactly the same as above under append mode

SparkQA · 2016-12-21T00:32:00Z

Test build #70428 has finished for PR 16360 at commit b5f216e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

brkyvz · 2016-12-21T00:46:47Z

@tdas Can you also please update pyspark docs:
https://github.com/apache/spark/blob/master/python/pyspark/sql/streaming.py#L659

brkyvz · 2016-12-21T00:50:29Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StatefulAggregate.scala

-              private[this] val baseIterator = iter
+
+              // Filter late date using watermark if specified
+              private[this] val baseIterator = watermarkPredicate match {


is it a bug that append mode doesn't do a similar filtering at the moment?

This is a very very good question. I had to go back and double check this with an improved test. But the behavior is correct in append mode even without the filter. This is because append mode and update mode are implemented differently.

append mode: consume the iterator and push all changes into state store, remove things that should be removed, and then create a new iterator from the removed elements. so late date (whose state have been dropped in the past) will get added and immediately removed before. The new iterator is smart enough to only filter those that got immediately added and removed. Note that this is blocking, the parent iterator is first consumed, and then a new iterator created.

update mode: In contrast, update mode is not really blocking. A new iterator is immediately created that wraps the parent iterator. It one by one, consumes from parent iterator, updates state store, and returns the updated rows. So we need the filter out the rows with late data so that we dont emit anything for them.

SparkQA · 2016-12-21T01:01:44Z

Test build #70429 has finished for PR 16360 at commit 9b858d5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-21T01:39:25Z

Test build #70430 has finished for PR 16360 at commit b764a94.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-21T03:07:08Z

Test build #70441 has finished for PR 16360 at commit 8887340.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-21T05:31:46Z

Test build #70450 has finished for PR 16360 at commit 628c6c2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus

Generally LGTM!

marmbrus · 2016-12-21T19:50:03Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/InternalOutputModes.scala

@@ -15,7 +15,7 @@
 * limitations under the License.
 */

-package org.apache.spark.sql
+package org.apache.spark.sql.catalyst


maybe in catalyst.streaming or something?

marmbrus · 2016-12-21T19:55:29Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala

+      outputMode match {
+        case Append | Complete => // allowed
+        case Update =>
+          throw new AnalysisException("Update ouptut mode is not supported for memory format")


"memory sink" for consistency with the above error

also "output". Thinking a little more, I wonder if it would be better to say what is supported?

SparkQA · 2016-12-22T00:18:25Z

Test build #70492 has finished for PR 16360 at commit 663f73a.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class TaskContext(object):

## What changes were proposed in this pull request? Made update mode public. As part of that here are the changes. - Update DatastreamWriter to accept "update" - Changed package of InternalOutputModes from o.a.s.sql to o.a.s.sql.catalyst - Added update mode state removing with watermark to StateStoreSaveExec ## How was this patch tested? Added new tests in changed modules Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #16360 from tdas/SPARK-18234. (cherry picked from commit 83a6ace) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

## What changes were proposed in this pull request? Made update mode public. As part of that here are the changes. - Update DatastreamWriter to accept "update" - Changed package of InternalOutputModes from o.a.s.sql to o.a.s.sql.catalyst - Added update mode state removing with watermark to StateStoreSaveExec ## How was this patch tested? Added new tests in changed modules Author: Tathagata Das <tathagata.das1565@gmail.com> Closes apache#16360 from tdas/SPARK-18234.

tdas added 2 commits December 20, 2016 14:09

Made update mode public

a6a3677

Added another test

b5f216e

tdas changed the title ~~Made update mode public~~ [SPARK-18234][SS] Made update mode public Dec 20, 2016

tdas added 2 commits December 20, 2016 14:35

Minor fixes

9b858d5

Fixed update mode bug

b764a94

brkyvz reviewed Dec 20, 2016

View reviewed changes

tdas added 2 commits December 20, 2016 16:32

Addressed comments from burak

288eb4a

Minor change

8887340

brkyvz reviewed Dec 21, 2016

View reviewed changes

Improved tests

628c6c2

marmbrus approved these changes Dec 21, 2016

View reviewed changes

tdas added 2 commits December 21, 2016 13:15

Addressed comments

d96fc8b

Merge remote-tracking branch 'apache-github/master' into SPARK-18234

663f73a

asfgit closed this in 83a6ace Dec 22, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-18234][SS] Made update mode public #16360

[SPARK-18234][SS] Made update mode public #16360

tdas commented Dec 20, 2016

tdas commented Dec 20, 2016

brkyvz Dec 20, 2016

tdas Dec 21, 2016

SparkQA commented Dec 21, 2016

brkyvz commented Dec 21, 2016

brkyvz Dec 21, 2016

tdas Dec 21, 2016

SparkQA commented Dec 21, 2016

SparkQA commented Dec 21, 2016

SparkQA commented Dec 21, 2016

SparkQA commented Dec 21, 2016

marmbrus left a comment

marmbrus Dec 21, 2016

marmbrus Dec 21, 2016

marmbrus Dec 21, 2016

SparkQA commented Dec 22, 2016

[SPARK-18234][SS] Made update mode public #16360

[SPARK-18234][SS] Made update mode public #16360

Conversation

tdas commented Dec 20, 2016

What changes were proposed in this pull request?

How was this patch tested?

tdas commented Dec 20, 2016

brkyvz Dec 20, 2016

Choose a reason for hiding this comment

tdas Dec 21, 2016

Choose a reason for hiding this comment

SparkQA commented Dec 21, 2016

brkyvz commented Dec 21, 2016

brkyvz Dec 21, 2016

Choose a reason for hiding this comment

tdas Dec 21, 2016

Choose a reason for hiding this comment

SparkQA commented Dec 21, 2016

SparkQA commented Dec 21, 2016

SparkQA commented Dec 21, 2016

SparkQA commented Dec 21, 2016

marmbrus left a comment

Choose a reason for hiding this comment

marmbrus Dec 21, 2016

Choose a reason for hiding this comment

marmbrus Dec 21, 2016

Choose a reason for hiding this comment

marmbrus Dec 21, 2016

Choose a reason for hiding this comment

SparkQA commented Dec 22, 2016