Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-18234][SS] Made update mode public #16360

Closed
wants to merge 9 commits into from
Closed

Conversation

tdas
Copy link
Contributor

@tdas tdas commented Dec 20, 2016

What changes were proposed in this pull request?

Made update mode public. As part of that here are the changes.

  • Update DatastreamWriter to accept "update"
  • Changed package of InternalOutputModes from o.a.s.sql to o.a.s.sql.catalyst
  • Added update mode state removing with watermark to StateStoreSaveExec

How was this patch tested?

Added new tests in changed modules

@tdas tdas changed the title Made update mode public [SPARK-18234][SS] Made update mode public Dec 20, 2016
@tdas
Copy link
Contributor Author

tdas commented Dec 20, 2016

@marmbrus Can you take a look.

keyExpressions.find(_.metadata.contains(EventTimeWatermark.delayKey))

val optionalPredicate = optionalWatermarkAttribute.map { watermarkAttribute =>
// If we are evicting based on a window, use the end of the window. Otherwise just
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason why you wouldn't want to make this a method? Seems exactly the same as above under append mode

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure.

@SparkQA
Copy link

SparkQA commented Dec 21, 2016

Test build #70428 has finished for PR 16360 at commit b5f216e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@brkyvz
Copy link
Contributor

brkyvz commented Dec 21, 2016

@tdas Can you also please update pyspark docs:
https://github.com/apache/spark/blob/master/python/pyspark/sql/streaming.py#L659

private[this] val baseIterator = iter

// Filter late date using watermark if specified
private[this] val baseIterator = watermarkPredicate match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it a bug that append mode doesn't do a similar filtering at the moment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very very good question. I had to go back and double check this with an improved test. But the behavior is correct in append mode even without the filter. This is because append mode and update mode are implemented differently.

  • append mode: consume the iterator and push all changes into state store, remove things that should be removed, and then create a new iterator from the removed elements. so late date (whose state have been dropped in the past) will get added and immediately removed before. The new iterator is smart enough to only filter those that got immediately added and removed. Note that this is blocking, the parent iterator is first consumed, and then a new iterator created.

  • update mode: In contrast, update mode is not really blocking. A new iterator is immediately created that wraps the parent iterator. It one by one, consumes from parent iterator, updates state store, and returns the updated rows. So we need the filter out the rows with late data so that we dont emit anything for them.

@SparkQA
Copy link

SparkQA commented Dec 21, 2016

Test build #70429 has finished for PR 16360 at commit 9b858d5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 21, 2016

Test build #70430 has finished for PR 16360 at commit b764a94.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 21, 2016

Test build #70441 has finished for PR 16360 at commit 8887340.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 21, 2016

Test build #70450 has finished for PR 16360 at commit 628c6c2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@marmbrus marmbrus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM!

@@ -15,7 +15,7 @@
* limitations under the License.
*/

package org.apache.spark.sql
package org.apache.spark.sql.catalyst
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe in catalyst.streaming or something?

outputMode match {
case Append | Complete => // allowed
case Update =>
throw new AnalysisException("Update ouptut mode is not supported for memory format")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"memory sink" for consistency with the above error

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also "output". Thinking a little more, I wonder if it would be better to say what is supported?

@SparkQA
Copy link

SparkQA commented Dec 22, 2016

Test build #70492 has finished for PR 16360 at commit 663f73a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class TaskContext(object):

@asfgit asfgit closed this in 83a6ace Dec 22, 2016
asfgit pushed a commit that referenced this pull request Dec 22, 2016
## What changes were proposed in this pull request?

Made update mode public. As part of that here are the changes.
- Update DatastreamWriter to accept "update"
- Changed package of InternalOutputModes from o.a.s.sql to o.a.s.sql.catalyst
- Added update mode state removing with watermark to StateStoreSaveExec

## How was this patch tested?

Added new tests in changed modules

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16360 from tdas/SPARK-18234.

(cherry picked from commit 83a6ace)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?

Made update mode public. As part of that here are the changes.
- Update DatastreamWriter to accept "update"
- Changed package of InternalOutputModes from o.a.s.sql to o.a.s.sql.catalyst
- Added update mode state removing with watermark to StateStoreSaveExec

## How was this patch tested?

Added new tests in changed modules

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes apache#16360 from tdas/SPARK-18234.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants