[DOCS] Added important updateStateByKey details #7229

mvogiatzis · 2015-07-05T21:52:43Z

Runs for all existing keys and returning "None" will remove the key-value pair.

Runs for *all* existing keys and returning "None" will remove the key-value pair.

srowen · 2015-07-06T00:45:19Z

It's kind of implied by the docs above, since it must run on the current state of every key. This doesn't look like the right place; the doc of this method is above.

mvogiatzis · 2015-07-06T08:38:01Z

The documentation above states: "This can be used to maintain arbitrary state data for each key" .

I would expect that each key means each incoming key in the batch, as the absence of new values for a key would mean no change of the existing state (e.g. for better performance). I found out the hard way (stackoverflow, mailing list, local testing).

I can move the description in the method doc above (although feels a big lengthy) or better in code, but I feel this information is necessary.

srowen · 2015-07-06T08:46:32Z

No I mean farther up in "UpdateStateByKey Operation" where the operation is explained, rather than putting this as a coda at the end after examples. You can weave a comment into the examples too I suppose, to clarify what keys are getting updated. I personally assumed the current behavior since there's otherwise no way key's values change unless a new value arrives, and that can't cover all use cases. Clarification never hurt though.

Moved the update description up, before the example. Maybe it's worth mentioning in the example that "Existing words with no new values will also be called by updateStateByKey until they return None"

mvogiatzis · 2015-07-06T09:40:28Z

I agree, this is better. Maybe it's worth mentioning in the example that "Existing words with no new values will also be called by updateStateByKey until they return None"

srowen · 2015-07-08T10:15:00Z

docs/streaming-programming-guide.md

@@ -928,7 +930,7 @@ runningCounts = pairs.updateStateByKey(updateFunction)
 The update function will be called for each word, with `newValues` having a sequence of 1's (from
 the `(word, 1)` pairs) and the `runningCount` having the previous count. For the complete
 Python code, take a look at the example
-[stateful_network_wordcount.py]({{site.SPARK_GITHUB_URL}}/blob/master/examples/src/main/python/streaming/stateful_network_wordcount.py).
+[stateful_network_wordcount.py]({{site.SPARK_GITHUB_URL}}/blob/master/examples/src/main/python/streaming/stateful_network_wordcount.py). 


You ended up with a stray space here, but if you don't have a moment to zap that today, it's no big deal, I'll merge. The text looks OK

Oops, removed.

SparkQA · 2015-07-08T13:45:46Z

Test build #1011 has finished for PR 7229 at commit c2656f9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2015-07-08T15:09:54Z

docs/streaming-programming-guide.md

@@ -854,6 +854,8 @@ it with new information. To use this, you will have to do two steps.
 1. Define the state update function - Specify with a function how to update the state using the
 previous state and the new values from an input stream.

+Spark will run the `updateStateByKey` operation for all existing keys, regardless of whether they have new data in a batch or not. If `updateStateByKey` returns None then the key-value pair will be eliminated.


Sorry to go around one more time on this, but I think the text should be sharpened slightly. "operation" might be better as "update function", and it's a question of what the update function returns, not updateStateByKey. (None can be code font too.)

Agreed. Please update.

andrewor14 · 2015-07-08T18:24:59Z

@tdas

Replaced operation with update function and None with code-font None

mvogiatzis · 2015-07-09T08:31:49Z

Is this ok now or should I take off the ~~updateStateByKey~~ update function ?

srowen · 2015-07-09T09:50:53Z

LGTM

tdas · 2015-07-09T22:28:49Z

docs/streaming-programming-guide.md

@@ -854,6 +854,8 @@ it with new information. To use this, you will have to do two steps.
 1. Define the state update function - Specify with a function how to update the state using the
 previous state and the new values from an input stream.

+Spark will run the `updateStateByKey` update function for all existing keys, regardless of whether they have new data in a batch or not. If the update function returns `None` then the key-value pair will be eliminated.


This whole section is about updateStateByKey so saying it again here is superfluous. Just "run the update function". Also I would clarify further: "In every batch, Spark will apply the update function for all... ". It wasnt clear that whether it was for every batch or overall.

Runs for *all* existing keys and returning "None" will remove the key-value pair. Author: Michael Vogiatzis <michaelvogiatzis@gmail.com> Closes #7229 from mvogiatzis/patch-1 and squashes the following commits: e7a2946 [Michael Vogiatzis] Updated updateStateByKey text 00283ed [Michael Vogiatzis] Removed space c2656f9 [Michael Vogiatzis] Moved description farther up 0a42551 [Michael Vogiatzis] Added important updateStateByKey details (cherry picked from commit d538919) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

tdas · 2015-07-10T02:55:50Z

No worries, I have the made the change while merging it. Thanks @mvogiatzis, merged it to master and 1.4

Added important updateStateByKey details

0a42551

Runs for *all* existing keys and returning "None" will remove the key-value pair.

mvogiatzis changed the title ~~Added important updateStateByKey details~~ [DOCS] Added important updateStateByKey details Jul 5, 2015

Moved description farther up

c2656f9

Moved the update description up, before the example. Maybe it's worth mentioning in the example that "Existing words with no new values will also be called by updateStateByKey until they return None"

srowen reviewed Jul 8, 2015
View reviewed changes

Removed space

00283ed

srowen reviewed Jul 8, 2015
View reviewed changes

Updated updateStateByKey text

e7a2946

Replaced operation with update function and None with code-font None

tdas reviewed Jul 9, 2015
View reviewed changes

asfgit closed this in d538919 Jul 10, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DOCS] Added important updateStateByKey details #7229

[DOCS] Added important updateStateByKey details #7229

mvogiatzis commented Jul 5, 2015

srowen commented Jul 6, 2015

mvogiatzis commented Jul 6, 2015

srowen commented Jul 6, 2015

mvogiatzis commented Jul 6, 2015

srowen Jul 8, 2015

mvogiatzis Jul 8, 2015

SparkQA commented Jul 8, 2015

srowen Jul 8, 2015

tdas Jul 8, 2015

andrewor14 commented Jul 8, 2015

mvogiatzis commented Jul 9, 2015

srowen commented Jul 9, 2015

tdas Jul 9, 2015

tdas commented Jul 10, 2015

[DOCS] Added important updateStateByKey details #7229

[DOCS] Added important updateStateByKey details #7229

Conversation

mvogiatzis commented Jul 5, 2015

srowen commented Jul 6, 2015

mvogiatzis commented Jul 6, 2015

srowen commented Jul 6, 2015

mvogiatzis commented Jul 6, 2015

srowen Jul 8, 2015

Choose a reason for hiding this comment

mvogiatzis Jul 8, 2015

Choose a reason for hiding this comment

SparkQA commented Jul 8, 2015

srowen Jul 8, 2015

Choose a reason for hiding this comment

tdas Jul 8, 2015

Choose a reason for hiding this comment

andrewor14 commented Jul 8, 2015

mvogiatzis commented Jul 9, 2015

srowen commented Jul 9, 2015

tdas Jul 9, 2015

Choose a reason for hiding this comment

tdas commented Jul 10, 2015