-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DOCS] Added important updateStateByKey details #7229
Conversation
Runs for *all* existing keys and returning "None" will remove the key-value pair.
It's kind of implied by the docs above, since it must run on the current state of every key. This doesn't look like the right place; the doc of this method is above. |
The documentation above states: "This can be used to maintain arbitrary state data for each key" . I would expect that each key means each incoming key in the batch, as the absence of new values for a key would mean no change of the existing state (e.g. for better performance). I found out the hard way (stackoverflow, mailing list, local testing). I can move the description in the method doc above (although feels a big lengthy) or better in code, but I feel this information is necessary. |
No I mean farther up in "UpdateStateByKey Operation" where the operation is explained, rather than putting this as a coda at the end after examples. You can weave a comment into the examples too I suppose, to clarify what keys are getting updated. I personally assumed the current behavior since there's otherwise no way key's values change unless a new value arrives, and that can't cover all use cases. Clarification never hurt though. |
Moved the update description up, before the example. Maybe it's worth mentioning in the example that "Existing words with no new values will also be called by updateStateByKey until they return None"
I agree, this is better. Maybe it's worth mentioning in the example that "Existing words with no new values will also be called by |
@@ -928,7 +930,7 @@ runningCounts = pairs.updateStateByKey(updateFunction) | |||
The update function will be called for each word, with `newValues` having a sequence of 1's (from | |||
the `(word, 1)` pairs) and the `runningCount` having the previous count. For the complete | |||
Python code, take a look at the example | |||
[stateful_network_wordcount.py]({{site.SPARK_GITHUB_URL}}/blob/master/examples/src/main/python/streaming/stateful_network_wordcount.py). | |||
[stateful_network_wordcount.py]({{site.SPARK_GITHUB_URL}}/blob/master/examples/src/main/python/streaming/stateful_network_wordcount.py). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You ended up with a stray space here, but if you don't have a moment to zap that today, it's no big deal, I'll merge. The text looks OK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, removed.
Test build #1011 has finished for PR 7229 at commit
|
@@ -854,6 +854,8 @@ it with new information. To use this, you will have to do two steps. | |||
1. Define the state update function - Specify with a function how to update the state using the | |||
previous state and the new values from an input stream. | |||
|
|||
Spark will run the `updateStateByKey` operation for all existing keys, regardless of whether they have new data in a batch or not. If `updateStateByKey` returns None then the key-value pair will be eliminated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry to go around one more time on this, but I think the text should be sharpened slightly. "operation" might be better as "update function", and it's a question of what the update function returns, not updateStateByKey. (None
can be code font too.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Please update.
Replaced operation with update function and None with code-font None
Is this ok now or should I take off the |
LGTM |
@@ -854,6 +854,8 @@ it with new information. To use this, you will have to do two steps. | |||
1. Define the state update function - Specify with a function how to update the state using the | |||
previous state and the new values from an input stream. | |||
|
|||
Spark will run the `updateStateByKey` update function for all existing keys, regardless of whether they have new data in a batch or not. If the update function returns `None` then the key-value pair will be eliminated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole section is about updateStateByKey
so saying it again here is superfluous. Just "run the update function". Also I would clarify further: "In every batch, Spark will apply the update function for all... ". It wasnt clear that whether it was for every batch or overall.
Runs for *all* existing keys and returning "None" will remove the key-value pair. Author: Michael Vogiatzis <michaelvogiatzis@gmail.com> Closes #7229 from mvogiatzis/patch-1 and squashes the following commits: e7a2946 [Michael Vogiatzis] Updated updateStateByKey text 00283ed [Michael Vogiatzis] Removed space c2656f9 [Michael Vogiatzis] Moved description farther up 0a42551 [Michael Vogiatzis] Added important updateStateByKey details (cherry picked from commit d538919) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
No worries, I have the made the change while merging it. Thanks @mvogiatzis, merged it to master and 1.4 |
Runs for all existing keys and returning "None" will remove the key-value pair.