Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOCS] Added important updateStateByKey details #7229

Closed
wants to merge 4 commits into from

Conversation

mvogiatzis
Copy link
Contributor

Runs for all existing keys and returning "None" will remove the key-value pair.

Runs for *all* existing keys and returning "None" will remove the key-value pair.
@mvogiatzis mvogiatzis changed the title Added important updateStateByKey details [DOCS] Added important updateStateByKey details Jul 5, 2015
@srowen
Copy link
Member

srowen commented Jul 6, 2015

It's kind of implied by the docs above, since it must run on the current state of every key. This doesn't look like the right place; the doc of this method is above.

@mvogiatzis
Copy link
Contributor Author

The documentation above states: "This can be used to maintain arbitrary state data for each key" .

I would expect that each key means each incoming key in the batch, as the absence of new values for a key would mean no change of the existing state (e.g. for better performance). I found out the hard way (stackoverflow, mailing list, local testing).

I can move the description in the method doc above (although feels a big lengthy) or better in code, but I feel this information is necessary.

@srowen
Copy link
Member

srowen commented Jul 6, 2015

No I mean farther up in "UpdateStateByKey Operation" where the operation is explained, rather than putting this as a coda at the end after examples. You can weave a comment into the examples too I suppose, to clarify what keys are getting updated. I personally assumed the current behavior since there's otherwise no way key's values change unless a new value arrives, and that can't cover all use cases. Clarification never hurt though.

Moved the update description up, before the example. Maybe it's worth mentioning in the example that "Existing words with no new values will also be called by updateStateByKey until they return None"
@mvogiatzis
Copy link
Contributor Author

I agree, this is better. Maybe it's worth mentioning in the example that "Existing words with no new values will also be called by updateStateByKey until they return None"

@@ -928,7 +930,7 @@ runningCounts = pairs.updateStateByKey(updateFunction)
The update function will be called for each word, with `newValues` having a sequence of 1's (from
the `(word, 1)` pairs) and the `runningCount` having the previous count. For the complete
Python code, take a look at the example
[stateful_network_wordcount.py]({{site.SPARK_GITHUB_URL}}/blob/master/examples/src/main/python/streaming/stateful_network_wordcount.py).
[stateful_network_wordcount.py]({{site.SPARK_GITHUB_URL}}/blob/master/examples/src/main/python/streaming/stateful_network_wordcount.py).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You ended up with a stray space here, but if you don't have a moment to zap that today, it's no big deal, I'll merge. The text looks OK

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, removed.

@SparkQA
Copy link

SparkQA commented Jul 8, 2015

Test build #1011 has finished for PR 7229 at commit c2656f9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -854,6 +854,8 @@ it with new information. To use this, you will have to do two steps.
1. Define the state update function - Specify with a function how to update the state using the
previous state and the new values from an input stream.

Spark will run the `updateStateByKey` operation for all existing keys, regardless of whether they have new data in a batch or not. If `updateStateByKey` returns None then the key-value pair will be eliminated.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to go around one more time on this, but I think the text should be sharpened slightly. "operation" might be better as "update function", and it's a question of what the update function returns, not updateStateByKey. (None can be code font too.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Please update.

@andrewor14
Copy link
Contributor

@tdas

Replaced operation with update function and None with code-font None
@mvogiatzis
Copy link
Contributor Author

Is this ok now or should I take off the updateStateByKey update function ?

@srowen
Copy link
Member

srowen commented Jul 9, 2015

LGTM

@@ -854,6 +854,8 @@ it with new information. To use this, you will have to do two steps.
1. Define the state update function - Specify with a function how to update the state using the
previous state and the new values from an input stream.

Spark will run the `updateStateByKey` update function for all existing keys, regardless of whether they have new data in a batch or not. If the update function returns `None` then the key-value pair will be eliminated.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole section is about updateStateByKey so saying it again here is superfluous. Just "run the update function". Also I would clarify further: "In every batch, Spark will apply the update function for all... ". It wasnt clear that whether it was for every batch or overall.

asfgit pushed a commit that referenced this pull request Jul 10, 2015
Runs for *all* existing keys and returning "None" will remove the key-value pair.

Author: Michael Vogiatzis <michaelvogiatzis@gmail.com>

Closes #7229 from mvogiatzis/patch-1 and squashes the following commits:

e7a2946 [Michael Vogiatzis] Updated updateStateByKey text
00283ed [Michael Vogiatzis] Removed space
c2656f9 [Michael Vogiatzis] Moved description farther up
0a42551 [Michael Vogiatzis] Added important updateStateByKey details

(cherry picked from commit d538919)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
@tdas
Copy link
Contributor

tdas commented Jul 10, 2015

No worries, I have the made the change while merging it. Thanks @mvogiatzis, merged it to master and 1.4

@asfgit asfgit closed this in d538919 Jul 10, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants