Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

STORM-2039: backpressure refactoring in worker and executor #1627

Conversation

abellina
Copy link
Contributor

@abellina abellina commented Aug 15, 2016

For 1.x branch, will submit a PR for 2.x branch.

@zhuoliu worked on the initial refactor. I added some due code review comments.

@abellina abellina closed this Aug 15, 2016
@abellina abellina reopened this Aug 15, 2016
@@ -743,7 +736,7 @@
(fn [& args]
(check-credentials-changed)
(if ((:storm-conf worker) TOPOLOGY-BACKPRESSURE-ENABLE)
(check-throttle-changed))))
(topology-backpressure-callback))))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like the wrong place for this.
For some reason the backpressure is tacked onto the credentials timer?

I know this was already the way it worked, but maybe there's a better place for it now? I don't think it's good that the value of the task.credentials.poll.secs config affects the frequency of backpressure polling.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, agreed. I'll add another timer for this.

@knusbaum
Copy link
Contributor

+1 for the changes. My comments are suggestions for a couple more.
The travis failures look unrelated. Storm Core passes.

@abellina abellina force-pushed the STORM-2039_backpressure_refactoring_in_worker_and_executor branch from a4fa289 to 24990d3 Compare August 17, 2016 03:58
@abellina
Copy link
Contributor Author

@knusbaum I made the changes suggested. Also removed an extra fn wrapper on that function triggered by the timer.

@knusbaum
Copy link
Contributor

+1

@zhuoliu
Copy link

zhuoliu commented Aug 19, 2016

@abellina looks good to me, just one extra suggestion:
In
https://github.com/apache/storm/blob/master/storm-core/src/clj/org/apache/storm/daemon/worker.clj#L156
To reduce the traffic to ZK, we checked the previous flag to update to Zookeeper only when it is necessary (the flag has changed), but still there may some weird case we have inconsistency between local "prev-backpressure-flag" and the worker backpressure node at Zookeeper (see STORM-1949).
(One such weird case I would think of is: the worker-backpressure call does not update the zk flag successfully but no exception was caught and the local flag updating still succeeds)

So to strike a tradeoff and avoid the possible wrong stall of a topology, could you add do a random number generation here, e.g., generate a number between 0-9, if it is zero, we will call the try block no matter "prev-backpressure-flag" and "curr-backpressure-flag" is equal or not.
This will make sure the local flag and remote flag is synchronized and the ZK traffic would be once per second per worker. (the cyclic function is called every 100ms). Or one every 5-10 second is still acceptable since this is only needed rarely.
Using (currentTimeStampMs / 1000 % intervalInSeconds == 0) as the update condition might make more sense than using random number. For intervalInSeconds, we can use 5 or 10. Most of the ZK traffic would be read. Assume ZK (3 nodes) can sustain ~50K req/second. 5% of its capacity can satisfy 25K workers if setting the intervalInSeconds as 10.

Thanks a lot!

@abellina
Copy link
Contributor Author

@zhuoliu thanks for the comment. I am taking a look at STORM-1949 to try and reproduce it, and will comment there. I think that is a separate issue from this PR so we should see a potential fix to STORM-1949 in a different PR. Are you ok with this going in as is?

@zhuoliu
Copy link

zhuoliu commented Aug 20, 2016

sure, Alex. Really appreciate if you can submit a separate PR for 1949.
+1 to this one!

@zhuoliu
Copy link

zhuoliu commented Aug 24, 2016

Since with the refactoring we directly access the disruptor queue to get the backpressure status rather than having an additional flag in executor data which is error-prone when flipping (which is very possibly the root cause of the halt in STORM-1949), we may NOT need to add the ADDITIONAL ZK pressure which I mentioned in the above.

@HeartSaVioR
Copy link
Contributor

+1

@asfgit asfgit merged commit 050e68d into apache:1.x-branch Aug 31, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants