-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
STORM-433: Give users visibility to the depth of queues at each bolt #236
Conversation
(let [ret (async-loop | ||
(fn [] (consume-batch-when-available queue handler) 0) | ||
(fn [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This touches the critical path. We can't add this until quantifying the performance cost of this.
-1 As this touches the critical path, this needs some performance testing to measure the before and after impact of this before we can even consider merging this in. |
I see your point. This probably doesn't belong on the critical path anyway, so I'll move it off. |
Yea if you can find a way to do that that's thread safe that's ideal. One way to do that is to place a special event on the disruptor queue, and when the read thread sees that special event it can write the stat out somewhere (this would be similar to how the INTERRUPT event works in DisruptorQueue.java) |
Alright, I've moved the updates into the heartbeat thread. |
@@ -133,11 +133,11 @@ | |||
</script> | |||
<script id="bolt-executor-template" type="text/html"> | |||
<h2>Executors</h2> | |||
<table class="zebra-striped" id="bolt-executor-table"><thead><tr><th class="header headerSortDown"><span class="tip right" title="The unique executor ID.">Id</span></th><th class="header"><span data-original-title="The length of time an Executor (thread) has been alive." class="tip right">Uptime</span></th><th class="header"><span class="tip above" title="The hostname reported by the remote host. (Note that this hostname is not the result of a reverse lookup at the Nimbus node.)">Host</span></th><th class="header"><span class="tip above" title="The port number used by the Worker to which an Executor is assigned. Click on the port number to open the logviewer page for this Worker.">Port</span></th><th class="header"><span class="tip above" title="The number of Tuples emitted.">Emitted</span></th><th class="header"><span class="tip above" title="The number of Tuples emitted that sent to one or more bolts.">Transferred</span></th><th class="header"><span class="tip above" title="If this is around 1.0, the corresponding Bolt is running as fast as it can, so you may want to increase the Bolt's parallelism. This is (number executed * average execute latency) / measurement time.">Capacity (last 10m)</span></th><th class="header"><span data-original-title="The average time a Tuple spends in the execute method. The execute method may complete without sending an Ack for the tuple." class="tip above">Execute latency (ms)</span></th><th class="header"><span class="tip above" title="The number of incoming Tuples processed.">Executed</span></th><th class="header"><span data-original-title="The average time it takes to Ack a Tuple after it is first received. Bolts that join, aggregate or batch may not Ack a tuple until a number of other Tuples have been received." class="tip above">Process latency (ms)</span></th><th class="header"><span data-original-title="The number of Tuples acknowledged by this Bolt." class="tip above">Acked</span></th><th class="header"><span data-original-title="The number of tuples Failed by this Bolt." class="tip left">Failed</span></th></tr></thead> | |||
<table class="zebra-striped" id="bolt-executor-table"><thead><tr><th class="header headerSortDown"><span class="tip right" title="The unique executor ID.">Id</span></th><th class="header"><span data-original-title="The length of time an Executor (thread) has been alive." class="tip right">Uptime</span></th><th class="header"><span class="tip above" title="The hostname reported by the remote host. (Note that this hostname is not the result of a reverse lookup at the Nimbus node.)">Host</span></th><th class="header"><span class="tip above" title="The port number used by the Worker to which an Executor is assigned. Click on the port number to open the logviewer page for this Worker.">Port</span></th><th class="header"><span class="tip above" title="The number of Tuples emitted.">Emitted</span></th><th class="header"><span class="tip above" title="The number of Tuples emitted that sent to one or more bolts.">Transferred</span></th><th class="header"><span class="tip above" title="If this is around 1.0, the corresponding Bolt is running as fast as it can, so you may want to increase the Bolt's parallelism. This is (number executed * average execute latency) / measurement time.">Capacity (last 10m)</span></th><th class="header"><span data-original-title="The average time a Tuple spends in the execute method. The execute method may complete without sending an Ack for the tuple." class="tip above">Execute latency (ms)</span></th><th class="header"><span class="tip above" title="The number of incoming Tuples processed.">Executed</span></th><th class="header"><span data-original-title="The average time it takes to Ack a Tuple after it is first received. Bolts that join, aggregate or batch may not Ack a tuple until a number of other Tuples have been received." class="tip above">Process latency (ms)</span></th><th class="header"><span class="tip right" title="The length of the executor's queue.">Queue Length</span></th><th class="header"><span data-original-title="The number of Tuples acknowledged by this Bolt." class="tip above">Acked</span></th><th class="header"><span data-original-title="The number of tuples Failed by this Bolt." class="tip left">Failed</span></th></tr></thead> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you split this line up? everything looks fine, but it is so long it is really hard to tell what is happening with this line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'd like to reformat these templates in a few places. There are quite a few spots like this in the HTML. I can fix this and file a separate jira for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK file a separate JIRA and then I am +1, but @nathanmarz was also reviewing this, so I am going to wait for a +1 from him before committing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request and jira are up #243
A few things:
|
@nathanmarz, @revans2, @d2r
|
Yes, I think this would be a tremendous help to users while debugging "slowness" issues. |
Conflicts: storm-core/src/clj/backtype/storm/daemon/executor.clj storm-core/src/clj/backtype/storm/ui/core.clj storm-core/src/ui/public/templates/component-page-template.html
Regenerating Thrift code. Ready for review.
This is ready for another look. |
Hi @knusbaum, it has been a while, and it looks like this needs an up-merge again. It might also be good to remove changes from the Thrift-generated code that result in only time stamp changes. |
@knusbaum : I see you closed this. Are you planning to open a new PR for publishing these stats? It would be a big help to us, we often wonder at various behaviors we witness from our users' topologies, and the only mechanism we have right now for getting visibility into the queue depths seems to be getting every topology owner to use a custom metrics consumer to publish the metrics which we would then need to provide fancy aggregation on top of. Having it in the Storm UI and also in the API stats would be very very helpful. |
I am preparing a patch for publishing in-backlog (receive-queue) and out-backlog (send-queue). I am able to see the average value of these metrics on UI over each time window period but only in executors section (No aggregation over component level etc) Also users may be interested in instant value of these metrics and I don't know how will I fit that into UI. any suggestions are welcome. |
@abhishekagarwal87 : awesome news! Any way you can post a screenshot of the UI you are currently proposing? At least please do so with the PR when you send it. That could help others to brainstorm how to put such values into the UI. Maybe you're instead asking for suggestions on how to handle obtaining the instantaneous values? |
I will do that Eric. I am using https://github.com/apache/storm/blob/master/storm-core/src/jvm/org/apache/storm/metric/internal/LatencyStatAndMetric.java to store the windowed values. It is easy to add the instantaneous values in the result map so that is not a problem. I will put a screenshot and the PR soon. May be that will clear the confusion. |
@abhishekagarwal87 I am excited to see your screenshot as well. Also, I am curious as why you chose to use LatencyStatAndMetric.java over CountStatAndMetric.java? |
Because we need average length of queue average over each time window. If you look at the LatencyStatAndMetric, it does exactly that. I record the queue lengths in each operation and return the average values with executor heartbeat. |
Guys, follow up discussion is happening on #1406. Can you let me know your feedback there? |
This pull request adds a column to the executors table on the components page showing the average length of the tuple queue when the executor consumes a chunk.