STORM-433: Give users visibility to the depth of queues at each bolt #236

knusbaum · 2014-08-20T20:50:40Z

This pull request adds a column to the executors table on the components page showing the average length of the tuple queue when the executor consumes a chunk.

nathanmarz · 2014-08-20T21:06:37Z

storm-core/src/clj/backtype/storm/disruptor.clj

  (let [ret (async-loop
-              (fn [] (consume-batch-when-available queue handler) 0)
+              (fn []


This touches the critical path. We can't add this until quantifying the performance cost of this.

nathanmarz · 2014-08-20T21:07:46Z

-1

As this touches the critical path, this needs some performance testing to measure the before and after impact of this before we can even consider merging this in.

knusbaum · 2014-08-20T22:32:43Z

I see your point. This probably doesn't belong on the critical path anyway, so I'll move it off.

nathanmarz · 2014-08-20T23:03:37Z

Yea if you can find a way to do that that's thread safe that's ideal. One way to do that is to place a special event on the disruptor queue, and when the read thread sees that special event it can write the stat out somewhere (this would be similar to how the INTERRUPT event works in DisruptorQueue.java)

knusbaum · 2014-08-21T20:49:05Z

Alright, I've moved the updates into the heartbeat thread.

revans2 · 2014-08-25T19:44:11Z

storm-core/src/ui/public/templates/component-page-template.html

@@ -133,11 +133,11 @@
 </script>
 <script id="bolt-executor-template" type="text/html">
 <h2>Executors</h2>
-<table class="zebra-striped" id="bolt-executor-table"><thead><tr><th class="header headerSortDown"><span class="tip right" title="The unique executor ID.">Id</span></th><th class="header"><span data-original-title="The length of time an Executor (thread) has been alive." class="tip right">Uptime</span></th><th class="header"><span class="tip above" title="The hostname reported by the remote host. (Note that this hostname is not the result of a reverse lookup at the Nimbus node.)">Host</span></th><th class="header"><span class="tip above" title="The port number used by the Worker to which an Executor is assigned. Click on the port number to open the logviewer page for this Worker.">Port</span></th><th class="header"><span class="tip above" title="The number of Tuples emitted.">Emitted</span></th><th class="header"><span class="tip above" title="The number of Tuples emitted that sent to one or more bolts.">Transferred</span></th><th class="header"><span class="tip above" title="If this is around 1.0, the corresponding Bolt is running as fast as it can, so you may want to increase the Bolt's parallelism. This is (number executed * average execute latency) / measurement time.">Capacity (last 10m)</span></th><th class="header"><span data-original-title="The average time a Tuple spends in the execute method. The execute method may complete without sending an Ack for the tuple." class="tip above">Execute latency (ms)</span></th><th class="header"><span class="tip above" title="The number of incoming Tuples processed.">Executed</span></th><th class="header"><span data-original-title="The average time it takes to Ack a Tuple after it is first received.  Bolts that join, aggregate or batch may not Ack a tuple until a number of other Tuples have been received." class="tip above">Process latency (ms)</span></th><th class="header"><span data-original-title="The number of Tuples acknowledged by this Bolt." class="tip above">Acked</span></th><th class="header"><span data-original-title="The number of tuples Failed by this Bolt." class="tip left">Failed</span></th></tr></thead>
+<table class="zebra-striped" id="bolt-executor-table"><thead><tr><th class="header headerSortDown"><span class="tip right" title="The unique executor ID.">Id</span></th><th class="header"><span data-original-title="The length of time an Executor (thread) has been alive." class="tip right">Uptime</span></th><th class="header"><span class="tip above" title="The hostname reported by the remote host. (Note that this hostname is not the result of a reverse lookup at the Nimbus node.)">Host</span></th><th class="header"><span class="tip above" title="The port number used by the Worker to which an Executor is assigned. Click on the port number to open the logviewer page for this Worker.">Port</span></th><th class="header"><span class="tip above" title="The number of Tuples emitted.">Emitted</span></th><th class="header"><span class="tip above" title="The number of Tuples emitted that sent to one or more bolts.">Transferred</span></th><th class="header"><span class="tip above" title="If this is around 1.0, the corresponding Bolt is running as fast as it can, so you may want to increase the Bolt's parallelism. This is (number executed * average execute latency) / measurement time.">Capacity (last 10m)</span></th><th class="header"><span data-original-title="The average time a Tuple spends in the execute method. The execute method may complete without sending an Ack for the tuple." class="tip above">Execute latency (ms)</span></th><th class="header"><span class="tip above" title="The number of incoming Tuples processed.">Executed</span></th><th class="header"><span data-original-title="The average time it takes to Ack a Tuple after it is first received.  Bolts that join, aggregate or batch may not Ack a tuple until a number of other Tuples have been received." class="tip above">Process latency (ms)</span></th><th class="header"><span class="tip right" title="The length of the executor's queue.">Queue Length</span></th><th class="header"><span data-original-title="The number of Tuples acknowledged by this Bolt." class="tip above">Acked</span></th><th class="header"><span data-original-title="The number of tuples Failed by this Bolt." class="tip left">Failed</span></th></tr></thead>


Can you split this line up? everything looks fine, but it is so long it is really hard to tell what is happening with this line.

Yeah, I'd like to reformat these templates in a few places. There are quite a few spots like this in the HTML. I can fix this and file a separate jira for that.

OK file a separate JIRA and then I am +1, but @nathanmarz was also reviewing this, so I am going to wait for a +1 from him before committing this.

Pull request and jira are up #243

nathanmarz · 2014-08-26T20:02:18Z

A few things:

I made a comment about inserting a type hint to avoid reflection
To be clear, the only queue you're monitoring here is the receive queue for executors?
I need to be convinced more that the interactions with the DisruptorQueue are thread-safe. This is critical code, so we have to be extra careful.

knusbaum · 2015-02-05T00:15:39Z

@nathanmarz, @revans2, @d2r
Do we still want this? If so I'll upmerge and fix all the problems:

easy, I'll take care of this.
I can also monitor the batch-transfer-queue.
This one's trickier. It depends on what you mean by "thread-safe." They are certainly thread-safe in the sense that nothing will get corrupted. I've read the disruptor code for 2.10.1, and only reads are ever done by the additional code. It's not thread-safe in that the queue population values may be off by an amount if a write is done during, but it won't corrupt any values.

d2r · 2015-02-05T14:11:25Z

Do we still want this?

Yes, I think this would be a tremendous help to users while debugging "slowness" issues.

Conflicts: storm-core/src/clj/backtype/storm/daemon/executor.clj storm-core/src/clj/backtype/storm/ui/core.clj storm-core/src/ui/public/templates/component-page-template.html

Regenerating Thrift code. Ready for review.

knusbaum · 2015-02-06T18:37:03Z

This is ready for another look.

d2r · 2015-06-24T00:33:15Z

Hi @knusbaum, it has been a while, and it looks like this needs an up-merge again. It might also be good to remove changes from the Thrift-generated code that result in only time stamp changes.

erikdw · 2016-05-04T22:45:57Z

@knusbaum : I see you closed this. Are you planning to open a new PR for publishing these stats? It would be a big help to us, we often wonder at various behaviors we witness from our users' topologies, and the only mechanism we have right now for getting visibility into the queue depths seems to be getting every topology owner to use a custom metrics consumer to publish the metrics which we would then need to provide fancy aggregation on top of. Having it in the Storm UI and also in the API stats would be very very helpful.

abhishekagarwal87 · 2016-05-05T09:20:27Z

I am preparing a patch for publishing in-backlog (receive-queue) and out-backlog (send-queue). I am able to see the average value of these metrics on UI over each time window period but only in executors section (No aggregation over component level etc) Also users may be interested in instant value of these metrics and I don't know how will I fit that into UI. any suggestions are welcome.

erikdw · 2016-05-05T18:06:34Z

@abhishekagarwal87 : awesome news!

Any way you can post a screenshot of the UI you are currently proposing? At least please do so with the PR when you send it. That could help others to brainstorm how to put such values into the UI. Maybe you're instead asking for suggestions on how to handle obtaining the instantaneous values?

abhishekagarwal87 · 2016-05-05T18:22:01Z

I will do that Eric. I am using https://github.com/apache/storm/blob/master/storm-core/src/jvm/org/apache/storm/metric/internal/LatencyStatAndMetric.java to store the windowed values. It is easy to add the instantaneous values in the result map so that is not a problem. I will put a screenshot and the PR soon. May be that will clear the confusion.

dsKarthick · 2016-05-05T18:52:37Z

@abhishekagarwal87 I am excited to see your screenshot as well. Also, I am curious as why you chose to use LatencyStatAndMetric.java over CountStatAndMetric.java?

abhishekagarwal87 · 2016-05-06T06:58:06Z

Because we need average length of queue average over each time window. If you look at the LatencyStatAndMetric, it does exactly that. I record the queue lengths in each operation and return the average values with executor heartbeat.

abhishekagarwal87 · 2016-05-10T07:01:40Z

Guys, follow up discussion is happening on #1406. Can you let me know your feedback there?

Kyle Nusbaum and others added 7 commits August 19, 2014 19:00

Rebuilding thrift code.

7e406ef

Recompiled thrift code.

514d7e2

Recompiling thrift code

66bc78d

Compiling thrift

b95114c

Finishing up queue lengths

c17dd16

Cleanup

4ae6b7d

Cleanup

0af300a

nathanmarz reviewed Aug 20, 2014
View reviewed changes

Kyle Nusbaum added 2 commits August 21, 2014 20:44

Moving queue checking out of critical thread.

daf0235

Cleaning up disruptor.

28013f5

knusbaum changed the title ~~Storm-433: Give users visibility to the depth of queues at each bolt~~ STORM-433: Give users visibility to the depth of queues at each bolt Aug 21, 2014

revans2 reviewed Aug 25, 2014
View reviewed changes

Kyle Nusbaum added 2 commits February 5, 2015 10:17

Merge branch 'master' into STORM-433

8da9589

Conflicts: storm-core/src/clj/backtype/storm/daemon/executor.clj storm-core/src/clj/backtype/storm/ui/core.clj storm-core/src/ui/public/templates/component-page-template.html

Monitoring batch-transfer-queue.

4310621

Regenerating Thrift code. Ready for review.

knusbaum force-pushed the STORM-433 branch from 1bdf2a5 to 4310621 Compare February 6, 2015 18:36

Kyle Nusbaum added 2 commits February 9, 2015 14:10

Fixed type-hint

849c032

Upmerging with thrift 0.9.2

5725f87

knusbaum closed this Mar 7, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

STORM-433: Give users visibility to the depth of queues at each bolt #236

STORM-433: Give users visibility to the depth of queues at each bolt #236

knusbaum commented Aug 20, 2014

nathanmarz Aug 20, 2014

nathanmarz commented Aug 20, 2014

knusbaum commented Aug 20, 2014

nathanmarz commented Aug 20, 2014

knusbaum commented Aug 21, 2014

revans2 Aug 25, 2014

knusbaum Aug 26, 2014

revans2 Aug 26, 2014

knusbaum Aug 26, 2014

nathanmarz commented Aug 26, 2014

knusbaum commented Feb 5, 2015

d2r commented Feb 5, 2015

knusbaum commented Feb 6, 2015

d2r commented Jun 24, 2015

erikdw commented May 4, 2016

abhishekagarwal87 commented May 5, 2016 •

edited

erikdw commented May 5, 2016

abhishekagarwal87 commented May 5, 2016

dsKarthick commented May 5, 2016 •

edited

abhishekagarwal87 commented May 6, 2016

abhishekagarwal87 commented May 10, 2016

STORM-433: Give users visibility to the depth of queues at each bolt #236

STORM-433: Give users visibility to the depth of queues at each bolt #236

Conversation

knusbaum commented Aug 20, 2014

nathanmarz Aug 20, 2014

Choose a reason for hiding this comment

nathanmarz commented Aug 20, 2014

knusbaum commented Aug 20, 2014

nathanmarz commented Aug 20, 2014

knusbaum commented Aug 21, 2014

revans2 Aug 25, 2014

Choose a reason for hiding this comment

knusbaum Aug 26, 2014

Choose a reason for hiding this comment

revans2 Aug 26, 2014

Choose a reason for hiding this comment

knusbaum Aug 26, 2014

Choose a reason for hiding this comment

nathanmarz commented Aug 26, 2014

knusbaum commented Feb 5, 2015

d2r commented Feb 5, 2015

knusbaum commented Feb 6, 2015

d2r commented Jun 24, 2015

erikdw commented May 4, 2016

abhishekagarwal87 commented May 5, 2016 • edited

erikdw commented May 5, 2016

abhishekagarwal87 commented May 5, 2016

dsKarthick commented May 5, 2016 • edited

abhishekagarwal87 commented May 6, 2016

abhishekagarwal87 commented May 10, 2016

abhishekagarwal87 commented May 5, 2016 •

edited

dsKarthick commented May 5, 2016 •

edited