Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

STORM-433: Give users visibility to the depth of queues at each bolt #236

Closed
wants to merge 13 commits into from

Conversation

knusbaum
Copy link
Contributor

This pull request adds a column to the executors table on the components page showing the average length of the tuple queue when the executor consumes a chunk.

(let [ret (async-loop
(fn [] (consume-batch-when-available queue handler) 0)
(fn []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This touches the critical path. We can't add this until quantifying the performance cost of this.

@nathanmarz
Copy link
Contributor

-1

As this touches the critical path, this needs some performance testing to measure the before and after impact of this before we can even consider merging this in.

@knusbaum
Copy link
Contributor Author

I see your point. This probably doesn't belong on the critical path anyway, so I'll move it off.

@nathanmarz
Copy link
Contributor

Yea if you can find a way to do that that's thread safe that's ideal. One way to do that is to place a special event on the disruptor queue, and when the read thread sees that special event it can write the stat out somewhere (this would be similar to how the INTERRUPT event works in DisruptorQueue.java)

@knusbaum
Copy link
Contributor Author

Alright, I've moved the updates into the heartbeat thread.

@knusbaum knusbaum changed the title Storm-433: Give users visibility to the depth of queues at each bolt STORM-433: Give users visibility to the depth of queues at each bolt Aug 21, 2014
@@ -133,11 +133,11 @@
</script>
<script id="bolt-executor-template" type="text/html">
<h2>Executors</h2>
<table class="zebra-striped" id="bolt-executor-table"><thead><tr><th class="header headerSortDown"><span class="tip right" title="The unique executor ID.">Id</span></th><th class="header"><span data-original-title="The length of time an Executor (thread) has been alive." class="tip right">Uptime</span></th><th class="header"><span class="tip above" title="The hostname reported by the remote host. (Note that this hostname is not the result of a reverse lookup at the Nimbus node.)">Host</span></th><th class="header"><span class="tip above" title="The port number used by the Worker to which an Executor is assigned. Click on the port number to open the logviewer page for this Worker.">Port</span></th><th class="header"><span class="tip above" title="The number of Tuples emitted.">Emitted</span></th><th class="header"><span class="tip above" title="The number of Tuples emitted that sent to one or more bolts.">Transferred</span></th><th class="header"><span class="tip above" title="If this is around 1.0, the corresponding Bolt is running as fast as it can, so you may want to increase the Bolt's parallelism. This is (number executed * average execute latency) / measurement time.">Capacity (last 10m)</span></th><th class="header"><span data-original-title="The average time a Tuple spends in the execute method. The execute method may complete without sending an Ack for the tuple." class="tip above">Execute latency (ms)</span></th><th class="header"><span class="tip above" title="The number of incoming Tuples processed.">Executed</span></th><th class="header"><span data-original-title="The average time it takes to Ack a Tuple after it is first received. Bolts that join, aggregate or batch may not Ack a tuple until a number of other Tuples have been received." class="tip above">Process latency (ms)</span></th><th class="header"><span data-original-title="The number of Tuples acknowledged by this Bolt." class="tip above">Acked</span></th><th class="header"><span data-original-title="The number of tuples Failed by this Bolt." class="tip left">Failed</span></th></tr></thead>
<table class="zebra-striped" id="bolt-executor-table"><thead><tr><th class="header headerSortDown"><span class="tip right" title="The unique executor ID.">Id</span></th><th class="header"><span data-original-title="The length of time an Executor (thread) has been alive." class="tip right">Uptime</span></th><th class="header"><span class="tip above" title="The hostname reported by the remote host. (Note that this hostname is not the result of a reverse lookup at the Nimbus node.)">Host</span></th><th class="header"><span class="tip above" title="The port number used by the Worker to which an Executor is assigned. Click on the port number to open the logviewer page for this Worker.">Port</span></th><th class="header"><span class="tip above" title="The number of Tuples emitted.">Emitted</span></th><th class="header"><span class="tip above" title="The number of Tuples emitted that sent to one or more bolts.">Transferred</span></th><th class="header"><span class="tip above" title="If this is around 1.0, the corresponding Bolt is running as fast as it can, so you may want to increase the Bolt's parallelism. This is (number executed * average execute latency) / measurement time.">Capacity (last 10m)</span></th><th class="header"><span data-original-title="The average time a Tuple spends in the execute method. The execute method may complete without sending an Ack for the tuple." class="tip above">Execute latency (ms)</span></th><th class="header"><span class="tip above" title="The number of incoming Tuples processed.">Executed</span></th><th class="header"><span data-original-title="The average time it takes to Ack a Tuple after it is first received. Bolts that join, aggregate or batch may not Ack a tuple until a number of other Tuples have been received." class="tip above">Process latency (ms)</span></th><th class="header"><span class="tip right" title="The length of the executor's queue.">Queue Length</span></th><th class="header"><span data-original-title="The number of Tuples acknowledged by this Bolt." class="tip above">Acked</span></th><th class="header"><span data-original-title="The number of tuples Failed by this Bolt." class="tip left">Failed</span></th></tr></thead>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you split this line up? everything looks fine, but it is so long it is really hard to tell what is happening with this line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'd like to reformat these templates in a few places. There are quite a few spots like this in the HTML. I can fix this and file a separate jira for that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK file a separate JIRA and then I am +1, but @nathanmarz was also reviewing this, so I am going to wait for a +1 from him before committing this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request and jira are up #243

@nathanmarz
Copy link
Contributor

A few things:

  1. I made a comment about inserting a type hint to avoid reflection
  2. To be clear, the only queue you're monitoring here is the receive queue for executors?
  3. I need to be convinced more that the interactions with the DisruptorQueue are thread-safe. This is critical code, so we have to be extra careful.

@knusbaum
Copy link
Contributor Author

knusbaum commented Feb 5, 2015

@nathanmarz, @revans2, @d2r
Do we still want this? If so I'll upmerge and fix all the problems:

  1. easy, I'll take care of this.
  2. I can also monitor the batch-transfer-queue.
  3. This one's trickier. It depends on what you mean by "thread-safe." They are certainly thread-safe in the sense that nothing will get corrupted. I've read the disruptor code for 2.10.1, and only reads are ever done by the additional code. It's not thread-safe in that the queue population values may be off by an amount if a write is done during, but it won't corrupt any values.

@d2r
Copy link

d2r commented Feb 5, 2015

Do we still want this?

Yes, I think this would be a tremendous help to users while debugging "slowness" issues.

Kyle Nusbaum added 2 commits February 5, 2015 10:17
Conflicts:
	storm-core/src/clj/backtype/storm/daemon/executor.clj
	storm-core/src/clj/backtype/storm/ui/core.clj
	storm-core/src/ui/public/templates/component-page-template.html
Regenerating Thrift code.

Ready for review.
@knusbaum
Copy link
Contributor Author

knusbaum commented Feb 6, 2015

This is ready for another look.

@d2r
Copy link

d2r commented Jun 24, 2015

Hi @knusbaum, it has been a while, and it looks like this needs an up-merge again. It might also be good to remove changes from the Thrift-generated code that result in only time stamp changes.

@knusbaum knusbaum closed this Mar 7, 2016
@erikdw
Copy link
Contributor

erikdw commented May 4, 2016

@knusbaum : I see you closed this. Are you planning to open a new PR for publishing these stats? It would be a big help to us, we often wonder at various behaviors we witness from our users' topologies, and the only mechanism we have right now for getting visibility into the queue depths seems to be getting every topology owner to use a custom metrics consumer to publish the metrics which we would then need to provide fancy aggregation on top of. Having it in the Storm UI and also in the API stats would be very very helpful.

@abhishekagarwal87
Copy link
Contributor

abhishekagarwal87 commented May 5, 2016

I am preparing a patch for publishing in-backlog (receive-queue) and out-backlog (send-queue). I am able to see the average value of these metrics on UI over each time window period but only in executors section (No aggregation over component level etc) Also users may be interested in instant value of these metrics and I don't know how will I fit that into UI. any suggestions are welcome.

@erikdw
Copy link
Contributor

erikdw commented May 5, 2016

@abhishekagarwal87 : awesome news!

Any way you can post a screenshot of the UI you are currently proposing? At least please do so with the PR when you send it. That could help others to brainstorm how to put such values into the UI. Maybe you're instead asking for suggestions on how to handle obtaining the instantaneous values?

@abhishekagarwal87
Copy link
Contributor

I will do that Eric. I am using https://github.com/apache/storm/blob/master/storm-core/src/jvm/org/apache/storm/metric/internal/LatencyStatAndMetric.java to store the windowed values. It is easy to add the instantaneous values in the result map so that is not a problem. I will put a screenshot and the PR soon. May be that will clear the confusion.

@dsKarthick
Copy link

dsKarthick commented May 5, 2016

@abhishekagarwal87 I am excited to see your screenshot as well. Also, I am curious as why you chose to use LatencyStatAndMetric.java over CountStatAndMetric.java?

@abhishekagarwal87
Copy link
Contributor

Because we need average length of queue average over each time window. If you look at the LatencyStatAndMetric, it does exactly that. I record the queue lengths in each operation and return the average values with executor heartbeat.

@abhishekagarwal87
Copy link
Contributor

Guys, follow up discussion is happening on #1406. Can you let me know your feedback there?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants