[STORM-433] [WIP] Executor queue backlog metric #1406

abhishekagarwal87 · 2016-05-08T16:48:48Z

This is first set of changes to show the average queue population in UI. It is work in progress. Right now, only the population is shown only for the executors and not aggregated at component level.

…klog-metrics

abhishekagarwal87 · 2016-05-08T16:50:30Z

HeartSaVioR · 2016-05-09T02:46:43Z

@abhishekagarwal87
Thanks for the improvement. :)

Btw, I've some opinions on this change.

A. some concerns about adding payloads for task heartbeat

As you may know, why Storm needs Pacemaker daemon with large cluster is that Storm includes task metrics into heartbeat message and store to ZK in a short interval (task.heartbeat.frequency.secs, its default value is 3) which is a big pressure for ZK.

So we would like to have some discussions for expanding heartbeat message with current way, or change the way to send metrics to Nimbus (like JStorm). If we can make some more spaces for metrics, we can have ideations around metrics and add them to enrich. For example, spout tasks can have optional metrics, for example, partition information and lag for KafkaSpout.

B. metrics for queue

I guess sojourn time for the queue is one of most wanted feature of queue metrics, since many users said that they see very short latencies for execute/process latency for each task but also see very high complete latency.
(@wangli1426 addresses sojourn time for disruptor queue but as he stated to code comment, it's based on precondition which is sometimes not true for problematic task. If we can make it stable it would be really helpful.)

STORM-1742 covers the accuracy of 'complete latency', but many parts of lifecycle of tuple are still hidden, for example, avg. of queue sojourn time, serde latency, transfer latency, etc. I think we don't want to address the things which can affect overall performance in order to measure, but they're meaningful information indeed so I would like to address if they don't hurt at all.

abhishekagarwal87 · 2016-05-09T04:56:19Z

Regarding A - That's a good point and glad to know that this issue is being solved for. The current executor stats takes up good amount of space (0.5 MB in my topologies) I earlier ran into an issue where I had packed too many executors into single worker and zookeeper writing failed. In fact, I was quite surprised to know that such a heavy write is being done into zookeeper :) This PR can wait if we want to solve that problem first.

Regarding B - Having queue depth can give user's very good visibility. Though I think it would have been more useful when backpressure was not available. Now if you look at the code, the (queue + overflow buffer) is actually unbounded. Previously, a slow bolt would stall the whole topology and queue depth would have helped in zeroing down on that bolt.

complete latency - I haven't gone through the discussion closely so I can't comment right now.
sojournTime - Looks like a good metric. I missed it completely. Though it still remains to be shown on UI.

So to summarize, there are four major points -

A more efficient way to periodically update executor stats (Metric Producer)
Zeroing on which queue metrics are useful
Packing queue metrics in the executor stats (simple enough)
Move nimbus to read the metrics for new metric store or some other place

HeartSaVioR · 2016-05-09T07:31:28Z

For addressing big picture of metrics improvement, yes we want to address four major points.

But given that we already released major version of Storm 1.0.0, we would want to keep backward compatibility for 1.x. That's why I'm addressing small improvements first instead of working long-term huge improvements. It's also planned to phase 2 on JStorm merger in 2.0.0.

But if we have many ideas which relies on that improvements, I think we can discuss how to address and work on that right now. (evaluation of many ways including porting metrics feature of JStorm)

Btw, since we're having alternative (Pacemaker) now, we may feel OK to add small amount of payload to heartbeat message.

If it doesn't make sense, even we can separate heartbeat message and metrics message, and set frequency of latter thing to 10s or even longer, I guess we can reduce ZK write load greatly. (since it aggregates built-in topology metrics and writes to ZK every 3 seconds by default.)

hustfxj · 2016-05-10T01:24:22Z

In fact, JStorm can show alll queues population in UI. And we seperarate heartbeat message and metrics message. I agree with @HeartSaVioR . The work is long-term huge, So I hope we can plan to phase 2 in 2.0.0.

abhishekagarwal87 · 2016-05-10T07:09:12Z

Two parts of this PR -
Metric collection - From what I have seen, storm users are very much interested in queue length statistics. Current queue metrics do not have histogram or average of queue length etc. There must be a way to add them without touching the critical path.

Metric reporting - Reporting additional metrics to UI adds zookeeper overhead but additional payload is not significant. Other alternative is to let users rely on external components such as graphite by using appropriate MetricConsumer.

knusbaum · 2016-05-10T20:59:58Z

I think it's fine if this goes in. It doesn't add much to the metrics load, and we already have a working solution for clusters whose zk instances get overloaded. Ultimately, we do want to move all the metrics somewhere else, but until then, we can continue doing them as we have been.

When we make the larger change, it won't me more difficult because of this than it would have been.

HeartSaVioR · 2016-05-11T01:31:29Z

@knusbaum Yeah, agreed. I was seeing larger view, but when we narrow the view, this PR would be add a tiny amount of load. Seeing is believing so I would like to see performance test result.

HeartSaVioR · 2016-06-01T08:46:58Z

@abhishekagarwal87 @knusbaum
Not at all. It shouldn't hurt much so let's apply this as it is. We can address broader considerations from another issues.
One thing I'd like to see is the status of backpressure for each queue. We can show the percentage, or just show whether this queue meets condition to trigger backpressure or not.

HeartSaVioR · 2016-07-05T23:55:44Z

@abhishekagarwal87 Any updates on this? I'm supportive for this patch but you marked WIP so I'm waiting for you to finalize.

abhishekagarwal87 · 2016-07-06T03:53:20Z

@HeartSaVioR - I am not able to get time. I would like to do some performance benchmarking as well for these changes. Can you point me to any references for that?

HeartSaVioR · 2016-07-06T03:55:01Z

@abhishekagarwal87 No worry for performance benchmarking. I'm thinking it's tiny for now. If we want to have more values about queue we can measure then.

HeartSaVioR · 2016-07-20T23:42:01Z

@abhishekagarwal87
Do you have some more works in mind, or do you think we can remove [WIP]?
(Input backlog and output backlog seems a good start. It would be better to have component level aggregation, but I think I can help if you don't have time to go on.)

If you think it's done I'll take a deep look.

abhishekagarwal87 · 2016-07-21T10:18:23Z

@HeartSaVioR - If we are ok to show these metrics only at the executor level (and not aggregate it at component level), it is almost done. Aggregation at component level can be taken up in separate jira as well.

HeartSaVioR · 2016-07-21T10:20:06Z

@abhishekagarwal87 Yeah I think we can do that later, and anyone can do that once it's added to metrics.

HeartSaVioR · 2016-08-22T09:08:13Z

@abhishekagarwal87 Any updates here?

abhishekagarwal87 · 2016-08-22T10:50:02Z

@HeartSaVioR unfortunately I havent done any further work. This feature needs some gauranteed time which I am not able to give. Anyone else wants to take it up and reuse my work, please do so

HeartSaVioR · 2016-08-22T11:14:27Z

@abhishekagarwal87
Could I ask some questions regarding this: what points did you need to work further? It's needed for someone including me to take this up and reuse your work.

erikdw · 2016-12-07T22:05:02Z

@abhishekagarwal87 & @HeartSaVioR : any update on this? Getting visibility into queue depth in storm is a major reason for us to even consider upgrading off of the storm-0.9.x branch. Do you need help? We can potentially dedicate time in our next quarter towards helping.

HeartSaVioR · 2016-12-10T21:50:02Z

@erikdw Yes I'd be really appreciated if you can help. It might depends on metrics renewal so if you want to address this later, please stay tuned for metrics renewal by @ptgoetz and @abellina.

We are closing stale Pull Requests to make the list more manageable. Please re-open any Pull Request that has been closed in error. Closes apache#608 Closes apache#639 Closes apache#640 Closes apache#648 Closes apache#662 Closes apache#668 Closes apache#692 Closes apache#705 Closes apache#724 Closes apache#728 Closes apache#730 Closes apache#753 Closes apache#803 Closes apache#854 Closes apache#922 Closes apache#986 Closes apache#992 Closes apache#1019 Closes apache#1040 Closes apache#1041 Closes apache#1043 Closes apache#1046 Closes apache#1051 Closes apache#1078 Closes apache#1146 Closes apache#1164 Closes apache#1165 Closes apache#1178 Closes apache#1213 Closes apache#1225 Closes apache#1258 Closes apache#1259 Closes apache#1268 Closes apache#1272 Closes apache#1277 Closes apache#1278 Closes apache#1288 Closes apache#1296 Closes apache#1328 Closes apache#1342 Closes apache#1353 Closes apache#1370 Closes apache#1376 Closes apache#1391 Closes apache#1395 Closes apache#1399 Closes apache#1406 Closes apache#1410 Closes apache#1422 Closes apache#1427 Closes apache#1443 Closes apache#1462 Closes apache#1468 Closes apache#1483 Closes apache#1506 Closes apache#1509 Closes apache#1515 Closes apache#1520 Closes apache#1521 Closes apache#1525 Closes apache#1527 Closes apache#1544 Closes apache#1550 Closes apache#1566 Closes apache#1569 Closes apache#1570 Closes apache#1575 Closes apache#1580 Closes apache#1584 Closes apache#1591 Closes apache#1600 Closes apache#1611 Closes apache#1613 Closes apache#1639 Closes apache#1703 Closes apache#1711 Closes apache#1719 Closes apache#1737 Closes apache#1760 Closes apache#1767 Closes apache#1768 Closes apache#1785 Closes apache#1799 Closes apache#1822 Closes apache#1824 Closes apache#1844 Closes apache#1874 Closes apache#1918 Closes apache#1928 Closes apache#1937 Closes apache#1942 Closes apache#1951 Closes apache#1957 Closes apache#1963 Closes apache#1964 Closes apache#1965 Closes apache#1967 Closes apache#1968 Closes apache#1971 Closes apache#1985 Closes apache#1986 Closes apache#1998 Closes apache#2031 Closes apache#2032 Closes apache#2071 Closes apache#2076 Closes apache#2108 Closes apache#2119 Closes apache#2128 Closes apache#2142 Closes apache#2174 Closes apache#2206 Closes apache#2297 Closes apache#2322 Closes apache#2332 Closes apache#2341 Closes apache#2377 Closes apache#2414 Closes apache#2469

YSTORM-4621: Removes conflicting jar

Abhishek Agarwal added 4 commits April 25, 2016 13:00

First cut implementation for backlog metrics

81f35aa

Merge branch '1.x-branch' of https://github.com/apache/storm into bac…

73f608b

…klog-metrics

Changes in thrift java class

d00376f

Add backlog metrics to spout component

ab138bc

abhishekagarwal87 mentioned this pull request May 10, 2016

STORM-433: Give users visibility to the depth of queues at each bolt #236

Closed

d2r mentioned this pull request Oct 16, 2018

STORM-3250: Closes Pull Requests unchanged in 2018 #2880

Merged

asfgit closed this in #2880 Oct 22, 2018

bipinprasad pushed a commit to bipinprasad/storm that referenced this pull request Oct 17, 2019

Merge pull request apache#1406 from gmenon/YSTORM-4621

9ec9730

YSTORM-4621: Removes conflicting jar

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[STORM-433] [WIP] Executor queue backlog metric #1406

[STORM-433] [WIP] Executor queue backlog metric #1406

abhishekagarwal87 commented May 8, 2016

abhishekagarwal87 commented May 8, 2016

HeartSaVioR commented May 9, 2016 •

edited

Loading

abhishekagarwal87 commented May 9, 2016

HeartSaVioR commented May 9, 2016

hustfxj commented May 10, 2016

abhishekagarwal87 commented May 10, 2016

knusbaum commented May 10, 2016

HeartSaVioR commented May 11, 2016

HeartSaVioR commented Jun 1, 2016

HeartSaVioR commented Jul 5, 2016

abhishekagarwal87 commented Jul 6, 2016

HeartSaVioR commented Jul 6, 2016

HeartSaVioR commented Jul 20, 2016

abhishekagarwal87 commented Jul 21, 2016

HeartSaVioR commented Jul 21, 2016

HeartSaVioR commented Aug 22, 2016

abhishekagarwal87 commented Aug 22, 2016

HeartSaVioR commented Aug 22, 2016

erikdw commented Dec 7, 2016

HeartSaVioR commented Dec 10, 2016

[STORM-433] [WIP] Executor queue backlog metric #1406

[STORM-433] [WIP] Executor queue backlog metric #1406

Conversation

abhishekagarwal87 commented May 8, 2016

abhishekagarwal87 commented May 8, 2016

HeartSaVioR commented May 9, 2016 • edited Loading

abhishekagarwal87 commented May 9, 2016

HeartSaVioR commented May 9, 2016

hustfxj commented May 10, 2016

abhishekagarwal87 commented May 10, 2016

knusbaum commented May 10, 2016

HeartSaVioR commented May 11, 2016

HeartSaVioR commented Jun 1, 2016

HeartSaVioR commented Jul 5, 2016

abhishekagarwal87 commented Jul 6, 2016

HeartSaVioR commented Jul 6, 2016

HeartSaVioR commented Jul 20, 2016

abhishekagarwal87 commented Jul 21, 2016

HeartSaVioR commented Jul 21, 2016

HeartSaVioR commented Aug 22, 2016

abhishekagarwal87 commented Aug 22, 2016

HeartSaVioR commented Aug 22, 2016

erikdw commented Dec 7, 2016

HeartSaVioR commented Dec 10, 2016

HeartSaVioR commented May 9, 2016 •

edited

Loading