Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[STORM-433] [WIP] Executor queue backlog metric #1406

Closed

Conversation

abhishekagarwal87
Copy link
Contributor

This is first set of changes to show the average queue population in UI. It is work in progress. Right now, only the population is shown only for the executors and not aggregated at component level.

@abhishekagarwal87
Copy link
Contributor Author

screen shot 2016-05-08 at 9 48 27 pm

@HeartSaVioR
Copy link
Contributor

HeartSaVioR commented May 9, 2016

@abhishekagarwal87
Thanks for the improvement. :)

Btw, I've some opinions on this change.

A. some concerns about adding payloads for task heartbeat

As you may know, why Storm needs Pacemaker daemon with large cluster is that Storm includes task metrics into heartbeat message and store to ZK in a short interval (task.heartbeat.frequency.secs, its default value is 3) which is a big pressure for ZK.

So we would like to have some discussions for expanding heartbeat message with current way, or change the way to send metrics to Nimbus (like JStorm). If we can make some more spaces for metrics, we can have ideations around metrics and add them to enrich. For example, spout tasks can have optional metrics, for example, partition information and lag for KafkaSpout.

B. metrics for queue

I guess sojourn time for the queue is one of most wanted feature of queue metrics, since many users said that they see very short latencies for execute/process latency for each task but also see very high complete latency.
(@wangli1426 addresses sojourn time for disruptor queue but as he stated to code comment, it's based on precondition which is sometimes not true for problematic task. If we can make it stable it would be really helpful.)

STORM-1742 covers the accuracy of 'complete latency', but many parts of lifecycle of tuple are still hidden, for example, avg. of queue sojourn time, serde latency, transfer latency, etc. I think we don't want to address the things which can affect overall performance in order to measure, but they're meaningful information indeed so I would like to address if they don't hurt at all.

@abhishekagarwal87
Copy link
Contributor Author

Regarding A - That's a good point and glad to know that this issue is being solved for. The current executor stats takes up good amount of space (0.5 MB in my topologies) I earlier ran into an issue where I had packed too many executors into single worker and zookeeper writing failed. In fact, I was quite surprised to know that such a heavy write is being done into zookeeper :) This PR can wait if we want to solve that problem first.

Regarding B - Having queue depth can give user's very good visibility. Though I think it would have been more useful when backpressure was not available. Now if you look at the code, the (queue + overflow buffer) is actually unbounded. Previously, a slow bolt would stall the whole topology and queue depth would have helped in zeroing down on that bolt.

complete latency - I haven't gone through the discussion closely so I can't comment right now.
sojournTime - Looks like a good metric. I missed it completely. Though it still remains to be shown on UI.

So to summarize, there are four major points -

  1. A more efficient way to periodically update executor stats (Metric Producer)
  2. Zeroing on which queue metrics are useful
  3. Packing queue metrics in the executor stats (simple enough)
  4. Move nimbus to read the metrics for new metric store or some other place

@HeartSaVioR
Copy link
Contributor

For addressing big picture of metrics improvement, yes we want to address four major points.

But given that we already released major version of Storm 1.0.0, we would want to keep backward compatibility for 1.x. That's why I'm addressing small improvements first instead of working long-term huge improvements. It's also planned to phase 2 on JStorm merger in 2.0.0.

But if we have many ideas which relies on that improvements, I think we can discuss how to address and work on that right now. (evaluation of many ways including porting metrics feature of JStorm)

Btw, since we're having alternative (Pacemaker) now, we may feel OK to add small amount of payload to heartbeat message.

If it doesn't make sense, even we can separate heartbeat message and metrics message, and set frequency of latter thing to 10s or even longer, I guess we can reduce ZK write load greatly. (since it aggregates built-in topology metrics and writes to ZK every 3 seconds by default.)

@hustfxj
Copy link
Contributor

hustfxj commented May 10, 2016

In fact, JStorm can show alll queues population in UI. And we seperarate heartbeat message and metrics message. I agree with @HeartSaVioR . The work is long-term huge, So I hope we can plan to phase 2 in 2.0.0.

@abhishekagarwal87
Copy link
Contributor Author

Two parts of this PR -
Metric collection - From what I have seen, storm users are very much interested in queue length statistics. Current queue metrics do not have histogram or average of queue length etc. There must be a way to add them without touching the critical path.

Metric reporting - Reporting additional metrics to UI adds zookeeper overhead but additional payload is not significant. Other alternative is to let users rely on external components such as graphite by using appropriate MetricConsumer.

@knusbaum
Copy link
Contributor

I think it's fine if this goes in. It doesn't add much to the metrics load, and we already have a working solution for clusters whose zk instances get overloaded. Ultimately, we do want to move all the metrics somewhere else, but until then, we can continue doing them as we have been.

When we make the larger change, it won't me more difficult because of this than it would have been.

@HeartSaVioR
Copy link
Contributor

@knusbaum Yeah, agreed. I was seeing larger view, but when we narrow the view, this PR would be add a tiny amount of load. Seeing is believing so I would like to see performance test result.

@HeartSaVioR
Copy link
Contributor

@abhishekagarwal87 @knusbaum
Not at all. It shouldn't hurt much so let's apply this as it is. We can address broader considerations from another issues.
One thing I'd like to see is the status of backpressure for each queue. We can show the percentage, or just show whether this queue meets condition to trigger backpressure or not.

@HeartSaVioR
Copy link
Contributor

@abhishekagarwal87 Any updates on this? I'm supportive for this patch but you marked WIP so I'm waiting for you to finalize.

@abhishekagarwal87
Copy link
Contributor Author

@HeartSaVioR - I am not able to get time. I would like to do some performance benchmarking as well for these changes. Can you point me to any references for that?

@HeartSaVioR
Copy link
Contributor

@abhishekagarwal87 No worry for performance benchmarking. I'm thinking it's tiny for now. If we want to have more values about queue we can measure then.

@HeartSaVioR
Copy link
Contributor

@abhishekagarwal87
Do you have some more works in mind, or do you think we can remove [WIP]?
(Input backlog and output backlog seems a good start. It would be better to have component level aggregation, but I think I can help if you don't have time to go on.)

If you think it's done I'll take a deep look.

@abhishekagarwal87
Copy link
Contributor Author

@HeartSaVioR - If we are ok to show these metrics only at the executor level (and not aggregate it at component level), it is almost done. Aggregation at component level can be taken up in separate jira as well.

@HeartSaVioR
Copy link
Contributor

@abhishekagarwal87 Yeah I think we can do that later, and anyone can do that once it's added to metrics.

@HeartSaVioR
Copy link
Contributor

@abhishekagarwal87 Any updates here?

@abhishekagarwal87
Copy link
Contributor Author

@HeartSaVioR unfortunately I havent done any further work. This feature needs some gauranteed time which I am not able to give. Anyone else wants to take it up and reuse my work, please do so

@HeartSaVioR
Copy link
Contributor

@abhishekagarwal87
Could I ask some questions regarding this: what points did you need to work further? It's needed for someone including me to take this up and reuse your work.

@erikdw
Copy link
Contributor

erikdw commented Dec 7, 2016

@abhishekagarwal87 & @HeartSaVioR : any update on this? Getting visibility into queue depth in storm is a major reason for us to even consider upgrading off of the storm-0.9.x branch. Do you need help? We can potentially dedicate time in our next quarter towards helping.

@HeartSaVioR
Copy link
Contributor

@erikdw Yes I'd be really appreciated if you can help. It might depends on metrics renewal so if you want to address this later, please stay tuned for metrics renewal by @ptgoetz and @abellina.

d2r pushed a commit to d2r/storm that referenced this pull request Oct 16, 2018
We are closing stale Pull Requests to make the list more manageable.

Please re-open any Pull Request that has been closed in error.

Closes apache#608
Closes apache#639
Closes apache#640
Closes apache#648
Closes apache#662
Closes apache#668
Closes apache#692
Closes apache#705
Closes apache#724
Closes apache#728
Closes apache#730
Closes apache#753
Closes apache#803
Closes apache#854
Closes apache#922
Closes apache#986
Closes apache#992
Closes apache#1019
Closes apache#1040
Closes apache#1041
Closes apache#1043
Closes apache#1046
Closes apache#1051
Closes apache#1078
Closes apache#1146
Closes apache#1164
Closes apache#1165
Closes apache#1178
Closes apache#1213
Closes apache#1225
Closes apache#1258
Closes apache#1259
Closes apache#1268
Closes apache#1272
Closes apache#1277
Closes apache#1278
Closes apache#1288
Closes apache#1296
Closes apache#1328
Closes apache#1342
Closes apache#1353
Closes apache#1370
Closes apache#1376
Closes apache#1391
Closes apache#1395
Closes apache#1399
Closes apache#1406
Closes apache#1410
Closes apache#1422
Closes apache#1427
Closes apache#1443
Closes apache#1462
Closes apache#1468
Closes apache#1483
Closes apache#1506
Closes apache#1509
Closes apache#1515
Closes apache#1520
Closes apache#1521
Closes apache#1525
Closes apache#1527
Closes apache#1544
Closes apache#1550
Closes apache#1566
Closes apache#1569
Closes apache#1570
Closes apache#1575
Closes apache#1580
Closes apache#1584
Closes apache#1591
Closes apache#1600
Closes apache#1611
Closes apache#1613
Closes apache#1639
Closes apache#1703
Closes apache#1711
Closes apache#1719
Closes apache#1737
Closes apache#1760
Closes apache#1767
Closes apache#1768
Closes apache#1785
Closes apache#1799
Closes apache#1822
Closes apache#1824
Closes apache#1844
Closes apache#1874
Closes apache#1918
Closes apache#1928
Closes apache#1937
Closes apache#1942
Closes apache#1951
Closes apache#1957
Closes apache#1963
Closes apache#1964
Closes apache#1965
Closes apache#1967
Closes apache#1968
Closes apache#1971
Closes apache#1985
Closes apache#1986
Closes apache#1998
Closes apache#2031
Closes apache#2032
Closes apache#2071
Closes apache#2076
Closes apache#2108
Closes apache#2119
Closes apache#2128
Closes apache#2142
Closes apache#2174
Closes apache#2206
Closes apache#2297
Closes apache#2322
Closes apache#2332
Closes apache#2341
Closes apache#2377
Closes apache#2414
Closes apache#2469
d2r pushed a commit to d2r/storm that referenced this pull request Oct 16, 2018
We are closing stale Pull Requests to make the list more manageable.

Please re-open any Pull Request that has been closed in error.

Closes apache#608
Closes apache#639
Closes apache#640
Closes apache#648
Closes apache#662
Closes apache#668
Closes apache#692
Closes apache#705
Closes apache#724
Closes apache#728
Closes apache#730
Closes apache#753
Closes apache#803
Closes apache#854
Closes apache#922
Closes apache#986
Closes apache#992
Closes apache#1019
Closes apache#1040
Closes apache#1041
Closes apache#1043
Closes apache#1046
Closes apache#1051
Closes apache#1078
Closes apache#1146
Closes apache#1164
Closes apache#1165
Closes apache#1178
Closes apache#1213
Closes apache#1225
Closes apache#1258
Closes apache#1259
Closes apache#1268
Closes apache#1272
Closes apache#1277
Closes apache#1278
Closes apache#1288
Closes apache#1296
Closes apache#1328
Closes apache#1342
Closes apache#1353
Closes apache#1370
Closes apache#1376
Closes apache#1391
Closes apache#1395
Closes apache#1399
Closes apache#1406
Closes apache#1410
Closes apache#1422
Closes apache#1427
Closes apache#1443
Closes apache#1462
Closes apache#1468
Closes apache#1483
Closes apache#1506
Closes apache#1509
Closes apache#1515
Closes apache#1520
Closes apache#1521
Closes apache#1525
Closes apache#1527
Closes apache#1544
Closes apache#1550
Closes apache#1566
Closes apache#1569
Closes apache#1570
Closes apache#1575
Closes apache#1580
Closes apache#1584
Closes apache#1591
Closes apache#1600
Closes apache#1611
Closes apache#1613
Closes apache#1639
Closes apache#1703
Closes apache#1711
Closes apache#1719
Closes apache#1737
Closes apache#1760
Closes apache#1767
Closes apache#1768
Closes apache#1785
Closes apache#1799
Closes apache#1822
Closes apache#1824
Closes apache#1844
Closes apache#1874
Closes apache#1918
Closes apache#1928
Closes apache#1937
Closes apache#1942
Closes apache#1951
Closes apache#1957
Closes apache#1963
Closes apache#1964
Closes apache#1965
Closes apache#1967
Closes apache#1968
Closes apache#1971
Closes apache#1985
Closes apache#1986
Closes apache#1998
Closes apache#2031
Closes apache#2032
Closes apache#2071
Closes apache#2076
Closes apache#2108
Closes apache#2119
Closes apache#2128
Closes apache#2142
Closes apache#2174
Closes apache#2206
Closes apache#2297
Closes apache#2322
Closes apache#2332
Closes apache#2341
Closes apache#2377
Closes apache#2414
Closes apache#2469
@asfgit asfgit closed this in #2880 Oct 22, 2018
bipinprasad pushed a commit to bipinprasad/storm that referenced this pull request Oct 17, 2019
YSTORM-4621: Removes conflicting jar
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants