-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[STORM-433] [WIP] Executor queue backlog metric #1406
[STORM-433] [WIP] Executor queue backlog metric #1406
Conversation
@abhishekagarwal87 Btw, I've some opinions on this change. A. some concerns about adding payloads for task heartbeat As you may know, why Storm needs Pacemaker daemon with large cluster is that Storm includes task metrics into heartbeat message and store to ZK in a short interval (task.heartbeat.frequency.secs, its default value is 3) which is a big pressure for ZK. So we would like to have some discussions for expanding heartbeat message with current way, or change the way to send metrics to Nimbus (like JStorm). If we can make some more spaces for metrics, we can have ideations around metrics and add them to enrich. For example, spout tasks can have optional metrics, for example, partition information and lag for KafkaSpout. B. metrics for queue I guess sojourn time for the queue is one of most wanted feature of queue metrics, since many users said that they see very short latencies for execute/process latency for each task but also see very high complete latency. STORM-1742 covers the accuracy of 'complete latency', but many parts of lifecycle of tuple are still hidden, for example, avg. of queue sojourn time, serde latency, transfer latency, etc. I think we don't want to address the things which can affect overall performance in order to measure, but they're meaningful information indeed so I would like to address if they don't hurt at all. |
Regarding A - That's a good point and glad to know that this issue is being solved for. The current executor stats takes up good amount of space (0.5 MB in my topologies) I earlier ran into an issue where I had packed too many executors into single worker and zookeeper writing failed. In fact, I was quite surprised to know that such a heavy write is being done into zookeeper :) This PR can wait if we want to solve that problem first. Regarding B - Having queue depth can give user's very good visibility. Though I think it would have been more useful when backpressure was not available. Now if you look at the code, the (queue + overflow buffer) is actually unbounded. Previously, a slow bolt would stall the whole topology and queue depth would have helped in zeroing down on that bolt. complete latency - I haven't gone through the discussion closely so I can't comment right now. So to summarize, there are four major points -
|
For addressing big picture of metrics improvement, yes we want to address four major points. But given that we already released major version of Storm 1.0.0, we would want to keep backward compatibility for 1.x. That's why I'm addressing small improvements first instead of working long-term huge improvements. It's also planned to phase 2 on JStorm merger in 2.0.0. But if we have many ideas which relies on that improvements, I think we can discuss how to address and work on that right now. (evaluation of many ways including porting metrics feature of JStorm) Btw, since we're having alternative (Pacemaker) now, we may feel OK to add small amount of payload to heartbeat message. If it doesn't make sense, even we can separate heartbeat message and metrics message, and set frequency of latter thing to 10s or even longer, I guess we can reduce ZK write load greatly. (since it aggregates built-in topology metrics and writes to ZK every 3 seconds by default.) |
In fact, JStorm can show alll queues population in UI. And we seperarate heartbeat message and metrics message. I agree with @HeartSaVioR . The work is long-term huge, So I hope we can plan to phase 2 in 2.0.0. |
Two parts of this PR - Metric reporting - Reporting additional metrics to UI adds zookeeper overhead but additional payload is not significant. Other alternative is to let users rely on external components such as graphite by using appropriate MetricConsumer. |
I think it's fine if this goes in. It doesn't add much to the metrics load, and we already have a working solution for clusters whose zk instances get overloaded. Ultimately, we do want to move all the metrics somewhere else, but until then, we can continue doing them as we have been. When we make the larger change, it won't me more difficult because of this than it would have been. |
@knusbaum Yeah, agreed. I was seeing larger view, but when we narrow the view, this PR would be add a tiny amount of load. Seeing is believing so I would like to see performance test result. |
@abhishekagarwal87 @knusbaum |
@abhishekagarwal87 Any updates on this? I'm supportive for this patch but you marked WIP so I'm waiting for you to finalize. |
@HeartSaVioR - I am not able to get time. I would like to do some performance benchmarking as well for these changes. Can you point me to any references for that? |
@abhishekagarwal87 No worry for performance benchmarking. I'm thinking it's tiny for now. If we want to have more values about queue we can measure then. |
@abhishekagarwal87 If you think it's done I'll take a deep look. |
@HeartSaVioR - If we are ok to show these metrics only at the executor level (and not aggregate it at component level), it is almost done. Aggregation at component level can be taken up in separate jira as well. |
@abhishekagarwal87 Yeah I think we can do that later, and anyone can do that once it's added to metrics. |
@abhishekagarwal87 Any updates here? |
@HeartSaVioR unfortunately I havent done any further work. This feature needs some gauranteed time which I am not able to give. Anyone else wants to take it up and reuse my work, please do so |
@abhishekagarwal87 |
@abhishekagarwal87 & @HeartSaVioR : any update on this? Getting visibility into queue depth in storm is a major reason for us to even consider upgrading off of the storm-0.9.x branch. Do you need help? We can potentially dedicate time in our next quarter towards helping. |
We are closing stale Pull Requests to make the list more manageable. Please re-open any Pull Request that has been closed in error. Closes apache#608 Closes apache#639 Closes apache#640 Closes apache#648 Closes apache#662 Closes apache#668 Closes apache#692 Closes apache#705 Closes apache#724 Closes apache#728 Closes apache#730 Closes apache#753 Closes apache#803 Closes apache#854 Closes apache#922 Closes apache#986 Closes apache#992 Closes apache#1019 Closes apache#1040 Closes apache#1041 Closes apache#1043 Closes apache#1046 Closes apache#1051 Closes apache#1078 Closes apache#1146 Closes apache#1164 Closes apache#1165 Closes apache#1178 Closes apache#1213 Closes apache#1225 Closes apache#1258 Closes apache#1259 Closes apache#1268 Closes apache#1272 Closes apache#1277 Closes apache#1278 Closes apache#1288 Closes apache#1296 Closes apache#1328 Closes apache#1342 Closes apache#1353 Closes apache#1370 Closes apache#1376 Closes apache#1391 Closes apache#1395 Closes apache#1399 Closes apache#1406 Closes apache#1410 Closes apache#1422 Closes apache#1427 Closes apache#1443 Closes apache#1462 Closes apache#1468 Closes apache#1483 Closes apache#1506 Closes apache#1509 Closes apache#1515 Closes apache#1520 Closes apache#1521 Closes apache#1525 Closes apache#1527 Closes apache#1544 Closes apache#1550 Closes apache#1566 Closes apache#1569 Closes apache#1570 Closes apache#1575 Closes apache#1580 Closes apache#1584 Closes apache#1591 Closes apache#1600 Closes apache#1611 Closes apache#1613 Closes apache#1639 Closes apache#1703 Closes apache#1711 Closes apache#1719 Closes apache#1737 Closes apache#1760 Closes apache#1767 Closes apache#1768 Closes apache#1785 Closes apache#1799 Closes apache#1822 Closes apache#1824 Closes apache#1844 Closes apache#1874 Closes apache#1918 Closes apache#1928 Closes apache#1937 Closes apache#1942 Closes apache#1951 Closes apache#1957 Closes apache#1963 Closes apache#1964 Closes apache#1965 Closes apache#1967 Closes apache#1968 Closes apache#1971 Closes apache#1985 Closes apache#1986 Closes apache#1998 Closes apache#2031 Closes apache#2032 Closes apache#2071 Closes apache#2076 Closes apache#2108 Closes apache#2119 Closes apache#2128 Closes apache#2142 Closes apache#2174 Closes apache#2206 Closes apache#2297 Closes apache#2322 Closes apache#2332 Closes apache#2341 Closes apache#2377 Closes apache#2414 Closes apache#2469
We are closing stale Pull Requests to make the list more manageable. Please re-open any Pull Request that has been closed in error. Closes apache#608 Closes apache#639 Closes apache#640 Closes apache#648 Closes apache#662 Closes apache#668 Closes apache#692 Closes apache#705 Closes apache#724 Closes apache#728 Closes apache#730 Closes apache#753 Closes apache#803 Closes apache#854 Closes apache#922 Closes apache#986 Closes apache#992 Closes apache#1019 Closes apache#1040 Closes apache#1041 Closes apache#1043 Closes apache#1046 Closes apache#1051 Closes apache#1078 Closes apache#1146 Closes apache#1164 Closes apache#1165 Closes apache#1178 Closes apache#1213 Closes apache#1225 Closes apache#1258 Closes apache#1259 Closes apache#1268 Closes apache#1272 Closes apache#1277 Closes apache#1278 Closes apache#1288 Closes apache#1296 Closes apache#1328 Closes apache#1342 Closes apache#1353 Closes apache#1370 Closes apache#1376 Closes apache#1391 Closes apache#1395 Closes apache#1399 Closes apache#1406 Closes apache#1410 Closes apache#1422 Closes apache#1427 Closes apache#1443 Closes apache#1462 Closes apache#1468 Closes apache#1483 Closes apache#1506 Closes apache#1509 Closes apache#1515 Closes apache#1520 Closes apache#1521 Closes apache#1525 Closes apache#1527 Closes apache#1544 Closes apache#1550 Closes apache#1566 Closes apache#1569 Closes apache#1570 Closes apache#1575 Closes apache#1580 Closes apache#1584 Closes apache#1591 Closes apache#1600 Closes apache#1611 Closes apache#1613 Closes apache#1639 Closes apache#1703 Closes apache#1711 Closes apache#1719 Closes apache#1737 Closes apache#1760 Closes apache#1767 Closes apache#1768 Closes apache#1785 Closes apache#1799 Closes apache#1822 Closes apache#1824 Closes apache#1844 Closes apache#1874 Closes apache#1918 Closes apache#1928 Closes apache#1937 Closes apache#1942 Closes apache#1951 Closes apache#1957 Closes apache#1963 Closes apache#1964 Closes apache#1965 Closes apache#1967 Closes apache#1968 Closes apache#1971 Closes apache#1985 Closes apache#1986 Closes apache#1998 Closes apache#2031 Closes apache#2032 Closes apache#2071 Closes apache#2076 Closes apache#2108 Closes apache#2119 Closes apache#2128 Closes apache#2142 Closes apache#2174 Closes apache#2206 Closes apache#2297 Closes apache#2322 Closes apache#2332 Closes apache#2341 Closes apache#2377 Closes apache#2414 Closes apache#2469
YSTORM-4621: Removes conflicting jar
This is first set of changes to show the average queue population in UI. It is work in progress. Right now, only the population is shown only for the executors and not aggregated at component level.