-
Notifications
You must be signed in to change notification settings - Fork 8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[task manager][meta] improve task manager performance logging #109941
Comments
I should mention, it's possible that some of the changes made in PR #109741 will end up improving the situation - for instance, cutting down on the number of times the message is generated over time (because we relaxed the conditions considered problematic). But I think we'll need to see over time. |
Pinging @elastic/kibana-alerting-services (Team:Alerting Services) |
I think this is going to become a meta issue, realized I needed a place to vent on my other concerns regarding the task manager perf logging:
|
I was also thinking yesterday it might be useful to make use of the event log. But I think it would have to be conditional, otherwise it's going to get REAL busy.
What would we add? Of that, I'm not sure. It might be a good place to put the health documents, but I think they would have to be a new object/enabled: false field. Or perhaps flattened. Maybe a different shape that would be better for KQL queries. Task start/ends documents might be good. Wondering if we could use this to do better estimation of the number of active Kibanas. |
PR #109741 removes some of the "observability of task manager", by "hiding" the "potential performance problem" log warning, by turning it into a debug warning.
The original issue that spawned the PR is here #109095 and contains references to other issues where this message has appeared and caused undo alarm.
It would be nice to "promote" this message back to a "warn", but I think we need to feel pretty confident that the message is only logged when we really know we have a problem.
Some specific problems we've seen:
that message was logged every 10 seconds, a number of times. We need to apply some throttling. It's especially bad when there are multiple Kibana instances, as typically they all come to the same "conclusion" about generating the log message
the calculation of some of the values used in the health report are suspect. TM guesses at how many active Kibana instances there are, based on UUIDs found in the task manager documents, but these can change when Kibana instances are rebooted. We've seen cases where TM guesses ~2x the number of instances, which we're guessing is when a cluster is rebooted, and each instance gets a new server UUID. Likewise, we think there are cases where it can undercount. We should see if we can find a more precise way of guessing this.
The text was updated successfully, but these errors were encountered: