[task manager][meta] improve task manager performance logging #109941

pmuellr · 2021-08-24T20:01:57Z

PR #109741 removes some of the "observability of task manager", by "hiding" the "potential performance problem" log warning, by turning it into a debug warning.

The original issue that spawned the PR is here #109095 and contains references to other issues where this message has appeared and caused undo alarm.

It would be nice to "promote" this message back to a "warn", but I think we need to feel pretty confident that the message is only logged when we really know we have a problem.

Some specific problems we've seen:

that message was logged every 10 seconds, a number of times. We need to apply some throttling. It's especially bad when there are multiple Kibana instances, as typically they all come to the same "conclusion" about generating the log message
the calculation of some of the values used in the health report are suspect. TM guesses at how many active Kibana instances there are, based on UUIDs found in the task manager documents, but these can change when Kibana instances are rebooted. We've seen cases where TM guesses ~2x the number of instances, which we're guessing is when a cluster is rebooted, and each instance gets a new server UUID. Likewise, we think there are cases where it can undercount. We should see if we can find a more precise way of guessing this.

pmuellr · 2021-08-24T20:04:28Z

I should mention, it's possible that some of the changes made in PR #109741 will end up improving the situation - for instance, cutting down on the number of times the message is generated over time (because we relaxed the conditions considered problematic). But I think we'll need to see over time.

elasticmachine · 2021-08-24T20:04:56Z

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

pmuellr · 2021-08-25T13:18:58Z

I think this is going to become a meta issue, realized I needed a place to vent on my other concerns regarding the task manager perf logging:

we dump the heath record too often, especially if you just have debug logging set. I think we probably should NOT dump it, if the error came from capacity concerns, And probably if the reason hasn't changed since the last time it was dumped.
the health record we're dropping is not useful for diagnostic reasons, as it's stringified JSON (with escaped " chars). I think we can add this to the log via a meta object field, which would be easier to access. But I think there's a down-side that then it won't work with log appenders that don't deal directly with JSON (which I assume is the default, and how the logger is set up at dev time).
the warning message with the doc link (from the latest PR) doesn't get printed on capacity sizing issues, but I think it should
the warning message with the doc link has a fairly naive throttle - it only prints on a transition from an "OK" to "not OK" state. I think it should be time-based - every minute even sounds like too much, if I had to look through a day's worth of logs. Maybe an hour?
we need to plumb the "reason" we're setting the Kibana status with the actual level; today we're just logging it, but then it won't show up in the status UI or report

pmuellr · 2021-08-25T13:29:47Z

I was also thinking yesterday it might be useful to make use of the event log. But I think it would have to be conditional, otherwise it's going to get REAL busy.

API to enable/disable logging TM events
API to return event log docs for TM over a range of time, maybe KQL filtering

What would we add? Of that, I'm not sure. It might be a good place to put the health documents, but I think they would have to be a new object/enabled: false field. Or perhaps flattened. Maybe a different shape that would be better for KQL queries. Task start/ends documents might be good. Wondering if we could use this to do better estimation of the number of active Kibanas.

pmuellr added the bug Fixes for quality problems that affect the customer experience label Aug 24, 2021

botelastic bot added the needs-team Issues missing a team label label Aug 24, 2021

pmuellr added Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) and removed needs-team Issues missing a team label labels Aug 24, 2021

pmuellr added Feature:Task Manager research labels Aug 24, 2021

pmuellr mentioned this issue Aug 24, 2021

[task manager] provide better diagnostics when task manager performance is degraded #109741

Merged

7 tasks

pmuellr changed the title ~~[task manager] investigate false positives, spamming of task manager health warning~~ [task manager][meta] improve task manager performance logging Aug 25, 2021

pmuellr added the Meta label Aug 25, 2021

mikecote added technical debt Improvement of the software architecture and operational architecture discuss labels Aug 25, 2021

mikecote added this to Backlog in Kibana Alerting Aug 25, 2021

XavierM removed this from Backlog in Kibana Alerting Jan 6, 2022

kobelb added the needs-team Issues missing a team label label Jan 31, 2022

botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[task manager][meta] improve task manager performance logging #109941

[task manager][meta] improve task manager performance logging #109941

pmuellr commented Aug 24, 2021

pmuellr commented Aug 24, 2021

elasticmachine commented Aug 24, 2021

pmuellr commented Aug 25, 2021

pmuellr commented Aug 25, 2021

[task manager][meta] improve task manager performance logging #109941

[task manager][meta] improve task manager performance logging #109941

Comments

pmuellr commented Aug 24, 2021

pmuellr commented Aug 24, 2021

elasticmachine commented Aug 24, 2021

pmuellr commented Aug 25, 2021

pmuellr commented Aug 25, 2021