Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[task manager][meta] improve task manager performance logging #109941

Open
pmuellr opened this issue Aug 24, 2021 · 4 comments
Open

[task manager][meta] improve task manager performance logging #109941

pmuellr opened this issue Aug 24, 2021 · 4 comments
Labels
bug Fixes for quality problems that affect the customer experience discuss Feature:Task Manager Meta research Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) technical debt Improvement of the software architecture and operational architecture

Comments

@pmuellr
Copy link
Member

pmuellr commented Aug 24, 2021

PR #109741 removes some of the "observability of task manager", by "hiding" the "potential performance problem" log warning, by turning it into a debug warning.

The original issue that spawned the PR is here #109095 and contains references to other issues where this message has appeared and caused undo alarm.

It would be nice to "promote" this message back to a "warn", but I think we need to feel pretty confident that the message is only logged when we really know we have a problem.

Some specific problems we've seen:

  • that message was logged every 10 seconds, a number of times. We need to apply some throttling. It's especially bad when there are multiple Kibana instances, as typically they all come to the same "conclusion" about generating the log message

  • the calculation of some of the values used in the health report are suspect. TM guesses at how many active Kibana instances there are, based on UUIDs found in the task manager documents, but these can change when Kibana instances are rebooted. We've seen cases where TM guesses ~2x the number of instances, which we're guessing is when a cluster is rebooted, and each instance gets a new server UUID. Likewise, we think there are cases where it can undercount. We should see if we can find a more precise way of guessing this.

@pmuellr pmuellr added the bug Fixes for quality problems that affect the customer experience label Aug 24, 2021
@botelastic botelastic bot added the needs-team Issues missing a team label label Aug 24, 2021
@pmuellr
Copy link
Member Author

pmuellr commented Aug 24, 2021

I should mention, it's possible that some of the changes made in PR #109741 will end up improving the situation - for instance, cutting down on the number of times the message is generated over time (because we relaxed the conditions considered problematic). But I think we'll need to see over time.

@pmuellr pmuellr added Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) and removed needs-team Issues missing a team label labels Aug 24, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@pmuellr
Copy link
Member Author

pmuellr commented Aug 25, 2021

I think this is going to become a meta issue, realized I needed a place to vent on my other concerns regarding the task manager perf logging:

  • we dump the heath record too often, especially if you just have debug logging set. I think we probably should NOT dump it, if the error came from capacity concerns, And probably if the reason hasn't changed since the last time it was dumped.

  • the health record we're dropping is not useful for diagnostic reasons, as it's stringified JSON (with escaped " chars). I think we can add this to the log via a meta object field, which would be easier to access. But I think there's a down-side that then it won't work with log appenders that don't deal directly with JSON (which I assume is the default, and how the logger is set up at dev time).

  • the warning message with the doc link (from the latest PR) doesn't get printed on capacity sizing issues, but I think it should

  • the warning message with the doc link has a fairly naive throttle - it only prints on a transition from an "OK" to "not OK" state. I think it should be time-based - every minute even sounds like too much, if I had to look through a day's worth of logs. Maybe an hour?

  • we need to plumb the "reason" we're setting the Kibana status with the actual level; today we're just logging it, but then it won't show up in the status UI or report

@pmuellr pmuellr changed the title [task manager] investigate false positives, spamming of task manager health warning [task manager][meta] improve task manager performance logging Aug 25, 2021
@pmuellr pmuellr added the Meta label Aug 25, 2021
@pmuellr
Copy link
Member Author

pmuellr commented Aug 25, 2021

I was also thinking yesterday it might be useful to make use of the event log. But I think it would have to be conditional, otherwise it's going to get REAL busy.

  • API to enable/disable logging TM events
  • API to return event log docs for TM over a range of time, maybe KQL filtering

What would we add? Of that, I'm not sure. It might be a good place to put the health documents, but I think they would have to be a new object/enabled: false field. Or perhaps flattened. Maybe a different shape that would be better for KQL queries. Task start/ends documents might be good. Wondering if we could use this to do better estimation of the number of active Kibanas.

@mikecote mikecote added technical debt Improvement of the software architecture and operational architecture discuss labels Aug 25, 2021
@mikecote mikecote added this to Backlog in Kibana Alerting Aug 25, 2021
@XavierM XavierM removed this from Backlog in Kibana Alerting Jan 6, 2022
@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience discuss Feature:Task Manager Meta research Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) technical debt Improvement of the software architecture and operational architecture
Projects
No open projects
Development

No branches or pull requests

4 participants