-
Notifications
You must be signed in to change notification settings - Fork 843
Add debugging guide #6425
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Add debugging guide #6425
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,169 @@ | ||
| # Debugging | ||
|
|
||
| ## Goals of this guide | ||
|
|
||
| This is NOT meant to be an exhaustive list of possible issues in Fleet and how to solve them. | ||
|
|
||
| This is a guide for going from a vague statement such as "things are not working correctly" to a more narrowed down and | ||
| specific assessment. This doesn't mean necessarily a solution; but, with a more specific assessment, it'll be easier for | ||
| the Engineering team to help. | ||
|
|
||
| Note that even if you follow all those steps, the Engineering team might have follow-up questions. | ||
|
|
||
| ## Basic data that is needed | ||
|
|
||
| While it's not needed strictly 100% of the times, in most cases it's extremely useful to have a clear understanding of | ||
| the basic characteristics of the Fleet deployment with the issues: | ||
|
|
||
| - Amount of total hosts. | ||
| - Amount of online hosts. | ||
| - Amount of scheduled queries. | ||
| - Amount and size (CPU/Mem) of the Fleet instances. | ||
| - Fleet instances CPU and Memory usage while the issue has been happening. | ||
| - MySQL flavor/version in use. | ||
| - MySQL server capacity (CPU/Mem). | ||
| - MySQL CPU and Memory usage while the issue has been happening. | ||
| - Are MySQL read replicas configured? If so, how many? | ||
| - Redis version and server capacity (CPU/Mem). | ||
| - Is Redis running in cluster mode? | ||
| - Redis CPU and Memory usage while the issue has been happening. | ||
| - The output of `fleetctl debug archive`. | ||
|
|
||
| ## Triaging the issue | ||
|
|
||
| The first step in understanding an issue better is figuring out in what area of the system the issue is happening. There | ||
| are two main areas an issue might fall in: server side or client side. | ||
|
|
||
| A server side issue is one where one of the few pieces of infrastructure on the server encounters an issue. Some of | ||
| these pieces are: the MySQL database, the load balancer, a Fleet server instance. | ||
|
|
||
| A client side issue is one where the issue occurs on the software that runs on the hosts (i.e. the machine that runs | ||
| osquery/orbit/Fleet Desktop). | ||
|
|
||
| There are issues that expand both areas, but in most cases the issue happens in one area and the other is more of | ||
| symptom rather than the issue itself. So we'll continue this text with the assumption that multi-area issues are rare | ||
| and even if facing them, the following should help narrow it down. | ||
|
|
||
| While the classification of client and server side issues is easy, it's also not realistic. So let's expand the | ||
| categories a bit more and let's "mark" them with keyword: | ||
|
|
||
| 1. Fleet itself (the binary/docker image running the Fleet API): `SERVER` | ||
| 1. A specific part of the Fleet UI is slow: `PARTIALSERVER` | ||
| 2. MySQL: `MYSQL` | ||
| 3. Redis: `REDIS` | ||
| 4. Infrastructure: `INFRA` | ||
| 5. osquery / orbit / Fleet Desktop: `OSQUERY` | ||
|
|
||
| With this areas in mind, here's a list of possible issues and what are you should look into: | ||
|
|
||
| - A specific device (or a handful of devices) is not behaving as expected -> `OSQUERY` | ||
| - A specific device appears online but last fetch at is old -> `OSQUERY` | ||
| - The Fleet UI is slow overall -> `SERVER` | ||
| - A specific page (or a handful of pages, but not all) in the Fleet UI is slow -> `PARTIALSERVER` | ||
| - New devices cannot enroll -> `OSQUERY` | ||
| - Live query results come in very slowly -> `REDIS` or `SERVER` | ||
| - osquery Extensions are not working correctly -> `OSQUERY` | ||
| - fleetctl is getting errors when applying yamls -> `SERVER` | ||
| - Migrations are taking too long -> `MYSQL` | ||
| - I see connection/network errors on the fleetctl or osquery logs, but not on my Fleet logs -> `INFRA` | ||
|
|
||
| ### SERVER | ||
|
|
||
| Whenever diagnosing a server side issue, one of the first steps is to look at Fleet itself. In particular, that means | ||
| looking at the logs across all instances that are running. How to look at these logs would vary depending on your | ||
| deployment. If, for instance, it's an AWS deployment, and you're using our terraform files as guidance, you'd use | ||
| CloudWatch. | ||
|
|
||
| Fleet by default will log errors, and those are the first thing to look for. If you have debug logging enabled, you can | ||
| filter errors by filtering the keyword `err`. | ||
|
|
||
| These logs will be the first way to triage a server side error. For example, if there are timeouts happening in APIs, | ||
| you should continue by looking at `MYSQL` and then `REDIS`. Otherwise, if it looks like a more illustrative error, this | ||
| would be a good point to reach out with all the information gathered. | ||
|
|
||
| If there are no errors in the logs and everything looks normal, check `INFRA`. | ||
|
|
||
| ### PARTIALSERVER | ||
|
|
||
| Sometimes Fleet operates without any errors but accessing a specific part of the web UI are slow. As a starting point it | ||
| would be good to get a screenshot of the Network tab in the Developer Tools of your browser. The main data that needs to | ||
| be visible are: Name, Status, and Time (in Chrome's terms). | ||
| [Here's how to accomplish this using Google Chrome](https://developer.chrome.com/docs/devtools/network/). | ||
|
|
||
| Besides from this, it might be good to continue with `MYSQL` and `REDIS`. | ||
|
|
||
| Depending on the API, there will likely be followup questions about amount of data, but this would be a good point to | ||
| check in with Engineering. | ||
|
|
||
| ### MYSQL | ||
|
|
||
| Most of the data needed to understand an issue in MySQL should've been gathered already by the basic data specified at | ||
| the beginning of this document. However, there is a chance that Fleet is running with a database user that is not | ||
| capable of querying the information needed. So here are the queries that would output a good first step in information | ||
| gathering: | ||
|
|
||
| ```sql | ||
| show engine innodb status; | ||
| show processlist; | ||
| ``` | ||
|
|
||
| If read replicas are configured, another piece of important data is whether there has been any replication lag | ||
| registered. | ||
|
|
||
| With all this gathered, it's a good time to reach out to the Engineering team. | ||
|
|
||
| ### REDIS | ||
|
|
||
| In most cases, the data gathered at the beginning of this document should be enough to understand what might be | ||
| happening with Redis. However, if more details are needed, running the | ||
| [monitor command](https://redis.io/commands/monitor/) should shed more light in the issue. | ||
|
chiiph marked this conversation as resolved.
|
||
|
|
||
| **WARNING**: if Redis is suffering from performance issues, running monitor will only increase the problem. | ||
|
|
||
| A less invasive way to check for more stats, if Elasticache is being used (or another system with more reporting) other | ||
| metrics like current connections, replication lag if applicable, if one instance is largely overused compare to others | ||
| in cluster mode, number of commands per key type could help identify what is wrong. | ||
|
|
||
| ### OSQUERY | ||
|
|
||
| Just like with the Fleet server, the best way to understand issues on the client side is to look at logs. | ||
|
chiiph marked this conversation as resolved.
|
||
|
|
||
| If you are running vanilla osquery in the host, please restart the host with `--tls_dump` and `--verbose`. This will | ||
| allow us to see more details as to what's happening in the communication with Fleet (or lack there of). Check the | ||
| [official documentation](https://osquery.readthedocs.io/en/stable/deployment/logging/) for details as to how to locate | ||
| the logs and other configurations. | ||
|
|
||
| If you are running Orbit, you should add `--debug` to the command line options. This will get debug logs for Orbit and | ||
| also for osquery automatically. Check the [Orbit README](https://github.com/fleetdm/fleet/blob/main/orbit/README.md#logs) | ||
| for more details as to where to find Orbit specific logs. | ||
|
|
||
| If you are running Fleet Desktop there's no change needed, you should see the log file in the following directories | ||
| depending on the platform: | ||
|
|
||
| - Linux: `$XDG_STATE_HOME/Fleet` or `$HOME/.local/state/Fleet` | ||
| - macOS: `$HOME/Library/Logs/Fleet` | ||
| - Windows: `%LocalAppData%/Fleet` | ||
|
|
||
| The log file name is `fleet-desktop.log`. | ||
|
|
||
| If the issue is related to osquery extensions, the following data would be needed: | ||
|
|
||
| - osquery version | ||
| - OS it's running on | ||
| - What does the extension do? | ||
| - How is the extension queried/deployed? | ||
| - What language is the extension implemented in? | ||
| - What's the nature of the problem? (i.e. whether the extension is respawning, or whether the extension can’t connect, | ||
| or extension is up/working and then dies and can’t reconnect) | ||
|
|
||
| With this data, it's time to reach out to Engineering. | ||
|
|
||
| ### `INFRA` | ||
|
|
||
| At this level, what you want to look into are Load Balancer logs, errors, and configurations. For instance, does the LB | ||
|
chiiph marked this conversation as resolved.
|
||
| have a request size limit? If the LB is not terminating TLS, is that configured properly on the Fleet side? | ||
|
|
||
| Make sure as well that your cloud provider is not having issues of their own. For instances, | ||
| [here](https://health.aws.amazon.com/health/status) is where to check for status for AWS. | ||
|
|
||
| <meta name="pageOrderInSection" value="600"> | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about approximate amount of scheduled queries and policies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I wonder if we can make
fleetctl debug archiveanswer all these questions for us... (have it generate a report.txt or something with number of users, etc.)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We certainly can for some of them. We could check the size of all tables.