-
Notifications
You must be signed in to change notification settings - Fork 826
Ubuntu hosts locking up after Fleet agent deployment #38632
Description
Fleet version: 4.79.0
Web browser and operating system: Ubuntu 24.04.03
💥 Actual behavior
Ubuntu hosts lock up or become unresponsive after full deployment of the Fleet agent. Issue may have been present before but went unreported when users could remove the agent. Now deployed as non-removable, the problem is blocking.
Customer has also opened a case with Canonical.
🛠️ To fix
Investigate root cause of Ubuntu host lockups. Timebox TBD at estimation.
🧑💻 Steps to reproduce
These steps:
- Have been confirmed to consistently lead to reproduction in multiple Fleet instances.
- Describe the workflow that led to the error, but have not yet been reproduced in multiple Fleet instances.
- Deploy Fleet agent to Ubuntu hosts with the configuration below.
- Enable all audit flags (
audit_allow_config,audit_allow_sockets,audit_allow_fim_events,audit_allow_user_events,audit_allow_process_events). - Run queries including
process_events, software inventory with npm/python packages, and container process queries with wildcard patterns. - Set
distributed_interval: 10andlogger_tls_period: 10. - Observe host lockup over time.
🕯️ More info
Customer configuration
Agent options
config:
options:
disable_audit: false
disable_events: false
pack_delimiter: /
logger_tls_period: 10
distributed_plugin: tls
enable_file_events: true
disable_distributed: false
logger_tls_endpoint: /api/osquery/log
distributed_interval: 10
distributed_tls_max_attempts: 3
decorators:
load:
- SELECT uuid AS host_uuid FROM system_info;
- SELECT hostname AS hostname FROM system_info;
command_line_flags:
disable_audit: false
disable_events: false
audit_allow_config: true
audit_allow_sockets: true
audit_allow_fim_events: true
audit_allow_user_events: true
audit_allow_process_events: trueQueries
Files:
SELECT * FROM file WHERE directory = '/tmp/' and filename not like '%apple%';Check for listening port to 22 (SSH):
SELECT * FROM listening_ports WHERE port = 22 AND family = 2;Container_Processes_Execution:
SELECT dc.name AS container_name, dcp.pid, dcp.name AS process_name, dcp.cmdline
FROM docker_containers dc
JOIN docker_container_processes dcp ON dc.id = dcp.id
WHERE (dcp.cmdline LIKE '%/dev/tcp/%'
OR dcp.cmdline LIKE '%bash -i%'
OR dcp.cmdline LIKE '%bash%/dev/tcp%'
OR dcp.cmdline LIKE '%nc%'
OR dcp.cmdline LIKE '%ncat%'
OR dcp.cmdline LIKE '%python%'
OR dcp.cmdline LIKE '%perl%'
OR dcp.cmdline LIKE '%php%'
OR dcp.cmdline LIKE '%ruby%'
OR dcp.cmdline LIKE '%socat%'
OR dcp.cmdline LIKE '%telnet%'
OR dcp.cmdline LIKE '%mkfifo%'
OR dcp.cmdline LIKE '%docker.sock%'
OR dcp.cmdline LIKE '%go run%'
OR dcp.cmdline LIKE '%fsockopen%'
OR dcp.cmdline LIKE '%dup2(%'
OR dcp.cmdline LIKE '%exec 5<>%'
OR dcp.cmdline LIKE '%0<&1%'
OR dcp.cmdline LIKE '%2>&1%'
OR dcp.cmdline LIKE '%>&%'
OR dcp.cmdline LIKE '%tmate%'
OR dcp.name IN ('bash','sh','nc','ncat','python','perl','php','ruby','socat','telnet','tmate'))
AND dcp.cmdline NOT LIKE '%gunicorn -b 0.0.0.0:%';Docker Processes:
SELECT dc.id AS container_id, dc.name AS container_name, dc.image, dc.state, dcp.pid, dcp.name AS process_name, dcp.cmdline
FROM docker_containers dc
JOIN docker_container_processes dcp ON dc.id = dcp.id;Get applications hogging memory:
SELECT pid, name, ROUND((total_size * '10e-7'), 2) AS memory_used FROM processes ORDER BY total_size DESC LIMIT 10;Get crashes:
SELECT uid, datetime, responsible, exception_type, identifier, version, crash_path FROM users CROSS JOIN crashes USING (uid);Get installed Linux software:
SELECT name AS name, version AS version, 'Package (APT)' AS type, 'apt_sources' AS source FROM apt_sources
UNION SELECT name AS name, version AS version, 'Package (deb)' AS type, 'deb_packages' AS source FROM deb_packages
UNION SELECT package AS name, version AS version, 'Package (Portage)' AS type, 'portage_packages' AS source FROM portage_packages
UNION SELECT name AS name, version AS version, 'Package (RPM)' AS type, 'rpm_packages' AS source FROM rpm_packages
UNION SELECT name AS name, '' AS version, 'Package (YUM)' AS type, 'yum_sources' AS source FROM yum_sources
UNION SELECT name AS name, version AS version, 'Package (NPM)' AS type, 'npm_packages' AS source FROM npm_packages
UNION SELECT name AS name, version AS version, 'Package (Atom)' AS type, 'atom_packages' AS source FROM atom_packages
UNION SELECT name AS name, version AS version, 'Package (Python)' AS type, 'python_packages' AS source FROM python_packages;Get_Docker_Mounts:
SELECT * FROM docker_container_mounts WHERE source LIKE '/host_mnt/Users/%' OR source LIKE '/home/%';Is it Ubuntu that needs upgrades:
SELECT 1 WHERE NOT EXISTS (
SELECT 1 FROM deb_packages AS dp
JOIN (SELECT name AS latest_name, MAX(version) AS latest_version FROM deb_packages GROUP BY name) lv
ON dp.name = lv.latest_name
WHERE dp.version < lv.latest_version
);SecOps-Curl:
SELECT round_trip_time FROM curl WHERE URL='https://fleetdm.com';SecOps-Process_events:
SELECT * FROM process_events;SecOps-process_open_sockets:
SELECT pos.local_port, pos.remote_port, pos.remote_address, p.pid, p.path
FROM process_open_sockets pos
JOIN processes p ON pos.pid = p.pid
WHERE remote_address NOT LIKE '192.168%'
AND remote_address NOT LIKE '10.%'
AND remote_address NOT LIKE '172.16.%'
AND remote_address NOT LIKE '127.%'
AND remote_address!='0.0.0.0'
AND remote_address NOT LIKE 'fe80%'
AND remote_port!='0';SecOps-Processes:
SELECT l.port, l.pid, p.name, p.path FROM listening_ports l JOIN processes p USING (pid);Shell History:
SELECT * FROM users CROSS JOIN shell_history USING (uid);Scripts
apt-update.sh:
#!/bin/bash
if [ -f /etc/os-release ]; then
. /etc/os-release
if [[ "$ID" == "ubuntu" ]]; then
echo "Running apt update..."
apt-get update -y
apt-get update --fix-missing
exit_code=$?
if [[ $exit_code -eq 0 ]]; then
echo "APT update completed successfully."
else
echo "APT update failed with exit code $exit_code."
fi
exit $exit_code
else
echo "Not an Ubuntu system. Skipping apt update."
exit 0
fi
else
echo "/etc/os-release not found. Cannot determine OS."
exit 1
fiapt-upgrade.sh:
#!/bin/bash
if [ -f /etc/os-release ]; then
. /etc/os-release
if [[ "$ID" == "ubuntu" ]]; then
echo "Updating package lists..."
apt-get update -y
echo "Upgrading all packages..."
DEBIAN_FRONTEND=noninteractive apt-get upgrade -y
DEBIAN_FRONTEND=noninteractive apt-get autoremove --purge -y
exit_code=$?
if [[ $exit_code -eq 0 ]]; then
echo "APT upgrade completed successfully."
else
echo "APT upgrade failed with exit code $exit_code."
fi
exit $exit_code
else
echo "This is not an Ubuntu system. Skipping upgrade."
exit 0
fi
else
echo "Unable to determine OS. /etc/os-release not found."
exit 1
fiNext steps
- Waiting for confirmation on watchdog status (
command_line_flags: disable_watchdog: true). If watchdog is enabled, it should prevent hangs by killing resource-heavy queries. If disabled, that could explain why hosts hang instead of osquery terminating problematic queries. - Collect diagnostic data:
- Output of
SELECT * FROM osquery_scheduleto check if queries are being denylisted by watchdog - Output of
SELECT * FROM osquery_eventsto see event counts per table - Output of
SELECT * FROM osquery_flagsto confirm actual flag settings
- Output of
Additional context (December 2024)
Customer previously asked about running long-running scripts (2-3 hours) without getting killed.
script_execution_timeoutis configurable but capped at 3600 seconds (1 hour).- Customer was advised to use detached/child processes via
setsidornohupfor scripts exceeding the timeout. - As long as the child process is detached from orbit, it will keep running.
- The primary process needs to exit successfully before the timeout.
Could long scripts that spawn detached processes be contributing to system resource exhaustion combined with the heavy evented table queries?
Questions for engineering
-
Customer is running
SELECT * FROM process_eventswith all audit flags enabled (audit_allow_config,audit_allow_sockets,audit_allow_fim_events,audit_allow_user_events,audit_allow_process_events). Could enabling all audit flags simultaneously cause issues at the OS/kernel level that occur before osquery's watchdog can intervene? -
Are there any known issues with osquery's Linux audit framework on Ubuntu 24.04 causing system hangs?
-
If watchdog is enabled and working correctly, should the system be protected from hangs? Or can the audit framework cause kernel-level issues that watchdog cannot prevent?
Feature request
I will create a separate FR for expanded Linux software inventory. The customer's "Get installed Linux software" query includes tables not currently collected by Fleet's built-in software-linux vital.
Customer is querying:
apt_sourcesdeb_packages(Fleet collects)portage_packages(Fleet collects)rpm_packages(Fleet collects)yum_sourcesnpm_packages(Fleet collects as of v4.76.0)atom_packagespython_packages(Fleet has separate vital)
Not in Fleet's built-in Linux software vital:
apt_sourcesyum_sourcesatom_packages
Metadata
Metadata
Assignees
Labels
Type
Projects
Status