Skip to content

Ubuntu hosts locking up after Fleet agent deployment #38632

@AdamBaali

Description

@AdamBaali

Fleet version: 4.79.0
Web browser and operating system: Ubuntu 24.04.03


💥 Actual behavior

Ubuntu hosts lock up or become unresponsive after full deployment of the Fleet agent. Issue may have been present before but went unreported when users could remove the agent. Now deployed as non-removable, the problem is blocking.

Customer has also opened a case with Canonical.

🛠️ To fix

Investigate root cause of Ubuntu host lockups. Timebox TBD at estimation.

🧑‍💻 Steps to reproduce

These steps:

  • Have been confirmed to consistently lead to reproduction in multiple Fleet instances.
  • Describe the workflow that led to the error, but have not yet been reproduced in multiple Fleet instances.
  1. Deploy Fleet agent to Ubuntu hosts with the configuration below.
  2. Enable all audit flags (audit_allow_config, audit_allow_sockets, audit_allow_fim_events, audit_allow_user_events, audit_allow_process_events).
  3. Run queries including process_events, software inventory with npm/python packages, and container process queries with wildcard patterns.
  4. Set distributed_interval: 10 and logger_tls_period: 10.
  5. Observe host lockup over time.

🕯️ More info


Customer configuration

Agent options

config:
  options:
    disable_audit: false
    disable_events: false
    pack_delimiter: /
    logger_tls_period: 10
    distributed_plugin: tls
    enable_file_events: true
    disable_distributed: false
    logger_tls_endpoint: /api/osquery/log
    distributed_interval: 10
    distributed_tls_max_attempts: 3
  decorators:
    load:
      - SELECT uuid AS host_uuid FROM system_info;
      - SELECT hostname AS hostname FROM system_info;
command_line_flags:
  disable_audit: false
  disable_events: false
  audit_allow_config: true
  audit_allow_sockets: true
  audit_allow_fim_events: true
  audit_allow_user_events: true
  audit_allow_process_events: true

Queries

Files:

SELECT * FROM file WHERE directory = '/tmp/' and filename not like '%apple%';

Check for listening port to 22 (SSH):

SELECT * FROM listening_ports WHERE port = 22 AND family = 2;

Container_Processes_Execution:

SELECT dc.name AS container_name, dcp.pid, dcp.name AS process_name, dcp.cmdline
FROM docker_containers dc
JOIN docker_container_processes dcp ON dc.id = dcp.id
WHERE (dcp.cmdline LIKE '%/dev/tcp/%'
  OR dcp.cmdline LIKE '%bash -i%'
  OR dcp.cmdline LIKE '%bash%/dev/tcp%'
  OR dcp.cmdline LIKE '%nc%'
  OR dcp.cmdline LIKE '%ncat%'
  OR dcp.cmdline LIKE '%python%'
  OR dcp.cmdline LIKE '%perl%'
  OR dcp.cmdline LIKE '%php%'
  OR dcp.cmdline LIKE '%ruby%'
  OR dcp.cmdline LIKE '%socat%'
  OR dcp.cmdline LIKE '%telnet%'
  OR dcp.cmdline LIKE '%mkfifo%'
  OR dcp.cmdline LIKE '%docker.sock%'
  OR dcp.cmdline LIKE '%go run%'
  OR dcp.cmdline LIKE '%fsockopen%'
  OR dcp.cmdline LIKE '%dup2(%'
  OR dcp.cmdline LIKE '%exec 5<>%'
  OR dcp.cmdline LIKE '%0<&1%'
  OR dcp.cmdline LIKE '%2>&1%'
  OR dcp.cmdline LIKE '%>&%'
  OR dcp.cmdline LIKE '%tmate%'
  OR dcp.name IN ('bash','sh','nc','ncat','python','perl','php','ruby','socat','telnet','tmate'))
AND dcp.cmdline NOT LIKE '%gunicorn -b 0.0.0.0:%';

Docker Processes:

SELECT dc.id AS container_id, dc.name AS container_name, dc.image, dc.state, dcp.pid, dcp.name AS process_name, dcp.cmdline 
FROM docker_containers dc 
JOIN docker_container_processes dcp ON dc.id = dcp.id;

Get applications hogging memory:

SELECT pid, name, ROUND((total_size * '10e-7'), 2) AS memory_used FROM processes ORDER BY total_size DESC LIMIT 10;

Get crashes:

SELECT uid, datetime, responsible, exception_type, identifier, version, crash_path FROM users CROSS JOIN crashes USING (uid);

Get installed Linux software:

SELECT name AS name, version AS version, 'Package (APT)' AS type, 'apt_sources' AS source FROM apt_sources 
UNION SELECT name AS name, version AS version, 'Package (deb)' AS type, 'deb_packages' AS source FROM deb_packages 
UNION SELECT package AS name, version AS version, 'Package (Portage)' AS type, 'portage_packages' AS source FROM portage_packages 
UNION SELECT name AS name, version AS version, 'Package (RPM)' AS type, 'rpm_packages' AS source FROM rpm_packages 
UNION SELECT name AS name, '' AS version, 'Package (YUM)' AS type, 'yum_sources' AS source FROM yum_sources 
UNION SELECT name AS name, version AS version, 'Package (NPM)' AS type, 'npm_packages' AS source FROM npm_packages 
UNION SELECT name AS name, version AS version, 'Package (Atom)' AS type, 'atom_packages' AS source FROM atom_packages 
UNION SELECT name AS name, version AS version, 'Package (Python)' AS type, 'python_packages' AS source FROM python_packages;

Get_Docker_Mounts:

SELECT * FROM docker_container_mounts WHERE source LIKE '/host_mnt/Users/%' OR source LIKE '/home/%';

Is it Ubuntu that needs upgrades:

SELECT 1 WHERE NOT EXISTS (
  SELECT 1 FROM deb_packages AS dp
  JOIN (SELECT name AS latest_name, MAX(version) AS latest_version FROM deb_packages GROUP BY name) lv
  ON dp.name = lv.latest_name
  WHERE dp.version < lv.latest_version
);

SecOps-Curl:

SELECT round_trip_time FROM curl WHERE URL='https://fleetdm.com';

SecOps-Process_events:

SELECT * FROM process_events;

SecOps-process_open_sockets:

SELECT pos.local_port, pos.remote_port, pos.remote_address, p.pid, p.path 
FROM process_open_sockets pos 
JOIN processes p ON pos.pid = p.pid 
WHERE remote_address NOT LIKE '192.168%' 
  AND remote_address NOT LIKE '10.%' 
  AND remote_address NOT LIKE '172.16.%' 
  AND remote_address NOT LIKE '127.%' 
  AND remote_address!='0.0.0.0' 
  AND remote_address NOT LIKE 'fe80%' 
  AND remote_port!='0';

SecOps-Processes:

SELECT l.port, l.pid, p.name, p.path FROM listening_ports l JOIN processes p USING (pid);

Shell History:

SELECT * FROM users CROSS JOIN shell_history USING (uid);

Scripts

apt-update.sh:

#!/bin/bash
if [ -f /etc/os-release ]; then
    . /etc/os-release
    if [[ "$ID" == "ubuntu" ]]; then
        echo "Running apt update..."
        apt-get update -y
        apt-get update --fix-missing
        exit_code=$?
        if [[ $exit_code -eq 0 ]]; then
            echo "APT update completed successfully."
        else
            echo "APT update failed with exit code $exit_code."
        fi
        exit $exit_code
    else
        echo "Not an Ubuntu system. Skipping apt update."
        exit 0
    fi
else
    echo "/etc/os-release not found. Cannot determine OS."
    exit 1
fi

apt-upgrade.sh:

#!/bin/bash
if [ -f /etc/os-release ]; then
    . /etc/os-release
    if [[ "$ID" == "ubuntu" ]]; then
        echo "Updating package lists..."
        apt-get update -y
        echo "Upgrading all packages..."
        DEBIAN_FRONTEND=noninteractive apt-get upgrade -y
        DEBIAN_FRONTEND=noninteractive apt-get autoremove --purge -y
        exit_code=$?
        if [[ $exit_code -eq 0 ]]; then
            echo "APT upgrade completed successfully."
        else
            echo "APT upgrade failed with exit code $exit_code."
        fi
        exit $exit_code
    else
        echo "This is not an Ubuntu system. Skipping upgrade."
        exit 0
    fi
else
    echo "Unable to determine OS. /etc/os-release not found."
    exit 1
fi

Next steps

  1. Waiting for confirmation on watchdog status (command_line_flags: disable_watchdog: true). If watchdog is enabled, it should prevent hangs by killing resource-heavy queries. If disabled, that could explain why hosts hang instead of osquery terminating problematic queries.
  2. Collect diagnostic data:
    • Output of SELECT * FROM osquery_schedule to check if queries are being denylisted by watchdog
    • Output of SELECT * FROM osquery_events to see event counts per table
    • Output of SELECT * FROM osquery_flags to confirm actual flag settings

Additional context (December 2024)

Customer previously asked about running long-running scripts (2-3 hours) without getting killed.

  • script_execution_timeout is configurable but capped at 3600 seconds (1 hour).
  • Customer was advised to use detached/child processes via setsid or nohup for scripts exceeding the timeout.
  • As long as the child process is detached from orbit, it will keep running.
  • The primary process needs to exit successfully before the timeout.

Could long scripts that spawn detached processes be contributing to system resource exhaustion combined with the heavy evented table queries?


Questions for engineering

  1. Customer is running SELECT * FROM process_events with all audit flags enabled (audit_allow_config, audit_allow_sockets, audit_allow_fim_events, audit_allow_user_events, audit_allow_process_events). Could enabling all audit flags simultaneously cause issues at the OS/kernel level that occur before osquery's watchdog can intervene?

  2. Are there any known issues with osquery's Linux audit framework on Ubuntu 24.04 causing system hangs?

  3. If watchdog is enabled and working correctly, should the system be protected from hangs? Or can the audit framework cause kernel-level issues that watchdog cannot prevent?


Feature request

I will create a separate FR for expanded Linux software inventory. The customer's "Get installed Linux software" query includes tables not currently collected by Fleet's built-in software-linux vital.

Customer is querying:

  • apt_sources
  • deb_packages (Fleet collects)
  • portage_packages (Fleet collects)
  • rpm_packages (Fleet collects)
  • yum_sources
  • npm_packages (Fleet collects as of v4.76.0)
  • atom_packages
  • python_packages (Fleet has separate vital)

Not in Fleet's built-in Linux software vital:

  • apt_sources
  • yum_sources
  • atom_packages

Metadata

Metadata

Assignees

Labels

#g-orchestrationOrchestration product group:releaseReady to write code. Scheduled in a release. See "Making changes" in handbook.bugSomething isn't working as documentedcustomer-firenze~timeboxA task that is completed in a predetermined amount of time.

Type

No type

Projects

Status

Done

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions