Ubuntu hosts locking up after Fleet agent deployment


**Fleet version**: 4.79.0
**Web browser and operating system**: Ubuntu 24.04.03

---

### 💥 Actual behavior

Ubuntu hosts lock up or become unresponsive after full deployment of the Fleet agent. Issue may have been present before but went unreported when users could remove the agent. Now deployed as non-removable, the problem is blocking.

Customer has also opened a case with Canonical.

### 🛠️ To fix

Investigate root cause of Ubuntu host lockups. Timebox TBD at estimation.

### 🧑‍💻 Steps to reproduce

These steps:
- [ ] Have been confirmed to consistently lead to reproduction in multiple Fleet instances.
- [x] Describe the workflow that led to the error, but have not yet been reproduced in multiple Fleet instances.

1. Deploy Fleet agent to Ubuntu hosts with the configuration below.
2. Enable all audit flags (`audit_allow_config`, `audit_allow_sockets`, `audit_allow_fim_events`, `audit_allow_user_events`, `audit_allow_process_events`).
3. Run queries including `process_events`, software inventory with npm/python packages, and container process queries with wildcard patterns.
4. Set `distributed_interval: 10` and `logger_tls_period: 10`.
5. Observe host lockup over time.

### 🕯️ More info

---

## Customer configuration

### Agent options

```yaml
config:
  options:
    disable_audit: false
    disable_events: false
    pack_delimiter: /
    logger_tls_period: 10
    distributed_plugin: tls
    enable_file_events: true
    disable_distributed: false
    logger_tls_endpoint: /api/osquery/log
    distributed_interval: 10
    distributed_tls_max_attempts: 3
  decorators:
    load:
      - SELECT uuid AS host_uuid FROM system_info;
      - SELECT hostname AS hostname FROM system_info;
command_line_flags:
  disable_audit: false
  disable_events: false
  audit_allow_config: true
  audit_allow_sockets: true
  audit_allow_fim_events: true
  audit_allow_user_events: true
  audit_allow_process_events: true
```

### Queries

**Files:**
```sql
SELECT * FROM file WHERE directory = '/tmp/' and filename not like '%apple%';
```

**Check for listening port to 22 (SSH):**
```sql
SELECT * FROM listening_ports WHERE port = 22 AND family = 2;
```

**Container_Processes_Execution:**
```sql
SELECT dc.name AS container_name, dcp.pid, dcp.name AS process_name, dcp.cmdline
FROM docker_containers dc
JOIN docker_container_processes dcp ON dc.id = dcp.id
WHERE (dcp.cmdline LIKE '%/dev/tcp/%'
  OR dcp.cmdline LIKE '%bash -i%'
  OR dcp.cmdline LIKE '%bash%/dev/tcp%'
  OR dcp.cmdline LIKE '%nc%'
  OR dcp.cmdline LIKE '%ncat%'
  OR dcp.cmdline LIKE '%python%'
  OR dcp.cmdline LIKE '%perl%'
  OR dcp.cmdline LIKE '%php%'
  OR dcp.cmdline LIKE '%ruby%'
  OR dcp.cmdline LIKE '%socat%'
  OR dcp.cmdline LIKE '%telnet%'
  OR dcp.cmdline LIKE '%mkfifo%'
  OR dcp.cmdline LIKE '%docker.sock%'
  OR dcp.cmdline LIKE '%go run%'
  OR dcp.cmdline LIKE '%fsockopen%'
  OR dcp.cmdline LIKE '%dup2(%'
  OR dcp.cmdline LIKE '%exec 5<>%'
  OR dcp.cmdline LIKE '%0<&1%'
  OR dcp.cmdline LIKE '%2>&1%'
  OR dcp.cmdline LIKE '%>&%'
  OR dcp.cmdline LIKE '%tmate%'
  OR dcp.name IN ('bash','sh','nc','ncat','python','perl','php','ruby','socat','telnet','tmate'))
AND dcp.cmdline NOT LIKE '%gunicorn -b 0.0.0.0:%';
```

**Docker Processes:**
```sql
SELECT dc.id AS container_id, dc.name AS container_name, dc.image, dc.state, dcp.pid, dcp.name AS process_name, dcp.cmdline 
FROM docker_containers dc 
JOIN docker_container_processes dcp ON dc.id = dcp.id;
```

**Get applications hogging memory:**
```sql
SELECT pid, name, ROUND((total_size * '10e-7'), 2) AS memory_used FROM processes ORDER BY total_size DESC LIMIT 10;
```

**Get crashes:**
```sql
SELECT uid, datetime, responsible, exception_type, identifier, version, crash_path FROM users CROSS JOIN crashes USING (uid);
```

**Get installed Linux software:**
```sql
SELECT name AS name, version AS version, 'Package (APT)' AS type, 'apt_sources' AS source FROM apt_sources 
UNION SELECT name AS name, version AS version, 'Package (deb)' AS type, 'deb_packages' AS source FROM deb_packages 
UNION SELECT package AS name, version AS version, 'Package (Portage)' AS type, 'portage_packages' AS source FROM portage_packages 
UNION SELECT name AS name, version AS version, 'Package (RPM)' AS type, 'rpm_packages' AS source FROM rpm_packages 
UNION SELECT name AS name, '' AS version, 'Package (YUM)' AS type, 'yum_sources' AS source FROM yum_sources 
UNION SELECT name AS name, version AS version, 'Package (NPM)' AS type, 'npm_packages' AS source FROM npm_packages 
UNION SELECT name AS name, version AS version, 'Package (Atom)' AS type, 'atom_packages' AS source FROM atom_packages 
UNION SELECT name AS name, version AS version, 'Package (Python)' AS type, 'python_packages' AS source FROM python_packages;
```

**Get_Docker_Mounts:**
```sql
SELECT * FROM docker_container_mounts WHERE source LIKE '/host_mnt/Users/%' OR source LIKE '/home/%';
```

**Is it Ubuntu that needs upgrades:**
```sql
SELECT 1 WHERE NOT EXISTS (
  SELECT 1 FROM deb_packages AS dp
  JOIN (SELECT name AS latest_name, MAX(version) AS latest_version FROM deb_packages GROUP BY name) lv
  ON dp.name = lv.latest_name
  WHERE dp.version < lv.latest_version
);
```

**SecOps-Curl:**
```sql
SELECT round_trip_time FROM curl WHERE URL='https://fleetdm.com';
```

**SecOps-Process_events:**
```sql
SELECT * FROM process_events;
```

**SecOps-process_open_sockets:**
```sql
SELECT pos.local_port, pos.remote_port, pos.remote_address, p.pid, p.path 
FROM process_open_sockets pos 
JOIN processes p ON pos.pid = p.pid 
WHERE remote_address NOT LIKE '192.168%' 
  AND remote_address NOT LIKE '10.%' 
  AND remote_address NOT LIKE '172.16.%' 
  AND remote_address NOT LIKE '127.%' 
  AND remote_address!='0.0.0.0' 
  AND remote_address NOT LIKE 'fe80%' 
  AND remote_port!='0';
```

**SecOps-Processes:**
```sql
SELECT l.port, l.pid, p.name, p.path FROM listening_ports l JOIN processes p USING (pid);
```

**Shell History:**
```sql
SELECT * FROM users CROSS JOIN shell_history USING (uid);
```

### Scripts

**apt-update.sh:**
```bash
#!/bin/bash
if [ -f /etc/os-release ]; then
    . /etc/os-release
    if [[ "$ID" == "ubuntu" ]]; then
        echo "Running apt update..."
        apt-get update -y
        apt-get update --fix-missing
        exit_code=$?
        if [[ $exit_code -eq 0 ]]; then
            echo "APT update completed successfully."
        else
            echo "APT update failed with exit code $exit_code."
        fi
        exit $exit_code
    else
        echo "Not an Ubuntu system. Skipping apt update."
        exit 0
    fi
else
    echo "/etc/os-release not found. Cannot determine OS."
    exit 1
fi
```

**apt-upgrade.sh:**
```bash
#!/bin/bash
if [ -f /etc/os-release ]; then
    . /etc/os-release
    if [[ "$ID" == "ubuntu" ]]; then
        echo "Updating package lists..."
        apt-get update -y
        echo "Upgrading all packages..."
        DEBIAN_FRONTEND=noninteractive apt-get upgrade -y
        DEBIAN_FRONTEND=noninteractive apt-get autoremove --purge -y
        exit_code=$?
        if [[ $exit_code -eq 0 ]]; then
            echo "APT upgrade completed successfully."
        else
            echo "APT upgrade failed with exit code $exit_code."
        fi
        exit $exit_code
    else
        echo "This is not an Ubuntu system. Skipping upgrade."
        exit 0
    fi
else
    echo "Unable to determine OS. /etc/os-release not found."
    exit 1
fi
```

---

## Next steps

1. Waiting for confirmation on watchdog status (`command_line_flags: disable_watchdog: true`). If watchdog is enabled, it should prevent hangs by killing resource-heavy queries. If disabled, that could explain why hosts hang instead of osquery terminating problematic queries.
2. Collect diagnostic data:
   - Output of `SELECT * FROM osquery_schedule` to check if queries are being denylisted by watchdog
   - Output of `SELECT * FROM osquery_events` to see event counts per table
   - Output of `SELECT * FROM osquery_flags` to confirm actual flag settings

---

## Additional context (December 2024)

Customer previously asked about running long-running scripts (2-3 hours) without getting killed.

- `script_execution_timeout` is configurable but capped at 3600 seconds (1 hour).
- Customer was advised to use detached/child processes via `setsid` or `nohup` for scripts exceeding the timeout.
- As long as the child process is detached from orbit, it will keep running.
- The primary process needs to exit successfully before the timeout.

Could long scripts that spawn detached processes be contributing to system resource exhaustion combined with the heavy evented table queries?

---

## Questions for engineering

1. Customer is running `SELECT * FROM process_events` with all audit flags enabled (`audit_allow_config`, `audit_allow_sockets`, `audit_allow_fim_events`, `audit_allow_user_events`, `audit_allow_process_events`). Could enabling all audit flags simultaneously cause issues at the OS/kernel level that occur before osquery's watchdog can intervene?

2. Are there any known issues with osquery's Linux audit framework on Ubuntu 24.04 causing system hangs?

3. If watchdog is enabled and working correctly, should the system be protected from hangs? Or can the audit framework cause kernel-level issues that watchdog cannot prevent?

---

## Feature request

I will create a separate FR for expanded Linux software inventory. The customer's "Get installed Linux software" query includes tables not currently collected by Fleet's built-in [software-linux vital](https://fleetdm.com/vitals/software-linux#linux).

Customer is querying:
- `apt_sources`
- `deb_packages` (Fleet collects)
- `portage_packages` (Fleet collects)
- `rpm_packages` (Fleet collects)
- `yum_sources`
- `npm_packages` (Fleet collects as of v4.76.0)
- `atom_packages`
- `python_packages` (Fleet has separate vital)

Not in Fleet's built-in Linux software vital:
- `apt_sources`
- `yum_sources`  
- `atom_packages`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ubuntu hosts locking up after Fleet agent deployment #38632

💥 Actual behavior

🛠️ To fix

🧑‍💻 Steps to reproduce

🕯️ More info

Customer configuration

Agent options

Queries

Scripts

Next steps

Additional context (December 2024)

Questions for engineering

Feature request

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Ubuntu hosts locking up after Fleet agent deployment #38632

Description

💥 Actual behavior

🛠️ To fix

🧑‍💻 Steps to reproduce

🕯️ More info

Customer configuration

Agent options

Queries

Scripts

Next steps

Additional context (December 2024)

Questions for engineering

Feature request

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions