Skip to content

Software -> OS UI (and fleet/os_versions endpoint) unusable when large numbers of Linux hosts present #34500

@getvictor

Description

@getvictor

Fleet version:
4.75.0


💥  Actual behavior

Hitting the /api/latest/fleet/os_versions?order_key=hosts_count&order_direction=desc&page=0&per_page=20 takes ~1 minutes and occasionally times out (5XX server error)

🛠️ To fix

TBD

Product designer: _________________________

🧑‍💻  Steps to reproduce

  1. Create a DB with an Ubuntu OS version that has >30 kernels with vulnerabilities (in operating_systems)
  2. Hit the endpoint

🕯️ More info (optional)

We return only 20 OS versions by default. But the problem is that for each OS version, we look up all kernel vulnerabilities for all kernels that the OS version uses in the fleet. Theoretically, 1 OS version may use 1,000 kernel versions.

Customer has Ubuntu 24.04.3 LTS that has 33 kernels across their fleet (33 rows in operating_systems).

When fetching kernel vulnerabilities, we run the following query which takes ~10 seconds (on a standalone DB without traffic) and returns ~192,000 rows (5,800 distinct CVEs):

SELECT
    os.id as os_id,
    sc.cve,
    sc.resolved_in_version,
    MIN(sc.created_at) as created_at
FROM software_cve sc
         JOIN kernel_host_counts khc ON khc.software_id = sc.software_id
         JOIN operating_systems os ON os.os_version_id = khc.os_version_id
WHERE os.id IN (
                ?, ?, ...
    )
  AND khc.hosts_count > 0
GROUP BY os.id, sc.cve, sc.resolved_in_version; 

For each unique CVE above, we get CVE metadata, in batches of 500 CVEs. This also takes a long time.

  SELECT
      cve,
      cvss_score,
      epss_probability,
      cisa_known_exploit,
      published,
      description
  FROM cve_meta
  WHERE cve IN (?, ?, ?, ...)

Video demo of the fix

https://www.youtube.com/watch?v=4HZlKG0G1B0

QA

  • The migration populates a new table based on existing Linux kernel vulnerability data. It may take a while to run (1+ minutes) for large customer deployments. We need to test migration on a customer DB (like numa) before release.
  • The loadtest should include Ubuntu hosts (the osquery perf updates (on main and 4.76.0 branch) will automatically add Ubuntu patch versions and kernels to closer represent real environments)

Metadata

Metadata

Assignees

Labels

#g-security-complianceSecurity & Compliance product group:releaseReady to write code. Scheduled in a release. See "Making changes" in handbook.P1Critical: Broken workflow (critical bug), potential vuln, new feature for immediate Fleet needbugSomething isn't working as documentedcustomer-numacustomer-rialto~released bugThis bug was found in a stable release.

Type

No type

Projects

Status

Done

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions