Skip to content

Orbit clients hitting /device/{token}/desktop with expired tokens causes high DB usage #44816

@rfairburn

Description

@rfairburn

Fleet versions

  • Discovered:
  • Reproduced:

Web browser and operating system:


💥 Actual behavior

Orbit clients hitting GET /api/latest/fleet/device/{token}/desktop repeatedly with expired tokens trigger high database usage due to expensive SQL queries executing before rate limiting is applied. The DB load monitor shows sustained Average Active Sessions (AAS) climbing from ~1 to ~7+ during the affected period, primarily driven by wait/io/table/sql/handler.

Image

Clients experience 401 errors alongside high resource consumption on the Fleet server. The database query being triggered on every request is a large JOIN across multiple tables in LoadHostByDeviceAuthToken:

SELECT
  h.id,
  h.osquery_host_id,
  h.created_at,
  h.updated_at,
  h.detail_updated_at,
  h.node_key,
  h.hostname,
  h.uuid,
  h.platform,
  h.osquery_version,
  h.os_version,
  h.build,
  h.platform_like,
  h.code_name,
  h.uptime,
  h.memory,
  h.cpu_type,
  h.cpu_subtype,
  h.cpu_brand,
  h.cpu_physical_cores,
  h.cpu_logical_cores,
  h.hardware_vendor,
  h.hardware_model,
  h.hardware_version,
  h.hardware_serial,
  h.computer_name,
  h.primary_ip_id,
  h.distributed_interval,
  h.logger_tls_period,
  h.config_tls_refresh,
  h.primary_ip,
  h.primary_mac,
  h.label_updated_at,
  h.last_enrolled_at,
  h.refetch_requested,
  h.refetch_critical_queries_until,
  h.team_id,
  h.policy_updated_at,
  h.public_ip,
  COALESCE(hd.gigs_disk_space_available, 0) as gigs_disk_space_available,
  COALESCE(hd.percent_disk_space_available, 0) as percent_disk_space_available,
  COALESCE(hd.gigs_total_disk_space, 0) as gigs_total_disk_space,
  hd.encrypted as disk_encryption_enabled,
  IF(hdep.host_id AND ISNULL(hdep.deleted_at), true, false) AS dep_assigned_to_fleet,
  EXISTS(SELECT 1 FROM host_identity_scep_certificates hisc WHERE hisc.host_id = h.id AND hisc.revoked = 0) as has_host_identity_cert
FROM
  host_device_auth hda
INNER JOIN
  hosts h ON hda.host_id = h.id
LEFT OUTER JOIN
  host_disks hd ON hd.host_id = hda.host_id
LEFT OUTER JOIN
  host_mdm hm ON hm.host_id = h.id
LEFT OUTER JOIN
  host_dep_assignments hdep ON hdep.host_id = h.id AND hdep.deleted_at IS NULL
WHERE
  (hda.token = ? OR hda.previous_token = ?) AND
  hda.updated_at >= DATE_SUB(NOW(), INTERVAL ? SECOND)

🛠️ Expected behavior

When orbit clients send requests with expired device tokens, the server should:

  1. Fast-fail the authentication without executing expensive multi-table JOIN queries when the token has already expired
  2. Apply rate limiting BEFORE running the authentication query, so that burst traffic from invalid/expired tokens is throttled early
  3. Return a lightweight 401/404 response with minimal database impact

🧑‍💻 Steps to reproduce

These steps:

  • Have been confirmed to consistently lead to reproduction in multiple Fleet instances.
  • Describe the workflow that led to the issue, but have not yet been reproduced in multiple Fleet instances.
  1. Deploy a Fleet instance with many managed hosts (Fleet Desktop enabled)
  2. Configure Orbit/Fleet Desktop on endpoints (token rotates every 1 hour per server/service/devices.go:263)
  3. Simulate orbit clients repeatedly calling GET /api/latest/fleet/device/{token}/desktop with expired tokens (e.g., via the desktop-rate-limit tool)
  4. Observe high DB CPU/IO and increased Average Active Sessions
  5. Note that 401 errors are returned but after expensive queries have already executed

🕯️ More info (optional)

N/A

Metadata

Metadata

Assignees

Labels

#g-orchestrationOrchestration product group:releaseReady to write code. Scheduled in a release. See "Making changes" in handbook.P0Emergency: Customer outage, confirmed vuln (critical bug), new feature for immediate Fleet emergencybugSomething isn't working as documentedcustomer-rosner~released bugThis bug was found in a stable release.

Type

No type

Projects

Status

Done

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions