Skip to content

Periodically clean up nano_ tables so that they do not grow unreasonably large #34950

@JordanMontgomery

Description

@JordanMontgomery

Goal

User story
As a fleet developer,
I want to periodically clean old refetch entries from nano_commands, nano_enrollment_queue and nano_command_results
so that I can trust that querying those tables stays performant.

Roadmap item

Original requests

Related to #34797 but no specific customer request

Resources

Just some back of the envelope math:
A deployment with iOS/iPadOS devices will generate:

3 nano_commands entries every 10 minutes for REFETCH-* commands which will be sent to any devices that haven't had their details updated in 1 hour and don't have outstanding REFETCH commands.

Which works out to 3 entries in nano_enrollment_queue and nano_command_results every hour for online and responding devices.

Assuming devices are online enough to respond for 12 hours a day(optimistically low perhaps) we get:

3 commands * 500 devices * 12 hours = 18000 entries per day = 6,570,000 entries per year. And we never delete them which leads to significant table bloat. For customer-fairbank 95% of their entries in nano_enrollment_queue and nano_command_results were REFETCH-* entries and this data is of low use(it all gets pumped into things like host details and softwre inventory anyways) but slows down queries like the "which devices have active commands" query pretty badly.

Current thinking is that unless a user disables it we could delete all previous(or perhaps some previous - maybe 3 at a time) REFETCH-* nano_enrollment_queue and nano_command_results entries of a given type for a device. This should spread out the deletes enough not to impact performance too badly while also shrinking the table quite a bit over time

Changes

Nano commands deletion: Whenever a REFETCH command is successfully acknowledged by a host, delete up to 3(ths was originally 100 but changed to 3 for performance/load reasons) nano_command_results and nano_enrollment_queue entries, of the same type of commands, of commands sent and acknowledged(or errored) at least 1 month prior. The current command should never be deleted, only old ones, but they need not be the oldest. Data ingestion based on the REFETCH command will proceed as expected with no changes.

As a technical note, you should be absolutely sure that corresponding nano_enrollment_queue and nano_command_results entries are deleted in the same transaction so as not to risk re-sending or any other issues

Add a cron job that runs hourly, SELECTs up to 100 REFETCH- prefixed commands older than 30 days from nano_commands. Delete any which have no references in nano_enrollment_queue and nano_command_results.

The only user visible change, listed below, will be that admins will over time see less historic entries for iOS/iPadOS host refetch commands

Product

  • UI changes: None
  • CLI (fleetctl) usage changes: None
  • YAML changes: None
  • REST API changes: None
  • Fleet's agent (fleetd) changes: None
  • GitOps mode UI changes:None
  • GitOps generation changes: None
  • Activity changes: None
  • Permissions changes: None
  • Changes to paid features or tiers: None
  • My device and fleetdm.com/better changes: None
  • Usage statistics: None
  • Other reference documentation changes: We likely need to call out that we will no longer be listing old results for the regularly-scheduled ListInstalledApplication, ListCertificates and DeviceInformation
  • First draft of test plan added
  • Once shipped, requester has been notified
  • Once shipped, dogfooding issue has been filed

Engineering

  • Test plan is finalized
  • Contributor API changes: None
  • Feature guide changes: Update https://fleetdm.com/guides/mdm-commands#troubleshooting to callout this behavior(lack of old refetch commands) and also add callout for the host filtering parameter that can be used
  • Database schema migrations: None
  • Load testing: Build out a test with 1500 simulated iOS/iPadOS devices an an instrumented that sends refetch queries every 5 minutes to iOS/iPadOS host. Let this run for at least a few days. Run a query so that all of those commands are now > 30 days in the past in terms of created_at timestamps, then upgrade to this build and verify that there is not significantly increased load during deletion(and also that the number of commands in the table shrinks)
    This branch can be used to load up the DB prior to deploying the fix branch: DO NOT MERGE/FOR QA ONLY: Make refetches happen way faster #42644 . If you run it for a couple of days you will have very large nano tables. Note that if the DB is already under heavy load/at 99% you may still see some contention since the tables are already under contention but if the DB is not at that point yet you should not see heavy load associated with deletes
  • Load testing/osquery-perf improvements: None
  • This is a premium only feature: No

ℹ️  Please read this issue carefully and understand it. Pay special attention to UI wireframes, especially "dev notes".

QA

Risk assessment

  • Requires load testing: Yes, see above
  • Risk level: Low

Test plan

Make sure to go through the list and consider all events that might be related to this story, so we catch edge cases earlier.

  1. Verify that iOS/iPhone refetching operates as expected(e.g. hourly cadence) including software and certificate lists updating for all enrollment types(ADE, profile-based manual enrollment, personal enrollment)
  2. Verify that the mdm command list in the UI returns at least the result of the most recent REFETCH- commands
  3. Verify that nano_enrollment_queue and nano_command_results DB tables decrease in size when a system is "Steady state"(i.e. no new profiles/VPP apps being pushed) for several days and all of the refetch commands are set to > 30 days old
  4. Verify that if a load test is done as described in the Engineering section above the size of nano_enrollment_queue and nano_command_results is reduced
  5. Send a manual InstalledApplicationList, CertificateList and DeviceInformation query via the commands API(e.g. not in the refetch flow). Use DB queries to update their nano_commands, nano_enrollment_queue and nano_command_results table creatd_at and updated_at values to be greater than 30 days in the past. Verify that they do not get deleted, even if the device has less than 100 prior commands in the list

Testing notes

It may help for engineering to provide an instrumented build of a prior release which runs the refetcher much more often than usual to help fill up these tables so that the impact of the deletion can be observed. Engineering can also provide the queries discussed above

Confirmation

  1. Engineer: Added comment to user story confirming successful completion of test plan.
  2. QA: Added comment to user story confirming successful completion of test plan.

Metadata

Metadata

Labels

#g-mdmMDM product group:releaseReady to write code. Scheduled in a release. See "Making changes" in handbook.P2Urgent: Supported workflow not functioning as intended, newly drafted feature with urgent Fleet needcustomer-deebradelcustomer-fairbankcustomer-hawkingstoryA user story defining an entire feature~engineering-initiatedEngineering-initiated story, such as a bug, refactor, or contributor experience improvement.

Type

No type

Projects

Status

✅ Ready for release

Status

✅ Ready for release

Status

No status

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions