Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fc-agent: improve maintenance scheduling #671

Merged
merged 1 commit into from
Jul 19, 2023

Conversation

dpausp
Copy link
Member

@dpausp dpausp commented Feb 28, 2023

  • New requests can now be merged with existing ones if their activities have the same type. Significant changes to activities cause postponing of the updated request.
  • Requests can be cancelled by requesting an activity which nullifies the original activity (for example, reset channel back to current system channel => planned update will be cancelled)
  • UpdateActivity with metadata and better comment generation replaces dumb shell scripts for planned system updates.
  • VMChangeActivity with metadata replaces RebootActivity for mem and core changes.
  • All activities can request a reboot which will be done after all due requests have been executed.
  • Continously scheduled requests will be executed in one go if at least the first request is due, avoiding repeated switching to maintenance mode in a short time frame and possibly unneccessary reboots.
  • Overdue requests (more than 30 minutes after scheduled start time) will be postponed to avoid overrunning the planned maintenance window or interfering with other machines going into maintenance mode.
  • Maintenance preparation time and request execution time are different concepts now. Execution of requests is typically quite fast but there may be commands delaying the execution of all requests. Directory doesn't support this yet so we just report the sum of preparation time and estimated execution time (but at least 15min).
  • Un-tangled maintenance code and manage.py: all maintenance requests are now generated in maintenance.py.
  • Fix handling of postponed requests and cleaned up state updates in the process. tempfail and retrylimit don't exist anymore as dedicated states.
  • Update shortcut saving time: if the new channel of an UpdateActivity results in the same system, just set the system channel and forget about the update.
  • Explicitly exit after calling the reboot command.
  • Reduce number of channel URL resolve calls (which impact Hydra), UpdateActivity expects a resolved URL now.

@flyingcircusio/release-managers

Release process

Impact:

Changelog:

  • agent: improve scheduling of maintenance activities for system updates and VM property changes (memory, CPU cores). The main change is that activities can be merged/updated now. As a result, the number of reboots is reduced and multiple pending updates can be applied faster. Activities can be cancelled if they are no longer effective, for example if a memory change is requested in error and reset to the previous value some time later.
    Reboots for kernel updates now happen directly after system updates, avoiding scheduling another maintenance for the reboot. This also fixes the long-standing bug that delayed activities could be executed outside of maintenance windows. Activities that are overdue (more than 30min after planned time) are postponed for at least 8 hours and scheduled again (PL-129777).

Security implications

  • Security requirements defined? (WHERE)
    • nothing new, only affects internal handling of how we prepare and request updates and VM property changes.
  • Security requirements tested? (EVIDENCE)
    • Python tests cover new and old functionality
    • extensive manual tests on a dev VM and one with production flag set
      • updates, reboots and VM property (mem, cpu) changes merge properly
      • significant updates postpone the request and are announced via mail
      • serialized legacy requests still load and run
      • updates are not prepared multiple times
      • executing an updateActivity properly registers and switches to the new system, reboots are done
      • continuously scheduled requests are executed in one go

@dpausp dpausp force-pushed the PL-129777-agent-maintenance-scheduling branch 2 times, most recently from 51d2578 to e0cf4ce Compare February 28, 2023 23:47
@dpausp dpausp force-pushed the PL-129777-agent-maintenance-scheduling branch 6 times, most recently from 375794f to 72e29bb Compare March 15, 2023 20:30
@dpausp dpausp force-pushed the PL-129777-agent-maintenance-scheduling branch 6 times, most recently from a4deaaa to 4b18896 Compare March 30, 2023 08:13
@dpausp dpausp force-pushed the PL-129777-agent-maintenance-scheduling branch 3 times, most recently from f9a58a8 to 57a2902 Compare April 12, 2023 22:38
@dpausp dpausp force-pushed the PL-129777-agent-maintenance-scheduling branch from 111d1e6 to dd95e15 Compare April 18, 2023 22:06
@dpausp dpausp force-pushed the PL-129777-agent-maintenance-scheduling branch from dd95e15 to 382bc17 Compare May 16, 2023 10:06
@dpausp dpausp force-pushed the PL-129777-agent-maintenance-scheduling branch 6 times, most recently from 13cf40f to 1d2d361 Compare June 6, 2023 17:33
@dpausp dpausp changed the base branch from fc-22.11-dev to fc-23.05-dev June 6, 2023 18:21
@dpausp dpausp force-pushed the PL-129777-agent-maintenance-scheduling branch from 1d2d361 to 94a4689 Compare June 7, 2023 21:57
@dpausp dpausp force-pushed the PL-129777-agent-maintenance-scheduling branch from 94a4689 to b4b66e9 Compare June 30, 2023 10:58
@dpausp dpausp force-pushed the PL-129777-agent-maintenance-scheduling branch 2 times, most recently from a4d6ae5 to 5a88908 Compare July 8, 2023 00:54
@dpausp dpausp force-pushed the PL-129777-agent-maintenance-scheduling branch 3 times, most recently from 282869c to f2e1cad Compare July 11, 2023 21:33
@dpausp dpausp changed the title wip fc-agent maintenance scheduling fc-agent: improve maintenance scheduling Jul 13, 2023
@dpausp dpausp force-pushed the PL-129777-agent-maintenance-scheduling branch 2 times, most recently from bd1b9a2 to fc35e15 Compare July 18, 2023 22:02
@dpausp dpausp requested a review from osnyx July 18, 2023 22:40
@dpausp dpausp marked this pull request as ready for review July 18, 2023 22:40
pkgs/default.nix Outdated Show resolved Hide resolved
pkgs/fc/agent/default.nix Show resolved Hide resolved
@dpausp dpausp force-pushed the PL-129777-agent-maintenance-scheduling branch 3 times, most recently from dc96955 to de29cf5 Compare July 19, 2023 15:49
- New requests can now be merged with existing ones if their activities have
  the same type. Significant changes to activities cause postponing of the
  updated request.
- Requests can be cancelled by requesting an activity which nullifies the
  original activity (for example, reset channel back to current system
  channel => planned update will be cancelled)
- UpdateActivity with metadata and better comment generation replaces dumb
  shell scripts for planned system updates.
- VMChangeActivity with metadata replaces RebootActivity for mem and core
  changes.
- All activities can request a reboot which will be done after all due
  requests have been executed.
- Continously scheduled requests will be executed in one go if at least the
  first request is due, avoiding repeated switching to maintenance mode in a
  short time frame and possibly unneccessary reboots.
- Overdue requests (more than 30 minutes after scheduled start time) will be
  postponed to avoid overrunning the planned maintenance window or
  interfering with other machines going into maintenance mode.
- Maintenance preparation time and request execution time are different
  concepts now. Execution of requests is typically quite fast but there may
  be commands delaying the execution of all requests. Directory doesn't
  support this yet so we just report the sum of preparation time and
  estimated execution time (but at least 15min).
- Un-tangled maintenance code and manage.py: all maintenance requests are now
  generated in maintenance.py.
- Fix handling of postponed requests and cleaned up state updates in the
  process. tempfail and retrylimit don't exist anymore as dedicated states.
- Update shortcut saving time: if the new channel of an UpdateActivity results
  in the same system, just set the system channel and forget about the
  update.
- Explicitly exit after calling the reboot command.
- Reduce number of channel URL resolve calls (which impact Hydra),
  UpdateActivity expects a resolved URL now.

PL-129777
@dpausp dpausp force-pushed the PL-129777-agent-maintenance-scheduling branch from de29cf5 to c3abf47 Compare July 19, 2023 15:50
@osnyx osnyx self-requested a review July 19, 2023 15:54
@osnyx osnyx merged commit 9556b59 into fc-23.05-dev Jul 19, 2023
1 check passed
@osnyx osnyx deleted the PL-129777-agent-maintenance-scheduling branch July 19, 2023 15:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants