Skip to content

[V2] Add TaskAction garbage collector for terminal CRDs#6994

Merged
pingsutw merged 7 commits intov2from
gc
Mar 11, 2026
Merged

[V2] Add TaskAction garbage collector for terminal CRDs#6994
pingsutw merged 7 commits intov2from
gc

Conversation

@pingsutw
Copy link
Member

@pingsutw pingsutw commented Mar 9, 2026

Tracking issue

Closes #6995

Why are the changes needed?

Terminal TaskAction CRDs (Succeeded/Failed) remain in the cluster indefinitely after completion. This wastes etcd storage and slows down list operations. We need a garbage collector that periodically deletes terminal TaskActions after a configurable TTL.

What changes were proposed in this pull request?

Adds a label-based TTL garbage collector for terminal TaskActions, modeled after the propeller FlyteWorkflow GC (flytepropeller/pkg/controller/garbage_collector.go):

  • executor/pkg/config/config.go — Added GCConfig struct with Interval and MaxTTL fields.
  • executor/pkg/controller/taskaction_controller.go — Added ensureTerminalLabels() method that stamps terminal TaskActions with flyte.org/termination-status=terminated and flyte.org/completed-time=<UTC hour> labels. Called from both the terminal short-circuit path and after status updates that result in terminal state. Idempotent — no-op if labels already set.
  • executor/pkg/controller/garbage_collector.go (NEW) — GarbageCollector implementing manager.Runnable. Runs on a ticker, lists TaskActions with terminated label, filters by completed-time using lexicographic string comparison, deletes expired ones.
  • executor/setup.go — Conditionally adds GC to the controller manager when cfg.GC.Interval.Duration > 0
  • executor/pkg/config/config_flags.go / config_flags_test.go — Regenerated via go generate

How was this patch tested?

  • executor/pkg/controller/garbage_collector_test.go (NEW) — Ginkgo/envtest tests for: expired deletion, recent retention, non-terminated retention, empty list handling, and ensureTerminalLabels idempotency
  • executor/pkg/controller/taskaction_controller_test.go — Extended with test verifying GC labels are set when reconciling a terminal TaskAction
  • go build ./executor/... compiles
  • go test ./executor/pkg/config/... passes

Labels

  • added: New TaskAction garbage collector feature

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a TTL-based garbage collector for terminal TaskAction CRDs to prevent indefinite accumulation in the cluster, reducing etcd/storage bloat and improving list performance.

Changes:

  • Stamp terminal TaskActions with GC discovery labels (termination-status and completed-time) during reconcile.
  • Add a controller-runtime manager.Runnable garbage collector that periodically deletes expired terminal TaskActions.
  • Add GC config (interval + TTL) and wire the runnable into executor manager startup.

Reviewed changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
manager/cmd/main.go Initializes SetupContext.Scope for unified mode metrics.
gen/rust/Cargo.lock Removes the Rust Cargo.lock from generated crate directory.
executor/setup.go Conditionally registers the new GC runnable with the controller manager.
executor/pkg/controller/taskaction_controller.go Adds terminal labeling logic (ensureTerminalLabels) and label constants.
executor/pkg/controller/garbage_collector.go New GC runnable that lists terminated TaskActions and deletes expired ones.
executor/pkg/controller/garbage_collector_test.go New envtest coverage for GC behavior and ensureTerminalLabels idempotency.
executor/pkg/controller/taskaction_controller_test.go Adds a reconcile test verifying GC labels are applied to terminal resources.
executor/pkg/config/config.go Adds GCConfig and default GC settings.
executor/pkg/config/config_flags.go Adds CLI flags for gc.interval and gc.maxTTL (generated).
executor/pkg/config/config_flags_test.go Adds generated flag wiring tests for GC config.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +116 to +121
if cfg.GC.Interval.Duration > 0 {
gc := controller.NewGarbageCollector(mgr.GetClient(), cfg.GC.Interval.Duration, cfg.GC.MaxTTL.Duration)
if err := mgr.Add(gc); err != nil {
return fmt.Errorf("executor: failed to add garbage collector: %w", err)
}
}
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GC is enabled based solely on cfg.GC.Interval > 0. If cfg.GC.MaxTTL is set to 0 or a negative duration, collect() will treat the cutoff as now/future and may delete essentially all terminated TaskActions. Consider validating MaxTTL > 0 (and returning an error or disabling GC) before adding the runnable.

Copilot uses AI. Check for mistakes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can support immediate clean-up when set MaxTTL to <= 0? Should we?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, can be a follow-up

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +233 to +236
labels := taskAction.GetLabels()
if labels != nil && labels[LabelTerminationStatus] == LabelValueTerminated {
return nil // already labeled
}
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ensureTerminalLabels returns early when termination-status=terminated is present, but it doesn't verify that completed-time is set. If a TaskAction has the terminated label but is missing/empty completed-time (manual edits, partial migrations, etc.), it will never become GC-eligible. Consider checking both labels and patching any missing/empty one(s).

Copilot uses AI. Check for mistakes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ I think this is valid?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines +243 to +245
labels[LabelTerminationStatus] = LabelValueTerminated
labels[LabelCompletedTime] = time.Now().UTC().Format(labelHourTimeFormat)
taskAction.SetLabels(labels)
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LabelCompletedTime is described as “when the TaskAction became terminal”, but it’s set using time.Now() rather than the terminal condition’s LastTransitionTime. This means restarted controllers / pre-existing terminal TaskActions will get a much newer completed-time and be retained longer than the configured TTL. Consider deriving the label from the succeeded/failed condition transition time when available (falling back to time.Now() if missing).

Copilot uses AI. Check for mistakes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ and this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines +66 to +68
// Initialize metrics scope
sc.Scope = promutils.NewScope("flyte")

Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sc.Scope is initialized here, but the storage DataStore is still constructed with promutils.NewTestScope(), so storage metrics won't be emitted under the configured scope in unified mode. Consider passing sc.Scope (or a child scope) into storage.NewDataStore for consistency with other services that use sc.Scope.

Copilot uses AI. Check for mistakes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ Should we fix this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@pingsutw pingsutw linked an issue Mar 10, 2026 that may be closed by this pull request
client.Limit(gcPageSize),
}
if continueToken != "" {
listOpts = append(listOpts, client.Continue(continueToken))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add the check for whether LabelCompletedTime exists into the listOpts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

pingsutw and others added 7 commits March 10, 2026 20:16
Terminal TaskAction CRDs (Succeeded/Failed) remain in the cluster
indefinitely, wasting etcd storage and slowing list operations. This adds
a garbage collector that periodically deletes terminal TaskActions after
a configurable TTL, modeled after the propeller FlyteWorkflow GC.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Kevin Su <pingsutw@apache.org>
…e fields, regenerate pflags

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Kevin Su <pingsutw@apache.org>
Copy link
Member

@machichima machichima left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@pingsutw pingsutw merged commit df8959c into v2 Mar 11, 2026
14 checks passed
@pingsutw pingsutw deleted the gc branch March 11, 2026 03:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[V2] Add TaskAction garbage collector for terminal CRDs

3 participants