Support garbage-collecting stale AnalysisRuns #3285

kevinqian-db · 2024-01-02T23:32:47Z

Summary

Motivation:

Right now Argo rollout supports max AnalysisRun history retention for rollouts through successfulRunHistoryLimit and unsuccessfulRunHistoryLimit. However, it seems that it does not support auto-cleanup for one-off AnalysisRuns.

In our case, we have been using independent AnalysisRuns for other workload types e.g. StatefulSets to evaluation their health and decide whether to trigger a rollback. This leaves a huge amount of completed (mostly Successful) AnalysisRuns not cleaned up. After one year of such usage, its total number surpasses 10k.

This causes two major problems for us right now, aside from increased CPU and memory usage:

~~Due to unclear reason that needs further investigation,~~ for both v1.2 and v1.6.3, we have seen completed AnalysisRuns keep being reconciled with skipped patches. This floods our work queue and causes occasional high latency between two passes of metrics collection (likely causing overdues)

Identified cause to be

argo-rollouts/controller/controller.go

Line 64 in feda353

DefaultRolloutResyncPeriod = 15 * 60

, which is applied to all types of informers for Argo resources, and thus will trigger at least 1 update call (even when there is no actual update) every 15min, as long as the resource is not deleted.

The metrics endpoint has been overburdened by such a huge amount of AnalysisRuns, that a GET to the endpoint takes >30s to respond (and even when responded, generating >200MB response due to our shear large amount of historical AR accompanying completed releases), hanging monitoring e.g. Grafana and requires us to unnecessarily set readiness check delay to a high number.

Our pprof shows that about half of the CPU time is spent on collecting and processing metrics.

Therefore, we hope to be able to configure Argo so that it will automatically issue deletes for AnalysisRuns that are stale "enough"

Proposal:

Define the following field on AnalysisRun manifest:

spec:
  ttlStrategy:
    secondsAfterCompletion: <number-of-seconds>
    secondsAfterFailure: <number-of-seconds>
    secondsAfterSuccess: <number-of-seconds>

Prior art in argo-workflow.
Similarly there has been prior art in Jobs's ttlSecondsAfterFinished since 1.23.
Making it a subfield of ttlStrategy makes it future-proof when more complicated TTL settings are added)

In analysis::Controller.reconcileAnalysisRun (this is already being invoked every resync period for all completed AnalysisRuns in current Argo version):

If run's status is completed
- Check if it has ttlStrategy set. If not, return.
- Compute its completion time. It can either be
  - The max timestamp of all measurements
  - (Recommended) Optionally, we can consider define AnalysisRun.Status.CompletedAt (similar to the existing AnalysisRun.Status.StartedAt). This should make the extraction of completion timestamp much faster (if there are too many measurements)
- If now - CompletedAt > secondsAfter(Completion|Failure|Success) and the corresponding completion status matches it, delete the AnalysisRun
  - secondsAfterFailure and secondsAfterSuccess will override secondsAfterCompletion if both set.

(Optional) Add the following to AnalysisTemplate manifest:

spec:
  runTtlStrategy:
    secondsAfterCompletion: <number-of-seconds>
    secondsAfterFailure: <number-of-seconds>
    secondsAfterSuccess: <number-of-seconds>

Prefixed with run to avoid confusion, since this TTL is not for deleting AnalysisTemplate.
In NewAnalysisRunFromTemplates which translates AnalysisTemplate to AnalysisRun, copy runTtlStrategy to the ttlStrategy of the generated AnalysisRun if it is present.

Alternatives (Not Recommanded):

On Argo controller binary, add a new flag --analysis-run-max-alive-duration=<max-alive-duration> (e.g. 30d).
- Seen prior art on setting leader election related duration using a similar approach.
If this flag is provided (with a non-zero value), it will be passed as an argument to the analysis::NewController(...). An extra thread will be started in analysis::Controller.Run(...) so that it will do the following every fixed period of time (e.g. 15min) (or until the previous pass is completed):
- List all current AnalysisRuns
  - Through AnalysisRunInformer's Listener. This is currently already been used in metrics collection code such as in
    
    argo-rollouts/controller/metrics/analysis.go
    
    Line 42 in feda353
    
    analysisRuns, err := c.runs.List(labels.NewSelector())
- If the AnalysisRun is not deleted && in a completed state (e.g. Successful) && Its creation timestamp is older than now - max alive duration, then issue a deletion request for the given AR.
  - The deletion request will be issued through argoProjClientset API client, which is accessible in the AnalysisRun controller.
- This thread will exit similar to other analysis threads when the outer context is canceled

Note that with this alternative:

Running GC on a single thread sequentially is intentional to make it a gradual process and avoid causing resource usage spikes that interferes with normal release processes.
The corresponding metrics of this deleted AnalysisRun will also disappear after the deletion. This will also help with improving responsiveness of /metrics call (currently can go as high as >30s + 200MB response, which times out most of the times for Grafana)

Use Cases

See above motivation.

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

The text was updated successfully, but these errors were encountered:

jessesuen · 2024-01-10T20:46:55Z

One consideration is that because runs can be synthesized from multiple templates, we would need to determine what to do if ttl disagreed. I guess we would need to take the largest value.

zachaller · 2024-01-10T20:48:24Z

So I like the proposal I would like to define exactly how AnalysisRun.Status.CompletedAt is determined but overall the idea sounds good.

I think that (Optional) Add the following to AnalysisTemplate manifest: might be slightly hard to define because analysis templates get merged together, we could maybe do some heuristic but I am also ok with it just being on the AnalysisRun itself, especially as a first pass.

kevinqian-db · 2024-01-10T22:57:21Z

I think that (Optional) Add the following to AnalysisTemplate manifest: might be slightly hard to define because analysis templates get merged together, we could maybe do some heuristic but I am also ok with it just being on the AnalysisRun itself, especially as a first pass.

Sounds good. For our immediate usage I don't think we need to support it for now.

So I like the proposal I would like to define exactly how AnalysisRun.Status.CompletedAt is determined but overall the idea sounds good.

I believe we can make CompletedAt as the now timestamp of the AnalysisRun reconciliation pass. Right now for StartedAt we populate it with now during assessRunStatus at

argo-rollouts/analysis/analysis.go

Lines 428 to 431 in 99ae20c

    
           if run.Status.StartedAt == nil { 
        
           	now := timeutil.MetaNow() 
        
           	run.Status.StartedAt = &now 
        
           }

I think for CompletedAt we can similarly take the now value at

argo-rollouts/analysis/analysis.go

Lines 105 to 112 in 99ae20c

    
           newStatus, newMessage := c.assessRunStatus(run, resolvedMetrics, dryRunMetricsMap) 
        
           if newStatus != run.Status.Phase { 
        
           	run.Status.Phase = newStatus 
        
           	run.Status.Message = newMessage 
        
           	if newStatus.Completed() { 
        
           		c.recordAnalysisRunCompletionEvent(run) 
        
           	} 
        
           }

of when the new phase != existing phase while new phase being a Complete phase.

zachaller · 2024-01-11T16:53:48Z

Yup that makes since I did not know off the top of my head if there was an actual completed phase and that that phase was also set on things like errors or failures etc, if that phase is not set on errors or failures do we still want to GC? Just things to double check and confirm etc.

kevinqian-db · 2024-01-11T17:40:13Z

Yup that makes since I did not know off the top of my head if there was an actual completed phase and that that phase was also set on things like errors or failures etc, if that phase is not set on errors or failures do we still want to GC? Just things to double check and confirm etc.

According to

argo-rollouts/pkg/apis/rollouts/v1alpha1/analysis_types.go

Lines 198 to 205 in 99ae20c

    
           // Completed returns whether or not the analysis status is considered completed 
        
           func (as AnalysisPhase) Completed() bool { 
        
           	switch as { 
        
           	case AnalysisPhaseSuccessful, AnalysisPhaseFailed, AnalysisPhaseError, AnalysisPhaseInconclusive: 
        
           		return true 
        
           	} 
        
           	return false 
        
           }

All these phases should be considered complete, so IMO it makes sense to just the value to all these 4 types of phases.

zachaller · 2024-01-12T15:09:33Z

Yea agree, I am open to a PR the implements option 2. along with the addition of the CompletedAt status field for use in determining when to GC based on the TTL settings.

jessesuen · 2024-01-12T16:29:26Z

we could maybe do some heuristic but I am also ok with it just being on the AnalysisRun itself, especially as a first pass.

Given the two options, I feel the GC feature becomes much less useful without the ability to define it in the template. In the common case, users will have 1:1 match between templates and runs. Allowing GC to be defined in the template further enables the ability to use Analysis without the need to pair it with another object (e.g. Rollout), which would have to introduce a place define the TTL and carry it over to the Run.

I think an acceptable and unsurprising heuristic is to take the max of all TTLs from all Templates, for the synthesized run.

kevinqian-db · 2024-01-12T18:37:02Z

I think an acceptable and unsurprising heuristic is to take the max of all TTLs from all Templates, for the synthesized run.

Will definitely need to document this well to avoid surprises, since some might expect the min instead of max (in the sense of "overriding"). Also I think we can implement this on template in a separate PR (I believe measurementRetention is also similarly separated into 3 PRs)

kevinqian-db added the enhancement New feature or request label Jan 2, 2024

kevinqian-db changed the title ~~Support deleting stale AnalysisRuns not associated to Rollout~~ Support garbage-collecting stale AnalysisRuns not associated to Rollout Jan 2, 2024

kevinqian-db mentioned this issue Jan 9, 2024

Enqueued AnalysisRun next reconcile time not respected #3283

Open

2 tasks

kevinqian-db changed the title ~~Support garbage-collecting stale AnalysisRuns not associated to Rollout~~ Support garbage-collecting stale AnalysisRuns Jan 9, 2024

kevinqian-db mentioned this issue Jan 23, 2024

feat(analysis): add ttlStrategy on AnalysisRun for garbage collecting stale AnalysisRun automatically #3324

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support garbage-collecting stale AnalysisRuns #3285

Support garbage-collecting stale AnalysisRuns #3285

kevinqian-db commented Jan 2, 2024 •

edited

jessesuen commented Jan 10, 2024

zachaller commented Jan 10, 2024 •

edited

kevinqian-db commented Jan 10, 2024 •

edited

zachaller commented Jan 11, 2024 •

edited

kevinqian-db commented Jan 11, 2024

zachaller commented Jan 12, 2024

jessesuen commented Jan 12, 2024 •

edited

kevinqian-db commented Jan 12, 2024

Support garbage-collecting stale AnalysisRuns #3285

Support garbage-collecting stale AnalysisRuns #3285

Comments

kevinqian-db commented Jan 2, 2024 • edited

Summary

Use Cases

jessesuen commented Jan 10, 2024

zachaller commented Jan 10, 2024 • edited

kevinqian-db commented Jan 10, 2024 • edited

zachaller commented Jan 11, 2024 • edited

kevinqian-db commented Jan 11, 2024

zachaller commented Jan 12, 2024

jessesuen commented Jan 12, 2024 • edited

kevinqian-db commented Jan 12, 2024

kevinqian-db commented Jan 2, 2024 •

edited

zachaller commented Jan 10, 2024 •

edited

kevinqian-db commented Jan 10, 2024 •

edited

zachaller commented Jan 11, 2024 •

edited

jessesuen commented Jan 12, 2024 •

edited