Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support garbage-collecting stale AnalysisRuns #3285

Open
kevinqian-db opened this issue Jan 2, 2024 · 8 comments
Open

Support garbage-collecting stale AnalysisRuns #3285

kevinqian-db opened this issue Jan 2, 2024 · 8 comments
Labels
enhancement New feature or request

Comments

@kevinqian-db
Copy link
Contributor

kevinqian-db commented Jan 2, 2024

Summary

Motivation:

Right now Argo rollout supports max AnalysisRun history retention for rollouts through successfulRunHistoryLimit and unsuccessfulRunHistoryLimit. However, it seems that it does not support auto-cleanup for one-off AnalysisRuns.

In our case, we have been using independent AnalysisRuns for other workload types e.g. StatefulSets to evaluation their health and decide whether to trigger a rollback. This leaves a huge amount of completed (mostly Successful) AnalysisRuns not cleaned up. After one year of such usage, its total number surpasses 10k.

This causes two major problems for us right now, aside from increased CPU and memory usage:

  1. Due to unclear reason that needs further investigation, for both v1.2 and v1.6.3, we have seen completed AnalysisRuns keep being reconciled with skipped patches. This floods our work queue and causes occasional high latency between two passes of metrics collection (likely causing overdues)
  • Identified cause to be
    DefaultRolloutResyncPeriod = 15 * 60
    , which is applied to all types of informers for Argo resources, and thus will trigger at least 1 update call (even when there is no actual update) every 15min, as long as the resource is not deleted.
  1. The metrics endpoint has been overburdened by such a huge amount of AnalysisRuns, that a GET to the endpoint takes >30s to respond (and even when responded, generating >200MB response due to our shear large amount of historical AR accompanying completed releases), hanging monitoring e.g. Grafana and requires us to unnecessarily set readiness check delay to a high number.
  • Our pprof shows that about half of the CPU time is spent on collecting and processing metrics.

Therefore, we hope to be able to configure Argo so that it will automatically issue deletes for AnalysisRuns that are stale "enough"

Proposal:

  1. Define the following field on AnalysisRun manifest:
spec:
  ttlStrategy:
    secondsAfterCompletion: <number-of-seconds>
    secondsAfterFailure: <number-of-seconds>
    secondsAfterSuccess: <number-of-seconds>
  • Prior art in argo-workflow.
  • Similarly there has been prior art in Jobs's ttlSecondsAfterFinished since 1.23.
  • Making it a subfield of ttlStrategy makes it future-proof when more complicated TTL settings are added)
  1. In analysis::Controller.reconcileAnalysisRun (this is already being invoked every resync period for all completed AnalysisRuns in current Argo version):
  • If run's status is completed
    • Check if it has ttlStrategy set. If not, return.
    • Compute its completion time. It can either be
      • The max timestamp of all measurements
      • (Recommended) Optionally, we can consider define AnalysisRun.Status.CompletedAt (similar to the existing AnalysisRun.Status.StartedAt). This should make the extraction of completion timestamp much faster (if there are too many measurements)
    • If now - CompletedAt > secondsAfter(Completion|Failure|Success) and the corresponding completion status matches it, delete the AnalysisRun
      • secondsAfterFailure and secondsAfterSuccess will override secondsAfterCompletion if both set.
  1. (Optional) Add the following to AnalysisTemplate manifest:
spec:
  runTtlStrategy:
    secondsAfterCompletion: <number-of-seconds>
    secondsAfterFailure: <number-of-seconds>
    secondsAfterSuccess: <number-of-seconds>
  • Prefixed with run to avoid confusion, since this TTL is not for deleting AnalysisTemplate.
  • In NewAnalysisRunFromTemplates which translates AnalysisTemplate to AnalysisRun, copy runTtlStrategy to the ttlStrategy of the generated AnalysisRun if it is present.

Alternatives (Not Recommanded):

  • On Argo controller binary, add a new flag --analysis-run-max-alive-duration=<max-alive-duration> (e.g. 30d).
    • Seen prior art on setting leader election related duration using a similar approach.
  • If this flag is provided (with a non-zero value), it will be passed as an argument to the analysis::NewController(...). An extra thread will be started in analysis::Controller.Run(...) so that it will do the following every fixed period of time (e.g. 15min) (or until the previous pass is completed):
    • List all current AnalysisRuns
    • If the AnalysisRun is not deleted && in a completed state (e.g. Successful) && Its creation timestamp is older than now - max alive duration, then issue a deletion request for the given AR.
      • The deletion request will be issued through argoProjClientset API client, which is accessible in the AnalysisRun controller.
    • This thread will exit similar to other analysis threads when the outer context is canceled

Note that with this alternative:

  • Running GC on a single thread sequentially is intentional to make it a gradual process and avoid causing resource usage spikes that interferes with normal release processes.
  • The corresponding metrics of this deleted AnalysisRun will also disappear after the deletion. This will also help with improving responsiveness of /metrics call (currently can go as high as >30s + 200MB response, which times out most of the times for Grafana)

Use Cases

See above motivation.


Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

@kevinqian-db kevinqian-db added the enhancement New feature or request label Jan 2, 2024
@kevinqian-db kevinqian-db changed the title Support deleting stale AnalysisRuns not associated to Rollout Support garbage-collecting stale AnalysisRuns not associated to Rollout Jan 2, 2024
@kevinqian-db kevinqian-db changed the title Support garbage-collecting stale AnalysisRuns not associated to Rollout Support garbage-collecting stale AnalysisRuns Jan 9, 2024
@jessesuen
Copy link
Member

One consideration is that because runs can be synthesized from multiple templates, we would need to determine what to do if ttl disagreed. I guess we would need to take the largest value.

@zachaller
Copy link
Collaborator

zachaller commented Jan 10, 2024

So I like the proposal I would like to define exactly how AnalysisRun.Status.CompletedAt is determined but overall the idea sounds good.

I think that (Optional) Add the following to AnalysisTemplate manifest: might be slightly hard to define because analysis templates get merged together, we could maybe do some heuristic but I am also ok with it just being on the AnalysisRun itself, especially as a first pass.

@kevinqian-db
Copy link
Contributor Author

kevinqian-db commented Jan 10, 2024

I think that (Optional) Add the following to AnalysisTemplate manifest: might be slightly hard to define because analysis templates get merged together, we could maybe do some heuristic but I am also ok with it just being on the AnalysisRun itself, especially as a first pass.

Sounds good. For our immediate usage I don't think we need to support it for now.

So I like the proposal I would like to define exactly how AnalysisRun.Status.CompletedAt is determined but overall the idea sounds good.

I believe we can make CompletedAt as the now timestamp of the AnalysisRun reconciliation pass. Right now for StartedAt we populate it with now during assessRunStatus at

if run.Status.StartedAt == nil {
now := timeutil.MetaNow()
run.Status.StartedAt = &now
}

I think for CompletedAt we can similarly take the now value at

newStatus, newMessage := c.assessRunStatus(run, resolvedMetrics, dryRunMetricsMap)
if newStatus != run.Status.Phase {
run.Status.Phase = newStatus
run.Status.Message = newMessage
if newStatus.Completed() {
c.recordAnalysisRunCompletionEvent(run)
}
}
of when the new phase != existing phase while new phase being a Complete phase.

@zachaller
Copy link
Collaborator

zachaller commented Jan 11, 2024

Yup that makes since I did not know off the top of my head if there was an actual completed phase and that that phase was also set on things like errors or failures etc, if that phase is not set on errors or failures do we still want to GC? Just things to double check and confirm etc.

@kevinqian-db
Copy link
Contributor Author

Yup that makes since I did not know off the top of my head if there was an actual completed phase and that that phase was also set on things like errors or failures etc, if that phase is not set on errors or failures do we still want to GC? Just things to double check and confirm etc.

According to

// Completed returns whether or not the analysis status is considered completed
func (as AnalysisPhase) Completed() bool {
switch as {
case AnalysisPhaseSuccessful, AnalysisPhaseFailed, AnalysisPhaseError, AnalysisPhaseInconclusive:
return true
}
return false
}

All these phases should be considered complete, so IMO it makes sense to just the value to all these 4 types of phases.

@zachaller
Copy link
Collaborator

Yea agree, I am open to a PR the implements option 2. along with the addition of the CompletedAt status field for use in determining when to GC based on the TTL settings.

@jessesuen
Copy link
Member

jessesuen commented Jan 12, 2024

we could maybe do some heuristic but I am also ok with it just being on the AnalysisRun itself, especially as a first pass.

Given the two options, I feel the GC feature becomes much less useful without the ability to define it in the template. In the common case, users will have 1:1 match between templates and runs. Allowing GC to be defined in the template further enables the ability to use Analysis without the need to pair it with another object (e.g. Rollout), which would have to introduce a place define the TTL and carry it over to the Run.

I think an acceptable and unsurprising heuristic is to take the max of all TTLs from all Templates, for the synthesized run.

@kevinqian-db
Copy link
Contributor Author

I think an acceptable and unsurprising heuristic is to take the max of all TTLs from all Templates, for the synthesized run.

Will definitely need to document this well to avoid surprises, since some might expect the min instead of max (in the sense of "overriding"). Also I think we can implement this on template in a separate PR (I believe measurementRetention is also similarly separated into 3 PRs)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants