-
Notifications
You must be signed in to change notification settings - Fork 802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support garbage-collecting stale AnalysisRuns #3285
Comments
One consideration is that because runs can be synthesized from multiple templates, we would need to determine what to do if ttl disagreed. I guess we would need to take the largest value. |
So I like the proposal I would like to define exactly how I think that |
Sounds good. For our immediate usage I don't think we need to support it for now.
I believe we can make argo-rollouts/analysis/analysis.go Lines 428 to 431 in 99ae20c
I think for argo-rollouts/analysis/analysis.go Lines 105 to 112 in 99ae20c
|
Yup that makes since I did not know off the top of my head if there was an actual completed phase and that that phase was also set on things like errors or failures etc, if that phase is not set on errors or failures do we still want to GC? Just things to double check and confirm etc. |
According to argo-rollouts/pkg/apis/rollouts/v1alpha1/analysis_types.go Lines 198 to 205 in 99ae20c
All these phases should be considered complete, so IMO it makes sense to just the value to all these 4 types of phases. |
Yea agree, I am open to a PR the implements option 2. along with the addition of the |
Given the two options, I feel the GC feature becomes much less useful without the ability to define it in the template. In the common case, users will have 1:1 match between templates and runs. Allowing GC to be defined in the template further enables the ability to use Analysis without the need to pair it with another object (e.g. Rollout), which would have to introduce a place define the TTL and carry it over to the Run. I think an acceptable and unsurprising heuristic is to take the max of all TTLs from all Templates, for the synthesized run. |
Will definitely need to document this well to avoid surprises, since some might expect the min instead of max (in the sense of "overriding"). Also I think we can implement this on template in a separate PR (I believe measurementRetention is also similarly separated into 3 PRs) |
Summary
Motivation:
Right now Argo rollout supports max AnalysisRun history retention for rollouts through
successfulRunHistoryLimit
andunsuccessfulRunHistoryLimit
. However, it seems that it does not support auto-cleanup for one-off AnalysisRuns.In our case, we have been using independent AnalysisRuns for other workload types e.g. StatefulSets to evaluation their health and decide whether to trigger a rollback. This leaves a huge amount of completed (mostly Successful) AnalysisRuns not cleaned up. After one year of such usage, its total number surpasses 10k.
This causes two major problems for us right now, aside from increased CPU and memory usage:
Due to unclear reason that needs further investigation,for both v1.2 and v1.6.3, we have seen completed AnalysisRuns keep being reconciled with skipped patches. This floods our work queue and causes occasional high latency between two passes of metrics collection (likely causing overdues)argo-rollouts/controller/controller.go
Line 64 in feda353
Therefore, we hope to be able to configure Argo so that it will automatically issue deletes for AnalysisRuns that are stale "enough"
Proposal:
AnalysisRun
manifest:Jobs
'sttlSecondsAfterFinished
since 1.23.ttlStrategy
makes it future-proof when more complicated TTL settings are added)analysis::Controller.reconcileAnalysisRun
(this is already being invoked every resync period for all completed AnalysisRuns in current Argo version):ttlStrategy
set. If not, return.AnalysisRun.Status.CompletedAt
(similar to the existingAnalysisRun.Status.StartedAt
). This should make the extraction of completion timestamp much faster (if there are too many measurements)now - CompletedAt > secondsAfter(Completion|Failure|Success)
and the corresponding completion status matches it, delete the AnalysisRunsecondsAfterFailure
andsecondsAfterSuccess
will overridesecondsAfterCompletion
if both set.AnalysisTemplate
manifest:run
to avoid confusion, since this TTL is not for deleting AnalysisTemplate.NewAnalysisRunFromTemplates
which translatesAnalysisTemplate
toAnalysisRun
, copyrunTtlStrategy
to thettlStrategy
of the generatedAnalysisRun
if it is present.Alternatives (Not Recommanded):
--analysis-run-max-alive-duration=<max-alive-duration>
(e.g. 30d).analysis::NewController(...)
. An extra thread will be started in analysis::Controller.Run(...) so that it will do the following every fixed period of time (e.g. 15min) (or until the previous pass is completed):argo-rollouts/controller/metrics/analysis.go
Line 42 in feda353
now - max alive duration
, then issue a deletion request for the given AR.argoProjClientset
API client, which is accessible in the AnalysisRun controller.Note that with this alternative:
Use Cases
See above motivation.
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.
The text was updated successfully, but these errors were encountered: