-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Timestamp for Health Status (#16972) #18660
base: master
Are you sure you want to change the base?
Conversation
11944ca
to
6d65cda
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I have a couple of initial things, PTAL :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mkieweg for the PR. I have left a suggestion. I think below section also needs an update.
argo-cd/controller/appcontroller.go
Lines 1777 to 1791 in 029b5ac
if orig.Status.Health.Status != newStatus.Health.Status { | |
message := fmt.Sprintf("Updated health status: %s -> %s", orig.Status.Health.Status, newStatus.Health.Status) | |
ctrl.logAppEvent(orig, argo.EventInfo{Reason: argo.EventReasonResourceUpdated, Type: v1.EventTypeNormal}, message, context.TODO()) | |
} | |
var newAnnotations map[string]string | |
if orig.GetAnnotations() != nil { | |
newAnnotations = make(map[string]string) | |
for k, v := range orig.GetAnnotations() { | |
newAnnotations[k] = v | |
} | |
delete(newAnnotations, appv1.AnnotationKeyRefresh) | |
} | |
patch, modified, err := diff.CreateTwoWayMergePatch( | |
&appv1.Application{ObjectMeta: metav1.ObjectMeta{Annotations: orig.GetAnnotations()}, Status: orig.Status}, | |
&appv1.Application{ObjectMeta: metav1.ObjectMeta{Annotations: newAnnotations}, Status: *newStatus}, appv1.Application{}) |
Looks like Update: not required.CreateTwoWayMergePatch
will always detect a drift here due to the new timestamp field and try to patch the status when there is no actual change in app health status. We might need to explicitly set it to old/orig value if there is not change in app health status.
Thanks for pointing that out, I'll give it a look |
Hi @mkieweg, I think you can ignore my previous suggestion regarding changes due to argo-cd/controller/appcontroller.go Line 1504 in 029b5ac
and will differ only when explicitly updated, which happens at 2 places. We need to ensure timestamp is updated at these places (looks like 2. is already handled in your change).
|
Hi @svghadi, I've added setting the timestamp in case we set the health status to unknown. The second one is indeed already handled by |
We actually used to have a timestamp, but the updating of the timestamp as part of every app reconciliation would cause a ton of K8s API server pressure. I need to find the PR that removed this, but this could actually cause a performance problem. |
I found the old issue. It is here: #1340 Basically, we used to update |
I took a closer look at this implementation, and noticed that we don't needlessly update timestamps (as we did with the problematic // If the status didn't change, we don't want to update the timestamp
if healthStatus.Status == statuses[i].Health.Status {
now = lastTransitionTime
} So my performance concerns may be alleviated. |
In the E2E test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran some tests and found some corner cases that need to be handled:
- Timestamp for individual resources is updated to Application's LastTransitionTime even when there is no change in the resource status.
- The timestamp for Application's health is not updated when the status changes from Non Healthy to Healthy state.
controller/health.go
Outdated
// If the status didn't change, we don't want to update the timestamp | ||
if healthStatus.Status == statuses[i].Health.Status { | ||
now = lastTransitionTime | ||
} | ||
|
||
resHealth := appv1.HealthStatus{Status: healthStatus.Status, Message: healthStatus.Message, LastTransitionTime: now} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section sets health of individual resources of the application. If there is no status change, I feel it should be set to lastTransitionTime of that particular resource instead of setting it to application's lastTransitionTime, as the values may not necessarily be the same.
// If the status didn't change, we don't want to update the timestamp | |
if healthStatus.Status == statuses[i].Health.Status { | |
now = lastTransitionTime | |
} | |
resHealth := appv1.HealthStatus{Status: healthStatus.Status, Message: healthStatus.Message, LastTransitionTime: now} | |
// If the status didn't change, we don't want to update the timestamp | |
if healthStatus.Status == statuses[i].Health.Status { | |
now = statuses[i].Health.LastTransitionTime | |
} | |
resHealth := appv1.HealthStatus{Status: healthStatus.Status, Message: healthStatus.Message, LastTransitionTime: now} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me have a look and get back to you on this. I vaguely remember, that the current implementation only persists the timestamp on an application level. So while setting it on the resource level here might make sense, but I'm not sure if it has any benefit if the rest of the implementation remains unchanged. Which I would advocate for at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at the code. The resource-level timestamp is indeed difficult to achieve due to the current implementation.
Since the resource timestamp will contain the app health timestamp, I feel it may not be that useful. How about we skip it at the resource level? Just have it at the application health level, i.e. .status.health
. If you change the timestamp to a pointer type, I think we can ignore it at resource level and it won't show up in .status.resources
.
controller/health.go
Outdated
if persistResourceHealth { | ||
appHealth.LastTransitionTime = statuses[i].Health.LastTransitionTime | ||
} else { | ||
appHealth.LastTransitionTime = now | ||
} | ||
} else if healthStatus.Status == health.HealthStatusHealthy { | ||
appHealth.LastTransitionTime = lastTransitionTime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to handle case where status transition happens from Non Healthy -> Healthy. Since we have access to old status values, would it make sense to set the time value outside the for
loop after the new status of application is calculated? Something like
diff --git a/controller/health.go b/controller/health.go
index b1c457776..e1a2c5d19 100644
--- a/controller/health.go
+++ b/controller/health.go
@@ -86,15 +86,11 @@ func setApplicationHealth(resources []managedResource, statuses []appv1.Resource
if health.IsWorse(appHealth.Status, healthStatus.Status) {
appHealth.Status = healthStatus.Status
- if persistResourceHealth {
- appHealth.LastTransitionTime = statuses[i].Health.LastTransitionTime
- } else {
- appHealth.LastTransitionTime = now
- }
- } else if healthStatus.Status == health.HealthStatusHealthy {
- appHealth.LastTransitionTime = lastTransitionTime
}
}
+ if persistResourceHealth {
+ if app.Status.Health.Status == appHealth.Status {
+ appHealth.LastTransitionTime = lastTransitionTime
+ } else {
+ appHealth.LastTransitionTime = metav1.Now()
+ }
+ }
if persistResourceHealth {
app.Status.ResourceHealthSource = appv1.ResourceHealthLocationInline
} else {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've managed to repro the issue. Unfortunately it persists with your suggestion. I'll look into it deeper tomorrow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added another test to appcontroller_test.go
where the issue you're describing doesn't happen. I think the code works as expected, but maybe you can have another look :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran a manual test by creating a pod with delayed start. The time is updated for Missing->Progressing
state change but it doesn't get updated when state moves from Progressing->Healthy
.
timestamp-no-update.mov
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's interesting. Let me have a closer look
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added some more extensive testing. Currently they break because I couldn't find a way to make them work with the way we're using mocks. Any suggestions on how to make TestUpdateHealthStatusProgression
work are welcome
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've managed to repro the issue. Unfortunately it persists with your suggestion.
Hmm, I haven't been able to repro this after applying @svghadi's suggestion.
kubectl get app guestbook -o jsonpath='{.status.health}'
{"lastTransitionTime":"2024-06-27T11:13:06Z","status":"Progressing"}
kubectl get app guestbook -o jsonpath='{.status.health}'
{"lastTransitionTime":"2024-06-27T11:13:31Z","status":"Healthy"}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mkieweg I added some changes which should fix your tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hiddeco wrote in #16972 that this doesn't quite account for all edge cases.
While I am super supportive of this being added, I have an edge case that I think falls into this category but is not covered by just this addition.
Within Argo CD, the sync operations are completely separated from the application health. However, in certain scenarios, you do want to know if the current state of the health check is the outcome of a previous sync operation.
With the proposal, this would become possible by comparing the .finishedAt from the .status.operationState against the .lastUpdateTime. This, however, only works when you sync workload resources, which means that it will not help in scenarios where you only have non-workload resources, which will always end up Healthy without causing a transition.
...
I wonder if it would be an idea to introduce a "heartbeat" kind of timestamp that updates at most every 30 seconds (to not cause any API pressure) to indicate the "freshness" of the observation even without it transitioning.
By modifying @svghadi's suggestion slightly I think that would be achievable by doing this (haven't tested this so caveat emptor):
diff --git a/controller/health.go b/controller/health.go
index b1c457776..e1a2c5d19 100644
--- a/controller/health.go
+++ b/controller/health.go
@@ -86,15 +86,11 @@ func setApplicationHealth(resources []managedResource, statuses []appv1.Resource
if health.IsWorse(appHealth.Status, healthStatus.Status) {
appHealth.Status = healthStatus.Status
- if persistResourceHealth {
- appHealth.LastTransitionTime = statuses[i].Health.LastTransitionTime
- } else {
- appHealth.LastTransitionTime = now
- }
- } else if healthStatus.Status == health.HealthStatusHealthy {
- appHealth.LastTransitionTime = lastTransitionTime
}
}
if persistResourceHealth {
+ if app.Status.Health.Status == appHealth.Status && time.Since(lastTransitionTime.Time) < 30*time.Second {
+ appHealth.LastTransitionTime = lastTransitionTime
+ } else {
+ appHealth.LastTransitionTime = metav1.Now()
+ }
app.Status.ResourceHealthSource = appv1.ResourceHealthLocationInline
} else {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My suggestion however would be to have the heartbeat timestamp as a separate field (as e.g. "last observed" or "heartbeat"), as otherwise it could create a false impression that a transition did take place.
8d4030b
to
1be8574
Compare
Hi @mkieweg, did you get a chance to work on this again? If you're busy with other tasks, I'm happy to jump in and continue with it. |
Hi @svghadi, sorry I had to pause working on this due to shifting priorities. I was planning on picking this back up this week. I'm not sure how much time I can actually commit, so if you're happy to take over that would be okay with me |
@svghadi I can also help out if needed |
@mkieweg - Sure. Could I get collaborator access to your argo-cd fork, or should I cherry-pick the changes and continue in a new PR? @blakepettersson - Thank you. Should we address @hiddeco’s use case in this implementation, or would it be better to keep the scope limited and handle it in a separate PR? |
@svghadi I've added you as a contributor to my fork. Can do the same for @blakepettersson if needed |
Ideally I'd like to see his use case addressed in this PR as well, since we have a particular use case in mind for it (it's related to Kargo). |
Sounds good to me, thanks! |
Hi @blakepettersson , I was thinking about how to incorporate @hiddeco 's use case into this PR. It seems a bit challenging without introducing a new field, as @mkieweg's use case requires a timestamp that doesn't update periodically.
In contrast, @hiddeco 's use case needs a periodically updating status timestamp.
Here are my thoughts:
The health status would look something like this: status:
health:
lastTransitionTime: "2024-09-03T08:51:32Z"
observedAt: "2024-09-03T08:51:32Z" # configurable
status: Progressing Thoughts/suggestions? |
Hi @svghadi, I agree that we'll need two separate fields, one for the health status and one to indicate "freshness".
👍
I think it's fine for now to keep it at max 30 secs - if there is a need to configure this at a later stage we can address that in another PR. |
e62a31f
to
1036899
Compare
1036899
to
f83469b
Compare
Hi @blakepettersson, the PR is ready for review. Changes:
The screen grab below shows that Screen.Recording.2024-09-12.at.6.53.06.PM.mov |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (pending the minor lint fixes), thanks @svghadi!
f83469b
to
bfe15ee
Compare
0375ccb
to
b4fd9fa
Compare
controller/appcontroller.go
Outdated
} | ||
|
||
// We are interested only in refreshing the status, not performing a resource-level comparison. | ||
ctrl.requestAppRefresh(newApp.QualifiedName(), ComparisonWithNothing.Pointer(), nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to perhaps only do the refresh while a sync is taking place for the app? As it stands now, this will refresh every 30 seconds throughout the app's lifecycle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am sorry, I am not quite following. Are you suggesting updating the .status.health.observedAt
field only when the app is syncing? The refresh done here(every 30s) uses ComparisonWithNothing
option, and AFAIU, it just rebuilds the app's resource tree without doing any manifest level comparison. However, I’m not sure how expensive that operation is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you suggesting updating the
.status.health.observedAt
field only when the app is syncing?
Yes exactly 😄
However, I’m not sure how expensive that operation is.
For a single app it's likely negligible, but @crenshaw-dev raised the concern that if there were say 10k apps, every 30 secs there would be 10k requests hammering the K8s API server (updating the .status.health.observedAt
field). It's unclear if that is an actual concern or not, but something to keep in mind. The hope is to sidestep that concern by only performing this refresh during app sync.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's a fair concern.
The hope is to sidestep that concern by only performing this refresh during app sync.
I believe we can achieve this without significant changes to the current implementation by removing the new AddEventHandlerWithResyncPeriod
event handler. However, without a periodic refresh, I am not sure how much value this new field provides. Existing fields already provide the same information that .status.health.observedAt
would, if updated only during syncs.
Another approach could be to enable updates to this field during app sync by default and put the periodic update behind a flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
without a periodic refresh, I am not sure how much value this new field provides
My understanding of the 30s refresh is that it's just a "safety net." My assumption would be that the health is constantly being refreshed based changes in the cluster cache and that the 30s refresh would only catch updates that were somehow missed by the cluster cache, a case which I'd expect to be rare.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind, I read above that the intent is to bump the observedAt
timestamp to prove that the health has been evaluated more recently than the last sync operation.
I worry about accruing tech debt (with a feature behind a flag) or incurring a performance penalty (with a feature without a flag) to provide a service answering the question "has this app been evaluated to be healthy since the app was synced?" There are already a few ways to do this:
- Wait a reasonable amount of time after the app has synced to observe the health (makes it very likely the health was evaluated)
- Request an app refresh and then observe the health (guarantees the health was evaluated)
- Add a post-sync resource hook to the app (ties the sync success/failure to resource health)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From both the perspective of a user and as an engineer trying to reliably extract information from ArgoCD based on the status it reflects, I do not see how any of these solve the problem reliably or without introducing new actions to take (defeating the whole purpose of the status of a Kubernetes object?).
They feel more like workarounds for something that's an observability issue at its core due to design choices made in both the API and controller, which my suggestion was attempting to address. If the primary concern is API pressure, I am happy with any alternative that would be updated frequently even if no transition occurs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure... I think it's nice for Argo CD to produce a fairly-recent timestamp for the most recent observed health. I just wanna make sure it's worth the performance and maintenance costs as compared to other options. Do we have some sense for a typical comparison_with_nothing_ms
distribution on a fairly-busy Argo instance? Is there reason to believe we might need the heartbeat time to be configurable or to include jitter? I'm in favor of spending CPU cycles on useful stuff, I just think we should first have a clear sense for how many CPU cycles we're spending.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eh, looks like that checkpoint was only added in 2.13, so I don't have any stats from Intuit.
I'm not super worried about it tbh. I'd just recommend trying it out on some relatively busy Argo instance and watching the effect on both the k8s API and Argo controller performance.
Hi @blakepettersson , did you get a chance to look into @crenshaw-dev suggestion on performance test? Should we consider splitting the PR to unblock the other feature which doesn't need performance testing. |
Hi @svghadi sadly no-one in our team has had the time to do so - I would support splitting this PR into two parts so we could at least move forward with the least contentious part, and then we will need to find some time to do performance testing of the |
Signed-off-by: Manuel Kieweg <mail@manuelkieweg.de>
Signed-off-by: Manuel Kieweg <mail@manuelkieweg.de>
Signed-off-by: Manuel Kieweg <mail@manuelkieweg.de>
Signed-off-by: Manuel Kieweg <mail@manuelkieweg.de>
Signed-off-by: Manuel Kieweg <mail@manuelkieweg.de>
Signed-off-by: Manuel Kieweg <mail@manuelkieweg.de>
Signed-off-by: Manuel Kieweg <mail@manuelkieweg.de>
Co-authored-by: Blake Pettersson <blake.pettersson@gmail.com> Signed-off-by: Manuel Kieweg <2939765+mkieweg@users.noreply.github.com>
Signed-off-by: Manuel Kieweg <mail@manuelkieweg.de>
Signed-off-by: Manuel Kieweg <mail@manuelkieweg.de>
Signed-off-by: Manuel Kieweg <mail@manuelkieweg.de>
Due to implementation limitations, setting LastTransitionTime at the resource level is challenging. Converting it to a pointer type allows it to be skipped at the resource level and prevents it from appearing in .status.resources of the Application CR. Additionally, it doesn’t provide much value or have a known use case right now. Signed-off-by: Siddhesh Ghadi <sghadi1203@gmail.com>
b4fd9fa
to
eb80af0
Compare
Hi @blakepettersson, @crenshaw-dev, I have removed the concerning |
hs.status = "Degraded" | ||
end | ||
end | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to override statuses?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, you can use a modernized way of specifying healthchecks (not that it matters here), e.g. https://argo-cd.readthedocs.io/en/latest/operator-manual/health/#argocd-app
for _, tc := range testCases { | ||
t.Run(tc.name, func(t *testing.T) { | ||
deployment.SetLabels(map[string]string{"status": tc.initialStatus}) | ||
ctrl.processAppRefreshQueueItem() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have to trigger it manually instead of requesting app refresh?
This PR contains an implementation of the enhancement proposal #16972.
Closes #16972
Checklist: