-
Notifications
You must be signed in to change notification settings - Fork 939
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
One Pager: Resource Status Improvements #5453
Conversation
@TerjeLafton Thanks for opening this. I'll do a review pass ASAP - probably later this week at this stage. |
Just a friendly reminder, @negz, in case it got lost in other tasks 😄 |
@TerjeLafton Apologies, I got pretty sidetracked with Kubecon EU last week. This is near the top of my todo list. |
@TerjeLafton Here's a quick brain dump of some things that I'd like to see discussed/considered in this design. Apologies if some of this is already covered, this is a bit write-only until I get time to load up all the context again:
|
Thank you, the points you mentioned are very insightful! I'll just reply to your thoughts as a reply now, but please
I did look into Crossplane has quite good conditions implemented already, so adopting
As conditions are part of the CRD's API, would this be considered a breaking change? It would probably be a very minor one, but I know some of the systems we have build would need changing to support this. Implementing this would comply with my wish for the
As stated in the issue, by you and others, I see The users should primarily be focused on conditions, and we should give them everything they need to understand the status of their resource here. The
Initially I didn't consider XRs part of this, as their status is only based on the status of their referenced resources. With the addition of composition functions, I am even responsible for updating their condition myself using your But if we consider adopting
For public clouds, I think some of their resource APIs might indicate this. I know for example the VirtualNetworkGateway in Azure has a status field which indicates if it is But for most APIs, I think it would be impossible to indicate from their API fields if they are For me, a successful call to |
UpToDate -->|No| Update | ||
``` | ||
|
||
#### Solution |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The canonical Kube way is to have observedGeneration
as proposed for Crossplane in crossplane/crossplane-runtime#633. If it does not match the generation in metadata, the reconcile has not finished.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The advantage here: it reduces writes when an update is requested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But what is the cost of a write? I imagine it is quite small, especially with updates which in my experience is one of the rarer operations.
My point is that I think it is worth it to have more precise statuses that can fully represent the state of the resource.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool @TerjeLafton, thank you for taking the time to think through this scenario and write down your thoughts in this proposal. That's really helpful for community members to articulate and drive the changes that are important to them. Thank you! 🙇♂️
I've thought about this more deeply this morning, and while I may not have some of the specific implementation details for crossplane-runtime's reconciler, I do feel like I've formed an opinion on this topic.
In general, I would agree with the high level problem statement, which I would summarize as "How do we know when a managed resource exactly matches its desired state?". A new status condition could possibly be added that captures this need. To majorly simplify the purpose of each status condition, I think they would look like:
Ready
: resource is ready to use / availableSynced
: we were able to apply the desired to the systemUpToDate
: the observed state of the system matches the desired state we applied
The observedGeneration
changes in crossplane/crossplane-runtime#633 are going to be helpful, but I don't think they fully cover this UpToDate
scenario you've been pursuing in this PR. IIRC, observedGeneration
is more about "we saw up to these desired changes and we have acted on them", as described in kstatus
:
If the generation and the observedGeneration of a resource does not match, it means there are changes that the controller has not yet seen, and therefore not acted upon.
I don't think that says anything about "and now the observed state of the external system matches that desired state", especially for systems that take a long time to update themselves internally to the state we requested. As we all seem to agree on, external APIs from say the cloud providers generally won't tell us this state directly for all their resources, so the best we can likely do is "does observed actually match desired"? This 3rd condition would mean exactly that and probably be useful.
That is what I would support, as opposed to changing the meaning of Ready
.
A few thoughts about kstatus
:
- it could definitely be useful for us to be conformant with a broader standard in use by others in the ecosystem
- we'll be closer to being able to support it with Add ability to expose resource reconciliation progress crossplane-runtime#633 (comment)
- It would probably be a requirement to do it in an "additive" way, as opposed to changing any meaning of our long standing conditions with breaking changes, e.g. our existing conditions don't go away or change.
``` | ||
|
||
#### Solution | ||
Implement an `Updating` status, mimicking the `Creating` status. Setting this status every time an update request either succeeds or fails, informing a user that an update request has been requested, but we have not verified if the resource is compliant with the updated desired state. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we were to add such a status, I think according to the API conventions, it would be better to not surface as a transient state:
Condition type names should describe the current observed state of the resource, rather than describing the current state transitions. This typically means that the name should be an adjective ("Ready", "OutOfDisk") or a past-tense verb ("Succeeded", "Failed") rather than a present-tense verb ("Deploying").
Perhaps it would be better to expose it as UpToDate
or something that is an adjective or past-tense verb.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read this part too, and I agree.
As I was basing my proposal on updating the existing Ready
condition, and I wanted it to be more about showing that it wasn't ready yet, I went with Updating
, so it would match the Creating
reason.
If the solution ends up being creating another condition, then I don`t mind following the API conventions linked.
#### Solution | ||
Implement an `Updating` status, mimicking the `Creating` status. Setting this status every time an update request either succeeds or fails, informing a user that an update request has been requested, but we have not verified if the resource is compliant with the updated desired state. | ||
|
||
This also requires updating the reconciler to immediately requeue a reconcile after a successfuly requested update. This ensures that an updated resource is immediatly checked for compliance, to give user quick feedback on if the resource is `Available` again after the update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have expected this to already be the case - when we perform an apply/update to an external resource, that we would requeue, because we know there's likely to be an external change soon in response to our update request.
If we're not doing that, then I would support a code change so we reconcile again soon after applying a change to the external system, instead of waiting until the next poll internal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree! This was my main motivation for looking into this, and it started with this issue. A suggested fix was also proposed in this PR.
I do believe there are reasons for why the reconciler wasn't designed to reconcile after an update though, having the amount of reconciles in mind. If we assume that a successful update request is enough to say that this resource is ready and compliant, then we don't need to reconcile.
My view has always been that this isn't optimal, and that a successful observe of the resource and checking that it matches our desired state is the only way to really say it is ready for use. An extra reconcile seems worth it to me.
|
||
This also requires updating the reconciler to immediately requeue a reconcile after a successfuly requested update. This ensures that an updated resource is immediatly checked for compliance, to give user quick feedback on if the resource is `Available` again after the update. | ||
|
||
Note that this would technically change the definition of the `Ready` condition. However, this definition is not very well documented, so I am not sure how many know about it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think we should change the semantics or meaning of our long standing Ready
condition. Way too many folks have made (codified) assumptions about the meaning across the ecosystem, that it's not an option IMO to change its meaning. As you already mentioned above from the API conventions:
Once defined, the meaning of a Condition can not be changed arbitrarily - it becomes part of the API, and has the same backwards- and forwards-compatibility concerns of any other part of the API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I totally get what you mean! As you pointed out, I commented on this myself, so it was always on my mind.
You know much better than me what others in the community have created, so if you think this is too much of a breaking change, I can understand that.
But do you think there are that many that have created things with a strong assumption that Ready
is not about desired state matching observed state? My point is just that, while I agree this would be a breaking change, I think it might be worth looking into if it would actually break anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sure people have yes - unfortunately it's impossible to ever know that fully within our ecosystem and it's many thousands of adopters. We can never hear from everyone and therefore we always have to stick with the fundamental principle of no breaking changes or changes to semantics in anything that is v1 API.
#### Solution | ||
Modify the existing condition helper functions in Crossplane to optionally accept a string argument intended for the `Message` field of a condition. This would empower provider authors to provide human-readable summaries when setting condition states. | ||
|
||
Crossplane would retain the ability to overwrite a provider-set `Reason` for the `Ready` condition if it's incorrect. This ensures consistency within the API, while providing more nuance in status reporting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure Crossplane should take on the burden of assessing whether a provider has set an accurate Reason
. The provider has more information about the external system than crossplane-runtime does, and I suspect we should either trust the providers fully or not allow them to set a Reason
at all. Having Crossplane assess the accuracy of a provider statement feels like it would not be reliable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree! My preferred solution would be for providers to be fully responsible for the Ready
condition. I would like there to be guidelines for which reasons you can use for a condition though, so the API is consistent across providers.
But as the crossplane-runtime does update the Ready
condition, for example with Reason: Creating
, it wouldn't work to make that change. That is just why I suggested this, but it is not the most important thing to fix.
Crossplane currently has primary control over the `Ready` condition, limiting the ability of providers to communicate detailed, human-readable status information to users via the `Message` field. While Crossplane's consistent conditions create predictable expectations for users, this generic nature leaves valuable context underutilized. Important status information for users and developers is often relegated to Kubernetes events and logs, which can be less convenient to access and interpret. | ||
|
||
#### Solution | ||
Modify the existing condition helper functions in Crossplane to optionally accept a string argument intended for the `Message` field of a condition. This would empower provider authors to provide human-readable summaries when setting condition states. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This statement seems to conflict with the previous "Problem" statement:
limiting the ability of providers to communicate detailed, human-readable status information to users via the
Message
field
Providers already can set the Message
field, but we should add the ability to set the Message
field? Did you perhaps mean Reason
field here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand, the first sentence in the solution wasn't optimal 😄
The limitation stems from the crossplane-runtime setting the Ready
condition most of the time, without a message.
Say something went wrong when creating an external resource. I would then want to create something like, to give good feedback to my users in a human readable format.
Type: Ready
Status: False
Reason: Creating
Message: This and that went wrong, etc.
If I create this condition and set it on my MR, it would immediately be overwritten by crossplane-runtime with a similar condition, without the message.
So my intent with this was basically to respect the message field, but it could have been explained better!
Co-authored-by: Pete Lumbis <pete@upbound.io> Signed-off-by: Terje Lafton <terje@lafton.io>
Thank you for the review, @jbw976! I understand the reluctance to change the Adding an extra condition would solve the issue too. My only problem with this is that the I definitely support trying to be compliant with I am new to this, so please tell me what I should do with this PR as reviews come. Do we discuss it until we reach a point where we generally agree, and then rewrite the one-pager to match the final design decision, if any, or how does this work? |
related PR: crossplane/crossplane-runtime#633 +1 on being kstatus compliant. |
Thanks for your continued collaboration @TerjeLafton! I think this is a tricky issue to work through because it has a lot of nuances and impacts every resource in the crossplane ecosystem, which isn't a small surface area 😉
Typically if we reach alignment during a PR discussion, it's reasonable to push a commit that captures that newly aligned state. When things are still in active discussion, it's reasonable to hold off on making any updates. Try to find a balance between avoiding churning your doc and pushing work that you may need to change again. It's tough to find a perfect balance, but thanks for the efforts always! I'll share my current thoughts here (influenced by some feedback from @negz), but I still have some lingering questions that I don't have answers for personally - perhaps others in the community will be able to add their knowledge as well.
|
@TerjeLafton note the further discussion on |
I think going all in on But is there a way to be fully compliant with I'll just provide a little context for what inspired this PR in the first place, as you asked for some relevant examples. We were building a Crossplane provider for use in-house against network devices. Because of security related reasons, we couldn't let a resource that is not compliant with the desired state be seen as As I can implement custom status for conditions, my issues would be solved only be reconcileing after updates. It also seems like you agreed that this is a good change to implement here. It is mostly the discussions in the issue and PR I made that inspired me to look more into if there could be made improvements to the API to make it more complete. I tried to think of ways to improve it without breaking changes, or at least minor ones, and A minor extra comment: We also implemented a |
I just read the So to be This sounds good, and should be something we could implement quite easily. The But one concern I have is about how users experience the API directly. The API convention states that the conditions should ideally be a complete view for the user to get a good idea of the status of their resource. I feel like a complete |
Realizing I need to consume the rest of these comments and links everyone is sharing, but for what it's worth We currently use the ready status on a resource to determine whether or not we can access status properties on the resource, which usually includes some form of ID / ARN. From our perspective, a resource that is updating is still "ready" even if the update fails because its existing status properties remain valid and usable. For example we may request to expand a disk on a VM which fails, but other resources consuming that VM's ID should consider the VM ready. |
Crossplane does not currently have enough maintainers to address every issue and pull request. This pull request has been automatically marked as |
Description of your changes
This PR introduces a one-pager with some potential improvements to the current statuses and how they are set.
The idea was initially brought up in this
crossplane-runtime
issue: crossplane/crossplane-runtime#583It was then discussed further in this PR, also in
crossplane-runtime
, alongside a potential implementation to solve the issue: crossplane/crossplane-runtime#654This was a big enough change that we agreed to document it in a one-pager. I took a while to work through this thoroughly, but it is still only based on my experiences working with Crossplane at work. Any feedback or discussion is very welcome.
I have:
- [ ] Runmake reviewable
to ensure this PR is ready for review.- [ ] Added or updated unit tests.- [ ] Added or updated e2e tests.- [ ] Linked a PR or a [docs tracking issue] to [document this change].- [ ] Addedbackport release-x.y
labels to auto-backport this PR.