-
Notifications
You must be signed in to change notification settings - Fork 905
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not requeue on successful revision reconcile #2503
Conversation
This is a fairly substantial change in the package revision controller behavior, but should drastically reduce the number of calls we make to the API Server. We previously reconciled package revisions every minute after a successful reconcile, but we now opt for only requeing if a reconcile is not successful. This means that installed resources could drift from what is specified in the package, but the odds of this causing negative impact on an environment are somewhat mininal. We also update to watch owned Deployments for a ProviderRevision as we need to be notified immediately if a provider Deployment is unhealthy. This does have the consequence that we will update all installed resources on a regular cadence in the case that a provider Deployment remains unhealthy. That being said, an unhealthy provider Deployment should call for urgent intervention, either by the controller or the cluster administrator. A potential negative side-effect of not regularly reconciling ProviderRevisions is that an updated ControllerConfig will not flow through to the provider Deployment within a minute, which was the previous guarantee. That being said, any update to the Provider itself will trigger a reconcile, which would cause ControllerConfig changes to also be picked up. Signed-off-by: hasheddan <georgedanielmangum@gmail.com>
Updates all package revision tests to not requeue in the event that we successfully reconcile. Signed-off-by: hasheddan <georgedanielmangum@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some general concerns around the fact that we will still hit the API server a bunch of times in the event that the Deployment
is unhealthy or a case where we consistently had issues with "Object has been modified" errors -- i.e. I don't love that we have to check all of the installed resources for a package revision every time we reconcile as it means any reconcile is fairly expensive (depending on the number of resources in the package). That being said, I am also not a fan of conditionally checking resources on reconcile because I want to preserve our guarantee that we won't create a Deployment
for a Provider
unless all resources can be controlled, and conditionally checking resources would mean there could be a case in which we don't check, but also may create or update a Deployment
.
@hasheddan Isn't it an option to watch ControllerConfig objects as well to fix this?
In such a case, would we hit API server with exponential backoff? If that is the case, I wouldn't be concerned too much given it will be less frequent than the current case. |
@turkenh good point -- we could and use a predicate to only enqueue for
We actually won't hit backoff here because we aren't returning an error. We could opt to return an error, but would likely want to tune backoff because we don't want to retry a bunch of times immediately and reach max backoff in the case that a |
Adds an event handler for enqueing requests for providerrevision when an event is observed on a controllerconfig it references. Signed-off-by: hasheddan <georgedanielmangum@gmail.com>
@@ -0,0 +1,87 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @negz for basically writing all of this code for me :)
(copied with slight modifications from EnqueueRequestForAllRevisionsWithRequests
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@turkenh updated here if you want to take another look 👍🏻
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM👍
Updates providerrevision controller to watch for events on controllerconfigs. Signed-off-by: hasheddan <georgedanielmangum@gmail.com>
Fixes a small typo in the providerrevision controllerconfig handler docstring. Signed-off-by: hasheddan <georgedanielmangum@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks @hasheddan , I think the tradeoff makes sense here. IMO, it's fine to rely on apiserver eventing for resources that do not get actively changed by the user, like managed resources or composition.
// EnqueueRequestForReferencingProviderRevisions enqueues a request for all | ||
// provider revisions that reference a ControllerConfig when the given | ||
// ControllerConfig changes. | ||
type EnqueueRequestForReferencingProviderRevisions struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can come up with something like this (maybe one that accepts add
function?) which would allow us not duplicate code but it's definitely non-blocking.
Backport failed for Please cherry-pick the changes locally. git fetch origin release-1.2
git worktree add -d .worktree/backport-2503-to-release-1.2 origin/release-1.2
cd .worktree/backport-2503-to-release-1.2
git checkout -b backport-2503-to-release-1.2
ancref=$(git merge-base 19571ad84bdbe15e36409449c4927275c05803ba f3cdf6af356bda9f59040687669a31da5a756c81)
git cherry-pick -x $ancref..f3cdf6af356bda9f59040687669a31da5a756c81 |
Backport failed for Please cherry-pick the changes locally. git fetch origin release-1.1
git worktree add -d .worktree/backport-2503-to-release-1.1 origin/release-1.1
cd .worktree/backport-2503-to-release-1.1
git checkout -b backport-2503-to-release-1.1
ancref=$(git merge-base 19571ad84bdbe15e36409449c4927275c05803ba f3cdf6af356bda9f59040687669a31da5a756c81)
git cherry-pick -x $ancref..f3cdf6af356bda9f59040687669a31da5a756c81 |
Successfully created backport PR #2509 for |
Description of your changes
This is a fairly substantial change in the package revision controller
behavior, but should drastically reduce the number of calls we make to
the API Server. We previously reconciled package revisions every minute
after a successful reconcile, but we now opt for only requeing if a
reconcile is not successful. This means that installed resources could
drift from what is specified in the package, but the odds of this
causing negative impact on an environment are somewhat mininal.
We also update to watch owned Deployments for a ProviderRevision as we
need to be notified immediately if a provider Deployment is unhealthy.
This does have the consequence that we will update all installed
resources on a regular cadence in the case that a provider Deployment
remains unhealthy. That being said, an unhealthy provider Deployment
should call for urgent intervention, either by the controller or the
cluster administrator.
A potential negative side-effect of not regularly reconcilingProviderRevisions is that an updated ControllerConfig will not flow
through to the provider Deployment within a minute, which was the
previous guarantee. That being said, any update to the Provider itself
will trigger a reconcile, which would cause ControllerConfig changes to
also be picked up.
Update: we are now enqueing reconciles for ProviderRevisions when
an event on a ControllerConfig it references is observed.
Signed-off-by: hasheddan georgedanielmangum@gmail.com
Fixes #2449
Note: I could go either way on this being backported. Typically I would not want to backport a behavior change, but this feels like it could be considered a "bug fix" given that the excessive API calls are causing issues for folks. Because I don't foresee if having negative impact on any existing users, I have opted to backport.
I have:
make reviewable
to ensure this PR is ready for review.backport release-x.y
labels to auto-backport this PR if necessary.How has this code been tested
Tested this by installing
provider-aws
and thegetting-started-with-gcp
Configuration. Once resources were established andDeployment
was healthy, I observed no further reconciles taking place. I also observed that when aDeployment
became unhealthy, we reconcile on our short wait cadence (i.e.30s
) until it becomes healthy again, at which point we cease requeing.