Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not requeue on successful revision reconcile #2503

Merged
merged 5 commits into from
Aug 20, 2021

Conversation

hasheddan
Copy link
Member

@hasheddan hasheddan commented Aug 19, 2021

Description of your changes

This is a fairly substantial change in the package revision controller
behavior, but should drastically reduce the number of calls we make to
the API Server. We previously reconciled package revisions every minute
after a successful reconcile, but we now opt for only requeing if a
reconcile is not successful. This means that installed resources could
drift from what is specified in the package, but the odds of this
causing negative impact on an environment are somewhat mininal.

We also update to watch owned Deployments for a ProviderRevision as we
need to be notified immediately if a provider Deployment is unhealthy.
This does have the consequence that we will update all installed
resources on a regular cadence in the case that a provider Deployment
remains unhealthy. That being said, an unhealthy provider Deployment
should call for urgent intervention, either by the controller or the
cluster administrator.

A potential negative side-effect of not regularly reconciling
ProviderRevisions is that an updated ControllerConfig will not flow
through to the provider Deployment within a minute, which was the
previous guarantee. That being said, any update to the Provider itself
will trigger a reconcile, which would cause ControllerConfig changes to
also be picked up.

Update: we are now enqueing reconciles for ProviderRevisions when
an event on a ControllerConfig it references is observed.

Signed-off-by: hasheddan georgedanielmangum@gmail.com

Fixes #2449

Note: I could go either way on this being backported. Typically I would not want to backport a behavior change, but this feels like it could be considered a "bug fix" given that the excessive API calls are causing issues for folks. Because I don't foresee if having negative impact on any existing users, I have opted to backport.

I have:

  • Read and followed Crossplane's contribution process.
  • Run make reviewable to ensure this PR is ready for review.
  • Added backport release-x.y labels to auto-backport this PR if necessary.

How has this code been tested

Tested this by installing provider-aws and the getting-started-with-gcp Configuration. Once resources were established and Deployment was healthy, I observed no further reconciles taking place. I also observed that when a Deployment became unhealthy, we reconcile on our short wait cadence (i.e. 30s) until it becomes healthy again, at which point we cease requeing.

This is a fairly substantial change in the package revision controller
behavior, but should drastically reduce the number of calls we make to
the API Server. We previously reconciled package revisions every minute
after a successful reconcile, but we now opt for only requeing if a
reconcile is not successful. This means that installed resources could
drift from what is specified in the package, but the odds of this
causing negative impact on an environment are somewhat mininal.

We also update to watch owned Deployments for a ProviderRevision as we
need to be notified immediately if a provider Deployment is unhealthy.
This does have the consequence that we will update all installed
resources on a regular cadence in the case that a provider Deployment
remains unhealthy. That being said, an unhealthy provider Deployment
should call for urgent intervention, either by the controller or the
cluster administrator.

A potential negative side-effect of not regularly reconciling
ProviderRevisions is that an updated ControllerConfig will not flow
through to the provider Deployment within a minute, which was the
previous guarantee. That being said, any update to the Provider itself
will trigger a reconcile, which would cause ControllerConfig changes to
also be picked up.

Signed-off-by: hasheddan <georgedanielmangum@gmail.com>
Updates all package revision tests to not requeue in the event that we
successfully reconcile.

Signed-off-by: hasheddan <georgedanielmangum@gmail.com>
Copy link
Member Author

@hasheddan hasheddan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some general concerns around the fact that we will still hit the API server a bunch of times in the event that the Deployment is unhealthy or a case where we consistently had issues with "Object has been modified" errors -- i.e. I don't love that we have to check all of the installed resources for a package revision every time we reconcile as it means any reconcile is fairly expensive (depending on the number of resources in the package). That being said, I am also not a fan of conditionally checking resources on reconcile because I want to preserve our guarantee that we won't create a Deployment for a Provider unless all resources can be controlled, and conditionally checking resources would mean there could be a case in which we don't check, but also may create or update a Deployment.

@turkenh
Copy link
Member

turkenh commented Aug 19, 2021

A potential negative side-effect of not regularly reconciling
ProviderRevisions is that an updated ControllerConfig will not flow
through to the provider Deployment within a minute, which was the
previous guarantee. That being said, any update to the Provider itself
will trigger a reconcile, which would cause ControllerConfig changes to
also be picked up.

@hasheddan Isn't it an option to watch ControllerConfig objects as well to fix this?

I have some general concerns around the fact that we will still hit the API server a bunch of times in the event that the Deployment is unhealthy or a case where we consistently had issues with "Object has been modified" errors -- i.e. I don't love that we have to check all of the installed resources for a package revision every time we reconcile as it means any reconcile is fairly expensive (depending on the number of resources in the package). That being said, I am also not a fan of conditionally checking resources on reconcile because I want to preserve our guarantee that we won't create a Deployment for a Provider unless all resources can be controlled, and conditionally checking resources would mean there could be a case in which we don't check, but also may create or update a Deployment.

In such a case, would we hit API server with exponential backoff? If that is the case, I wouldn't be concerned too much given it will be less frequent than the current case.

@hasheddan
Copy link
Member Author

Isn't it an option to watch ControllerConfig objects as well to fix this?

@turkenh good point -- we could and use a predicate to only enqueue for ProviderConfig that reference a given ControllerConfig -- Will update 👍🏻

In such a case, would we hit API server with exponential backoff? If that is the case, I wouldn't be concerned too much given it will be less frequent than the current case.

We actually won't hit backoff here because we aren't returning an error. We could opt to return an error, but would likely want to tune backoff because we don't want to retry a bunch of times immediately and reach max backoff in the case that a Deployment is restarting a Pod or something. I agree that this change is an improvement over status quo though. I am planning on looking into the backoff we use across all controllers as a separate effort.

Adds an event handler for enqueing requests for providerrevision when an
event is observed on a controllerconfig it references.

Signed-off-by: hasheddan <georgedanielmangum@gmail.com>
@@ -0,0 +1,87 @@
/*
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @negz for basically writing all of this code for me :)

(copied with slight modifications from EnqueueRequestForAllRevisionsWithRequests)

Copy link
Member Author

@hasheddan hasheddan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@turkenh updated here if you want to take another look 👍🏻

Copy link
Member

@turkenh turkenh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM👍

internal/controller/pkg/revision/watch.go Outdated Show resolved Hide resolved
Updates providerrevision controller to watch for events on
controllerconfigs.

Signed-off-by: hasheddan <georgedanielmangum@gmail.com>
Fixes a small typo in the providerrevision controllerconfig handler
docstring.

Signed-off-by: hasheddan <georgedanielmangum@gmail.com>
Copy link
Member

@muvaf muvaf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @hasheddan , I think the tradeoff makes sense here. IMO, it's fine to rely on apiserver eventing for resources that do not get actively changed by the user, like managed resources or composition.

// EnqueueRequestForReferencingProviderRevisions enqueues a request for all
// provider revisions that reference a ControllerConfig when the given
// ControllerConfig changes.
type EnqueueRequestForReferencingProviderRevisions struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can come up with something like this (maybe one that accepts add function?) which would allow us not duplicate code but it's definitely non-blocking.

@hasheddan hasheddan merged commit 45a7678 into crossplane:master Aug 20, 2021
@github-actions
Copy link

Backport failed for release-1.2, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally.

git fetch origin release-1.2
git worktree add -d .worktree/backport-2503-to-release-1.2 origin/release-1.2
cd .worktree/backport-2503-to-release-1.2
git checkout -b backport-2503-to-release-1.2
ancref=$(git merge-base 19571ad84bdbe15e36409449c4927275c05803ba f3cdf6af356bda9f59040687669a31da5a756c81)
git cherry-pick -x $ancref..f3cdf6af356bda9f59040687669a31da5a756c81

@github-actions
Copy link

Backport failed for release-1.1, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally.

git fetch origin release-1.1
git worktree add -d .worktree/backport-2503-to-release-1.1 origin/release-1.1
cd .worktree/backport-2503-to-release-1.1
git checkout -b backport-2503-to-release-1.1
ancref=$(git merge-base 19571ad84bdbe15e36409449c4927275c05803ba f3cdf6af356bda9f59040687669a31da5a756c81)
git cherry-pick -x $ancref..f3cdf6af356bda9f59040687669a31da5a756c81

@github-actions
Copy link

Successfully created backport PR #2509 for release-1.3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Too many update calls for resource customresourcedefinitions
3 participants