Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Account for two different kinds of consistency issues #283

Merged
merged 2 commits into from
Sep 3, 2021

Conversation

negz
Copy link
Member

@negz negz commented Aug 31, 2021

Description of your changes

Fixes crossplane-contrib/provider-aws#802
Closes #279
Closes #280

This PR is intended to address two issues that we diagnosed while investigating crossplane-contrib/provider-aws#802. It addresses feedback from and supercedes #279 and #280. Both issues are known to cause 'leaked' and duplicate managed resources. I have only been able to personally reproduce the second issue but we have evidence of the first happening in the wild.

The first issue is that controller-runtime does not guarantee reads from cache will return the freshest version of a resource. It's possible we could create an external resource in one reconcile, then shortly after trigger another in which it appears that the managed resource was never created because we didn't record its external-name. This only affects the subset of managed resources with non-deterministic external-names that are assigned during creation.

The second issue is that some external APIs are eventually consistent. A newly created external resource may take some time before our ExternalClient's observe call can confirm it exists. AWS EC2 is an example of one such API.

This PR attempts to address the first issue by making an Update to a managed resource immediately before Create it called. This Update call will be rejected by the API server if the managed resource we read from cache was not the latest version.

It attempts to address the second issue by allowing managed resource controller authors to configure an optional grace period that begins when an external resource is successfully created. During this grace period we'll requeue and keep waiting if Observe determines that the external resource doesn't exist, rather than (re)creating it.

I have:

  • Read and followed Crossplane's contribution process.
  • Run make reviewable to ensure this PR is ready for review.

How has this code been tested

https://gist.github.com/negz/e1f2e74f18802d15440214a1a1abc981

I've run the script from the above gist for about 12 hours against provider-aws built with this PR and observed zero leaked resources.

@negz negz requested review from chlunde, muvaf and turkenh August 31, 2021 07:12
This commit is intended to address two issues that we diagnosed while
investigating crossplane-contrib/provider-aws#802.

The first issue is that controller-runtime does not guarantee reads from cache
will return the freshest version of a resource. It's possible we could create an
external resource in one reconcile, then shortly after trigger another in which
it appears that the managed resource was never created because we didn't record
its external-name. This only affects the subset of managed resources with
non-deterministic external-names that are assigned during creation.

The second issue is that some external APIs are eventually consistent. A newly
created external resource may take some time before our ExternalClient's observe
call can confirm it exists. AWS EC2 is an example of one such API.

This commit attempts to address the first issue by making an Update to a managed
resource immediately before Create it called. This Update call will be rejected
by the API server if the managed resource we read from cache was not the latest
version.

It attempts to address the second issue by allowing managed resource controller
authors to configure an optional grace period that begins when an external
resource is successfully created. During this grace period we'll requeue and
keep waiting if Observe determines that the external resource doesn't exist,
rather than (re)creating it.

Signed-off-by: Nic Cope <negz@rk0n.org>
@negz negz marked this pull request as ready for review September 1, 2021 22:55
@negz
Copy link
Member Author

negz commented Sep 2, 2021

2021-09-01T13:18:18.202Z        DEBUG   provider-aws    cannot determine creation result - remove the crossplane.io/external-create-pending annotation if it is safe to proceed {"controller": "managed/routetable.ec2.aws.crossplane.io", "request": "/test-2-ff7wr", "uid": "1afdaa6a-82ec-4dfd-9d44-4f269f9d5eaa", "version": "444726", "external-name": ""}

Interestingly I did see the above case once in the logs from my 12-13 hours of testing. It seems to have resolved itself (the RouteTable was cleaned up successfully. I'm currently experiencing some login issues so I can't actually confirm whether or not the RouteTable was leaked, but I imagine it wasn't because if it was it would have blocked deletion of the VPC managed resource it existed in.

Copy link
Member

@muvaf muvaf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for fixing this hard problem with the least code change necessary in providers!

pkg/reconciler/managed/reconciler.go Outdated Show resolved Hide resolved
pkg/reconciler/managed/reconciler.go Outdated Show resolved Hide resolved
pkg/reconciler/managed/reconciler.go Outdated Show resolved Hide resolved
pkg/reconciler/managed/reconciler.go Show resolved Hide resolved
pkg/reconciler/managed/reconciler.go Show resolved Hide resolved
pkg/meta/meta.go Show resolved Hide resolved
pkg/reconciler/managed/reconciler.go Show resolved Hide resolved
The retry logic we use to persist critical annotations makes it difficult to
delete an annotation without potentially also deleting annotations added by
another controller (e.g. the composition logic). This commit therefore changes
the way we detect whether we might have created an external resource but not
recorded the result. Previously we relied on the presence of the 'pending'
annotation to detect this state. Now we check whether the 'pending' annotation
is newer than any 'succeeded' or 'failed' annotation.

Signed-off-by: Nic Cope <negz@rk0n.org>
@negz negz merged commit 1d64e22 into crossplane:master Sep 3, 2021
@negz negz deleted the creat branch September 3, 2021 20:16
@chlunde
Copy link

chlunde commented Sep 3, 2021

Nice, looks good!

@negz
Copy link
Member Author

negz commented Sep 7, 2021

/backport

@negz
Copy link
Member Author

negz commented Sep 7, 2021

/backport

(Now that the commands workflow actually exists in this repo.)

@github-actions
Copy link

github-actions bot commented Sep 7, 2021

Successfully created backport PR #288 for release-0.15.

@negz
Copy link
Member Author

negz commented Sep 7, 2021

/backport

This time for the v0.13 and v0.14 releases.

@github-actions
Copy link

github-actions bot commented Sep 7, 2021

Backport failed for release-0.13, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally.

git fetch origin release-0.13
git worktree add -d .worktree/backport-283-to-release-0.13 origin/release-0.13
cd .worktree/backport-283-to-release-0.13
git checkout -b backport-283-to-release-0.13
ancref=$(git merge-base 589072e678f8d9b4db515bbc7c7f1e129305d6ce 8e780ecd6d30f0a5e024f047311c3ededa915b8d)
git cherry-pick -x $ancref..8e780ecd6d30f0a5e024f047311c3ededa915b8d

@github-actions
Copy link

github-actions bot commented Sep 7, 2021

Successfully created backport PR #289 for release-0.14.

negz added a commit to negz/docs that referenced this pull request Jan 30, 2024
These annotations were introduced in crossplane/crossplane-runtime#283.

Per crossplane/crossplane#3037 folks find
these annotations hard to reason about. That's understandable, because
they're doing a lot of subtle things.

This section ended up super long, but I think this is an area where
folks really need to understand what's happening in order to make good
decisions when Crossplane refuses to proceed.

Signed-off-by: Nic Cope <nicc@rk0n.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Leaked RouteTable due to eventually consistent DescribeRouteTables API
3 participants