Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resource claim reconciler only propagates connection secrets at binding time #35

Closed
negz opened this issue Sep 27, 2019 · 6 comments · Fixed by #42 or crossplane/crossplane#924
Closed
Assignees
Labels
bug Something isn't working

Comments

@negz
Copy link
Member

negz commented Sep 27, 2019

What happened?

https://github.com/crossplaneio/crossplane-runtime/blob/d805043/pkg/resource/claim_reconciler.go#L401

The resource claim reconciler currently propagates connection secrets from a managed resource to the resource claim it is bound once, at binding time. This is insufficient for some managed resource kinds, including at least EKS clusters, that constantly refresh and update their connection secrets. Crossplane must notice changes to the managed resource's connection secret (even if the managed resource itself does not change), and propagate those changes to the resource claim.

This is a must-fix bug for the next stack-aws release. Currently we expect we are not able to deploy workloads to EKS clusters more than 60 minutes after their creation time.

How can we reproduce it?

  1. Dynamically provision an EKS cluster via a KubernetesCluster claim.
  2. Wait for the token granted at creation time to expire - this should take 60 minutes.
  3. Try to deploy a KubernetesApplication to the aforementioned KubernetesCluster

You should find that the Kubernetes bearer token stored in the EKS cluster connection secret has been refreshed, but the KubernetesCluster claim's connection secret's bearer token has not.

What environment did it happen in?

Crossplane version:

@negz negz added the bug Something isn't working label Sep 27, 2019
@negz
Copy link
Member Author

negz commented Sep 27, 2019

One thing to note here is that we currently only invoke the resource claim reconcile loop when the managed resource or its resource claim changes. We might need to either:

  1. Require the managed resource be changed in some way when its secret changes (pretty gross).
  2. Have the resource claim controller watch for changes to secrets owned by the managed resource it pertains to.
  3. Consider breaking connection secret propagation logic out into a separate controller?

@muvaf
Copy link
Member

muvaf commented Sep 27, 2019

Have the resource claim controller watch for changes to secrets owned by the managed resource it pertains to.

I like that idea. We can filter the secrets by label like claim-controller=kubernetes-cluster via a predicate.

BTW, there is a nice function for that which we can use in claim controller setup for managed resource watch call as well https://github.com/kubernetes-sigs/controller-runtime/blob/master/pkg/builder/controller.go#L78

@negz
Copy link
Member Author

negz commented Oct 1, 2019

I'm a little hesitant about the additional complexity of having resource claim controllers (which already watch resource claims and managed resources) also watch and propagate secrets. The micro-design I have in mind goes something like:

  1. Keep the resource claim reconciler's Reconcile logic exactly as it is today, i.e. call r.managed.PropagateConnection() at binding time only.
  2. Introduce a new ManagedConnectionPropagator variant that annotates the newly created resource claim's secret with annotations like propagate-from.crossplane.io/secret-name and propagate-from.crossplane.io/secret-namespace, and the managed resource's secret with the equivalent propagate-to annotations.
  3. Introduce a new controller (probably in Crossplane core) that watches for any Secret with the propagate-from annotation and copies the data of the named secret (as long as it is annotated with consenting propagate-to values).

One concern with this design is that it would mean the Crossplane binary would need to be running in order for secrets to stay in sync. Currently from an infrastructure stack perspective the only functionality we lose the Crossplane pods are offline is non-portable resource class defaulting, which is a lot less mission critical than being able to authenticate to your infrastructure.

@negz
Copy link
Member Author

negz commented Oct 1, 2019

One concern with this design

This could be alleviated by instantiating a "secret propagation controller" per managed resource kind, which would use a predicate to filter out any secret whose controller reference wasn't the ManagedKind with which it was concerned.

@negz
Copy link
Member Author

negz commented Oct 1, 2019

Actually the diff to the current logic wouldn't be so bad. It would pretty much only involve moving the r.managed.PropagateConnection() call out of the if IsBindable block so that it gets called for any newly bound or already bound (managed, claim) tuple. The bigger concern would be the increase to the already-quite-imposing set of watches and predicates resource claim controllers need to establish.

@negz
Copy link
Member Author

negz commented Oct 9, 2019

Reopening to track documenting this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants