Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Static provisioning requires nonsensical managed resources #28

Closed
negz opened this issue Sep 18, 2019 · 10 comments · Fixed by #31 or #55
Closed

Static provisioning requires nonsensical managed resources #28

negz opened this issue Sep 18, 2019 · 10 comments · Fixed by #31 or #55
Assignees
Labels
bug Something isn't working good first issue Good for newcomers

Comments

@negz
Copy link
Member

negz commented Sep 18, 2019

What happened?

Changes were made to our resource claim controller watch predicates recently in #18 and #24. The latter refactored the watch predicates used by resource claim controllers with the intention of supporting both statically and dynamically provisioned managed resources. It does this by accepting resources that satisfy any of three watch predicates, applied in the following order:

  1. resource.HasManagedResourceReferenceKind accepts resource claims with a .spec.resourceRef to the kind of managed resource the reconciler is concerned with.
  2. resource.HasDirectClassReferenceKind accepts managed resources with a .spec.classRef to the kind of non-portable class the reconciler is concerned with.
  3. resource.HasIndirectClassReferenceKind accepts resource claims with a .spec.classRef to a portable class with a .classRef to the kind of non-portable class the reconciler is concerned with.

In retrospect resource.HasDirectClassReferenceKind only makes sense in a dynamic provisioning context, because statically provisioned managed resources don't use a resource class, and thus typically don't set their .spec.classRef. Based on my read of the current predicate logic (I have not confirmed this) I expect our resource claim reconcilers would never notice changes to statically provisioned managed resources.

This is not too big an issue given the resource claim reconciler only watches managed resources to wait for them to become bindable, which they typically do before the resource claim is created in a static provisioning scenario. I'm guessing that's why I didn't encounter this case when I tested static provisioning per #24 (comment).

How can we reproduce it?

  1. Create a resource claim that references a managed resource that does not yet exist.
  2. Create the referenced managed resource.

My guess is that the claim will never bind.

There are two workarounds for this:

  1. Create the managed resource and wait for it its .status.bindingPhase to become Unbound before creating a resource claim that references it.
  2. Set the managed resource's .spec.classRef.kind and .spec.classRef.apiVersion to its corresponding non-portable resource class kind.

What environment did it happen in?

crossplane-runtime version: 26a458d

@negz negz added the bug Something isn't working label Sep 18, 2019
@negz
Copy link
Member Author

negz commented Sep 18, 2019

We can fix this pretty easily by deprecating resource.HasDirectClassReferenceKind and replacing it with a new resource.IsManagedKind predicate that accepts all resources of the supplied kind:

func IsManagedKind(k ManagedKind, ot runtime.ObjectTyper) PredicateFn {
	return func(obj runtime.Object) bool {
		// We probably want a GetKind variant of MustGetKind that returns
		// an error instead of panicing.
		gvk, err := runtime.GetKind(obj, ot)
		if err != nil {
			return false
		}
		return gvk == schema.GroupVersionKind(k)
	}
}

@negz
Copy link
Member Author

negz commented Oct 1, 2019

I'd like to reopen this issue in the spirit of integrating early. I've raised crossplane/crossplane#860 to update the developer guide to match the change. @hasheddan did you happen to test whether this fix works as expected per the reproduction steps in this issue?

P.S. I noticed during the doc update that IsManagedKind takes its arguments as kind, scheme while the existing HasIndirectClassReferenceKind takes its arguments as scheme, kinds. I don't think it's worth fixing, but I do wish I could unsee the lack of symmetry. 😭

@negz negz reopened this Oct 1, 2019
@hasheddan
Copy link
Member

@negz it appears that the claim stays in the Managed claim is waiting for managed resource to become bindable state and never binds to the managed resource

Managed Resource:

  Binding Phase:  Unbound
  Conditions:
    Last Transition Time:  2019-10-01T19:41:05Z
    Reason:                Managed resource is available for use
    Status:                True
    Type:                  Ready
    Last Transition Time:  2019-10-01T19:38:05Z
    Reason:                Successfully reconciled managed resource
    Status:                True
    Type:                  Synced

Claim:

Conditions:
    Last Transition Time:  2019-10-01T19:38:02Z
    Reason:                Managed claim is waiting for managed resource to become bindable
    Status:                False
    Type:                  Ready
    Last Transition Time:  2019-10-01T19:38:02Z
    Reason:                Successfully reconciled managed resource
    Status:                True
    Type:                  Synced

@negz
Copy link
Member Author

negz commented Oct 1, 2019

@hasheddan Thanks for testing that out!

There are three scenarios we need to make sure the claim reconciler handles:

  1. Dynamic provisioning. No managed resource exists. A resource claim is created with a portable class reference and without a managed resource reference. In this case the claim reconciler should dynamically provision and then bind to a managed resource.
  2. Resource-first static provisioning. A managed resource is created explicitly, without using a resource class. Later (or simultaneously), a resource claim that references (via its resource reference) that managed resource is created. We should skip dynamic provisioning and go straight to binding.
  3. Claim-first static provisioning. A resource claim is created that references (via its resource reference) a managed resource that does not yet exist. We should fail to reconcile that resource claim until such time as the referenced managed resource is created (out of band) and becomes bindable, at which point we should bind to it.

I'd like to be sure scenarios 1 and 2 still work as expected, because I think they're more common and important than case 3. As a lower priority, we should figure out why case 3 isn't working and fix that.

@negz
Copy link
Member Author

negz commented Oct 1, 2019

FWIW - based only on following the reconciler code and not actually testing it - I'd expect scenario three to result in a nil pointer exception in the claim reconciler:

  1. We get the claim.
  2. We try to get the referenced managed resource and find that it does not exist. We ignore this error and proceed.
  3. We determine that the managed resource was not created, so we try to create it. This involves getting the name of claim's portable class reference, which is nil, so I'd expect a panic here.

My theory is that instead of using IgnoreNotFound at claim reconciler line 305, we should handle NotFound by requeueing after a short wait. This way we'd only attempt to dynamically provision a managed resource if the resource claim did not already explicitly reference one.

@negz
Copy link
Member Author

negz commented Oct 1, 2019

Another possible issue with static provisioning is that as far as I can tell we never actually set the managed resource's claim reference during the static provisioning codepath (we only set it when we dynamically provision new managed resources). The EnqueueRequestForClaim watch handler will only enqueue reconciles for managed resources that have a claim reference set, i.e. only dynamically provisioned managed resources. :( This might explain the behaviour you're seeing.

@negz negz self-assigned this Oct 8, 2019
@negz
Copy link
Member Author

negz commented Oct 8, 2019

It sounds like there's a chance static provisioning is currently broken in master - I'm assigning this to myself to investigate and fix before we ship 0.4.

@negz
Copy link
Member Author

negz commented Oct 24, 2019

Confirming that static provisioning seems broken. When I submit the following (at the same time):

---
apiVersion: database.gcp.crossplane.io/v1beta1
kind: CloudsqlInstance
metadata:
  name: example
spec:
  writeConnectionSecretToRef:
    namespace: crossplane-system
    name: cloudsqlinstance-example
  forProvider:
    databaseVersion: POSTGRES_9_6
    region: us-west2
    settings:
      tier: db-custom-1-3840
      dataDiskType: PD_SSD
      dataDiskSizeGb: 10
  providerRef:
    name: example
  reclaimPolicy: Delete
---
apiVersion: database.crossplane.io/v1alpha1
kind: PostgreSQLInstance
metadata:
  name: app-postgresql
spec:
  resourceRef:
    apiVersion: database.gcp.crossplane.io/v1beta1
    kind: CloudsqlInstance
    name: example
  writeConnectionSecretToRef:
    name: postgresqlconn
  engineVersion: "9.6"

I eventually see the managed resource become available for binding (i.e. binding phase Unbound):

Status:
  # ...
  Binding Phase:                    Unbound
  Conditions:
    Last Transition Time:  2019-10-24T08:15:03Z
    Reason:                Successfully resolved managed resource references to other resources
    Status:                True
    Type:                  ReferencesResolved
    Last Transition Time:  2019-10-24T08:15:09Z
    Reason:                Successfully reconciled managed resource
    Status:                True
    Type:                  Synced
    Last Transition Time:  2019-10-24T08:19:10Z
    Reason:                Managed resource is available for use
    Status:                True
    Type:                  Ready
Events:                    <none>

Meanwhile the PostgreSQLInstance says:

Status:
  Conditions:
    Last Transition Time:  2019-10-24T08:15:53Z
    Reason:                Managed claim is waiting for managed resource to become bindable
    Status:                False
    Type:                  Ready
    Last Transition Time:  2019-10-24T08:15:53Z
    Reason:                Successfully reconciled managed resource
    Status:                True
    Type:                  Synced

@negz
Copy link
Member Author

negz commented Oct 24, 2019

However, static binding does work if you:

  1. Wait for the managed resource's binding phase to reach Unbound.
  2. Submit the resource claim.

In this case, with the exact same managed resource, I see the following on the claim:

Status:
  Binding Phase:  Bound
  Conditions:
    Last Transition Time:  2019-10-24T08:26:15Z
    Reason:                Managed resource is available for use
    Status:                True
    Type:                  Ready
    Last Transition Time:  2019-10-24T08:26:15Z
    Reason:                Successfully reconciled managed resource
    Status:                True
    Type:                  Synced

@negz
Copy link
Member Author

negz commented Oct 24, 2019

I'd expect scenario three to result in a nil pointer exception in the claim reconciler

Yuuup. If I submit the same resource claim before the managed resource exists I see:

l/wait/wait.go:88"}
E1024 01:30:33.462044   16337 runtime.go:69] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereferenc
e)
/Users/negz/control/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190404173353-6a84e37a896d/pkg/util/runtime/runtime.go:76
/Users/negz/control/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190404173353-6a84e37a896d/pkg/util/runtime/runtime.go:65
/Users/negz/control/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190404173353-6a84e37a896d/pkg/util/runtime/runtime.go:51
/usr/local/Cellar/go@1.12/1.12.10/libexec/src/runtime/panic.go:522
/usr/local/Cellar/go@1.12/1.12.10/libexec/src/runtime/panic.go:82
/usr/local/Cellar/go@1.12/1.12.10/libexec/src/runtime/signal_unix.go:390
/Users/negz/control/go/pkg/mod/github.com/crossplaneio/crossplane-runtime@v0.0.0-20191023215652-0f37bea5496e/pkg/resource/claim_binding_reconciler.go:348
/Users/negz/control/go/pkg/mod/github.com/crossplaneio/crossplane-runtime@v0.0.0-20191023215652-0f37bea5496e/pkg/resource/claim_binding_reconciler.go:348
/Users/negz/control/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.2.0/pkg/internal/controller/controller.go:216
/Users/negz/control/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.2.0/pkg/internal/controller/controller.go:192
/Users/negz/control/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.2.0/pkg/internal/controller/controller.go:171
/Users/negz/control/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190404173353-6a84e37a896d/pkg/util/wait/wait.go:152
/Users/negz/control/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190404173353-6a84e37a896d/pkg/util/wait/wait.go:153
/Users/negz/control/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190404173353-6a84e37a896d/pkg/util/wait/wait.go:88
/usr/local/Cellar/go@1.12/1.12.10/libexec/src/runtime/asm_amd64.s:1337
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
        panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x1f1b66e]

goroutine 712 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
        /Users/negz/control/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190404173353-6a84e37a896d/pkg/util/runtime/runtime.go:58 +0x105
panic(0x248a6a0, 0x38f45d0)
        /usr/local/Cellar/go@1.12/1.12.10/libexec/src/runtime/panic.go:522 +0x1b5
github.com/crossplaneio/crossplane-runtime/pkg/meta.NamespacedNameOf(...)
        /Users/negz/control/go/pkg/mod/github.com/crossplaneio/crossplane-runtime@v0.0.0-20191023215652-0f37bea5496e/pkg/meta/meta.go:90
github.com/crossplaneio/crossplane-runtime/pkg/resource.(*ClaimReconciler).Reconcile(0xc000286000, 0xc000040be9, 0x7, 0xc000040bb0, 0xe, 0x390ee00, 0x0, 0x0, 0x0)
        /Users/negz/control/go/pkg/mod/github.com/crossplaneio/crossplane-runtime@v0.0.0-20191023215652-0f37bea5496e/pkg/resource/claim_binding_reconciler.go:348 +0x242e
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0002b4000, 0x24f2620, 0xc000c10420, 0x24f2600)
        /Users/negz/control/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.2.0/pkg/internal/controller/controller.go:216 +0x146
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0002b4000, 0xc000116e00)
        /Users/negz/control/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.2.0/pkg/internal/controller/controller.go:192 +0xb5
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker(0xc0002b4000)
        /Users/negz/control/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.2.0/pkg/internal/controller/controller.go:171 +0x2b
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc00085c060)
        /Users/negz/control/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190404173353-6a84e37a896d/pkg/util/wait/wait.go:152 +0x54
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00085c060, 0x3b9aca00, 0x0, 0x1, 0xc0001381e0)
        /Users/negz/control/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190404173353-6a84e37a896d/pkg/util/wait/wait.go:153 +0xf8
k8s.io/apimachinery/pkg/util/wait.Until(0xc00085c060, 0x3b9aca00, 0xc0001381e0)
        /Users/negz/control/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190404173353-6a84e37a896d/pkg/util/wait/wait.go:88 +0x4d
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start
        /Users/negz/control/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.2.0/pkg/internal/controller/controller.go:157 +0x311

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers
Projects
None yet
2 participants