Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CloudSQLInstance stuck in FAILED if using private connectivity and Connection does not exist #84

Closed
hasheddan opened this issue Nov 5, 2019 · 7 comments · Fixed by #89
Assignees
Labels
bug Something isn't working

Comments

@hasheddan
Copy link
Member

What happened?

When creating a GlobalAddress, Connection, and CloudSQLInstance, I noticed the CloudSQLInstance would enter a FAILED state and never come out of it. The error was:

Failed to create subnetwork. Please create Service Networking connection with service 'servicenetworking.googleapis.com' from consumer project '283222062215' network 'argo' again.

Upon reaching this state the instance had to be deleted and recreated because it was observed as existing but could not be updated.

How can we reproduce it?

  1. Create a GlobalAddress, a Connection (that references it), and a CloudSQLInstance with secure connectivity enabled all at the same time in the same network.
  2. The GlobalAddress should come available almost immediately, but the Connection will wait a short time before creation because its first reconciliation will result in unresolved references.
  3. The CloudSQLInstance will begin creation because it does not have references to either of the GlobalAddress or Connection, but will fail if it gets to the subnetwork creation step before the Connection is created because it will be unable to peer the network.

Note: The need to delete and recreate the resource in this scenario is not a major problem because it is somewhat unlikely that this scenario will be exercised that frequently, and the instance will never be created so there is not risk of losing data. However, it does somewhat hamper the immediate bootstrap of a full environment that includes a database.

What environment did it happen in?

Crossplane version: v0.4.0
stack-gcp version: v0.2.0

@hasheddan hasheddan added the bug Something isn't working label Nov 5, 2019
@jbw976
Copy link
Member

jbw976 commented Nov 6, 2019

it is somewhat unlikely that this scenario will be exercised that frequently

Could you clarify on how the scenario described in this issue differs from the mainline scenarios that our user guides walk people through?

@muvaf
Copy link
Member

muvaf commented Nov 6, 2019

Could you clarify on how the scenario described in this issue differs from the mainline scenarios that our user guides walk people through?

As long as you allow Connection resource to be created before creating the CloudSQL resource, it's fine. Before, network resources were created separately from the claims, so, there was a short period in-between and we didn't see that problem. But now if you just do kubectl apply -f <dir> and that dir includes both MySQLInstance and Connection resources, you might end up in this situation.

There are two things here:

  • For whatever reason, GCP API doesn't return an error for Insert call in this case but when you go to console, you'd see the error given above and creation failure. This needs some more investigation as there might be some classes of errors that we have to account for specifically.
  • GCP explicitly tells you Please create Service Networking connection with service 'servicenetworking.googleapis.com' from consumer project '283222062215' network 'argo' again. but doing that doesn't help. CloudSQL doesn't reconcile on its own even if you create this like they suggest. You have to delete the CloudSQL instance and create a new one.

For the first problem, we probably need more investigation of CloudSQL API. It might be the case that this is not treated as error but rather a warning. Nevertheless, our CR should show it.

For the second one, what we could do is to have a reference only to block the creation. For instance, CloudSQL would have a connectionRef but this would be used only for waiting the referred object to be created, not really resolve to a value to be used in CloudSQL spec. cc'ing @negz

@hasheddan
Copy link
Member Author

For the second one, what we could do is to have a reference only to block the creation.

This was my initial thought was well. However, after talking with @negz he expressed some concerns around building a dependency graph a an eventually consistent system. We may have to get creative with how we want to solve this issue, and maybe the first pass does just look like a connectionRef that resolves without setting a configuration value in the CloudSQLInstance. These kind of scenarios make me glad that we now have our forProvider section of the spec because I like the thought of having "depends on" references in the spec but not in forProvider.

But now if you just do kubectl apply -f < dir > and that dir includes both MySQLInstance and Connection resources, you might end up in this situation.

@jbw976 To expound on what @muvaf said here, the creation in the Services guides is sequential so this is not an issue, and in the Stacks guide I am guessing this is not an issue because the CloudSQLInstance is waiting for the Network to resolve with this reference, and it must be the common case that this introduces enough of a delay that the Connection is ready.

@negz
Copy link
Member

negz commented Nov 6, 2019

it is somewhat unlikely that this scenario will be exercised that frequently

Could you clarify on how the scenario described in this issue differs from the mainline scenarios that our user guides walk people through?

@jbw976 Frankly I'm pretty skeptical that the scenario our guides walk folks through is a use case that exists in the real world. To clarify:

  • I believe using GitOps to manage network infrastructure is a real scenario
  • I believe using GitOps to manage CloudSQL resources is a real scenario
  • I believe the way our guides frame all of this (create everything at once) is a powerful demonstration

...but I strongly suspect folks wanting to create a VPC (typically the base network primitive that all other infrastructure lives in) in the exact same step as the infrastructure that runs in that VPC is a rare case, at best. It's certainly not a pattern I have ever encountered during my time in SRE. More commonly I would imagine creating the network infrastructure (a rare event) and creating infrastructure like databases that live in that network (a much more common event) would be distinct actions and thus either distinct stages in a GitOps pipeline, or distinct pipelines altogether.

I do think it's worth looking for a fix for this issue, but I strongly suggest that we weigh the complexity and effort required to fix it against how potentially serious this shortcoming is to folks trying to use Crossplane in the real world.

@negz
Copy link
Member

negz commented Nov 6, 2019

So two options come to mind here. Neither would bulletproof this experience, but both would reduce the the likelihood of a user encountering it. If a CloudSQLInstance references a Network, the network isn't "ready" as far as the CloudSQLInstance is concerned unless it has an active service connection. So:

  1. We can infer that a service connection exists by inspecting the Network managed resource's peerings, which are already reflected in its status (see below). So the GetStatus phase of reference resolution could block and only consider the Network ready when its Ready condition is True and it has a peering named servicenetworking-googleapis-com.
  2. We can determine that a Crossplane-managed service connection exists by (again during GetStatus ) listing all Connection resources (at most one exists per Network so I can't imagine there would ever be many) and determining whether any of them reference the Network that the CloudSQLInstance uses.
status:
    peerings:
    - autoCreateRoutes: true
      exchangeSubnetRoutes: true
      name: cloudsql-mysql-googleapis-com
      network: https://www.googleapis.com/compute/v1/projects/speckle-umbrella-30/global/networks/cloud-sql-network-283222062215-b2a39e7ea055b996
      state: ACTIVE
      stateDetails: '[2019-11-06T13:04:01.073-08:00]: Connected.'
    - autoCreateRoutes: true
      exchangeSubnetRoutes: true
      name: servicenetworking-googleapis-com
      network: https://www.googleapis.com/compute/v1/projects/c064626fb0df236dc-tp/global/networks/servicenetworking
      state: ACTIVE
      stateDetails: '[2019-11-06T13:03:51.961-08:00]: Connected.'
    routingConfig:
      routingMode: REGIONAL
    selfLink: https://www.googleapis.com/compute/v1/projects/REDACTED/global/networks/my-cool-network

My inclination is to go with option 1, because it's easier to do, and better supports the case in which you manage your Network with Crossplane, but not your service networking connection.

@negz negz self-assigned this Nov 7, 2019
@muvaf
Copy link
Member

muvaf commented Nov 7, 2019

the GetStatus phase of reference resolution could block and only consider the Network ready when its Ready condition is True and it has a peering named servicenetworking-googleapis-com.

Option 1 makes sense to me, too. If the name is constant, having it show up on the CR doesn't make a lot of sense. So, I like that option more than the one that adds a reference just to block. Handling the dependency implicitly in the NetworkURIReferencerForCloudSQLInstance sounds right.

Though this is still a bug on GCP side. They say try creating connection and we eventually do but CloudSQL doesn't reconcile and it doesn't return an error on creation. Hope there are not a class of errors that show up only in the console like this one.

@negz
Copy link
Member

negz commented Nov 7, 2019

Hope there are not a class of errors that show up only in the console like this one.

FWIW I think this does show up on the API too - at least the CloudSQL instance transitions to state FAILED and remains there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants