Skip to content
This repository has been archived by the owner on Dec 15, 2022. It is now read-only.

Discuss filesystem locks and the need for repeated calls to get the operation status #36

Closed
muvaf opened this issue Aug 26, 2021 · 4 comments
Assignees
Labels
alpha enhancement New feature or request

Comments

@muvaf
Copy link
Member

muvaf commented Aug 26, 2021

What problem are you facing?

During the initial implementation, in order to move fast and see our limits we deferred some of the discussion about the implementation and how we should do the bookkeeping was one of them. Currently, we manage the state and bookkeeping using filesystem. However, we also interact with Terraform's lock and state files, which makes the code complex to understand for readers and we don't really need to keep that information out of memory since the crash of the process means a restart of the Pod anyway. Additionally, there are some opportunities to merge some of the functionality in the code to get a simpler flow that is easier to read and change.

How could Terrajet help solve your problem?

I suggest that we use a global map that does the bookkeeping of the running CLI executions. Roughly, I think we need the following improvements in the initial implementation. Please feel free to add more things that we had deferred to after PoC phase. cc @ulucinar @turkenh

  • A global async-safe map that holds information about existing Terraform workspaces, like map[string]Workspace whose key is metadata.uuid.
    • Operation, which is invoked with a method on Workspace, should update the map before it starts and after it completes, similar to today.
  • Describe method on that Workspace object that will tell you the current status, i.e. creating, destroying or no-op.
    • For example, we report that the resource does not exist during creation because we need apply pipeline to be run repeatedly to check the status. But no other provider does that except some of Azure APIs, which is not user friendly at all in a controller, so we run into possibility of violating generic reconciler assumptions.
    • Today, we make controller call Create repeatedly because we need to make the same Apply call to finalize it but since Account for two different kinds of consistency issues crossplane-runtime#283 , this won't be allowed because generic reconciler won't call Create again in the grace period.
  • This map map[string]Workspace can be treated just like a cloud provider SDK client. We don't need to know Terraform CLI, pipeline/operation details or propagating status as error etc. We can use map[string]interface{} to pass down spec configuration and receive StateV4 and information about whether it's creating or destroying.
    • We could possibly remove the conversion layer completely and have the ExternalClient implementation directly call a single interface. Because if you look at it today, it's merely a shim on top of conversion layer. Since we use similar names, it makes it really hard to follow the layers.
    • When we merge the layers and there is a single global map, the only interface we'd need could be the one proposed in the design doc (except ApplyResult part, reporting current status should be the job of Describe) which is very similar to the set of functions we use in native provider ExternalClient implementations.

Note that some of the technical debt is accrued on purpose because we needed to be able to work in parallel during PoC and discover things from scratch. Before declaring terrajet as prod-ready and solidify the implementation with more tests, I think it's time to revisit the parts that we completed during that phase and refactor into a more cohesive structure because thanks to that initial effort we know what works, how Terraform behaves and the rough performance implications of the decisions.

@muvaf muvaf added the enhancement New feature or request label Aug 26, 2021
@muvaf muvaf self-assigned this Aug 26, 2021
@muvaf muvaf changed the title Discuss filesystem locsk vs memory for bookkeeping Discuss filesystem locks vs memory for bookkeeping Aug 27, 2021
@muvaf
Copy link
Member Author

muvaf commented Sep 22, 2021

With crossplane/crossplane-runtime#283 , generic reconciler doesn't call Create even if we say the resource does not exist for given grace period. I've reduced the grace period to 0 but then it re-queues immediately after every Create call until we report that resource exists since it assumes that you'd never call Create again. This causes excessive re-queueing until the resource is ready, which could take more than a minute.

Until we lift the requirement that the controller needs to call Apply again to get the results, we can't update to the new controller-runtime. Observe should be able to say that the resource exists until the creation is completed, similar to RDS DB Instance and other resources whose creation is a long-running operation.

@muvaf muvaf changed the title Discuss filesystem locks vs memory for bookkeeping Discuss filesystem locks and need for repeated calls to get the operation status Sep 22, 2021
@muvaf muvaf changed the title Discuss filesystem locks and need for repeated calls to get the operation status Discuss filesystem locks and the need for repeated calls to get the operation status Sep 22, 2021
@turkenh
Copy link
Member

turkenh commented Sep 23, 2021

I agree that we should simplify the flow considering well-known environment we know our controllers running:

  • We know that inside the controller container, no one else would try running terraform cli other than the controller process.
  • We know that if we crash, all filesystem locks would be gone just like the global in-memory map.

So, I don't see much benefit in trying to keep things synced with filesystem locks.

@muvaf
Copy link
Member Author

muvaf commented Sep 23, 2021

Since it'll affect our planning for the next sprint, I created #74 after getting the thumbs-up from @turkenh

@muvaf
Copy link
Member Author

muvaf commented Sep 24, 2021

Closing this, let's add comments to #74 for further discussion.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
alpha enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants