Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ISSUE] Issue with databricks_cluster immediately after workspace creation #3452

Open
jiropardo opened this issue Apr 11, 2024 · 0 comments
Open

Comments

@jiropardo
Copy link
Contributor

jiropardo commented Apr 11, 2024

Configuration

# Copy-paste your Terraform configuration here

module "workspace" {
source = "./modules/workspace"
databricks_account_id = var.databricks_account_id
region = var.region
}

module "catalogs_binding" {
source = "./modules/catalog_bin"
workspace_id = module.workspace.workspace_id

depends_on = [ module.workspace ]

}

module "proxy" {

source = "./modules/proxy_cluster"
depends_on = [ module.workspace ]

}

and the cluster resource within proxy_cluster is

resource "databricks_cluster" "git_proxy" {
autotermination_minutes = 0
aws_attributes {
ebs_volume_count = 1
ebs_volume_size = 32
first_on_demand = 1
}
cluster_name = var.git_proxy_name
custom_tags = {
"ResourceClass" = "SingleNode"
}
provider = databricks.workspace
spark_version = data.databricks_spark_version.latest_lts.id
node_type_id = data.databricks_node_type.smallest.id
num_workers = 0
spark_conf = {
"spark.databricks.cluster.profile" : "singleNode",
"spark.master" : "local[*]",
}
spark_env_vars = {
"GIT_PROXY_ENABLE_SSL_VERIFICATION" : "False"
"GIT_PROXY_HTTP_PROXY" : "git_URL"
}
timeouts {
create = "30m"
update = "30m"
delete = "30m"
}
}

Expected Behavior

Cluster should be created or a more verbose error could be displayed to explain that even though the API returns workspace is RUNNING, it is not yet fully operational

Actual Behavior

Workspace was running

2024-04-10T19:35:44.640-0600 [DEBUG] provider.terraform-provider-databricks_v1.39.0: GET /api/2.0/accounts/XXX/workspaces/XXX
< HTTP/2.0 200 OK
< {
...
< "workspace_status": "RUNNING",

but worker environment is not recognized so cluster creation fails

2024-04-10T19:36:17.232-0600 [ERROR] provider.terraform-provider-databricks_v1.39.0: Response contains error diagnostic: @module=sdk.proto diagnostic_detail="" diagnostic_severity=ERROR tf_proto_version=5.4 tf_req_id=XXX tf_rpc=ApplyResourceChange @caller=/home/runner/work/terraform-provider-databricks/terraform-provider-databricks/vendor/github.com/hashicorp/terraform-plugin-go/tfprotov5/internal/diag/diagnostics.go:58 diagnostic_summary="cannot create cluster: XXX is not able to transition from TERMINATED to RUNNING: worker env WorkerEnvId(workerenv-XXXXX) not found

*Note that I replaced the actual values in above logs with XXXXX

Steps to Reproduce

1.39

Is it a regression?

Debug Output

Important Factoids

Would you like to implement a fix?

Below dependency flow addresses this behavior. It seems this little time is enough for the workspace information to fully propagate

module "proxy" {

source = "./modules/proxy_cluster"

depends_on = [ module.workspace, module.catalogs_binding ] # fixes issue

}

github-merge-queue bot pushed a commit to databricks/databricks-sdk-go that referenced this issue Apr 22, 2024
…890)

## Changes
Due to the eventually consistent nature of worker environment creation
as a part of workspace setup, certain API requests can fail when made
right after workspace creation. We have some exception handling for
these in the SDKs, but a new case has appeared: "worker env
WorkerEnvId(workerenv-XXXXX) not found" (see
databricks/terraform-provider-databricks#3452).

This PR addresses this issue. Furthermore, it moves the transient error
messages into autogeneration so that there is a single source of truth
that applies to all SDKs.

One other small change: I removed running unit tests from the
autogeneration flow. It removes a small amount of convenience (users
need to run `make test` after regenerating the SDK), but it speeds up
the devloop when iterating on code generation, and it allows the release
flow to continue to make a PR before failing. This PR can be modified as
needed to fix any test failures or compilation failures, as per usual.

## Tests
Added a unit test to cover this.

- [ ] `make test` passing
- [ ] `make fmt` applied
- [ ] relevant integration tests applied
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant