Cache /.well-known/databricks-config lookups in the CLI#5011
Open
simonfaltum wants to merge 25 commits intomainfrom
Open
Cache /.well-known/databricks-config lookups in the CLI#5011simonfaltum wants to merge 25 commits intomainfrom
simonfaltum wants to merge 25 commits intomainfrom
Conversation
Verifies that two CLI invocations sharing DATABRICKS_CACHE_DIR produce only one /.well-known/databricks-config GET: the first populates the on-disk cache, the second reads from it. Co-authored-by: Isaac
Cached /.well-known/databricks-config lookups persist across CLI invocations now, so recorded request logs drop duplicate GETs and debug output shows the new host-metadata cache keys. Silenced SDK warnings on failed well-known fetches (the resolver returns nil,nil) also remove a couple of Warn lines from auth test outputs. Co-authored-by: Isaac
Inverts the internal newResolver(cfg, ...) into an exported NewResolver(fetch) that takes an injected fetch function. Attach stays as a one-liner convenience. Unit tests for the caching logic no longer need httptest servers or PAT-authed configs; one integration test retains the end-to-end SDK wiring. Co-authored-by: Isaac
Flips the resolver so the happy path is one disk read: positive cache wraps the miss flow, which now probes negative and falls through to fetch only on true miss. Context cancellation and deadline errors are no longer written to the negative cache because they say nothing about the host's long-term availability. Regenerates cache/telemetry acceptance outputs — the synthetic negative-cache probe no longer runs on cache hits. Co-authored-by: Isaac
Approval status: pending
|
GetSanitizedVersion replaces + with - in build version metadata for filesystem safety, but the [DEV_VERSION] replacement regex only covered the + form. Cache paths use the sanitized form, so telemetry tests failed across machines with different git HEAD SHAs. Regex now accepts either + or - before the SHA suffix. Co-authored-by: Isaac
os.Stat on a missing cache file returns an OS-specific error message (Unix: "no such file or directory"; Windows: "The system cannot find the file specified."), causing acceptance-test goldens to diverge between platforms. The error is also pure noise — the follow-up "cache miss, computing" line conveys the same information. Drop the log for fs.ErrNotExist; keep it for genuine stat failures (permissions, corruption). Co-authored-by: Isaac
github-merge-queue bot
pushed a commit
to databricks/databricks-sdk-go
that referenced
this pull request
Apr 20, 2026
## Changes [PR #1572](#1572) added `Config.HostMetadataResolver` so callers could override the SDK's `/.well-known/databricks-config` fetch on a per-Config basis. That covers "I have one Config and I want to wrap it." The gap: programs that construct many Configs across their command surface (e.g. the Databricks CLI) end up copying the same `cfg.HostMetadataResolver = ...` assignment at every construction site, in the CLI roughly 10 sites across 7 files plus a guardrail test to catch drift. This PR adds a package-level default consulted when a Config has no explicit resolver set. Callers set a factory once during startup; every subsequent Config gets the same resolver without per-site wiring. The Config-level field still takes precedence, so PR #1572's contract is unchanged. ### API ```go // config/host_metadata.go var DefaultHostMetadataResolverFactory func(*Config) HostMetadataResolver ``` Plain public variable, set once at init. Matches the stdlib pattern for single-default hooks: `http.DefaultClient`, `http.DefaultTransport`, `log.Default`. Callers needing per-Config or dynamic behaviour should use `Config.HostMetadataResolver` instead. ### Resolution order inside `Config.EnsureResolved` 1. If `Config.HostMetadataResolver` is set, use it. 2. Else, if `DefaultHostMetadataResolverFactory` is non-nil, invoke it with the resolving Config and use its return value. If it returns nil, fall through. 3. Else, SDK's default HTTP fetch (unchanged behavior for all existing callers). ## How the Databricks CLI will use this The canonical Go idiom for "library A registers itself with library B" is a blank import that triggers an `init()` in A. This is how `database/sql` drivers (`_ "github.com/lib/pq"`), image codecs (`_ "image/png"`), and encoding formats register themselves. After this PR lands and is bumped into the CLI, [CLI PR #5011](databricks/cli#5011) will collapse from ~10 wired-in `hostmetadata.Attach(cfg)` calls + a guardrail test down to two small pieces: **`repos/cli/libs/hostmetadata/resolver.go`** — set the caching factory at package init: ```go func init() { config.DefaultHostMetadataResolverFactory = func(cfg *config.Config) config.HostMetadataResolver { return NewResolver(cfg.DefaultHostMetadataResolver()) } } ``` **`repos/cli/cmd/databricks/main.go`** — one blank import to pull the package in at startup: ```go import ( // Registers a disk-cached HostMetadataResolver with the SDK so every // Config the CLI constructs reuses the cached /.well-known lookup. _ "github.com/databricks/cli/libs/hostmetadata" ) ``` That's the full integration. Every Config the CLI creates, now and in the future from any new command a developer adds, automatically gets caching. No per-site `Attach` call to remember, no guardrail test to maintain, no new developer ever has to learn this mechanism exists to benefit from it. ### Experimental Marked experimental to match the existing `HostMetadataResolver` field. No default behavior change for callers that never set `DefaultHostMetadataResolverFactory`. ## Tests Three new tests in `config/config_test.go`, each using a small `withDefaultHostMetadataResolverFactory(t, factory)` helper that captures and restores the prior value, so tests never clobber each other via the package-level default: - Factory is invoked when Config has no resolver; back-fill works end-to-end. - Config-level resolver takes precedence (factory not consulted). - Factory returning nil falls through to the SDK's HTTP fetch. - `make fmt test lint` clean - `go test ./config/... -count=1 -race` clean Signed-off-by: simon <simon.faltum@databricks.com> --------- Signed-off-by: simon <simon.faltum@databricks.com> Co-authored-by: Renaud Hartert <renaud.hartert@databricks.com>
SDK v0.128.0 (databricks/databricks-sdk-go#1636) adds config.DefaultHostMetadataResolverFactory so a package can install a single hook that every Config picks up on EnsureResolved, without per-site wiring. Replaces ten hostmetadata.Attach(cfg) call sites across seven files and the injection guardrail test with two pieces: - libs/hostmetadata/resolver.go: init() sets config.DefaultHostMetadataResolverFactory to wrap cfg.DefaultHostMetadataResolver() in the caching resolver. - main.go: blank import of libs/hostmetadata triggers that init() at startup so every *config.Config the CLI constructs picks up the cached lookup automatically. Co-authored-by: Isaac
bfe2d05 to
fae41c6
Compare
TestFactory_EndToEnd_CacheHitSkipsSDKFetch already covers the same case: if the init() factory weren't installed, the second EnsureResolved would hit the server (2 fetches, not 1) and that test would fail. Co-authored-by: Isaac
…ext on disk Two issues in the host-metadata resolver flagged by codex: 1. The negative-cache probe used GetOrCompute with a sentinel errNotCached value in the compute callback. That tripped the cache's "error while computing" debug log and local.cache.error telemetry metric on every positive-cache miss — even though the miss itself is not an error. Adds cache.Get[T] as a read-only lookup that never computes or writes, and uses it for the negative probe. Positive writes still go through GetOrCompute so concurrent resolves are still serialized by the cache mutex. 2. The negative sentinel persisted raw err.Error() to disk under Message, which was only read back into a debug log. Network errors can contain proxy URLs, internal hostnames, and other environment-sensitive text. Drop the Message field; only the existence of the sentinel matters. Regenerates acceptance outputs that captured the now-gone "error while computing: not cached" debug line. Co-authored-by: Isaac
- Drop TestNewResolver_DifferentHosts_SeparateEntries: the assertion that two hosts get separate cache entries just restates the fingerprint's Host-keyed design. - Collapse the four-line context.Background() comment into the existing //nolint line; drop the one-line hostFingerprint comment that restated the struct; shorten the negativeSentinel comment. Co-authored-by: Isaac
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
Every CLI command (
databricks auth profiles,bundle validate, every workspace or account call) goes throughConfig.EnsureResolved, which triggers an unauthenticated GET to{host}/.well-known/databricks-configto populate host metadata. That round trip is ~700ms against production and gets paid on every invocation, doubling the latency of otherwise single-request commands.Changes
Before: every CLI invocation hits the well-known endpoint once (or more when multiple configs get constructed).
Now: the first invocation populates a local disk cache under
~/.cache/databricks/<version>/host-metadata/; subsequent invocations read from it. Failures are negatively cached for 60s (except forcontext.Canceled/context.DeadlineExceeded, which are transient and never cached).The integration hooks into SDK
v0.128.0'sconfig.DefaultHostMetadataResolverFactory(added in databricks/databricks-sdk-go#1636) via two pieces:libs/hostmetadata/resolver.go:init()registers a factory that wrapscfg.DefaultHostMetadataResolver()in the caching resolver.NewResolver(fetch)remains the unit-testable primary API.main.go: a blank import oflibs/hostmetadatatriggers thatinit()at startup, so every*config.Configthe CLI constructs now and in the future picks up the cached lookup automatically. No per-site wiring, no guardrail test.Positive cache wraps the miss path, so the hit path is a single disk read; negative cache is only consulted when positive misses.
internal/testutil/env.gopinsDATABRICKS_CACHE_DIRto a temp dir in test cleanup so tests don't leak cache files intoHOME.Collateral cleanups
libs/cache/file_cache.go: drop thefailed to stat cache filedebug log when the file is simply missing (fs.ErrNotExist). It was pure noise (the next line,cache miss, computing, conveys the same info) and its OS-specific error text diverged between Unix (no such file or directory) and Windows (The system cannot find the file specified.), breaking cross-platform acceptance goldens. Genuine stat failures (permission, corruption) still log.libs/testdiff/replacement.go:devVersionRegexnow accepts either+SHAor-SHAafter0.0.0-dev.build.GetSanitizedVersion()swaps+to-for filesystem safety when the version is used in cache paths, and the old regex only covered the+form.Test plan
make checkscleanmake lintclean (0 issues)go test ./libs/hostmetadata/... -raceclean (factory-installed assertion + cache hit + fetch error + cancellation-not-cached + host isolation + end-to-end integration)go test ./cmd/root/... ./bundle/config/... ./cmd/auth/... ./libs/auth/... -racecleanacceptance/auth/host-metadata-cache/asserts exactly ONE/.well-known/databricks-configGET across twoauth profilesinvocations sharing aDATABRICKS_CACHE_DIRout.requests.txt(caching works), new[Local Cache]debug lines in cache/telemetry tests, twoWarn: Failed to resolve host metadatalines removed (intentional: the resolver returns(nil, nil)on fetch errors, which is how the SDK interprets "no metadata available"), stat-not-found lines removed (see Collateral cleanups)Live validation against dogfood (from previous push)
Built locally and ran
databricks -p e2-dogfood current-user mewith and without a warm cache:DATABRICKS_CACHE_DIR)cache miss, computing->GET /.well-known/databricks-config->computed and stored result[Local Cache] cache hitlineNet per-command savings: ~700ms, matching the Why.