Cache /.well-known/databricks-config lookups in the CLI by simonfaltum · Pull Request #5011 · databricks/cli

simonfaltum · 2026-04-17T11:46:48Z

Why

Every CLI command (databricks auth profiles, bundle validate, every workspace or account call) goes through Config.EnsureResolved, which triggers an unauthenticated GET to {host}/.well-known/databricks-config to populate host metadata. That round trip is ~700ms against production and gets paid on every invocation, doubling the latency of otherwise single-request commands.

Changes

Before: every CLI invocation hits the well-known endpoint once (or more when multiple configs get constructed).
Now: the first invocation populates a local disk cache under ~/.cache/databricks/<version>/host-metadata/; subsequent invocations read from it. Failures are negatively cached for 60s (except for context.Canceled / context.DeadlineExceeded, which are transient and never cached).

The integration hooks into SDK v0.128.0's config.DefaultHostMetadataResolverFactory (added in databricks/databricks-sdk-go#1636) via two pieces:

libs/hostmetadata/resolver.go: init() registers a factory that wraps cfg.DefaultHostMetadataResolver() in the caching resolver. NewResolver(fetch) remains the unit-testable primary API.
main.go: a blank import of libs/hostmetadata triggers that init() at startup, so every *config.Config the CLI constructs now and in the future picks up the cached lookup automatically. No per-site wiring, no guardrail test.

Positive cache wraps the miss path, so the hit path is a single disk read; negative cache is only consulted when positive misses.

internal/testutil/env.go pins DATABRICKS_CACHE_DIR to a temp dir in test cleanup so tests don't leak cache files into HOME.

Collateral cleanups

libs/cache/file_cache.go: drop the failed to stat cache file debug log when the file is simply missing (fs.ErrNotExist). It was pure noise (the next line, cache miss, computing, conveys the same info) and its OS-specific error text diverged between Unix (no such file or directory) and Windows (The system cannot find the file specified.), breaking cross-platform acceptance goldens. Genuine stat failures (permission, corruption) still log.
libs/testdiff/replacement.go: devVersionRegex now accepts either +SHA or -SHA after 0.0.0-dev. build.GetSanitizedVersion() swaps + to - for filesystem safety when the version is used in cache paths, and the old regex only covered the + form.

Test plan

make checks clean
make lint clean (0 issues)
go test ./libs/hostmetadata/... -race clean (factory-installed assertion + cache hit + fetch error + cancellation-not-cached + host isolation + end-to-end integration)
go test ./cmd/root/... ./bundle/config/... ./cmd/auth/... ./libs/auth/... -race clean
End-to-end acceptance test acceptance/auth/host-metadata-cache/ asserts exactly ONE /.well-known/databricks-config GET across two auth profiles invocations sharing a DATABRICKS_CACHE_DIR
Existing acceptance tests regenerated: fewer well-known GETs in out.requests.txt (caching works), new [Local Cache] debug lines in cache/telemetry tests, two Warn: Failed to resolve host metadata lines removed (intentional: the resolver returns (nil, nil) on fetch errors, which is how the SDK interprets "no metadata available"), stat-not-found lines removed (see Collateral cleanups)

Live validation against dogfood (from previous push)

Built locally and ran databricks -p e2-dogfood current-user me with and without a warm cache:

Scenario	Elapsed well-known time	Cache log output
Cold cache (fresh `DATABRICKS_CACHE_DIR`)	~713ms fetch	`cache miss, computing` -> `GET /.well-known/databricks-config` -> `computed and stored result`
Warm cache (second invocation)	~1ms	single `[Local Cache] cache hit` line

Net per-command savings: ~700ms, matching the Why.

Verifies that two CLI invocations sharing DATABRICKS_CACHE_DIR produce only one /.well-known/databricks-config GET: the first populates the on-disk cache, the second reads from it. Co-authored-by: Isaac

Cached /.well-known/databricks-config lookups persist across CLI invocations now, so recorded request logs drop duplicate GETs and debug output shows the new host-metadata cache keys. Silenced SDK warnings on failed well-known fetches (the resolver returns nil,nil) also remove a couple of Warn lines from auth test outputs. Co-authored-by: Isaac

Inverts the internal newResolver(cfg, ...) into an exported NewResolver(fetch) that takes an injected fetch function. Attach stays as a one-liner convenience. Unit tests for the caching logic no longer need httptest servers or PAT-authed configs; one integration test retains the end-to-end SDK wiring. Co-authored-by: Isaac

Flips the resolver so the happy path is one disk read: positive cache wraps the miss flow, which now probes negative and falls through to fetch only on true miss. Context cancellation and deadline errors are no longer written to the negative cache because they say nothing about the host's long-term availability. Regenerates cache/telemetry acceptance outputs — the synthetic negative-cache probe no longer runs on cache hits. Co-authored-by: Isaac

github-actions · 2026-04-17T12:06:36Z

Approval status: pending

`/acceptance/auth/` - needs approval

6 files changed
Suggested: @mihaimitrea-db
Also eligible: @tanmay-db, @renaudhartert-db, @hectorcast-db, @parthban-db, @Divyansh-db, @tejaskochar-db, @chrisst, @rauchy

`/acceptance/bundle/` - needs approval

4 files changed
Suggested: @denik
Also eligible: @pietern, @andrewnester, @anton-107, @shreyas-goenka, @lennartkats-db, @janniklasrose

`/internal/` - needs approval

Files: internal/testutil/env.go
Suggested: @mihaimitrea-db
Also eligible: @tanmay-db, @renaudhartert-db, @hectorcast-db, @parthban-db, @Divyansh-db, @tejaskochar-db, @chrisst, @rauchy

General files (require maintainer)

23 files changed
Based on git history:

@denik -- recent work in ./, libs/testdiff/, acceptance/telemetry/

_{Any maintainer (@andrewnester, @anton-107, @denik, @pietern, @shreyas-goenka, @renaudhartert-db) can approve all areas.

See OWNERS for ownership rules.}

GetSanitizedVersion replaces + with - in build version metadata for filesystem safety, but the [DEV_VERSION] replacement regex only covered the + form. Cache paths use the sanitized form, so telemetry tests failed across machines with different git HEAD SHAs. Regex now accepts either + or - before the SHA suffix. Co-authored-by: Isaac

os.Stat on a missing cache file returns an OS-specific error message (Unix: "no such file or directory"; Windows: "The system cannot find the file specified."), causing acceptance-test goldens to diverge between platforms. The error is also pure noise — the follow-up "cache miss, computing" line conveys the same information. Drop the log for fs.ErrNotExist; keep it for genuine stat failures (permissions, corruption). Co-authored-by: Isaac

## Changes [PR #1572](#1572) added `Config.HostMetadataResolver` so callers could override the SDK's `/.well-known/databricks-config` fetch on a per-Config basis. That covers "I have one Config and I want to wrap it." The gap: programs that construct many Configs across their command surface (e.g. the Databricks CLI) end up copying the same `cfg.HostMetadataResolver = ...` assignment at every construction site, in the CLI roughly 10 sites across 7 files plus a guardrail test to catch drift. This PR adds a package-level default consulted when a Config has no explicit resolver set. Callers set a factory once during startup; every subsequent Config gets the same resolver without per-site wiring. The Config-level field still takes precedence, so PR #1572's contract is unchanged. ### API ```go // config/host_metadata.go var DefaultHostMetadataResolverFactory func(*Config) HostMetadataResolver ``` Plain public variable, set once at init. Matches the stdlib pattern for single-default hooks: `http.DefaultClient`, `http.DefaultTransport`, `log.Default`. Callers needing per-Config or dynamic behaviour should use `Config.HostMetadataResolver` instead. ### Resolution order inside `Config.EnsureResolved` 1. If `Config.HostMetadataResolver` is set, use it. 2. Else, if `DefaultHostMetadataResolverFactory` is non-nil, invoke it with the resolving Config and use its return value. If it returns nil, fall through. 3. Else, SDK's default HTTP fetch (unchanged behavior for all existing callers). ## How the Databricks CLI will use this The canonical Go idiom for "library A registers itself with library B" is a blank import that triggers an `init()` in A. This is how `database/sql` drivers (`_ "github.com/lib/pq"`), image codecs (`_ "image/png"`), and encoding formats register themselves. After this PR lands and is bumped into the CLI, [CLI PR #5011](databricks/cli#5011) will collapse from ~10 wired-in `hostmetadata.Attach(cfg)` calls + a guardrail test down to two small pieces: **`repos/cli/libs/hostmetadata/resolver.go`** — set the caching factory at package init: ```go func init() { config.DefaultHostMetadataResolverFactory = func(cfg *config.Config) config.HostMetadataResolver { return NewResolver(cfg.DefaultHostMetadataResolver()) } } ``` **`repos/cli/cmd/databricks/main.go`** — one blank import to pull the package in at startup: ```go import ( // Registers a disk-cached HostMetadataResolver with the SDK so every // Config the CLI constructs reuses the cached /.well-known lookup. _ "github.com/databricks/cli/libs/hostmetadata" ) ``` That's the full integration. Every Config the CLI creates, now and in the future from any new command a developer adds, automatically gets caching. No per-site `Attach` call to remember, no guardrail test to maintain, no new developer ever has to learn this mechanism exists to benefit from it. ### Experimental Marked experimental to match the existing `HostMetadataResolver` field. No default behavior change for callers that never set `DefaultHostMetadataResolverFactory`. ## Tests Three new tests in `config/config_test.go`, each using a small `withDefaultHostMetadataResolverFactory(t, factory)` helper that captures and restores the prior value, so tests never clobber each other via the package-level default: - Factory is invoked when Config has no resolver; back-fill works end-to-end. - Config-level resolver takes precedence (factory not consulted). - Factory returning nil falls through to the SDK's HTTP fetch. - `make fmt test lint` clean - `go test ./config/... -count=1 -race` clean Signed-off-by: simon <simon.faltum@databricks.com> --------- Signed-off-by: simon <simon.faltum@databricks.com> Co-authored-by: Renaud Hartert <renaud.hartert@databricks.com>

SDK v0.128.0 (databricks/databricks-sdk-go#1636) adds config.DefaultHostMetadataResolverFactory so a package can install a single hook that every Config picks up on EnsureResolved, without per-site wiring. Replaces ten hostmetadata.Attach(cfg) call sites across seven files and the injection guardrail test with two pieces: - libs/hostmetadata/resolver.go: init() sets config.DefaultHostMetadataResolverFactory to wrap cfg.DefaultHostMetadataResolver() in the caching resolver. - main.go: blank import of libs/hostmetadata triggers that init() at startup so every *config.Config the CLI constructs picks up the cached lookup automatically. Co-authored-by: Isaac

TestFactory_EndToEnd_CacheHitSkipsSDKFetch already covers the same case: if the init() factory weren't installed, the second EnsureResolved would hit the server (2 fetches, not 1) and that test would fail. Co-authored-by: Isaac

…ext on disk Two issues in the host-metadata resolver flagged by codex: 1. The negative-cache probe used GetOrCompute with a sentinel errNotCached value in the compute callback. That tripped the cache's "error while computing" debug log and local.cache.error telemetry metric on every positive-cache miss — even though the miss itself is not an error. Adds cache.Get[T] as a read-only lookup that never computes or writes, and uses it for the negative probe. Positive writes still go through GetOrCompute so concurrent resolves are still serialized by the cache mutex. 2. The negative sentinel persisted raw err.Error() to disk under Message, which was only read back into a debug log. Network errors can contain proxy URLs, internal hostnames, and other environment-sensitive text. Drop the Message field; only the existence of the sentinel matters. Regenerates acceptance outputs that captured the now-gone "error while computing: not cached" debug line. Co-authored-by: Isaac

- Drop TestNewResolver_DifferentHosts_SeparateEntries: the assertion that two hosts get separate cache entries just restates the fingerprint's Host-keyed design. - Collapse the four-line context.Background() comment into the existing //nolint line; drop the one-line hostFingerprint comment that restated the struct; shorten the negativeSentinel comment. Co-authored-by: Isaac

simonfaltum added 17 commits April 17, 2026 11:04

Add hostmetadata package with caching resolver skeleton

eb8909f

Fix Task 1 spec deviations

fa9b174

Isolate hostmetadata test from user cache directory

9856973

Add hostmetadata cache hit/miss tests

2678fe3

Add hostmetadata negative-cache and host-isolation tests

b71f298

Attach hostmetadata cache to root workspace/account clients

a8fc7d1

Isolate CLI cache dir in test cleanup environment

6d9981c

Refactor hostmetadata.Attach to drop ctx requirement

b23299a

Attach hostmetadata cache to bundle workspace config

609f044

Attach hostmetadata cache at remaining config construction sites

62e71f2

Replace sync.Once with eager cache init in hostmetadata.Attach

50a27ca

Attach hostmetadata cache at two missing login construction sites

4937738

Add guardrail test for hostmetadata.Attach injection sites

2c9b1f3

Add end-to-end acceptance test for host metadata caching

1a179c5

Verifies that two CLI invocations sharing DATABRICKS_CACHE_DIR produce only one /.well-known/databricks-config GET: the first populates the on-disk cache, the second reads from it. Co-authored-by: Isaac

simonfaltum marked this pull request as ready for review April 17, 2026 12:06

simonfaltum requested a review from andrewnester April 17, 2026 12:06

simonfaltum added 2 commits April 17, 2026 14:31

simonfaltum marked this pull request as draft April 17, 2026 14:54

simonfaltum mentioned this pull request Apr 17, 2026

Add SetDefaultHostMetadataResolverFactory databricks/databricks-sdk-go#1636

Merged

simonfaltum marked this pull request as ready for review April 20, 2026 06:00

simonfaltum removed the request for review from andrewnester April 20, 2026 06:01

simonfaltum added 2 commits April 21, 2026 08:02

Merge branch 'main' into simonfaltum/hostmetadata-cache

2ac9b94

simonfaltum force-pushed the simonfaltum/hostmetadata-cache branch from bfe2d05 to fae41c6 Compare April 21, 2026 06:50

simonfaltum added 3 commits April 21, 2026 08:59

Drop redundant factory-registration test

9f32fc8

TestFactory_EndToEnd_CacheHitSkipsSDKFetch already covers the same case: if the init() factory weren't installed, the second EnsureResolved would hit the server (2 fetches, not 1) and that test would fail. Co-authored-by: Isaac

simonfaltum requested a review from andrewnester April 21, 2026 08:26

Merge branch 'main' into simonfaltum/hostmetadata-cache

f1c51d0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache /.well-known/databricks-config lookups in the CLI#5011

Cache /.well-known/databricks-config lookups in the CLI#5011
simonfaltum wants to merge 25 commits intomainfrom
simonfaltum/hostmetadata-cache

simonfaltum commented Apr 17, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

simonfaltum commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

Changes

Collateral cleanups

Test plan

Live validation against dogfood (from previous push)

Uh oh!

github-actions bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Approval status: pending

/acceptance/auth/ - needs approval

/acceptance/bundle/ - needs approval

/internal/ - needs approval

General files (require maintainer)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

simonfaltum commented Apr 17, 2026 •

edited

Loading

github-actions bot commented Apr 17, 2026 •

edited

Loading

`/acceptance/auth/` - needs approval

`/acceptance/bundle/` - needs approval

`/internal/` - needs approval