Skip to content

Cache /.well-known/databricks-config lookups in the CLI#5011

Open
simonfaltum wants to merge 25 commits intomainfrom
simonfaltum/hostmetadata-cache
Open

Cache /.well-known/databricks-config lookups in the CLI#5011
simonfaltum wants to merge 25 commits intomainfrom
simonfaltum/hostmetadata-cache

Conversation

@simonfaltum
Copy link
Copy Markdown
Member

@simonfaltum simonfaltum commented Apr 17, 2026

Why

Every CLI command (databricks auth profiles, bundle validate, every workspace or account call) goes through Config.EnsureResolved, which triggers an unauthenticated GET to {host}/.well-known/databricks-config to populate host metadata. That round trip is ~700ms against production and gets paid on every invocation, doubling the latency of otherwise single-request commands.

Changes

Before: every CLI invocation hits the well-known endpoint once (or more when multiple configs get constructed).
Now: the first invocation populates a local disk cache under ~/.cache/databricks/<version>/host-metadata/; subsequent invocations read from it. Failures are negatively cached for 60s (except for context.Canceled / context.DeadlineExceeded, which are transient and never cached).

The integration hooks into SDK v0.128.0's config.DefaultHostMetadataResolverFactory (added in databricks/databricks-sdk-go#1636) via two pieces:

  • libs/hostmetadata/resolver.go: init() registers a factory that wraps cfg.DefaultHostMetadataResolver() in the caching resolver. NewResolver(fetch) remains the unit-testable primary API.
  • main.go: a blank import of libs/hostmetadata triggers that init() at startup, so every *config.Config the CLI constructs now and in the future picks up the cached lookup automatically. No per-site wiring, no guardrail test.

Positive cache wraps the miss path, so the hit path is a single disk read; negative cache is only consulted when positive misses.

internal/testutil/env.go pins DATABRICKS_CACHE_DIR to a temp dir in test cleanup so tests don't leak cache files into HOME.

Collateral cleanups

  • libs/cache/file_cache.go: drop the failed to stat cache file debug log when the file is simply missing (fs.ErrNotExist). It was pure noise (the next line, cache miss, computing, conveys the same info) and its OS-specific error text diverged between Unix (no such file or directory) and Windows (The system cannot find the file specified.), breaking cross-platform acceptance goldens. Genuine stat failures (permission, corruption) still log.
  • libs/testdiff/replacement.go: devVersionRegex now accepts either +SHA or -SHA after 0.0.0-dev. build.GetSanitizedVersion() swaps + to - for filesystem safety when the version is used in cache paths, and the old regex only covered the + form.

Test plan

  • make checks clean
  • make lint clean (0 issues)
  • go test ./libs/hostmetadata/... -race clean (factory-installed assertion + cache hit + fetch error + cancellation-not-cached + host isolation + end-to-end integration)
  • go test ./cmd/root/... ./bundle/config/... ./cmd/auth/... ./libs/auth/... -race clean
  • End-to-end acceptance test acceptance/auth/host-metadata-cache/ asserts exactly ONE /.well-known/databricks-config GET across two auth profiles invocations sharing a DATABRICKS_CACHE_DIR
  • Existing acceptance tests regenerated: fewer well-known GETs in out.requests.txt (caching works), new [Local Cache] debug lines in cache/telemetry tests, two Warn: Failed to resolve host metadata lines removed (intentional: the resolver returns (nil, nil) on fetch errors, which is how the SDK interprets "no metadata available"), stat-not-found lines removed (see Collateral cleanups)

Live validation against dogfood (from previous push)

Built locally and ran databricks -p e2-dogfood current-user me with and without a warm cache:

Scenario Elapsed well-known time Cache log output
Cold cache (fresh DATABRICKS_CACHE_DIR) ~713ms fetch cache miss, computing -> GET /.well-known/databricks-config -> computed and stored result
Warm cache (second invocation) ~1ms single [Local Cache] cache hit line

Net per-command savings: ~700ms, matching the Why.

Verifies that two CLI invocations sharing DATABRICKS_CACHE_DIR produce only
one /.well-known/databricks-config GET: the first populates the on-disk
cache, the second reads from it.

Co-authored-by: Isaac
Cached /.well-known/databricks-config lookups persist across CLI
invocations now, so recorded request logs drop duplicate GETs and debug
output shows the new host-metadata cache keys. Silenced SDK warnings on
failed well-known fetches (the resolver returns nil,nil) also remove a
couple of Warn lines from auth test outputs.

Co-authored-by: Isaac
Inverts the internal newResolver(cfg, ...) into an exported
NewResolver(fetch) that takes an injected fetch function. Attach stays
as a one-liner convenience. Unit tests for the caching logic no longer
need httptest servers or PAT-authed configs; one integration test
retains the end-to-end SDK wiring.

Co-authored-by: Isaac
Flips the resolver so the happy path is one disk read: positive cache
wraps the miss flow, which now probes negative and falls through to
fetch only on true miss. Context cancellation and deadline errors are
no longer written to the negative cache because they say nothing about
the host's long-term availability.

Regenerates cache/telemetry acceptance outputs — the synthetic
negative-cache probe no longer runs on cache hits.

Co-authored-by: Isaac
@simonfaltum simonfaltum marked this pull request as ready for review April 17, 2026 12:06
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 17, 2026

Approval status: pending

/acceptance/auth/ - needs approval

6 files changed
Suggested: @mihaimitrea-db
Also eligible: @tanmay-db, @renaudhartert-db, @hectorcast-db, @parthban-db, @Divyansh-db, @tejaskochar-db, @chrisst, @rauchy

/acceptance/bundle/ - needs approval

4 files changed
Suggested: @denik
Also eligible: @pietern, @andrewnester, @anton-107, @shreyas-goenka, @lennartkats-db, @janniklasrose

/internal/ - needs approval

Files: internal/testutil/env.go
Suggested: @mihaimitrea-db
Also eligible: @tanmay-db, @renaudhartert-db, @hectorcast-db, @parthban-db, @Divyansh-db, @tejaskochar-db, @chrisst, @rauchy

General files (require maintainer)

23 files changed
Based on git history:

  • @denik -- recent work in ./, libs/testdiff/, acceptance/telemetry/

Any maintainer (@andrewnester, @anton-107, @denik, @pietern, @shreyas-goenka, @renaudhartert-db) can approve all areas.
See OWNERS for ownership rules.

GetSanitizedVersion replaces + with - in build version metadata for
filesystem safety, but the [DEV_VERSION] replacement regex only
covered the + form. Cache paths use the sanitized form, so telemetry
tests failed across machines with different git HEAD SHAs. Regex now
accepts either + or - before the SHA suffix.

Co-authored-by: Isaac
os.Stat on a missing cache file returns an OS-specific error message
(Unix: "no such file or directory"; Windows: "The system cannot find
the file specified."), causing acceptance-test goldens to diverge
between platforms. The error is also pure noise — the follow-up
"cache miss, computing" line conveys the same information. Drop the
log for fs.ErrNotExist; keep it for genuine stat failures (permissions,
corruption).

Co-authored-by: Isaac
@simonfaltum simonfaltum marked this pull request as draft April 17, 2026 14:54
@simonfaltum simonfaltum marked this pull request as ready for review April 20, 2026 06:00
@simonfaltum simonfaltum removed the request for review from andrewnester April 20, 2026 06:01
github-merge-queue bot pushed a commit to databricks/databricks-sdk-go that referenced this pull request Apr 20, 2026
## Changes

[PR #1572](#1572)
added `Config.HostMetadataResolver` so callers could override the SDK's
`/.well-known/databricks-config` fetch on a per-Config basis. That
covers "I have one Config and I want to wrap it."

The gap: programs that construct many Configs across their command
surface (e.g. the Databricks CLI) end up copying the same
`cfg.HostMetadataResolver = ...` assignment at every construction site,
in the CLI roughly 10 sites across 7 files plus a guardrail test to
catch drift.

This PR adds a package-level default consulted when a Config has no
explicit resolver set. Callers set a factory once during startup; every
subsequent Config gets the same resolver without per-site wiring. The
Config-level field still takes precedence, so PR #1572's contract is
unchanged.

### API

```go
// config/host_metadata.go
var DefaultHostMetadataResolverFactory func(*Config) HostMetadataResolver
```

Plain public variable, set once at init. Matches the stdlib pattern for
single-default hooks: `http.DefaultClient`, `http.DefaultTransport`,
`log.Default`. Callers needing per-Config or dynamic behaviour should
use `Config.HostMetadataResolver` instead.

### Resolution order inside `Config.EnsureResolved`

1. If `Config.HostMetadataResolver` is set, use it.
2. Else, if `DefaultHostMetadataResolverFactory` is non-nil, invoke it
with the resolving Config and use its return value. If it returns nil,
fall through.
3. Else, SDK's default HTTP fetch (unchanged behavior for all existing
callers).

## How the Databricks CLI will use this

The canonical Go idiom for "library A registers itself with library B"
is a blank import that triggers an `init()` in A. This is how
`database/sql` drivers (`_ "github.com/lib/pq"`), image codecs (`_
"image/png"`), and encoding formats register themselves.

After this PR lands and is bumped into the CLI, [CLI PR
#5011](databricks/cli#5011) will collapse from
~10 wired-in `hostmetadata.Attach(cfg)` calls + a guardrail test down to
two small pieces:

**`repos/cli/libs/hostmetadata/resolver.go`** — set the caching factory
at package init:

```go
func init() {
    config.DefaultHostMetadataResolverFactory = func(cfg *config.Config) config.HostMetadataResolver {
        return NewResolver(cfg.DefaultHostMetadataResolver())
    }
}
```

**`repos/cli/cmd/databricks/main.go`** — one blank import to pull the
package in at startup:

```go
import (
    // Registers a disk-cached HostMetadataResolver with the SDK so every
    // Config the CLI constructs reuses the cached /.well-known lookup.
    _ "github.com/databricks/cli/libs/hostmetadata"
)
```

That's the full integration. Every Config the CLI creates, now and in
the future from any new command a developer adds, automatically gets
caching. No per-site `Attach` call to remember, no guardrail test to
maintain, no new developer ever has to learn this mechanism exists to
benefit from it.

### Experimental

Marked experimental to match the existing `HostMetadataResolver` field.
No default behavior change for callers that never set
`DefaultHostMetadataResolverFactory`.

## Tests

Three new tests in `config/config_test.go`, each using a small
`withDefaultHostMetadataResolverFactory(t, factory)` helper that
captures and restores the prior value, so tests never clobber each other
via the package-level default:

- Factory is invoked when Config has no resolver; back-fill works
end-to-end.
- Config-level resolver takes precedence (factory not consulted).
- Factory returning nil falls through to the SDK's HTTP fetch.

- `make fmt test lint` clean
- `go test ./config/... -count=1 -race` clean

Signed-off-by: simon <simon.faltum@databricks.com>

---------

Signed-off-by: simon <simon.faltum@databricks.com>
Co-authored-by: Renaud Hartert <renaud.hartert@databricks.com>
SDK v0.128.0 (databricks/databricks-sdk-go#1636) adds
config.DefaultHostMetadataResolverFactory so a package can install a
single hook that every Config picks up on EnsureResolved, without
per-site wiring.

Replaces ten hostmetadata.Attach(cfg) call sites across seven files
and the injection guardrail test with two pieces:

- libs/hostmetadata/resolver.go: init() sets
  config.DefaultHostMetadataResolverFactory to wrap cfg.DefaultHostMetadataResolver()
  in the caching resolver.
- main.go: blank import of libs/hostmetadata triggers that init() at
  startup so every *config.Config the CLI constructs picks up the
  cached lookup automatically.

Co-authored-by: Isaac
@simonfaltum simonfaltum force-pushed the simonfaltum/hostmetadata-cache branch from bfe2d05 to fae41c6 Compare April 21, 2026 06:50
TestFactory_EndToEnd_CacheHitSkipsSDKFetch already covers the same case:
if the init() factory weren't installed, the second EnsureResolved
would hit the server (2 fetches, not 1) and that test would fail.

Co-authored-by: Isaac
…ext on disk

Two issues in the host-metadata resolver flagged by codex:

1. The negative-cache probe used GetOrCompute with a sentinel errNotCached
   value in the compute callback. That tripped the cache's "error while
   computing" debug log and local.cache.error telemetry metric on every
   positive-cache miss — even though the miss itself is not an error.

   Adds cache.Get[T] as a read-only lookup that never computes or writes,
   and uses it for the negative probe. Positive writes still go through
   GetOrCompute so concurrent resolves are still serialized by the cache
   mutex.

2. The negative sentinel persisted raw err.Error() to disk under Message,
   which was only read back into a debug log. Network errors can contain
   proxy URLs, internal hostnames, and other environment-sensitive text.
   Drop the Message field; only the existence of the sentinel matters.

Regenerates acceptance outputs that captured the now-gone "error while
computing: not cached" debug line.

Co-authored-by: Isaac
- Drop TestNewResolver_DifferentHosts_SeparateEntries: the assertion that
  two hosts get separate cache entries just restates the fingerprint's
  Host-keyed design.
- Collapse the four-line context.Background() comment into the existing
  //nolint line; drop the one-line hostFingerprint comment that restated
  the struct; shorten the negativeSentinel comment.

Co-authored-by: Isaac
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant