feat(api-gateway): CORE-418 readiness probe data model check by igorlukanin · Pull Request #10773 · cube-js/cube

igorlukanin · 2026-04-29T12:53:58Z

Summary

A user reported that merging a data model with two definitions of the same view caused Cube to successfully promote the build to production: the API process stayed alive and existing health checks passed, but every request that needed the schema (querying, Playground, SQL Runner, meta) broke with a compile error like:

some_view cube: Included member 'some_key' conflicts with existing member of 'some_view'.

The existing /readyz probe only verifies orchestrator connectivity in standalone mode; in multi-tenant deployments the schema compiles lazily on first request per tenant, so a broken model passes deploy-time checks and silently breaks the deployment.

This PR adds an opt-in readiness check that compiles the data model for every tenant returned by scheduledRefreshContexts (or once with an empty security context for single-tenant deployments). When enabled, /readyz returns 500 if any tenant fails to compile, so the deployment platform refuses to promote the bad build and the previous (working) version keeps serving traffic.

New env var: CUBEJS_READINESS_CHECK_DATA_MODEL (default false).
/livez is intentionally not changed — failing it would crash-loop the pod on a bad model.
Sequential per-tenant compile with early-exit on first failure. CompilerApi.getCompilers already caches by compilerVersion, so steady-state probe cost is dominated by cache hits. The existing cachedHandler (1s TTL) coalesces concurrent probes.

Files changed

packages/cubejs-backend-shared/src/env.ts — register readinessCheckDataModel.
packages/cubejs-api-gateway/src/gateway.ts — extend readiness handler with checkDataModelCompiles().
packages/cubejs-api-gateway/test/index.test.ts — 5 new jest tests covering: env disabled, single-tenant healthy, single-tenant broken, multi-tenant healthy, multi-tenant one-broken.
docs-mintlify/reference/configuration/environment-variables.mdx — new env var section.
docs-mintlify/cube-core/deployment.mdx — paragraph in the existing health-checks section pointing to the new env var.

Test plan

yarn build — TS compiles (no new errors; pre-existing sql-server.ts errors unrelated).
yarn lint — 0 errors.
cubejs-backend-shared unit tests — 308 passing.
cubejs-api-gateway unit tests — 185 passing, including 5 new healthcheck tests.
Local Docker (cubejs/cube:dev built from this branch) end-to-end scenarios:
- env var unset (default), broken model → /readyz 200 (existing behavior preserved)
- env var false, broken model → /readyz 200
- env var true, healthy single-tenant → /readyz 200
- env var true, broken single-tenant → /readyz 500, expected error reproduced
- env var true, multi-tenant all healthy → /readyz 200, three compilations executed
- env var true, multi-tenant one broken → /readyz 500
- live recovery: broken → fix in place → /readyz 500 → 200 within cachedHandler TTL, no restart
- /livez returns 200 in every scenario regardless of compile state

…fail readiness probe on broken data model

mintlify · 2026-04-29T12:54:05Z

Preview deployment for your docs. Learn more about Mintlify Previews.

Project	Status	Preview	Updated (UTC)
cubed3	🟢 Ready	View Preview	Apr 29, 2026, 12:54 PM

💡 Tip: Enable Workflows to automatically generate PRs for you.

claude · 2026-04-29T12:54:27Z

Claude finished @igorlukanin's task in 2m 27s —— View job

Code Review: CORE-418 Readiness Probe Data Model Check

Read and analyze all changed files
Review code quality and error handling
Review security considerations
Review performance implications
Review test coverage
Review documentation changes
Post final review

Overall Assessment

This is a well-scoped, well-tested feature that solves a real production problem. The opt-in design, early-exit on failure, and clear separation from /livez are all the right calls. The code integrates cleanly into the existing readiness handler pattern. A few observations below, mostly minor.

Code Quality

gateway.ts:2796-2823 — checkDataModelCompiles()

The implementation is clean and follows the existing patterns in the codebase. A few observations:

scheduledRefreshContexts() error is unhandled — If this.scheduledRefreshContexts() itself throws (e.g., the user's config function hits a database to enumerate tenants and the DB is down), the error will propagate uncaught out of readiness, pass through cachedHandler's catch block (which resets the cache and calls next(e)), and likely result in Express's default 500 response rather than the structured { health: 'DOWN' } JSON. Consider wrapping the scheduledRefreshContexts() call in the try/catch, or adding its own try/catch that returns 'DOWN':
```
let backgroundContexts: any[];
try {
  backgroundContexts = this.scheduledRefreshContexts
    ? await this.scheduledRefreshContexts()
    : [];
} catch (e: any) {
  this.logProbeError(e, 'Internal Server Error on readiness probe (scheduledRefreshContexts)');
  return 'DOWN';
}
```
Fix this →
Log message on line 2802 has a subtle edge case — The condition backgroundContexts.length === 0 && this.scheduledRefreshContexts logs a "skipped" message when scheduledRefreshContexts is defined but returns an empty array. However, the code then falls through to compile with [{ securityContext: {} }], so it doesn't actually skip — it compiles once with an empty context. The log message is misleading. Consider either:
- Changing the message to something like "scheduledRefreshContexts returned no contexts, checking with empty security context", or
- Removing the log, since the behavior is correct and the message creates confusion.
Fix this →
authInfo backward compat on line 2809 — Good to see backgroundContext?.authInfo as a fallback. This matches the pattern at line 544.

Security

No concerns. The feature:

Is opt-in via env var (default false).
Doesn't expose error details in the HTTP response — errors are logged server-side, and the response only contains { health: 'DOWN' }.
Doesn't accept any user input — the tenant contexts come from the server-side scheduledRefreshContexts config.
Reuses the existing guestMiddlewares on /readyz, so no auth bypass.

Performance

The design is sound:

cachedHandler (1s TTL) coalesces concurrent probes, so Kubernetes-style 10s interval probes won't cause redundant compilations.
CompilerApi.getCompilers caching means steady-state probes are cache hits.
Sequential + early-exit is the right tradeoff — no unnecessary parallelism for a health probe, and it fails fast.
Potential concern for large multi-tenant deployments: For deployments with hundreds of tenants via scheduledRefreshContexts, the first probe after a deploy (cold cache) could be slow as it sequentially compiles every tenant's model. If this ever becomes an issue, you could consider adding a configurable concurrency limit or a timeout, but that's premature for a v1 — the 1s cache TTL means only one probe at a time actually does the work. Worth noting in the docs if it comes up in practice.

Testing

The 5 new tests are well-structured and cover the key scenarios:

Scenario	Covered
Disabled by default	✅
Single tenant healthy	✅
Single tenant broken	✅
Multi-tenant all healthy	✅
Multi-tenant one broken	✅

Possible additions (nice-to-have, not blocking):

scheduledRefreshContexts itself throws → verify probe returns 500 (relates to observation ReferenceError: self is not defined - Issue with whatwg-fetch when using the core lib with Node.js #1 above).
scheduledRefreshContexts returns empty array → verify single-tenant fallback still compiles.
A test verifying the env var is restored correctly could use beforeAll/afterAll to make the pattern more resilient. The current afterEach approach works but captures originalEnv eagerly at describe-time — if another test suite in the same process mutates the env var before this suite runs, originalEnv would capture the wrong value. This is a theoretical concern given Jest's typical isolation, but the env-var library's caching via getEnv could interact unexpectedly.

Documentation

Both docs changes are clear and accurate:

deployment.mdx — well-placed in the existing health checks section with a direct link to the env var reference.
environment-variables.mdx — consistent format with surrounding entries. The description correctly mentions scheduled_refresh_contexts and single-tenant fallback.

Minor doc note: The deployment.mdx text says "compile the data model for every tenant on each probe" — this is technically true, but readers might worry about performance. Worth noting (even parenthetically) that compilations are cached and the probe response itself has a 1s cache TTL, so steady-state cost is minimal.

Summary

This is a solid, well-tested feature that addresses a real production gap. The code follows existing patterns, the tests cover the important scenarios, and the documentation is clear.

Blocking: None.

Non-blocking suggestions:

Wrap scheduledRefreshContexts() in a try/catch to handle failures from the user's context provider gracefully.
Fix the misleading "skipped" log message on line 2802.
Consider adding a test for when scheduledRefreshContexts itself throws.

…catch and warn on empty contexts

codecov · 2026-04-29T13:31:51Z

Codecov Report

❌ Patch coverage is 78.94737% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 57.99%. Comparing base (df30aad) to head (435742a).
⚠️ Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
packages/cubejs-api-gateway/src/gateway.ts	83.33%	1 Missing and 2 partials ⚠️
packages/cubejs-backend-shared/src/env.ts	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #10773      +/-   ##
==========================================
+ Coverage   57.96%   57.99%   +0.02%     
==========================================
  Files         215      215              
  Lines       16776    16795      +19     
  Branches     3389     3394       +5     
==========================================
+ Hits         9725     9740      +15     
- Misses       6559     6561       +2     
- Partials      492      494       +2

Flag	Coverage Δ
cube-backend	`57.99% <78.94%> (+0.02%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

feat(api-gateway): CORE-418 add CUBEJS_READINESS_CHECK_DATA_MODEL to …

2f5c828

…fail readiness probe on broken data model

igorlukanin requested a review from a team as a code owner April 29, 2026 12:53

github-actions Bot added the javascript Pull requests that update Javascript code label Apr 29, 2026

mintlify Bot deployed to staging - docs-mintlify April 29, 2026 12:54 View deployment

vercel Bot deployed to Preview April 29, 2026 12:55 View deployment

fix(api-gateway): CORE-418 wrap scheduledRefreshContexts call in try/…

435742a

…catch and warn on empty contexts

mintlify Bot deployed to staging - docs-mintlify April 29, 2026 13:13 View deployment

ovr approved these changes Apr 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(api-gateway): CORE-418 readiness probe data model check#10773

feat(api-gateway): CORE-418 readiness probe data model check#10773
igorlukanin wants to merge 2 commits intomasterfrom
igor/core-418-duplicate-view-definitions-break-deployment-compilation

igorlukanin commented Apr 29, 2026 •

edited

Loading

Uh oh!

mintlify Bot commented Apr 29, 2026 •

edited

Loading

Uh oh!

claude Bot commented Apr 29, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

igorlukanin commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Files changed

Test plan

Uh oh!

mintlify Bot commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude Bot commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review: CORE-418 Readiness Probe Data Model Check

Overall Assessment

Code Quality

Security

Performance

Testing

Documentation

Summary

Uh oh!

codecov Bot commented Apr 29, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

igorlukanin commented Apr 29, 2026 •

edited

Loading

mintlify Bot commented Apr 29, 2026 •

edited

Loading

claude Bot commented Apr 29, 2026 •

edited

Loading