Skip to content

feat(api-gateway): CORE-418 readiness probe data model check#10773

Open
igorlukanin wants to merge 2 commits intomasterfrom
igor/core-418-duplicate-view-definitions-break-deployment-compilation
Open

feat(api-gateway): CORE-418 readiness probe data model check#10773
igorlukanin wants to merge 2 commits intomasterfrom
igor/core-418-duplicate-view-definitions-break-deployment-compilation

Conversation

@igorlukanin
Copy link
Copy Markdown
Member

@igorlukanin igorlukanin commented Apr 29, 2026

Summary

Linear: CORE-418

A user reported that merging a data model with two definitions of the same view caused Cube to successfully promote the build to production: the API process stayed alive and existing health checks passed, but every request that needed the schema (querying, Playground, SQL Runner, meta) broke with a compile error like:

some_view cube: Included member 'some_key' conflicts with existing member of 'some_view'.

The existing /readyz probe only verifies orchestrator connectivity in standalone mode; in multi-tenant deployments the schema compiles lazily on first request per tenant, so a broken model passes deploy-time checks and silently breaks the deployment.

This PR adds an opt-in readiness check that compiles the data model for every tenant returned by scheduledRefreshContexts (or once with an empty security context for single-tenant deployments). When enabled, /readyz returns 500 if any tenant fails to compile, so the deployment platform refuses to promote the bad build and the previous (working) version keeps serving traffic.

  • New env var: CUBEJS_READINESS_CHECK_DATA_MODEL (default false).
  • /livez is intentionally not changed — failing it would crash-loop the pod on a bad model.
  • Sequential per-tenant compile with early-exit on first failure. CompilerApi.getCompilers already caches by compilerVersion, so steady-state probe cost is dominated by cache hits. The existing cachedHandler (1s TTL) coalesces concurrent probes.

Files changed

  • packages/cubejs-backend-shared/src/env.ts — register readinessCheckDataModel.
  • packages/cubejs-api-gateway/src/gateway.ts — extend readiness handler with checkDataModelCompiles().
  • packages/cubejs-api-gateway/test/index.test.ts — 5 new jest tests covering: env disabled, single-tenant healthy, single-tenant broken, multi-tenant healthy, multi-tenant one-broken.
  • docs-mintlify/reference/configuration/environment-variables.mdx — new env var section.
  • docs-mintlify/cube-core/deployment.mdx — paragraph in the existing health-checks section pointing to the new env var.

Test plan

  • yarn build — TS compiles (no new errors; pre-existing sql-server.ts errors unrelated).
  • yarn lint — 0 errors.
  • cubejs-backend-shared unit tests — 308 passing.
  • cubejs-api-gateway unit tests — 185 passing, including 5 new healthcheck tests.
  • Local Docker (cubejs/cube:dev built from this branch) end-to-end scenarios:
    • env var unset (default), broken model → /readyz 200 (existing behavior preserved)
    • env var false, broken model → /readyz 200
    • env var true, healthy single-tenant → /readyz 200
    • env var true, broken single-tenant → /readyz 500, expected error reproduced
    • env var true, multi-tenant all healthy → /readyz 200, three compilations executed
    • env var true, multi-tenant one broken → /readyz 500
    • live recovery: broken → fix in place → /readyz 500 → 200 within cachedHandler TTL, no restart
    • /livez returns 200 in every scenario regardless of compile state

@igorlukanin igorlukanin requested a review from a team as a code owner April 29, 2026 12:53
@mintlify
Copy link
Copy Markdown
Contributor

mintlify Bot commented Apr 29, 2026

Preview deployment for your docs. Learn more about Mintlify Previews.

Project Status Preview Updated (UTC)
cubed3 🟢 Ready View Preview Apr 29, 2026, 12:54 PM

💡 Tip: Enable Workflows to automatically generate PRs for you.

@github-actions github-actions Bot added the javascript Pull requests that update Javascript code label Apr 29, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 29, 2026

Claude finished @igorlukanin's task in 2m 27s —— View job


Code Review: CORE-418 Readiness Probe Data Model Check

  • Read and analyze all changed files
  • Review code quality and error handling
  • Review security considerations
  • Review performance implications
  • Review test coverage
  • Review documentation changes
  • Post final review

Overall Assessment

This is a well-scoped, well-tested feature that solves a real production problem. The opt-in design, early-exit on failure, and clear separation from /livez are all the right calls. The code integrates cleanly into the existing readiness handler pattern. A few observations below, mostly minor.

Code Quality

gateway.ts:2796-2823checkDataModelCompiles()

The implementation is clean and follows the existing patterns in the codebase. A few observations:

  1. scheduledRefreshContexts() error is unhandled — If this.scheduledRefreshContexts() itself throws (e.g., the user's config function hits a database to enumerate tenants and the DB is down), the error will propagate uncaught out of readiness, pass through cachedHandler's catch block (which resets the cache and calls next(e)), and likely result in Express's default 500 response rather than the structured { health: 'DOWN' } JSON. Consider wrapping the scheduledRefreshContexts() call in the try/catch, or adding its own try/catch that returns 'DOWN':

    let backgroundContexts: any[];
    try {
      backgroundContexts = this.scheduledRefreshContexts
        ? await this.scheduledRefreshContexts()
        : [];
    } catch (e: any) {
      this.logProbeError(e, 'Internal Server Error on readiness probe (scheduledRefreshContexts)');
      return 'DOWN';
    }

    Fix this →

  2. Log message on line 2802 has a subtle edge case — The condition backgroundContexts.length === 0 && this.scheduledRefreshContexts logs a "skipped" message when scheduledRefreshContexts is defined but returns an empty array. However, the code then falls through to compile with [{ securityContext: {} }], so it doesn't actually skip — it compiles once with an empty context. The log message is misleading. Consider either:

    • Changing the message to something like "scheduledRefreshContexts returned no contexts, checking with empty security context", or
    • Removing the log, since the behavior is correct and the message creates confusion.

    Fix this →

  3. authInfo backward compat on line 2809 — Good to see backgroundContext?.authInfo as a fallback. This matches the pattern at line 544.

Security

No concerns. The feature:

  • Is opt-in via env var (default false).
  • Doesn't expose error details in the HTTP response — errors are logged server-side, and the response only contains { health: 'DOWN' }.
  • Doesn't accept any user input — the tenant contexts come from the server-side scheduledRefreshContexts config.
  • Reuses the existing guestMiddlewares on /readyz, so no auth bypass.

Performance

The design is sound:

  • cachedHandler (1s TTL) coalesces concurrent probes, so Kubernetes-style 10s interval probes won't cause redundant compilations.
  • CompilerApi.getCompilers caching means steady-state probes are cache hits.
  • Sequential + early-exit is the right tradeoff — no unnecessary parallelism for a health probe, and it fails fast.
  • Potential concern for large multi-tenant deployments: For deployments with hundreds of tenants via scheduledRefreshContexts, the first probe after a deploy (cold cache) could be slow as it sequentially compiles every tenant's model. If this ever becomes an issue, you could consider adding a configurable concurrency limit or a timeout, but that's premature for a v1 — the 1s cache TTL means only one probe at a time actually does the work. Worth noting in the docs if it comes up in practice.

Testing

The 5 new tests are well-structured and cover the key scenarios:

Scenario Covered
Disabled by default
Single tenant healthy
Single tenant broken
Multi-tenant all healthy
Multi-tenant one broken

Possible additions (nice-to-have, not blocking):

  • scheduledRefreshContexts itself throws → verify probe returns 500 (relates to observation ReferenceError: self is not defined - Issue with whatwg-fetch when using the core lib with Node.js #1 above).
  • scheduledRefreshContexts returns empty array → verify single-tenant fallback still compiles.
  • A test verifying the env var is restored correctly could use beforeAll/afterAll to make the pattern more resilient. The current afterEach approach works but captures originalEnv eagerly at describe-time — if another test suite in the same process mutates the env var before this suite runs, originalEnv would capture the wrong value. This is a theoretical concern given Jest's typical isolation, but the env-var library's caching via getEnv could interact unexpectedly.

Documentation

Both docs changes are clear and accurate:

  • deployment.mdx — well-placed in the existing health checks section with a direct link to the env var reference.
  • environment-variables.mdx — consistent format with surrounding entries. The description correctly mentions scheduled_refresh_contexts and single-tenant fallback.

Minor doc note: The deployment.mdx text says "compile the data model for every tenant on each probe" — this is technically true, but readers might worry about performance. Worth noting (even parenthetically) that compilations are cached and the probe response itself has a 1s cache TTL, so steady-state cost is minimal.

Summary

This is a solid, well-tested feature that addresses a real production gap. The code follows existing patterns, the tests cover the important scenarios, and the documentation is clear.

Blocking: None.

Non-blocking suggestions:

  1. Wrap scheduledRefreshContexts() in a try/catch to handle failures from the user's context provider gracefully.
  2. Fix the misleading "skipped" log message on line 2802.
  3. Consider adding a test for when scheduledRefreshContexts itself throws.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 29, 2026

Codecov Report

❌ Patch coverage is 78.94737% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 57.99%. Comparing base (df30aad) to head (435742a).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
packages/cubejs-api-gateway/src/gateway.ts 83.33% 1 Missing and 2 partials ⚠️
packages/cubejs-backend-shared/src/env.ts 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10773      +/-   ##
==========================================
+ Coverage   57.96%   57.99%   +0.02%     
==========================================
  Files         215      215              
  Lines       16776    16795      +19     
  Branches     3389     3394       +5     
==========================================
+ Hits         9725     9740      +15     
- Misses       6559     6561       +2     
- Partials      492      494       +2     
Flag Coverage Δ
cube-backend 57.99% <78.94%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

javascript Pull requests that update Javascript code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants