Skip to content

docs: document Kafka streams mode for ksqlDB integration#10797

Merged
paveltiunov merged 11 commits intomasterfrom
cursor/docs-ksqldb-kafka-streams-mode-623b
May 1, 2026
Merged

docs: document Kafka streams mode for ksqlDB integration#10797
paveltiunov merged 11 commits intomasterfrom
cursor/docs-ksqldb-kafka-streams-mode-623b

Conversation

@paveltiunov
Copy link
Copy Markdown
Member

@paveltiunov paveltiunov commented May 1, 2026

Check List

  • Tests have been run in packages where changes have been made if available
  • Linter has been run for changed code
  • Tests for the changes have been added if not covered yet
  • Docs have been added / updated if required

Description of Changes Made

Documents the Kafka streams mode for the ksqlDB integration. In this mode, Cube reads data directly from the underlying Kafka topics instead of going through the ksqlDB REST API for data streaming. Cube does not create any tables or streams in ksqlDB when this mode is enabled.

Changes

Mintlify docs (docs-mintlify/admin/connect-to-data/data-sources/ksqldb.mdx):

  • Added CUBEJS_DB_KAFKA_HOST, CUBEJS_DB_KAFKA_USER, CUBEJS_DB_KAFKA_PASS, and CUBEJS_DB_KAFKA_USE_SSL to the environment variables table
  • Added new "Kafka streams mode" section with:
    • Overview of what the mode does and how it differs from the default
    • "When to use" guidance (read-only ksqlDB, higher throughput, restricted permissions)
    • "Enabling Kafka streams mode" with .env examples including multi-broker setup
    • Confluent Cloud note about separate Kafka vs ksqlDB credentials
    • "How it works" numbered walkthrough
    • "Data modeling" subsection with a full lambda pre-aggregation example (batch + streaming cubes with rollup_lambda)
    • "Primary key and ungrouped queries" — explains that at least one dimension must have primary_key: true for the streaming pre-aggregation to work, because Cube Store's stream engine doesn't support GROUP BY; Cube auto-omits grouping when a primary key is present
    • "Unique key columns and deduplication" — __seq column, last-row-per-key semantics, Kafka message key fallback
    • "Stream format" — expected JSON format, case-sensitive field names, with Timestamp handling subsection covering string (ISO 8601/RFC 3339) and numeric (epoch ms) formats, PARSE_TIMESTAMP for non-standard formats, time dimension truncation via granularity, and date_trunc availability
    • "Filtering on the stream" — Cube Store applies SQL projections/filters on micro-batches
    • "Supported SQL syntax" — strict plan shape, supported/unsupported clauses, alias requirements
  • Fixed incorrect "Possible Values" for CUBEJS_DB_USER and CUBEJS_DB_PASS

Legacy Nextra docs (docs/content/product/configuration/data-sources/ksqldb.mdx):

  • Same content additions and fixes, adapted to Nextra component conventions
Open in Web Open in Cursor 

Add documentation for the Kafka streams mode, where Cube reads data
directly from Kafka topics instead of going through the ksqlDB REST API
for data streaming. In this mode, Cube does not create any tables or
streams in ksqlDB.

The documentation covers:
- What Kafka streams mode is and how it differs from the default mode
- When to use it (read-only ksqlDB, higher throughput, restricted perms)
- How to enable it via CUBEJS_DB_KAFKA_* environment variables
- How it works under the hood (metadata from ksqlDB, data from Kafka)
- Configuration via driverFactory for programmatic setup

Also fixes incorrect 'Possible Values' descriptions for CUBEJS_DB_USER
and CUBEJS_DB_PASS in the env vars table.

Updated both the Mintlify docs (docs-mintlify/) and the legacy Nextra
docs (docs/) for consistency.

Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
@mintlify
Copy link
Copy Markdown
Contributor

mintlify Bot commented May 1, 2026

Preview deployment for your docs. Learn more about Mintlify Previews.

Project Status Preview Updated (UTC)
cubed3 🟢 Ready View Preview May 1, 2026, 3:00 AM

💡 Tip: Enable Workflows to automatically generate PRs for you.

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 1, 2026

Claude finished @paveltiunov's task in 2m 27s —— View job


PR Review: docs: document Kafka streams mode for ksqlDB integration

  • Gather context and read changed files
  • Review mintlify docs changes
  • Review legacy Nextra docs changes
  • Check consistency between both doc versions
  • Verify documentation against driver implementation
  • Post review feedback

Overall

Good documentation addition. The content is well-structured, clearly explains the feature, and the driverFactory example matches the actual driver implementation. The bug fix for CUBEJS_DB_USER/CUBEJS_DB_PASS "Possible Values" (were incorrectly showing "A valid port number" / "A valid database name") is a nice catch.

I verified all constructor options (url, username, password, kafkaHost, kafkaUser, kafkaPassword, kafkaUseSsl) and environment variables against packages/cubejs-ksql-driver/src/KsqlDriver.ts — they all match.

Issues

1. Broken link anchors for new Kafka env vars (mintlify)
docs-mintlify/admin/connect-to-data/data-sources/ksqldb.mdx lines 77–80

The new Kafka env vars link to anchors like /reference/configuration/environment-variables#cubejs_db_kafka_host, but these anchors do not exist in the environment variables reference page — there are no CUBEJS_DB_KAFKA_* entries there. These links will 404 or land at the top of the page. Either:

  • Add the Kafka env vars to the environment variables reference page, or
  • Remove the links (use plain backtick formatting like the Nextra version does)

Fix this →

2. Inconsistent <EnvVar> component usage (Nextra)
docs/content/product/configuration/data-sources/ksqldb.mdx lines 69–72

The existing env vars use the <EnvVar> component (which auto-links to the reference page), but the new Kafka env vars use plain backtick formatting. This is visually inconsistent within the same table. If the Kafka env vars aren't in the reference (and shouldn't be linked), this might be intentional — but it would be good to add a brief note or make the formatting consistent.

3. Unrelated ref link changes (mintlify)
docs-mintlify/admin/connect-to-data/data-sources/ksqldb.mdx lines 190–191

The PR changed two reference links that are unrelated to the Kafka streams mode feature:

  • ref-powerbi: /admin/connect-to-data/visualization-tools/powerbi/docs/integrations/power-bi
  • ref-dax-api: /reference/dax-api/reference/core-data-apis/dax-api

The original ref-powerbi path (/admin/connect-to-data/visualization-tools/powerbi) has a valid file on disk, while the new path (/docs/integrations/power-bi) points to a directory. For ref-dax-api, the original path (/reference/dax-api) has no file while the new one (/reference/core-data-apis/dax-api) does exist. The powerbi change may be introducing a regression — worth verifying that Mintlify resolves directory paths to index pages correctly.

4. Missing trailing newline
Both files end without a trailing newline. Minor, but POSIX convention.

Nit

The driverFactory example hardcodes kafkaUseSsl: true instead of reading from process.env.CUBEJS_DB_KAFKA_USE_SSL. For consistency with the other properties (which all read from env vars), consider:

kafkaUseSsl: process.env.CUBEJS_DB_KAFKA_USE_SSL === "true",

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 1, 2026

Claude finished @paveltiunov's task in 1m 27s —— View job


PR Review: docs: document Kafka streams mode for ksqlDB integration

  • Gather context and read changed files
  • Review mintlify docs changes
  • Review legacy Nextra docs changes
  • Verify documentation against driver implementation
  • Check consistency between both doc versions
  • Post review feedback

Overall

Well-structured documentation addition. The content clearly explains the Kafka streams mode feature, when to use it, and how to configure it. The bug fix for CUBEJS_DB_USER/CUBEJS_DB_PASS "Possible Values" (which incorrectly showed "A valid port number" / "A valid database name") is a good catch.

I verified all constructor options (url, username, password, kafkaHost, kafkaUser, kafkaPassword, kafkaUseSsl) and environment variables against packages/cubejs-ksql-driver/src/KsqlDriver.ts and packages/cubejs-backend-shared/src/env.ts — they all match. The "How it works" section accurately reflects the getStreamingTableData method in the driver.

Both doc versions (mintlify and Nextra) are consistent in content, with appropriate component differences (<Info> vs <InfoBox>, <EnvVar> vs links, etc.).


Issues

1. Broken link anchors for new Kafka env vars (mintlify)docs-mintlify/.../ksqldb.mdx lines 77–80

The new Kafka env vars link to anchors like /reference/configuration/environment-variables#cubejs_db_kafka_host, but these anchors do not exist in the environment variables reference page — I confirmed via grep that there are no CUBEJS_DB_KAFKA_* entries there. These links will 404 or land at the top of the page. Either:

  • Add the Kafka env vars to the environment variables reference page, or
  • Remove the links and use plain backtick formatting (like the Nextra version already does)

Fix this →

2. driverFactory example hardcodes kafkaUseSsl: true — both files

The driverFactory code example hardcodes kafkaUseSsl: true while every other property reads from process.env. For consistency and to match how users would actually configure this:

kafkaUseSsl: process.env.CUBEJS_DB_KAFKA_USE_SSL === "true",

This also matches the env var behavior defined in packages/cubejs-backend-shared/src/env.ts:489-493 where it defaults to false and parses as a boolean.

Fix this →

3. Missing trailing newline — both files

Both files end without a trailing newline (POSIX convention). Minor.


Nits

Inconsistent <EnvVar> component usage in Nextradocs/.../ksqldb.mdx lines 69–72

The existing env vars use the <EnvVar> component (which auto-links to the reference), but the new Kafka env vars use plain backtick formatting. This is likely intentional since the Kafka vars aren't in the reference page, but it's visually inconsistent within the same table. If these vars are added to the reference page (per issue #1), they could also use <EnvVar>.

ref-powerbi path change in mintlify — line 190

Changed from /admin/connect-to-data/visualization-tools/powerbi to /docs/integrations/power-bi. The new path is a directory (with index.mdx inside). This should resolve correctly in Mintlify, but worth verifying the preview deployment to confirm.

ref-dax-api path change in mintlify — line 191

Changed from /reference/dax-api (which has no file on disk) to /reference/core-data-apis/dax-api (which exists) — this is a good fix.


…ams mode

Remove the driverFactory configuration section and replace it with a
Data modeling section that shows:
- How to configure ksqlDB as a named data source using decorated
  environment variables (CUBEJS_DS_KSQL_DB_*)
- How to create a cube with data_source: ksql that references an
  existing ksqlDB stream or table
- A complete cube definition with measures, dimensions, and a streaming
  pre-aggregation in both YAML and JavaScript

Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
Replace the simple single-cube example with a full lambda pre-aggregation
pattern showing:
- A batch cube (order_events) querying a warehouse with FILTER_PARAMS,
  incremental daily partitions, and a rollup_lambda that merges batch
  and streaming rollups
- A streaming cube (order_events_stream) with data_source: ksql pointing
  at an existing ksqlDB stream, using read_only: true, stream_offset,
  unique_key_columns, and incremental refresh
- Documentation of key streaming pre-aggregation properties (read_only,
  stream_offset, unique_key_columns)

Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
It is true by default, so specifying it is redundant.

Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
Add three new subsections to the Kafka streams mode documentation:

- Unique key columns and deduplication: explains how __seq column is
  appended from Kafka offset, deduplication happens at read/compaction
  time (last row per key wins), and key values can fall back to Kafka
  message key when missing from payload
- Stream format: documents the expected JSON object format for Kafka
  message values, case-sensitivity of field names, and optional message
  key parsing
- Filtering on the stream: explains that Cube Store applies SELECT
  projections and WHERE filters from the cube's sql property directly
  on each micro-batch of Kafka messages, without creating any objects
  in ksqlDB

Also expands stream_offset documentation to explain defaults and
automatic resume behavior on subsequent refreshes.

Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
Add a 'Supported SQL syntax' subsection under 'Filtering on the stream'
that documents:
- The strict plan shape requirement (Projection > Filter > TableScan)
- Supported clauses: SELECT, WHERE with comparisons/boolean logic,
  IS NULL, IN, BETWEEN, CASE, CAST, EXTRACT, SUBSTRING, scalar
  functions, CONVERT_TZ, nested expressions
- Unsupported clauses: JOIN, subqueries, GROUP BY, HAVING, aggregates,
  ORDER BY, LIMIT/OFFSET, UNION/INTERSECT/EXCEPT, window functions,
  multiple FROM/WHERE, CTEs
- Alias requirement for non-column expressions
- Unique key column expression constraints

Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
Add PARSE_TIMESTAMP and FORMAT_TIMESTAMP to the supported functions
list, and add a paragraph explaining that time dimension truncation
(granularity) is fully supported via the
PARSE_TIMESTAMP(FORMAT_TIMESTAMP(CONVERT_TZ(...))) expression chain
that Cube generates automatically and Cube Store evaluates natively
as custom UDFs in its post-processing engine.

Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
Add a 'Timestamp handling' subsection under 'Stream format' that covers:
- String timestamps: ISO 8601 / RFC 3339 formats with examples
- Numeric timestamps: epoch milliseconds (not seconds/microseconds)
- PARSE_TIMESTAMP for converting non-standard timestamp formats
- Time dimension truncation via granularity (auto-generated
  PARSE_TIMESTAMP/FORMAT_TIMESTAMP/CONVERT_TZ chain)
- date_trunc availability as a standard SQL function

Move time dimension truncation docs from the Supported SQL syntax
section into the Timestamp handling section where it fits better.
Add date_trunc to the supported functions list.

Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
Add 'Primary key and ungrouped queries' subsection explaining:
- Cube Store's stream post-processing does not support GROUP BY
- Cube omits GROUP BY when at least one dimension has primary_key: true
- This makes the query a simple SELECT ... FROM ... eligible for
  read-only streaming
- Without a primary key dimension, GROUP BY is generated and the
  pre-aggregation cannot use the streaming path

Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
Update the ungrouped query requirement to state that all primary key
columns must be present in the streaming pre-aggregation's dimensions
list, not just one.

Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
Wrap the data modeling example in CodeGroup/CodeTabs with both YAML and
JavaScript versions of the batch cube (order_events) and streaming cube
(order_events_stream) including all pre-aggregation configuration.

Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
@paveltiunov paveltiunov merged commit 98128af into master May 1, 2026
7 checks passed
@paveltiunov paveltiunov deleted the cursor/docs-ksqldb-kafka-streams-mode-623b branch May 1, 2026 15:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants