docs: document Kafka streams mode for ksqlDB integration#10797
docs: document Kafka streams mode for ksqlDB integration#10797paveltiunov merged 11 commits intomasterfrom
Conversation
Add documentation for the Kafka streams mode, where Cube reads data directly from Kafka topics instead of going through the ksqlDB REST API for data streaming. In this mode, Cube does not create any tables or streams in ksqlDB. The documentation covers: - What Kafka streams mode is and how it differs from the default mode - When to use it (read-only ksqlDB, higher throughput, restricted perms) - How to enable it via CUBEJS_DB_KAFKA_* environment variables - How it works under the hood (metadata from ksqlDB, data from Kafka) - Configuration via driverFactory for programmatic setup Also fixes incorrect 'Possible Values' descriptions for CUBEJS_DB_USER and CUBEJS_DB_PASS in the env vars table. Updated both the Mintlify docs (docs-mintlify/) and the legacy Nextra docs (docs/) for consistency. Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
|
Preview deployment for your docs. Learn more about Mintlify Previews.
💡 Tip: Enable Workflows to automatically generate PRs for you. |
|
Claude finished @paveltiunov's task in 2m 27s —— View job PR Review: docs: document Kafka streams mode for ksqlDB integration
OverallGood documentation addition. The content is well-structured, clearly explains the feature, and the I verified all constructor options ( Issues1. Broken link anchors for new Kafka env vars (mintlify) The new Kafka env vars link to anchors like
2. Inconsistent The existing env vars use the 3. Unrelated ref link changes (mintlify) The PR changed two reference links that are unrelated to the Kafka streams mode feature:
The original 4. Missing trailing newline NitThe kafkaUseSsl: process.env.CUBEJS_DB_KAFKA_USE_SSL === "true", |
|
Claude finished @paveltiunov's task in 1m 27s —— View job PR Review: docs: document Kafka streams mode for ksqlDB integration
OverallWell-structured documentation addition. The content clearly explains the Kafka streams mode feature, when to use it, and how to configure it. The bug fix for I verified all constructor options ( Both doc versions (mintlify and Nextra) are consistent in content, with appropriate component differences ( Issues1. Broken link anchors for new Kafka env vars (mintlify) — The new Kafka env vars link to anchors like
2. The kafkaUseSsl: process.env.CUBEJS_DB_KAFKA_USE_SSL === "true",This also matches the env var behavior defined in 3. Missing trailing newline — both files Both files end without a trailing newline (POSIX convention). Minor. NitsInconsistent The existing env vars use the
Changed from
Changed from |
…ams mode Remove the driverFactory configuration section and replace it with a Data modeling section that shows: - How to configure ksqlDB as a named data source using decorated environment variables (CUBEJS_DS_KSQL_DB_*) - How to create a cube with data_source: ksql that references an existing ksqlDB stream or table - A complete cube definition with measures, dimensions, and a streaming pre-aggregation in both YAML and JavaScript Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
Replace the simple single-cube example with a full lambda pre-aggregation pattern showing: - A batch cube (order_events) querying a warehouse with FILTER_PARAMS, incremental daily partitions, and a rollup_lambda that merges batch and streaming rollups - A streaming cube (order_events_stream) with data_source: ksql pointing at an existing ksqlDB stream, using read_only: true, stream_offset, unique_key_columns, and incremental refresh - Documentation of key streaming pre-aggregation properties (read_only, stream_offset, unique_key_columns) Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
It is true by default, so specifying it is redundant. Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
Add three new subsections to the Kafka streams mode documentation: - Unique key columns and deduplication: explains how __seq column is appended from Kafka offset, deduplication happens at read/compaction time (last row per key wins), and key values can fall back to Kafka message key when missing from payload - Stream format: documents the expected JSON object format for Kafka message values, case-sensitivity of field names, and optional message key parsing - Filtering on the stream: explains that Cube Store applies SELECT projections and WHERE filters from the cube's sql property directly on each micro-batch of Kafka messages, without creating any objects in ksqlDB Also expands stream_offset documentation to explain defaults and automatic resume behavior on subsequent refreshes. Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
Add a 'Supported SQL syntax' subsection under 'Filtering on the stream' that documents: - The strict plan shape requirement (Projection > Filter > TableScan) - Supported clauses: SELECT, WHERE with comparisons/boolean logic, IS NULL, IN, BETWEEN, CASE, CAST, EXTRACT, SUBSTRING, scalar functions, CONVERT_TZ, nested expressions - Unsupported clauses: JOIN, subqueries, GROUP BY, HAVING, aggregates, ORDER BY, LIMIT/OFFSET, UNION/INTERSECT/EXCEPT, window functions, multiple FROM/WHERE, CTEs - Alias requirement for non-column expressions - Unique key column expression constraints Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
Add PARSE_TIMESTAMP and FORMAT_TIMESTAMP to the supported functions list, and add a paragraph explaining that time dimension truncation (granularity) is fully supported via the PARSE_TIMESTAMP(FORMAT_TIMESTAMP(CONVERT_TZ(...))) expression chain that Cube generates automatically and Cube Store evaluates natively as custom UDFs in its post-processing engine. Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
Add a 'Timestamp handling' subsection under 'Stream format' that covers: - String timestamps: ISO 8601 / RFC 3339 formats with examples - Numeric timestamps: epoch milliseconds (not seconds/microseconds) - PARSE_TIMESTAMP for converting non-standard timestamp formats - Time dimension truncation via granularity (auto-generated PARSE_TIMESTAMP/FORMAT_TIMESTAMP/CONVERT_TZ chain) - date_trunc availability as a standard SQL function Move time dimension truncation docs from the Supported SQL syntax section into the Timestamp handling section where it fits better. Add date_trunc to the supported functions list. Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
Add 'Primary key and ungrouped queries' subsection explaining: - Cube Store's stream post-processing does not support GROUP BY - Cube omits GROUP BY when at least one dimension has primary_key: true - This makes the query a simple SELECT ... FROM ... eligible for read-only streaming - Without a primary key dimension, GROUP BY is generated and the pre-aggregation cannot use the streaming path Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
Update the ungrouped query requirement to state that all primary key columns must be present in the streaming pre-aggregation's dimensions list, not just one. Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
Wrap the data modeling example in CodeGroup/CodeTabs with both YAML and JavaScript versions of the batch cube (order_events) and streaming cube (order_events_stream) including all pre-aggregation configuration. Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
Check List
Description of Changes Made
Documents the Kafka streams mode for the ksqlDB integration. In this mode, Cube reads data directly from the underlying Kafka topics instead of going through the ksqlDB REST API for data streaming. Cube does not create any tables or streams in ksqlDB when this mode is enabled.
Changes
Mintlify docs (
docs-mintlify/admin/connect-to-data/data-sources/ksqldb.mdx):CUBEJS_DB_KAFKA_HOST,CUBEJS_DB_KAFKA_USER,CUBEJS_DB_KAFKA_PASS, andCUBEJS_DB_KAFKA_USE_SSLto the environment variables table.envexamples including multi-broker setuprollup_lambda)primary_key: truefor the streaming pre-aggregation to work, because Cube Store's stream engine doesn't supportGROUP BY; Cube auto-omits grouping when a primary key is present__seqcolumn, last-row-per-key semantics, Kafka message key fallbackPARSE_TIMESTAMPfor non-standard formats, time dimension truncation viagranularity, anddate_truncavailabilityCUBEJS_DB_USERandCUBEJS_DB_PASSLegacy Nextra docs (
docs/content/product/configuration/data-sources/ksqldb.mdx):