Skip to content

[docs] Add Real-Time User Profile quickstart tutorial#2669

Open
Prajwal-banakar wants to merge 6 commits intoapache:mainfrom
Prajwal-banakar:docs-newQuickstart
Open

[docs] Add Real-Time User Profile quickstart tutorial#2669
Prajwal-banakar wants to merge 6 commits intoapache:mainfrom
Prajwal-banakar:docs-newQuickstart

Conversation

@Prajwal-banakar
Copy link
Contributor

@Prajwal-banakar Prajwal-banakar commented Feb 12, 2026

Purpose

Linked issue: close #2659

The purpose of this change is to add a new quickstart tutorial, "Real-Time User Profile," to the Apache Fluss documentation. This tutorial demonstrates a realistic, production-grade business scenario by combining the Auto-Increment Column and Aggregation Merge Engine features. It specifically addresses the need for guidance on mapping high-cardinality string identifiers to compact integers for efficient real-time analytics.

Brief change log

This pull request introduces a comprehensive tutorial located at website/docs/quickstartUuser-Profile.md. Key changes include:

Realistic Use Case: Developed a scenario focused on identity mapping (Email to UID) and real-time metric aggregation (Total Clicks and Unique Visitors).

Feature Integration: Showcases the synergy between FIP-16 (Auto-Increment) for dictionary management and FIP-21 (Aggregation Merge Engine) for storage-level pre-aggregation.

Technical Optimization: Implemented the maintainer's recommendation to use INT for the generated uid column to maximize storage efficiency and performance for RoaringBitmap (rbm64) operations.

Reliability Section: Added documentation on Undo Recovery to explain how Fluss ensures exactly-once accuracy for aggregations during Flink failovers.

Visual Guidance: Included an architectural diagram to illustrate the data flow from raw event ingestion to the final pre-aggregated profile storage.

Tests

Documentation Build: Verified that the documentation builds correctly using the local Docusaurus environment and that the new page is correctly linked in the sidebar.

SQL Verification: Manually verified the Flink SQL syntax against the Apache Fluss 0.9 connector specifications.

image

API and Format

This change is documentation-only and does not affect the Java API or the underlying storage format.

Documentation

Yes, this change introduces a new documentation feature (a new quickstart tutorial) aimed at guiding users through the adoption of Fluss's advanced streaming storage capabilities.

@Prajwal-banakar
Copy link
Contributor Author

Hi @wuchong PTAL!

@wuchong wuchong requested review from polyzos and xx789633 February 12, 2026 15:06
@wuchong
Copy link
Member

wuchong commented Feb 12, 2026

@polyzos @xx789633 could you help to review this?

Copy link
Member

@wuchong wuchong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Prajwal-banakar, thank you for your contribution! However, quickstart documentation typically needs to be fully reproducible—readers should be able to follow it step by step and achieve the same results, just like in our existing quickstarts:
https://fluss.apache.org/docs/quickstart/flink/ and https://fluss.apache.org/docs/quickstart/lakehouse/.

Could you please enhance the guide by adding the environment setup (e.g., using Docker Compose), clear instructions on how to run the queries, and guidance on how to visualize or verify the results? This will greatly improve usability and consistency with our documentation standards.

@xx789633
Copy link
Contributor

Hi @Prajwal-banakar thanks for the pr. I have the same suggestion as @wuchong . We need to make the example fully reproducible. For example, to ingest the raw data for the source datastream, we can provide a csv file as sample data, just like this example: https://github.com/aliyun/alibabacloud-hologres-connectors/blob/master/hologres-connector-examples/hologres-connector-flink-examples/src/main/java/com/alibaba/hologres/connector/flink/example/FlinkRoaringBitmapAggJob.java

@Prajwal-banakar
Copy link
Contributor Author

Hi @wuchong @xx789633 Thanks for quick feedback ! Enhanced the quick start guide, PTAL!

@wuchong
Copy link
Member

wuchong commented Feb 13, 2026

Hi @Prajwal-banakar, please ensure the quickstart can run successfully. Additionally, the image appears to be AI-generated. I don’t object to AI-generated content in principle, but please make sure the text in the image contains no garbled characters and all content makes sense.

@Prajwal-banakar
Copy link
Contributor Author

Prajwal-banakar commented Feb 13, 2026

Hi @wuchong verified the guide locally it is working perfectly! and fixed the diagram format,
Screenshot 2026-02-13 185004

d.uid,
-- Convert INT to BYTES for rbm64.
-- Note: In a real production job, you might use a UDF to ensure correct bitmap initialization.
CAST(CAST(d.uid AS STRING) AS BYTES),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we need the to_rbm and from_rbm Flink UDFs to process the data correctly. Without these functions, the results would be meaningless and users would not understand the purpose of this feature.

However, shipping Flink UDFs falls outside the scope of the Fluss project. I will coordinate with members of the Flink community to contribute these UDFs and identify an appropriate location to open source and publish the UDF JARs. Once available, we can reference these functions in our documentation and examples.

That said, the Lunar New Year holiday is approaching in China, so we likely will not be able to start this work until March. Until then, we may need to suspend this PR. Thank you for your quick updating.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we need the to_rbm and from_rbm Flink UDFs to process the data correctly. Without these functions, the results would be meaningless and users would not understand the purpose of this feature.

Exactly. we need such functions as RB_CARDINALITY and RB_OR_AGG for aggregating the result bitmap.

@Prajwal-banakar
Copy link
Contributor Author

Hi @wuchong, @xx789633,
Thank you for the detailed explanation.
I understand that this work will be coordinated with the Flink community. I am happy to keep this PR open and suspended until those JARs are published. Once the UDFs are available, I will update the guide to include the proper function calls and verification steps.
Happy Lunar New Year to the team! I look forward to finalize this! Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[docs] Add a new quickstart tutorial for auto-incremental column and Agg Merge Engine

3 participants