[doc] Update index documentation to not only realtime data lake by JingsongLi · Pull Request #7519 · apache/paimon

JingsongLi · 2026-03-25T03:25:46Z

Apache Paimon is a lake format for building Lakehouse Architecture for both streaming and batch
operations. Paimon provides large-scale data lake storage for analytics, realtime streaming updates
powered by LSM (Log-structured merge-tree) structure, and multimodal data management for AI workloads
— all in a single unified format.

Large-Scale Data Lake

Paimon is built for huge analytic datasets. A single table can contain tens of petabytes of data, and even
these huge tables can be read efficiently without a distributed SQL engine.

Time travel enables reproducible queries that use exactly the same table snapshot, or lets users easily
examine changes. Version rollback allows users to quickly correct problems by resetting tables to a good state.
Scan planning is fast — data files are pruned with partition and column-level stats, using table metadata.
File Index (BloomFilter, Bitmap, Range Bitmap) and aggregate push-down further accelerate queries.
Schema evolution supports add, drop, update, or rename columns, and has no side-effects.
Rich ecosystem — adds tables to compute engines including Flink, Spark, Hive, Trino, Presto, StarRocks, and
Doris, working just like a SQL table.
Incremental Clustering with z-order/hilbert/order sorting to optimize data layout at low cost.

Realtime Data Lake

Paimon's Primary Key Table brings realtime streaming updates into the lake architecture, powered by the LSM
(Log-structured merge-tree) structure.

Large-scale streaming updates with very high performance, typically through Flink Streaming.
Multiple Merge Engines: Deduplicate to keep last row, Partial Update to progressively complete records,
Aggregation to aggregate values, or First Row to keep the earliest record — update records however you like.
Multiple Table Modes: Merge On Read (MOR), Copy On Write (COW), and Merge On Write (MOW) with Deletion Vectors
for flexible read/write trade-offs.
Changelog Producers (None, Input, Lookup, Full Compaction) produce correct and complete changelog for merge
engines, simplifying your streaming analytics.
CDC Ingestion from MySQL, Kafka, MongoDB, Pulsar, PostgreSQL, and Flink CDC with schema evolution support.

Multimodal Data Lake

Paimon is a multimodal lakehouse for AI. Keep multimodal data, metadata, and embeddings in the same table and query
them via vector search, full-text search, or SQL.

Data Evolution for efficient row-level updates and partial column changes without rewriting entire files — add
new features (columns) as your application evolves, without copying existing data.
Blob Table for storing multimodal data (images, videos, audio, documents, model weights) with separated storage
layout — blob data is stored in dedicated .blob files while metadata stays in standard columnar files.
Global Index with BTree Index for high-performance scalar lookups and Vector Index (DiskANN) for approximate
nearest neighbor search.
PyPaimon native Python SDK with no JDK dependency, seamlessly integrating with the Python AI ecosystem
including Ray, PyTorch, Pandas and PyArrow for data loading, training, and inference workflows.

XiaoHongbo-Hope · 2026-03-25T04:00:05Z

+1

[doc] Update index documentation to not only realtime data lake

ec29416

JingsongLi merged commit e109ccd into apache:master Mar 25, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[doc] Update index documentation to not only realtime data lake#7519

[doc] Update index documentation to not only realtime data lake#7519
JingsongLi merged 1 commit intoapache:masterfrom
JingsongLi:update_index

JingsongLi commented Mar 25, 2026

Uh oh!

XiaoHongbo-Hope commented Mar 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JingsongLi commented Mar 25, 2026

Large-Scale Data Lake

Realtime Data Lake

Multimodal Data Lake

Uh oh!

XiaoHongbo-Hope commented Mar 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants