Skip to content

feat(blob): add blob descriptor write support for append-only tables#270

Merged
JingsongLi merged 2 commits into
apache:mainfrom
JingsongLi:blob_descriptor
Apr 21, 2026
Merged

feat(blob): add blob descriptor write support for append-only tables#270
JingsongLi merged 2 commits into
apache:mainfrom
JingsongLi:blob_descriptor

Conversation

@JingsongLi
Copy link
Copy Markdown
Contributor

Purpose

Introduce BlobDescriptor serialization/deserialization, AppendBlobFileWriter for writing blob format files, blob format writer implementation, and descriptor mode in the reader that returns BlobDescriptor references instead of inline data. Update row tracking in commit to use per-column counters for blob files. Allow BlobType in append-only tables (still unsupported with primary keys).

Brief change log

Tests

API and Format

Documentation

Introduce BlobDescriptor serialization/deserialization, AppendBlobFileWriter
for writing blob format files, blob format writer implementation, and
descriptor mode in the reader that returns BlobDescriptor references instead
of inline data. Update row tracking in commit to use per-column counters
for blob files. Allow BlobType in append-only tables (still unsupported
with primary keys).

/// Comma-separated BLOB field names stored as serialized BlobDescriptor
/// bytes inline in normal data files (no .blob files for these fields).
pub fn blob_descriptor_fields(&self) -> HashSet<String> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blob-descriptor-field needs schema-level validation before we use this set. In Java, every configured field must exist and must be a top-level BLOB field. Here we only parse the option string, so typos or nested / non-BLOB fields are silently accepted and we diverge from Java behavior. Can we validate this during schema construction / table initialization instead of letting it flow into the writer path?

Comment thread crates/paimon/src/table/table_write.rs Outdated
};

let has_blob_fields = schema.fields().iter().any(|f| {
f.data_type().contains_blob_type() && !blob_descriptor_fields.contains(f.name())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is broader than the Java blob contract. contains_blob_type() recurses into nested types, so a column like ROW<blob BLOB> now flips has_blob_fields and routes the whole top-level column into AppendBlobFileWriter, but the blob writer still only accepts a single top-level BinaryArray. That makes the new nested-blob acceptance path a false positive: the schema is accepted, then the write path fails at runtime. I think this needs to be tightened to top-level DataType::Blob(_) only (matching Java BlobType.fieldsInBlobFile) or explicitly rejected during schema validation.

Copy link
Copy Markdown
Contributor

@jerry-024 jerry-024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@JingsongLi JingsongLi merged commit 87671fa into apache:main Apr 21, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants