Skip to content

[core] Support external-storage BLOB columns and MERGE INTO updates for descriptor-based BLOB columns#7328

Merged
JingsongLi merged 3 commits intoapache:masterfrom
JunRuiLee:support-blob-des-copy
Mar 3, 2026
Merged

[core] Support external-storage BLOB columns and MERGE INTO updates for descriptor-based BLOB columns#7328
JingsongLi merged 3 commits intoapache:masterfrom
JunRuiLee:support-blob-des-copy

Conversation

@JunRuiLee
Copy link
Contributor

@JunRuiLee JunRuiLee commented Mar 2, 2026

Purpose

This PR adds support for descriptor-based BLOB fields backed by external storage. For these fields, Paimon writes raw BLOB data to a configured external storage path at write time and stores only serialized BlobDescriptors inline in data files. The change also adds validation for the new external-storage BLOB options and verifies that raw-data BLOB fields, descriptor-based BLOB fields, and descriptor-based BLOB fields backed by external storage can coexist in the same table.

This PR also refines MERGE INTO validation for BLOB columns in Flink and Spark. Updates are still rejected for raw-data BLOB columns, but are now allowed for descriptor-based BLOB columns, including those backed by external storage.

Tests

UT:

  • BlobTableTest#testExternalStorageBlobField
  • BlobTableTest#testThreeTypeBlobCoexistence
  • BlobTableTest#testExternalStorageFieldValidationRequiresPath
  • BlobTableTest#testExternalStorageFieldMustBeSubsetOfDescriptorField
  • BlobTestBase: Blob: merge-into rejects updating raw-data BLOB column
  • BlobTestBase: Blob: merge-into updates non-blob column on descriptor blob table
  • BlobTestBase: Blob: merge-into updates descriptor blob column with external storage end-to-end

IT:

  • BlobTableITCase#testExternalStorageBlob
  • BlobTableITCase#testThreeTypeBlobCoexistence
  • BlobTableITCase#testExternalStorageBlobMultipleWrites
  • DataEvolutionMergeIntoActionITCase#testUpdateRawBlobColumnThrowsError
  • DataEvolutionMergeIntoActionITCase#testUpdateNonBlobColumnOnDescriptorBlobTableSucceeds
  • DataEvolutionMergeIntoActionITCase#testUpdateExternalStorageBlobColumnSucceeds

API and Format

Documentation

Generative AI tooling

@JunRuiLee JunRuiLee force-pushed the support-blob-des-copy branch 2 times, most recently from fc2f6cf to 807fd1d Compare March 3, 2026 02:58
Copy link
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented to options.


@Immutable
public static final ConfigOption<String> BLOB_COPIED_DESCRIPTOR_FIELD =
key("blob-copied-descriptor-field")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blob-external-storage-field


@Immutable
public static final ConfigOption<String> BLOB_COPIED_DESCRIPTOR_TARGET_DIR =
key("blob-copied-descriptor-target-dir")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blob-external-storage-path

Support a third type of BLOB column: external-storage. When configured,
raw BLOB data is written to a configured external storage path at write time
and the BLOB value in the row is replaced with a BlobDescriptor pointing
to the external data.

This enables a BLOB hierarchy in a single table:
- Raw BLOB: data stored in .blob files within the table
- Descriptor BLOB: data stored externally, descriptor inline
- External-storage BLOB: data written to external storage, descriptor inline

New options:
- blob-external-storage-field: comma-separated field names
- blob-external-storage-path: external storage path for raw data
External-storage fields must be a subset of blob-descriptor-field.
Orphan file cleanup is not applied to the external storage path.
…-into.

Previously, any table with BLOB columns
was rejected by DataEvolution merge-into.
Now the validation checks whether the BLOB columns being written are
raw-data or descriptor-based:
- Raw-data BLOB columns: update is still rejected
- Descriptor-based BLOB columns, including external-storage fields: update is allowed
Supported in both Flink (DataEvolutionMergeIntoAction) and
Spark (DataEvolutionPaimonWriter).
@JunRuiLee JunRuiLee force-pushed the support-blob-des-copy branch from 807fd1d to 5b49724 Compare March 3, 2026 03:59
@JunRuiLee JunRuiLee changed the title [core] Support descriptor-based BLOB columns with copied raw data and allow MERGE INTO updates [core] Support external-storage BLOB columns and MERGE INTO updates for descriptor-based BLOB columns Mar 3, 2026
@JunRuiLee
Copy link
Contributor Author

@JingsongLi Thanks for the suggestion! I've updated the names for the two configs you mentioned, along with the related code, to ensure consistency. Additionally, I've updated the PR title, description, and commit message. PTAL~

Copy link
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also update documentation? blob.md.

@JunRuiLee
Copy link
Contributor Author

Can you also update documentation? blob.md.

Thanks @JingsongLi , Documentation for blob.md has been updated.

Copy link
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @JunRuiLee Looks good to me!

@JingsongLi JingsongLi merged commit 2c7bb26 into apache:master Mar 3, 2026
13 checks passed
@JunRuiLee JunRuiLee deleted the support-blob-des-copy branch March 3, 2026 12:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants