Skip to content

fix(transformer): preserve existing BYTES value when transform yields null#18503

Merged
xiangfu0 merged 3 commits into
apache:masterfrom
xiangfu0:claude/elastic-mccarthy-5dc648
May 15, 2026
Merged

fix(transformer): preserve existing BYTES value when transform yields null#18503
xiangfu0 merged 3 commits into
apache:masterfrom
xiangfu0:claude/elastic-mccarthy-5dc648

Conversation

@xiangfu0
Copy link
Copy Markdown
Contributor

Summary

  • Fix ExpressionTransformer overriding an existing non-null byte[] (BYTES column) with null when a transform expression evaluates to null. byte[] is logically a scalar in Pinot, but it accidentally falls into the isArray() branch intended for nested fields, so the null-result path called applyTransformedValue(record, column, null) and cleared the BYTES column.
  • Extract a shouldPreserveExistingValue helper that skips the apply step when the transform yields null AND an existing non-null value is present, for two cases:
    1. Implicit map transforms (mapField__KEYS/__VALUES) — generalizes the prior guard introduced in Add post-partial-upsert transforms for derived columns #17308 so it also covers the shouldApplyTransform branch (e.g. columns flagged null via putDefaultNullValue).
    2. BYTES single-value columns — treat byte[] as the scalar it represents.
  • Guard applies in both the populate-missing/overwrite branch and the existing array/Collection/Map branch, including the post-upsert overwrite path (_overwriteExistingValues=true).

Related: #17308.

Test plan

  • Added testImplicitMapTransformDoesNotOverrideExistingBytesValueWhenSourceMissing — implicit map + BYTES with default-null marker
  • Added testTransformReturningNullDoesNotOverrideExistingBytesValue — explicit transform yielding null, regular ingestion path
  • Added testPostUpsertTransformReturningNullDoesNotOverrideExistingBytesValue — explicit transform yielding null, post-upsert overwrite path
  • All 18 tests in ExpressionTransformerTest pass
  • ./mvnw spotless:apply license:format checkstyle:check license:check -pl pinot-segment-local clean

🤖 Generated with Claude Code

… null

`ExpressionTransformer` was overwriting an existing non-null `byte[]` value
with null when a transform expression evaluated to null. `byte[]` is logically
a scalar but is caught by the `isArray()` branch intended for nested fields,
so the null-result path applied `applyTransformedValue(record, column, null)`
and cleared the BYTES column.

Extract a `shouldPreserveExistingValue` helper that skips the apply step in
two cases when the transform yields null and an existing non-null value is
present:
1. Implicit map transforms (`mapField__KEYS`/`__VALUES`) — generalizes the
   prior guard so it also covers the `shouldApplyTransform` branch (e.g.
   columns flagged null via `putDefaultNullValue`).
2. BYTES single-value columns — treat `byte[]` as the scalar it represents.

Guard applies in both the populate-missing/overwrite branch and the existing
array/Collection/Map branch, including the post-upsert overwrite path
(`_overwriteExistingValues=true`).

Adds three regression tests covering implicit-map + BYTES, explicit
null-returning transform on BYTES, and the post-upsert overwrite variant.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@xiangfu0 xiangfu0 added bug Something is not working as expected ingestion Related to data ingestion pipeline labels May 14, 2026
@xiangfu0 xiangfu0 requested review from Jackie-Jiang and Copilot May 14, 2026 22:09
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a null-handling edge case in ExpressionTransformer where BYTES single-value columns (represented as byte[]) were incorrectly treated as “nested/array” fields and could be cleared when a transform expression evaluated to null. The change ensures existing non-null byte[] values are preserved when the transform yields null, including in the post-upsert overwrite mode.

Changes:

  • Add a shouldPreserveExistingValue(...) helper to skip applying a null transform result when an existing non-null value must be retained (implicit MAP-derived columns and BYTES byte[]).
  • Apply the preservation guard in both the “shouldApplyTransform” path and the “array/collection/map compatibility” path.
  • Add targeted unit tests covering implicit MAP + BYTES, explicit null transforms on BYTES, and post-upsert overwrite mode.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/recordtransformer/ExpressionTransformer.java Prevents null transform results from clobbering existing non-null BYTES (byte[]) values; reuses the same guard for implicit MAP-derived columns.
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/recordtransformer/ExpressionTransformerTest.java Adds regression tests for BYTES preservation behavior in normal ingestion and post-upsert overwrite paths.

The `shouldApplyTransform` branch is reached when `existingValue == null`,
the column is null-flagged, or `_overwriteExistingValues == true`. In all
three cases the transform is meant to apply — for the post-upsert overwrite
path in particular, a null-returning transform is the configured way to
clear a derived column. Keeping the preservation guard there blocks that
intent.

Restrict the `shouldPreserveExistingValue` guard to the array/Collection/Map
branch, which only fires when the existing value is genuinely non-null and
not null-flagged. Drop the two tests that exercised the misplaced guard
(implicit-map + putDefaultNullValue, and post-upsert overwrite).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 14, 2026

Codecov Report

❌ Patch coverage is 0% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 63.69%. Comparing base (4863cdc) to head (4a9d066).
⚠️ Report is 37 commits behind head on master.

Files with missing lines Patch % Lines
...local/recordtransformer/ExpressionTransformer.java 0.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18503      +/-   ##
============================================
+ Coverage     63.65%   63.69%   +0.04%     
+ Complexity     1735     1684      -51     
============================================
  Files          3254     3266      +12     
  Lines        199446   199898     +452     
  Branches      30977    31048      +71     
============================================
+ Hits         126953   127333     +380     
- Misses        62356    62403      +47     
- Partials      10137    10162      +25     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-21 63.69% <0.00%> (+0.04%) ⬆️
temurin 63.69% <0.00%> (+0.04%) ⬆️
unittests 63.69% <0.00%> (+0.04%) ⬆️
unittests1 55.81% <0.00%> (+0.10%) ⬆️
unittests2 34.96% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Cleaner shape: filter `byte[]` out of the else-if predicate itself so BYTES
single-value columns fall through as scalars (matching STRING/INT/etc.).
The transform is never evaluated when the column already has a value, which
makes the rule self-evident at the branch test. Drop the
`shouldPreserveExistingValue` helper since it's no longer needed.

Multi-value BYTES (`byte[][]`) is still caught by `isArray()` and continues
to be treated as a nested array. The implicit-map null guard inside the
branch is unchanged.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@xiangfu0 xiangfu0 merged commit 7542d22 into apache:master May 15, 2026
11 checks passed
@xiangfu0 xiangfu0 deleted the claude/elastic-mccarthy-5dc648 branch May 15, 2026 05:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something is not working as expected ingestion Related to data ingestion pipeline

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants