Enforce UTF8 when decoding byte[] to string in ValueReader by jtao15 · Pull Request #16608 · apache/pinot

jtao15 · 2025-08-15T04:03:18Z

Fix bug in reading unicode string in ValueReader as mentioned in #16607

codecov-commenter · 2025-08-15T04:51:48Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 63.35%. Comparing base (d9e7317) to head (0da58b6).

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #16608      +/-   ##
============================================
+ Coverage     63.34%   63.35%   +0.01%     
  Complexity     1379     1379              
============================================
  Files          3019     3019              
  Lines        175600   175600              
  Branches      26918    26918              
============================================
+ Hits         111236   111256      +20     
+ Misses        55874    55852      -22     
- Partials       8490     8492       +2

Flag	Coverage Δ
custom-integration1	`100.00% <ø> (ø)`
integration	`100.00% <ø> (ø)`
integration1	`100.00% <ø> (ø)`
integration2	`0.00% <ø> (ø)`
java-11	`63.29% <100.00%> (-0.01%)`	⬇️
java-21	`63.33% <100.00%> (+0.02%)`	⬆️
temurin	`63.35% <100.00%> (+0.01%)`	⬆️
unittests	`63.35% <100.00%> (+0.01%)`	⬆️
unittests1	`56.44% <100.00%> (+<0.01%)`	⬆️
unittests2	`33.27% <100.00%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

sajjad-moradi · 2025-08-15T15:52:21Z

Can you add a unit test? The test should pass on Mac, but GA will run it on Linux.

jtao15 · 2025-08-15T16:46:24Z

Can you add a unit test? The test should pass on Mac, but GA will run it on Linux.

For some reason, I didn't push the changes for the test. PTAL now.

sajjad-moradi

LGTM.

sajjad-moradi · 2025-08-15T17:50:57Z

@Jackie-Jiang this needs to be hotfixed for us ASAP. Please take a look.

ankitsultana · 2025-08-15T19:10:24Z

pinot-segment-local/src/main/java/org/apache/pinot/segment/local/io/util/ValueReader.java

  default String getUnpaddedString(int index, int numBytesPerValue, byte[] buffer) {
    int length = readUnpaddedBytes(index, numBytesPerValue, buffer);
-    return new String(buffer, 0, length);
+    return new String(buffer, 0, length, StandardCharsets.UTF_8);


(not blocking)

String in most JDKs support LATIN1 and UTF-16. My understanding is that most Latin/English character strings end up using LATIN1 which is reasonably compact (e.g. UUID is 36 characters).

Would passing UTF_8 switch the coder value to UTF-16 for some other strings that could have been served with Latin-1? It could lead to a quite a big increase in heap utilization for such cases, since UTF-16 is almost double the size of Latin-1.

My understanding is UTF_8 decoding should not force UTF-16 storage, and how Java stores it only depends on the characters in the string.

Enforce UTF8 when decoding byte[] to string in ValueReader

0da58b6

jtao15 added the bugfix label Aug 15, 2025

sajjad-moradi requested a review from Jackie-Jiang August 15, 2025 15:49

add test

7a91e70

sajjad-moradi approved these changes Aug 15, 2025

View reviewed changes

ankitsultana reviewed Aug 15, 2025

View reviewed changes

ankitsultana approved these changes Aug 15, 2025

View reviewed changes

Jackie-Jiang merged commit 88190cc into apache:master Aug 15, 2025
66 of 72 checks passed

jtao15 added a commit that referenced this pull request Aug 18, 2025

Enforce UTF8 when decoding byte[] to string in ValueReader (#16608)

d3dbd58

xiangfu0 added the bug Something is not working as expected label Mar 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enforce UTF8 when decoding byte[] to string in ValueReader#16608

Enforce UTF8 when decoding byte[] to string in ValueReader#16608
Jackie-Jiang merged 2 commits intoapache:masterfrom
jtao15:unicode

jtao15 commented Aug 15, 2025

Uh oh!

codecov-commenter commented Aug 15, 2025 •

edited

Loading

Uh oh!

sajjad-moradi commented Aug 15, 2025

Uh oh!

jtao15 commented Aug 15, 2025

Uh oh!

sajjad-moradi left a comment

Uh oh!

sajjad-moradi commented Aug 15, 2025

Uh oh!

ankitsultana Aug 15, 2025 •

edited

Loading

Uh oh!

jtao15 Aug 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

jtao15 commented Aug 15, 2025

Uh oh!

codecov-commenter commented Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sajjad-moradi commented Aug 15, 2025

Uh oh!

jtao15 commented Aug 15, 2025

Uh oh!

sajjad-moradi left a comment

Choose a reason for hiding this comment

Uh oh!

sajjad-moradi commented Aug 15, 2025

Uh oh!

ankitsultana Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jtao15 Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

codecov-commenter commented Aug 15, 2025 •

edited

Loading

ankitsultana Aug 15, 2025 •

edited

Loading