Skip to content

Enforce UTF8 when decoding byte[] to string in ValueReader#16608

Merged
Jackie-Jiang merged 2 commits intoapache:masterfrom
jtao15:unicode
Aug 15, 2025
Merged

Enforce UTF8 when decoding byte[] to string in ValueReader#16608
Jackie-Jiang merged 2 commits intoapache:masterfrom
jtao15:unicode

Conversation

@jtao15
Copy link
Contributor

@jtao15 jtao15 commented Aug 15, 2025

Fix bug in reading unicode string in ValueReader as mentioned in #16607

@jtao15 jtao15 added the bugfix label Aug 15, 2025
@codecov-commenter
Copy link

codecov-commenter commented Aug 15, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 63.35%. Comparing base (d9e7317) to head (0da58b6).

Additional details and impacted files
@@             Coverage Diff              @@
##             master   #16608      +/-   ##
============================================
+ Coverage     63.34%   63.35%   +0.01%     
  Complexity     1379     1379              
============================================
  Files          3019     3019              
  Lines        175600   175600              
  Branches      26918    26918              
============================================
+ Hits         111236   111256      +20     
+ Misses        55874    55852      -22     
- Partials       8490     8492       +2     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-11 63.29% <100.00%> (-0.01%) ⬇️
java-21 63.33% <100.00%> (+0.02%) ⬆️
temurin 63.35% <100.00%> (+0.01%) ⬆️
unittests 63.35% <100.00%> (+0.01%) ⬆️
unittests1 56.44% <100.00%> (+<0.01%) ⬆️
unittests2 33.27% <100.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@sajjad-moradi
Copy link
Contributor

Can you add a unit test? The test should pass on Mac, but GA will run it on Linux.

@jtao15
Copy link
Contributor Author

jtao15 commented Aug 15, 2025

Can you add a unit test? The test should pass on Mac, but GA will run it on Linux.

For some reason, I didn't push the changes for the test. PTAL now.

Copy link
Contributor

@sajjad-moradi sajjad-moradi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@sajjad-moradi
Copy link
Contributor

@Jackie-Jiang this needs to be hotfixed for us ASAP. Please take a look.

default String getUnpaddedString(int index, int numBytesPerValue, byte[] buffer) {
int length = readUnpaddedBytes(index, numBytesPerValue, buffer);
return new String(buffer, 0, length);
return new String(buffer, 0, length, StandardCharsets.UTF_8);
Copy link
Contributor

@ankitsultana ankitsultana Aug 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(not blocking)

String in most JDKs support LATIN1 and UTF-16. My understanding is that most Latin/English character strings end up using LATIN1 which is reasonably compact (e.g. UUID is 36 characters).

Would passing UTF_8 switch the coder value to UTF-16 for some other strings that could have been served with Latin-1? It could lead to a quite a big increase in heap utilization for such cases, since UTF-16 is almost double the size of Latin-1.

image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is UTF_8 decoding should not force UTF-16 storage, and how Java stores it only depends on the characters in the string.
Screenshot 2025-08-15 at 1 27 54 PM

@Jackie-Jiang Jackie-Jiang merged commit 88190cc into apache:master Aug 15, 2025
66 of 72 checks passed
@xiangfu0 xiangfu0 added the bug Something is not working as expected label Mar 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something is not working as expected

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants