Enforce UTF8 when decoding byte[] to string in ValueReader#16608
Enforce UTF8 when decoding byte[] to string in ValueReader#16608Jackie-Jiang merged 2 commits intoapache:masterfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #16608 +/- ##
============================================
+ Coverage 63.34% 63.35% +0.01%
Complexity 1379 1379
============================================
Files 3019 3019
Lines 175600 175600
Branches 26918 26918
============================================
+ Hits 111236 111256 +20
+ Misses 55874 55852 -22
- Partials 8490 8492 +2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Can you add a unit test? The test should pass on Mac, but GA will run it on Linux. |
For some reason, I didn't push the changes for the test. PTAL now. |
|
@Jackie-Jiang this needs to be hotfixed for us ASAP. Please take a look. |
| default String getUnpaddedString(int index, int numBytesPerValue, byte[] buffer) { | ||
| int length = readUnpaddedBytes(index, numBytesPerValue, buffer); | ||
| return new String(buffer, 0, length); | ||
| return new String(buffer, 0, length, StandardCharsets.UTF_8); |
There was a problem hiding this comment.
(not blocking)
String in most JDKs support LATIN1 and UTF-16. My understanding is that most Latin/English character strings end up using LATIN1 which is reasonably compact (e.g. UUID is 36 characters).
Would passing UTF_8 switch the coder value to UTF-16 for some other strings that could have been served with Latin-1? It could lead to a quite a big increase in heap utilization for such cases, since UTF-16 is almost double the size of Latin-1.

Fix bug in reading unicode string in ValueReader as mentioned in #16607