Fix reading UTF-8 strings (from OPC UA nodes) #832

Planet-X · 2023-03-07T15:56:27Z

When reading string values via the OPC UA protocol a StringIndexOutOfBoundsException was thrown whenever the string contained non-ASCII characters. Upon further inspection, the problem could be pinned down to the logic in readString() ofReadBufferByteBased.

The readString() function in ReadBufferByteBased could result in a StringIndexOutOfBoundsException upon calling substring(). Reason is that the calculated realLength is in bytes and the length of the created string is the number of characters. For UTF encodings, multiple bytes can be just one character. This causes realLength to be longer than the actual string and thus the substring()-call to fail.

This PR fixes the issue by applying realLength to the byte-length instead of the string length: The byte array is sliced to length realLength and then converted to the final string. substring() isn't used anymore.
The fix can be verified via the included regression test and has additionally been tested locally using the OPC UA protocol.
Maybe other protocols were also affected?

Additionally I've included some other minor fixes for the logic. Check the commits for these.

Regards,
Marc

The previous logic could result in a `StringIndexOutOfBoundsException` upon calling `substring()`. Reason is that the calculated `realLength` is in bytes and the length of the created string is the number of characters. For UTF encodings, multiple bytes can be just one character. This causes `realLength` to be longer than the actual string and thus the `substring()`-call to fail. Fix this by applying `realLength` to the byte-array instead of the string. This is done by utilizing a different String constructor. Note that on the UTF16 cases the `realLength` needs to be incremented by 2 to account for the two added bytes. This commit contains a regression test that fails without these adaptations. Signed-off-by: Marc Aurel Fritz <Marc-Aurel.Fritz@plansee.com>

Previously, the inner switch statement at the UTF-16 cases would not convert encoding to upper case. This prevented matching from working properly. Do so directly at the beginning in order to prevent future bugs. Signed-off-by: Marc Aurel Fritz <Marc-Aurel.Fritz@plansee.com>

Prevent having the same logic twice. Signed-off-by: Marc Aurel Fritz <Marc-Aurel.Fritz@plansee.com>

sruehl

Thanks for the PR: LGTM, although @chrisdutz should check the ASCII part regarding the padding

Planet-X added 3 commits March 7, 2023 16:11

plc4j/spi: Merge ASCII and UTF-8 cases on readString()

b348518

Prevent having the same logic twice. Signed-off-by: Marc Aurel Fritz <Marc-Aurel.Fritz@plansee.com>

sruehl requested a review from chrisdutz March 7, 2023 16:28

sruehl approved these changes Mar 7, 2023

View reviewed changes

ottlukas requested a review from hutcheb March 7, 2023 17:40

sruehl merged commit 93341ca into apache:develop Mar 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix reading UTF-8 strings (from OPC UA nodes) #832

Fix reading UTF-8 strings (from OPC UA nodes) #832

Planet-X commented Mar 7, 2023

sruehl left a comment

Fix reading UTF-8 strings (from OPC UA nodes) #832

Fix reading UTF-8 strings (from OPC UA nodes) #832

Conversation

Planet-X commented Mar 7, 2023

sruehl left a comment

Choose a reason for hiding this comment