read builtin command doesn't work as expected in Japanese locale #1186

sowmya573 · 2019-03-01T07:07:28Z

Description of problem:

ksh93t version builtin 'read' command ignores '0x5c' which comes as a part of Japaneese character under locale "Ja_JP" thinking it is ''.

Ksh version:
It exists in all versions of ksh, even on the latest ksh93 u+
version sh (AT&T Research) 93u+ 2012-08-01

How reproducible:
when LANG=Ja_JP

Steps to reproduce:

# locale
LANG=Ja_JP
LC_COLLATE="Ja_JP"
LC_CTYPE="Ja_JP"
LC_MONETARY="Ja_JP"
LC_NUMERIC="Ja_JP"
LC_TIME="Ja_JP"
LC_MESSAGES="Ja_JP"
LC_ALL=

(0) root @ mem68: 7.2.0.0: /ksh93_local
# perl -e 'print "\x94\x5c\x8e\x67"' | read char1

(1) root @ mem68: 7.2.0.0: /ksh93_local
# echo $char1 | od -ax
0000000  dc4  so   g  lf
            948e    670a <------ 5c disappeared 
0000004

Actual results:
0x5c which comes as a part of japaneese multibyte char is ignored.

Expected results:
data should be processed correctly.

Additional info:
NA

The text was updated successfully, but these errors were encountered:

krader1961 · 2019-03-03T04:10:44Z

The first thing that occurred to me to try was how this behaves in the bash and zsh shells. Bash outputs nothing other than the newline char. Zsh outputs the same sequence of bytes as ksh.

Where did the sequence of bytes in your perl command come from? What encoding does that stream of bytes utilize?

ASCII 0x5C is the backslash character which has special meaning for the read command. So I changed the reproduction test to use read -r to perform a "raw" read that does not recognize the backslash character. That produces the expected output in ksh and zsh (bash still outputs only a newline). This suggests the read implementation is checking for an ASCII backslash before an entire character has been assembled. Note that this does not affect UTF-8 or ISO 8859 encodings since a bare 0x5C byte is never part of a longer sequence and always represents a backslash.

I am ambivalent about supporting non-UTF-8 encodings now that Unicode has been a standard for almost three decades. Which means that even though the current behavior is wrong for non-UTF-8 and non-ISO-8859 encodings it is not obvious we should expend any effort fixing this bug.

krader1961 · 2019-03-04T04:36:52Z

This issue is a variation on issue #43.

krader1961 · 2019-03-05T05:59:17Z

@sowmya573 I personally do not intend to expend any effort to fix this bug because I only care about Unicode (specifically the UTF-8 encoding). And for those encodings this problem does not occur. But if you, or anyone else, creates a change to fix this bug we will be more than happy to merge it.

sowmya573 · 2019-03-06T15:45:46Z

usecase is from one of our Japanese customer. Basically customer is seeing the difference between ksh88 and ksh93. on Ksh93, using read -n 2 (mb_cur_max for Ja_JP) worked. But this cannot be generalised in the application as each locale has different mb_cur_max.

============================================
#!/bin/ksh93
LANG=Ja_JP
perl -e 'print "\x94\x5c\x8e\x67"' | read char1 char2
perl -e 'print "\x94\x5c\x8e\x67"' | read -n2 char3 char4
print $char1 $char2 $char3 $char4 | od -xc

============================================
88.ksh
#!/bin/ksh
LANG=Ja_JP
perl -e 'print "\x94\x5c\x8e\x67"' | read char1 char2
perl -e 'print "\x94\x5c\x8e\x67"' | read char3 char4
print $char1 $char2 $char3 $char4 | od -xc

RESULTS:
88.ksh: 945c 8e67 20 945c 8e67

93.ksh: 948e 67 20 945c 8e67
BAD "g" "space" OK OK

============================================

So the question is can ksh88 code be brought into ksh93? or how do we have the behaviour same on both ksh88 and ksh93.

krader1961 · 2019-03-06T21:59:42Z

So the question is can ksh88 code be brought into ksh93?

No. Not least because ksh88 was never open sourced so we don't have access to it. But even if we did have the source code it is almost a certainty that it is radically different from the current code. Which would make it impractical to just "bring it into ksh93."

It is unlikely your customer actually requires the special-casing of a backslash before a newline. In which case they can simply use read -r.

The current behavior is definitely broken. The code should be checking for a backslash only on fully formed chars, not individual bytes. Over the past couple of years @siteshwar and I have invested a huge amount of effort to clean up the code, fix unit tests, add interactive unit tests, and switch to a modern build system. We would love to see vendors like IBM contribute fixes for problems like this one. The fix will probably have to come from the CJK community since this doesn't affect UTF-8 or legacy encodings like ISO 8859 which are ASCII compatible.

krader1961 added the needs-more-info label Mar 3, 2019

krader1961 added the bug label Mar 3, 2019

krader1961 mentioned this issue Mar 4, 2019

Active locale/character set is not properly applied when parsing C-style strings #43

Open

krader1961 removed the needs-more-info label Mar 10, 2019

krader1961 mentioned this issue Jun 20, 2019

ksh93: random behaviour of read -n <nchar> for multi-byte characters. #22

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read builtin command doesn't work as expected in Japanese locale #1186

read builtin command doesn't work as expected in Japanese locale #1186

sowmya573 commented Mar 1, 2019 •

edited by siteshwar

Loading

krader1961 commented Mar 3, 2019

krader1961 commented Mar 4, 2019

krader1961 commented Mar 5, 2019

sowmya573 commented Mar 6, 2019 •

edited

Loading

krader1961 commented Mar 6, 2019

read builtin command doesn't work as expected in Japanese locale #1186

read builtin command doesn't work as expected in Japanese locale #1186

Comments

sowmya573 commented Mar 1, 2019 • edited by siteshwar Loading

krader1961 commented Mar 3, 2019

krader1961 commented Mar 4, 2019

krader1961 commented Mar 5, 2019

sowmya573 commented Mar 6, 2019 • edited Loading

krader1961 commented Mar 6, 2019

sowmya573 commented Mar 1, 2019 •

edited by siteshwar

Loading

sowmya573 commented Mar 6, 2019 •

edited

Loading