Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read builtin command doesn't work as expected in Japanese locale #1186

Open
sowmya573 opened this issue Mar 1, 2019 · 5 comments
Open

read builtin command doesn't work as expected in Japanese locale #1186

sowmya573 opened this issue Mar 1, 2019 · 5 comments
Labels

Comments

@sowmya573
Copy link

sowmya573 commented Mar 1, 2019

Description of problem:

ksh93t version builtin 'read' command ignores '0x5c' which comes as a part of Japaneese character under locale "Ja_JP" thinking it is ''.

Ksh version:
It exists in all versions of ksh, even on the latest ksh93 u+
version sh (AT&T Research) 93u+ 2012-08-01

How reproducible:
when LANG=Ja_JP

Steps to reproduce:

# locale
LANG=Ja_JP
LC_COLLATE="Ja_JP"
LC_CTYPE="Ja_JP"
LC_MONETARY="Ja_JP"
LC_NUMERIC="Ja_JP"
LC_TIME="Ja_JP"
LC_MESSAGES="Ja_JP"
LC_ALL=

(0) root @ mem68: 7.2.0.0: /ksh93_local
# perl -e 'print "\x94\x5c\x8e\x67"' | read char1

(1) root @ mem68: 7.2.0.0: /ksh93_local
# echo $char1 | od -ax
0000000  dc4  so   g  lf
            948e    670a <------ 5c disappeared 
0000004

Actual results:
0x5c which comes as a part of japaneese multibyte char is ignored.

Expected results:
data should be processed correctly.

Additional info:
NA

@krader1961
Copy link
Contributor

The first thing that occurred to me to try was how this behaves in the bash and zsh shells. Bash outputs nothing other than the newline char. Zsh outputs the same sequence of bytes as ksh.

Where did the sequence of bytes in your perl command come from? What encoding does that stream of bytes utilize?

ASCII 0x5C is the backslash character which has special meaning for the read command. So I changed the reproduction test to use read -r to perform a "raw" read that does not recognize the backslash character. That produces the expected output in ksh and zsh (bash still outputs only a newline). This suggests the read implementation is checking for an ASCII backslash before an entire character has been assembled. Note that this does not affect UTF-8 or ISO 8859 encodings since a bare 0x5C byte is never part of a longer sequence and always represents a backslash.

I am ambivalent about supporting non-UTF-8 encodings now that Unicode has been a standard for almost three decades. Which means that even though the current behavior is wrong for non-UTF-8 and non-ISO-8859 encodings it is not obvious we should expend any effort fixing this bug.

@krader1961 krader1961 added the bug label Mar 3, 2019
@krader1961
Copy link
Contributor

This issue is a variation on issue #43.

@krader1961
Copy link
Contributor

@sowmya573 I personally do not intend to expend any effort to fix this bug because I only care about Unicode (specifically the UTF-8 encoding). And for those encodings this problem does not occur. But if you, or anyone else, creates a change to fix this bug we will be more than happy to merge it.

@sowmya573
Copy link
Author

sowmya573 commented Mar 6, 2019

usecase is from one of our Japanese customer. Basically customer is seeing the difference between ksh88 and ksh93. on Ksh93, using read -n 2 (mb_cur_max for Ja_JP) worked. But this cannot be generalised in the application as each locale has different mb_cur_max.

============================================
#!/bin/ksh93
LANG=Ja_JP
perl -e 'print "\x94\x5c\x8e\x67"' | read char1 char2
perl -e 'print "\x94\x5c\x8e\x67"' | read -n2 char3 char4
print $char1 $char2 $char3 $char4 | od -xc

============================================
88.ksh
#!/bin/ksh
LANG=Ja_JP
perl -e 'print "\x94\x5c\x8e\x67"' | read char1 char2
perl -e 'print "\x94\x5c\x8e\x67"' | read char3 char4
print $char1 $char2 $char3 $char4 | od -xc

RESULTS:
88.ksh: 945c 8e67 20 945c 8e67

93.ksh: 948e 67 20 945c 8e67
BAD "g" "space" OK OK

============================================

So the question is can ksh88 code be brought into ksh93? or how do we have the behaviour same on both ksh88 and ksh93.

@krader1961
Copy link
Contributor

So the question is can ksh88 code be brought into ksh93?

No. Not least because ksh88 was never open sourced so we don't have access to it. But even if we did have the source code it is almost a certainty that it is radically different from the current code. Which would make it impractical to just "bring it into ksh93."

It is unlikely your customer actually requires the special-casing of a backslash before a newline. In which case they can simply use read -r.

The current behavior is definitely broken. The code should be checking for a backslash only on fully formed chars, not individual bytes. Over the past couple of years @siteshwar and I have invested a huge amount of effort to clean up the code, fix unit tests, add interactive unit tests, and switch to a modern build system. We would love to see vendors like IBM contribute fixes for problems like this one. The fix will probably have to come from the CJK community since this doesn't affect UTF-8 or legacy encodings like ISO 8859 which are ASCII compatible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants