ksh93: random behaviour of `read -n <nchar>` for multi-byte characters. #22

stephane-chazelas · 2016-08-04T15:27:25Z

Reproduced with version sh (AT&T Research) 93u+ 2012-08-01 and version sh (AT&T Research) 93v- 2014-12-24 on Debian GNU/Linux amd64:

According to the man page read -n reads a number of bytes, while read --help says characters.

Tests are inconsistent: here testing in a UTF-8 locale with the 3-byte € (EURO U+20AC) character:

$ ksh -c 'for ((i=1;i<=6;i++)); do echo €€€€€€€€ | IFS= read -rn "$i" a; printf "$i %q\n" "$a"; done'
1 $'\u[20ac]'
2 $'\u[20ac]\u[20ac]'
3 $'\u[20ac]'
4 $'\u[20ac]\u[20ac]\u[20ac]'
5 $'\u[20ac]\u[20ac]\u[20ac]'
6 $'\u[20ac]\u[20ac]'

The 1 case suggests a number of characters, the 3 case a number of bytes, the rest doesn't seem to make any sense.

read -N doesn't have the issue (and seems to take a number of characters):

$ ksh -c 'for ((i=1;i<=6;i++)); do echo €€€€€€€€ | IFS= read -rN "$i" a; printf "$i %q\n" "$a"; done'
1 $'\u[20ac]'
2 $'\u[20ac]\u[20ac]'
3 $'\u[20ac]\u[20ac]\u[20ac]'
4 $'\u[20ac]\u[20ac]\u[20ac]\u[20ac]'
5 $'\u[20ac]\u[20ac]\u[20ac]\u[20ac]\u[20ac]'
6 $'\u[20ac]\u[20ac]\u[20ac]\u[20ac]\u[20ac]\u[20ac]'

The text was updated successfully, but these errors were encountered:

zakukai · 2017-05-22T16:46:34Z

I seem to be getting some rather odd behavior here.
I wanted to turn this example into a test I could use on different shells. Bash posed a problem because "shopt lastpipe" was not set. So I rearranged the test to take input from a "here string":

file read-n_bytes_or_characters.sh

for ((i=1;i<=6;i++)); do
	IFS= read -rn "$i" a <<<'€€€€€€€€'
	printf "$i %q\n" "$a"
done

In this case, ksh seems to get it right:

$ ksh read-n_bytes_or_characters.sh
1 $'\u[20ac]'
2 $'\u[20ac]\u[20ac]'
3 $'\u[20ac]\u[20ac]\u[20ac]'
4 $'\u[20ac]\u[20ac]\u[20ac]\u[20ac]'
5 $'\u[20ac]\u[20ac]\u[20ac]\u[20ac]\u[20ac]'
6 $'\u[20ac]\u[20ac]\u[20ac]\u[20ac]\u[20ac]\u[20ac]'

But if I switch it back to "echo '€€€€€€€€' | read ..." then it fails again.
The "here string" feature is implemented using an unlinked temporary file, while the pipeline is (of course) a socket pair in ksh. It seems to go wrong when the input is a socket, tty, or pipe, but it does OK when the input is a file.

stephane-chazelas · 2017-05-22T21:09:16Z

2017-05-22 09:46:36 -0700, zakukai: [...]

It seems to go wrong when the input is a socket, tty, or pipe, but it does OK when the input is a file.

[...] The fact that we get a different code path when the input goes from a regular file doesn't surprise me. When getting input from a seekable file (like a regular file and some device files), it's easier. To read the requested amount of characters, ksh93 can read a large block, find out how many bytes those characters occupy and seek back after that. While for a non-seekable file like a pipe/socket/tty device, it needs to read one byte at a time to make sure it doesn't read past the requested amount of characters. I suppose that's when it gets confused.

…

-- Stephane

zakukai · 2017-05-23T09:35:11Z

I haven't quite nailed it down yet but it looks like it comes down to this line

ast/src/cmd/ksh93/bltins/read.c : 649
if((size -= x) > 0 && (up >= cur || z < 0) && ((flags & NN_FLAG) || z < 0 || m > c))
    continue;   // Otherwise, end the read loop

Basically, when using -n (that's N_FLAG, as opposed to -N which is NN_FLAG), the code loops: On each iteration there are (x) characters remaining, so the implementation reads (x) bytes (since each character must take at least one byte). Then the newly-completed multi-byte characters are counted (starting from the pointer (up) and extending to the pointer (cur)) to determine how many characters will be left on the next iteration. Usually there will be some number of extra bytes read in that are carried over to the next iteration. But when the end of the last read coincides with a character boundary, it hits an edge case:
(up == cur): the end of the last complete character read coincides with the last byte read from the stream
(z > 0): The call to mbsize(up) on line 644 returned nonzero, because *up = 0 (an end-of-string marker set on (cur) to prevent reading beyond the buffered data) - and 0 is a single-byte character in UTF-8.
So the back half of the "if" condition ((flags & NN_FLAG) || z < 0 || m > c) fails on this edge case, because the NN flag isn't set and z>0 (I don't really understand (m > c) yet..)

I'll try to work out the parts I'm not quite understanding yet and put a patch together.

siteshwar · 2018-04-06T17:12:35Z

Ksh uses sockets instead of real pipes to implement pipes in shell. This has caused issues on multiple occasions. For e.g.

$ cat /etc/passwd | head -1 /dev/stdin
head: cannot open '/dev/stdin' for reading: No such device or address

I tried removing this code. It fixes above issue, but it caused some of the io tests to fail. A number of failing tests seems to be incorrect. For e.g. consider this test. It expects different outputs on different locales. This command:

$ (print -n foo; sleep 1; print -n bar) |read -n6 temp; echo $temp

should always output foobar. With current development version, it gives foo. And the output may change with locale. Switching to real pipes fixes this issue, however there are going to be other regressions if we switch to real pipes. We should do a deeper analysis of how this is going to affect scripts.

Regarding fix for this bug, you can see my experiments in this branch.

DavidMorano · 2018-04-06T18:03:35Z

Of course some of you remember why KSH had to use sockets on some platforms: KSH needs to occasionally read only up to a new-line character and so it needs to PEEK into the input byte stream. Although older SysV type UNIXi allow for PEEKs on pipes, many newer systems (Linux?) do not allow for that (PEEK's on pipes). So KSH has to resort to sockets to get the PEEK capability on platforms that do not support PEEK for pipes. What KSH does now-a-days, exactly, on each platform, I do not know. I would like to think that it only resorts to sockets for shell "pipes" when needed, but I do not know if this is the case any longer (it might be using sockets on all platforms now, for all I know).

krader1961 · 2019-06-20T05:36:58Z

See also issue #1186 which is almost certainly the same problem as this issue.

src/cmd/ksh93/sh/name.c: - Correct the check for when a function is currently running to fix a segmentation fault that occurred when a POSIX function tries to unset itself while it is running. This bug fix was backported from ksh93v-. src/cmd/ksh93/sh/xec.c: - If a function tries to unset itself, unset the function with '_nv_unset(np, NV_RDONLY)' to fix a silent failure. This fix was also backported from ksh93v-. src/cmd/ksh93/tests/functions.sh: - Add four regression tests for when a function unsets itself. Resolves att#21

The fix in sh/xec.c, which was backported from the ksh 93v- beta to delay the actual removal of a running function that unsets itself, caused a segfault in the variables.sh regression tests (see att#23). src/cmd/ksh93/sh/xec.c: - Comment out the backported code pending a correct fix for att#21. Now both types of functions silently fail to unset themselves (unless they're discipline functions). src/cmd/ksh93/tests/functions.sh: - Disable regression tests checking that the function was actually unset, pending a correct fix for att#21. Resolves: att#23 Reopens: att#21

Applying the fix for 'unset -f' exposed a crashing bug in lookup() in sh/nvdisc.c, which is the function for looking up discipline functions. This is what caused tests/variables.sh to crash. Ref.: ksh93#23 (comment) src/cmd/ksh93/sh/nvdisc.c: lookup(): - To avoid segfault, check that the function pointer nq->nvalue.rp is actually set before checking if nq->nvalue.rp->running==1. src/cmd/ksh93/sh/xec.c, src/cmd/ksh93/tests/functions.sh: - Uncomment the 'unset -f' fix from b7932e8. Resolves att#21 (again).

dannyweldon added the bug label Mar 6, 2017

siteshwar mentioned this issue Nov 30, 2017

path.sh test fails on macOS with a call to abort() #169

Closed

siteshwar mentioned this issue Apr 6, 2018

Use real pipes instead of sockets #462

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ksh93: random behaviour of `read -n <nchar>` for multi-byte characters. #22

ksh93: random behaviour of `read -n <nchar>` for multi-byte characters. #22

stephane-chazelas commented Aug 4, 2016

zakukai commented May 22, 2017

stephane-chazelas commented May 22, 2017 via email

zakukai commented May 23, 2017 •

edited

Loading

siteshwar commented Apr 6, 2018

DavidMorano commented Apr 6, 2018

krader1961 commented Jun 20, 2019

ksh93: random behaviour of read -n <nchar> for multi-byte characters. #22

ksh93: random behaviour of read -n <nchar> for multi-byte characters. #22

Comments

stephane-chazelas commented Aug 4, 2016

zakukai commented May 22, 2017

stephane-chazelas commented May 22, 2017 via email

zakukai commented May 23, 2017 • edited Loading

siteshwar commented Apr 6, 2018

DavidMorano commented Apr 6, 2018

krader1961 commented Jun 20, 2019

ksh93: random behaviour of `read -n <nchar>` for multi-byte characters. #22

ksh93: random behaviour of `read -n <nchar>` for multi-byte characters. #22

zakukai commented May 23, 2017 •

edited

Loading