Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ksh93: random behaviour of read -n <nchar> for multi-byte characters. #22

Open
stephane-chazelas opened this issue Aug 4, 2016 · 6 comments
Labels

Comments

@stephane-chazelas
Copy link

Reproduced with version sh (AT&T Research) 93u+ 2012-08-01 and version sh (AT&T Research) 93v- 2014-12-24 on Debian GNU/Linux amd64:

According to the man page read -n reads a number of bytes, while read --help says characters.

Tests are inconsistent: here testing in a UTF-8 locale with the 3-byte € (EURO U+20AC) character:

$ ksh -c 'for ((i=1;i<=6;i++)); do echo €€€€€€€€ | IFS= read -rn "$i" a; printf "$i %q\n" "$a"; done'
1 $'\u[20ac]'
2 $'\u[20ac]\u[20ac]'
3 $'\u[20ac]'
4 $'\u[20ac]\u[20ac]\u[20ac]'
5 $'\u[20ac]\u[20ac]\u[20ac]'
6 $'\u[20ac]\u[20ac]'

The 1 case suggests a number of characters, the 3 case a number of bytes, the rest doesn't seem to make any sense.

read -N doesn't have the issue (and seems to take a number of characters):

$ ksh -c 'for ((i=1;i<=6;i++)); do echo €€€€€€€€ | IFS= read -rN "$i" a; printf "$i %q\n" "$a"; done'
1 $'\u[20ac]'
2 $'\u[20ac]\u[20ac]'
3 $'\u[20ac]\u[20ac]\u[20ac]'
4 $'\u[20ac]\u[20ac]\u[20ac]\u[20ac]'
5 $'\u[20ac]\u[20ac]\u[20ac]\u[20ac]\u[20ac]'
6 $'\u[20ac]\u[20ac]\u[20ac]\u[20ac]\u[20ac]\u[20ac]'
@dannyweldon dannyweldon added the bug label Mar 6, 2017
@zakukai
Copy link

zakukai commented May 22, 2017

I seem to be getting some rather odd behavior here.
I wanted to turn this example into a test I could use on different shells. Bash posed a problem because "shopt lastpipe" was not set. So I rearranged the test to take input from a "here string":

file read-n_bytes_or_characters.sh

for ((i=1;i<=6;i++)); do
	IFS= read -rn "$i" a <<<'€€€€€€€€'
	printf "$i %q\n" "$a"
done

In this case, ksh seems to get it right:

$ ksh read-n_bytes_or_characters.sh
1 $'\u[20ac]'
2 $'\u[20ac]\u[20ac]'
3 $'\u[20ac]\u[20ac]\u[20ac]'
4 $'\u[20ac]\u[20ac]\u[20ac]\u[20ac]'
5 $'\u[20ac]\u[20ac]\u[20ac]\u[20ac]\u[20ac]'
6 $'\u[20ac]\u[20ac]\u[20ac]\u[20ac]\u[20ac]\u[20ac]'

But if I switch it back to "echo '€€€€€€€€' | read ..." then it fails again.
The "here string" feature is implemented using an unlinked temporary file, while the pipeline is (of course) a socket pair in ksh. It seems to go wrong when the input is a socket, tty, or pipe, but it does OK when the input is a file.

@stephane-chazelas
Copy link
Author

stephane-chazelas commented May 22, 2017 via email

@zakukai
Copy link

zakukai commented May 23, 2017

I haven't quite nailed it down yet but it looks like it comes down to this line

ast/src/cmd/ksh93/bltins/read.c : 649
if((size -= x) > 0 && (up >= cur || z < 0) && ((flags & NN_FLAG) || z < 0 || m > c))
    continue;   // Otherwise, end the read loop

Basically, when using -n (that's N_FLAG, as opposed to -N which is NN_FLAG), the code loops: On each iteration there are (x) characters remaining, so the implementation reads (x) bytes (since each character must take at least one byte). Then the newly-completed multi-byte characters are counted (starting from the pointer (up) and extending to the pointer (cur)) to determine how many characters will be left on the next iteration. Usually there will be some number of extra bytes read in that are carried over to the next iteration. But when the end of the last read coincides with a character boundary, it hits an edge case:
(up == cur): the end of the last complete character read coincides with the last byte read from the stream
(z > 0): The call to mbsize(up) on line 644 returned nonzero, because *up = 0 (an end-of-string marker set on (cur) to prevent reading beyond the buffered data) - and 0 is a single-byte character in UTF-8.
So the back half of the "if" condition ((flags & NN_FLAG) || z < 0 || m > c) fails on this edge case, because the NN flag isn't set and z>0 (I don't really understand (m > c) yet..)

I'll try to work out the parts I'm not quite understanding yet and put a patch together.

@siteshwar
Copy link
Contributor

Ksh uses sockets instead of real pipes to implement pipes in shell. This has caused issues on multiple occasions. For e.g.

$ cat /etc/passwd | head -1 /dev/stdin
head: cannot open '/dev/stdin' for reading: No such device or address

I tried removing this code. It fixes above issue, but it caused some of the io tests to fail. A number of failing tests seems to be incorrect. For e.g. consider this test. It expects different outputs on different locales. This command:

$ (print -n foo; sleep 1; print -n bar) |read -n6 temp; echo $temp

should always output foobar. With current development version, it gives foo. And the output may change with locale. Switching to real pipes fixes this issue, however there are going to be other regressions if we switch to real pipes. We should do a deeper analysis of how this is going to affect scripts.

Regarding fix for this bug, you can see my experiments in this branch.

@DavidMorano
Copy link

Of course some of you remember why KSH had to use sockets on some platforms: KSH needs to occasionally read only up to a new-line character and so it needs to PEEK into the input byte stream. Although older SysV type UNIXi allow for PEEKs on pipes, many newer systems (Linux?) do not allow for that (PEEK's on pipes). So KSH has to resort to sockets to get the PEEK capability on platforms that do not support PEEK for pipes. What KSH does now-a-days, exactly, on each platform, I do not know. I would like to think that it only resorts to sockets for shell "pipes" when needed, but I do not know if this is the case any longer (it might be using sockets on all platforms now, for all I know).

@krader1961
Copy link
Contributor

See also issue #1186 which is almost certainly the same problem as this issue.

citrus-it pushed a commit to citrus-it/ast that referenced this issue Apr 15, 2021
src/cmd/ksh93/sh/name.c:
 - Correct the check for when a function is currently running
   to fix a segmentation fault that occurred when a POSIX
   function tries to unset itself while it is running.
   This bug fix was backported from ksh93v-.

src/cmd/ksh93/sh/xec.c:
 - If a function tries to unset itself, unset the function
   with '_nv_unset(np, NV_RDONLY)' to fix a silent failure.
   This fix was also backported from ksh93v-.

src/cmd/ksh93/tests/functions.sh:
 - Add four regression tests for when a function unsets itself.

Resolves att#21
citrus-it pushed a commit to citrus-it/ast that referenced this issue Apr 15, 2021
The fix in sh/xec.c, which was backported from the ksh 93v- beta to
delay the actual removal of a running function that unsets itself,
caused a segfault in the variables.sh regression tests (see att#23).

src/cmd/ksh93/sh/xec.c:
- Comment out the backported code pending a correct fix for att#21.
  Now both types of functions silently fail to unset themselves
  (unless they're discipline functions).

src/cmd/ksh93/tests/functions.sh:
- Disable regression tests checking that the function was actually
  unset, pending a correct fix for att#21.

Resolves: att#23
Reopens: att#21
citrus-it pushed a commit to citrus-it/ast that referenced this issue Apr 15, 2021
Applying the fix for 'unset -f' exposed a crashing bug in lookup()
in sh/nvdisc.c, which is the function for looking up discipline
functions. This is what caused tests/variables.sh to crash.
Ref.: ksh93#23 (comment)

src/cmd/ksh93/sh/nvdisc.c: lookup():
- To avoid segfault, check that the function pointer nq->nvalue.rp
  is actually set before checking if nq->nvalue.rp->running==1.

src/cmd/ksh93/sh/xec.c,
src/cmd/ksh93/tests/functions.sh:
- Uncomment the 'unset -f' fix from b7932e8.

Resolves att#21 (again).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants