Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

command-line-arguments can't read umlauts with utf-8 encoding #81

Closed
atticus0 opened this issue Jul 17, 2016 · 9 comments
Closed

command-line-arguments can't read umlauts with utf-8 encoding #81

atticus0 opened this issue Jul 17, 2016 · 9 comments

Comments

@atticus0
Copy link

Chez Scheme doesn't recognize umlauts read by command-line' and 'command-line-arguments'. Tested with version 9.4 and 9.4.1 commit a664335.

Link to mailing list discussion.

@akeep
Copy link
Contributor

akeep commented Jul 17, 2016

I've tested this out on my Mac using the standard terminal, and it also fails here if the expression editor is enabled:

% scheme
Chez Scheme Version 9.4.1
Copyright 1984-2016 Cisco Systems, Inc.

> (define ;; <option-u>u which should produce ü, produces nothing.

However, when I turn the expeditor off, it seems to work fine:

[akeep@hawkeye ~]% scheme --eedisable
Chez Scheme Version 9.4.1
Copyright 1984-2016 Cisco Systems, Inc.

> (define ü 5)
> ü
5

Additionally, echoing the instructions in from the terminal worked in both cases:

% echo '(define ü 5) ü' | scheme
Chez Scheme Version 9.4.1
Copyright 1984-2016 Cisco Systems, Inc.

> > 5
> 

and with --eedisable:

echo '(define ü 5) ü' | scheme --eedisable
Chez Scheme Version 9.4.1
Copyright 1984-2016 Cisco Systems, Inc.

> > 5
> 

However, command line arguments (which allow us to name files to load also fails). So, for instance if we have two files u.ss and ü.ss with the contents:

u.ss:

(define u 9)
(pretty-print u)

and ü.ss:

(define ü 5)
(pretty-print ü)

Then Chez can load u.ss:

% scheme u.ss
Chez Scheme Version 9.4.1
Copyright 1984-2016 Cisco Systems, Inc.

9
> 

But not ü.ss:

% scheme ü.ss
Chez Scheme Version 9.4.1
Copyright 1984-2016 Cisco Systems, Inc.

Exception in load: failed for ????????.ss: no such file or directory

However, Chez can load the ü.ss and execute it correctly, if we disable the expression editor.

% scheme --eedisable
Chez Scheme Version 9.4.1
Copyright 1984-2016 Cisco Systems, Inc.

> (load "ü.ss")
5
> 

So, anyway, it seems like there are two problems here:

  1. The expression editor isn't able to take in unicode characters.
  2. The command line arguments aren't being processed in a way that accepts unicode characters.

I think in both cases OS X is probably providing the characters in UTF-8, but I was a little surprised by the number of ? characters in the load error report.

So, there are some work arounds (though not being able to use the expression editor is a pretty big bummer). Worth noting though is that file and console IO seem to do the right thing when the expression editor isn't involved.

I'll also try to take a look into this and see what I can figure out.

@jltaylor-us
Copy link
Contributor

jltaylor-us commented Jul 17, 2016

The inability to enter non-latin characters in the expression editor is #32.

This issue would be more accurately titled "Command line arguments always treated as bytes". The C spec (at least as of C99) says that argv contains "strings", which by definition are made up of characters, not wide characters. It is silent on whether those strings can be "multi-byte strings" (which of course it would be, since they're the same pointer type). It appears that it's up to the OS (or maybe even the particular shell) to fill in argv however it sees fit, and it's not always consistent. See, e.g., http://stackoverflow.com/questions/5408730/what-is-the-encoding-of-argv

In any case, if you want to take a stab at making things better, looks like scheme.c:1112 is the place to start... possibly using the mbstowcs c library function and a new function like Sstring that takes wide character strings instead.

@akeep
Copy link
Contributor

akeep commented Jul 17, 2016

Yes, I was just looking at that file and the that stack overflow article.

The pertinent code for the expression editor is in expeditor.c at lines 639, 644, 653, 657, and 662, where we use read (through the READ macro) to read a single byte (lines 639, 644, 653, and 657) and then convert it into a character with Schar (line 662).

The mbstowcs and mbrtowc seem like they may help for this and #32.

eraserhd added a commit to eraserhd/ChezScheme that referenced this issue Jul 21, 2016
… work

  except that combining characters are not treated correctly for
  line-wrapping.  this addresses github issue cisco#32 and part of issue cisco#81.
    c/expeditor.c, s/expeditor.ss
@eraserhd
Copy link
Contributor

eraserhd commented Jul 21, 2016

@akeep You will probably also have to use newlocale() and uselocale() in a way similar to PR #83.

@burgerrg
Copy link
Contributor

burgerrg commented Apr 2, 2018

The command-line arguments are converted to Scheme strings using Sstring. Sstring does not process UTF-8, which explains the behavior you're experiencing.

The command-line argument handling should account for the encoding used by the operating system. For unix-like systems, it is UTF-8. For Windows, it's UTF-16LE when the arguments are obtained from CommandLineToArgvW.

@burgerrg
Copy link
Contributor

burgerrg commented Apr 2, 2018

It would be helpful to add Sstring_utf8 and Sstring_utf16le to scheme.h.

@burgerrg
Copy link
Contributor

burgerrg commented Jun 8, 2018

@dybvig, do you think we should add Sstring_utf8, or update S_string to process UTF-8? I don't see any cases in the C code that use an 8-bit encoding other than UTF-8.

@burgerrg
Copy link
Contributor

Commit aa1c2c4 addresses this issue. The command-line arguments and environment variables are now processed for Unicode. I added Sstring_utf8 to the scheme.h interface to make it convenient to create Unicode Scheme strings from C.

@burgerrg burgerrg removed their assignment Jun 14, 2018
@atticus0
Copy link
Author

Thank you for fixing this issue. Closed.

mflatt pushed a commit to racket/ChezScheme that referenced this issue Mar 24, 2021
… work

  except that combining characters are not treated correctly for
  line-wrapping.  this addresses github issue #32 and part of issue cisco#81.
    c/expeditor.c, s/expeditor.ss
mflatt pushed a commit to mflatt/ChezScheme that referenced this issue Oct 10, 2023
… work

  except that combining characters are not treated correctly for
  line-wrapping.  this addresses github issue cisco#32 and part of issue cisco#81.
    c/expeditor.c, s/expeditor.ss

Original commit: 87d4811
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants