New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

command-line-arguments can't read umlauts with utf-8 encoding #81

Closed
atticus0 opened this Issue Jul 17, 2016 · 9 comments

Comments

Projects
None yet
5 participants
@atticus0

atticus0 commented Jul 17, 2016

Chez Scheme doesn't recognize umlauts read by command-line' and 'command-line-arguments'. Tested with version 9.4 and 9.4.1 commit a664335.

Link to mailing list discussion.

@akeep

This comment has been minimized.

Show comment
Hide comment
@akeep

akeep Jul 17, 2016

Contributor

I've tested this out on my Mac using the standard terminal, and it also fails here if the expression editor is enabled:

% scheme
Chez Scheme Version 9.4.1
Copyright 1984-2016 Cisco Systems, Inc.

> (define ;; <option-u>u which should produce ü, produces nothing.

However, when I turn the expeditor off, it seems to work fine:

[akeep@hawkeye ~]% scheme --eedisable
Chez Scheme Version 9.4.1
Copyright 1984-2016 Cisco Systems, Inc.

> (define ü 5)
> ü
5

Additionally, echoing the instructions in from the terminal worked in both cases:

% echo '(define ü 5) ü' | scheme
Chez Scheme Version 9.4.1
Copyright 1984-2016 Cisco Systems, Inc.

> > 5
> 

and with --eedisable:

echo '(define ü 5) ü' | scheme --eedisable
Chez Scheme Version 9.4.1
Copyright 1984-2016 Cisco Systems, Inc.

> > 5
> 

However, command line arguments (which allow us to name files to load also fails). So, for instance if we have two files u.ss and ü.ss with the contents:

u.ss:

(define u 9)
(pretty-print u)

and ü.ss:

(define ü 5)
(pretty-print ü)

Then Chez can load u.ss:

% scheme u.ss
Chez Scheme Version 9.4.1
Copyright 1984-2016 Cisco Systems, Inc.

9
> 

But not ü.ss:

% scheme ü.ss
Chez Scheme Version 9.4.1
Copyright 1984-2016 Cisco Systems, Inc.

Exception in load: failed for ????????.ss: no such file or directory

However, Chez can load the ü.ss and execute it correctly, if we disable the expression editor.

% scheme --eedisable
Chez Scheme Version 9.4.1
Copyright 1984-2016 Cisco Systems, Inc.

> (load "ü.ss")
5
> 

So, anyway, it seems like there are two problems here:

  1. The expression editor isn't able to take in unicode characters.
  2. The command line arguments aren't being processed in a way that accepts unicode characters.

I think in both cases OS X is probably providing the characters in UTF-8, but I was a little surprised by the number of ? characters in the load error report.

So, there are some work arounds (though not being able to use the expression editor is a pretty big bummer). Worth noting though is that file and console IO seem to do the right thing when the expression editor isn't involved.

I'll also try to take a look into this and see what I can figure out.

Contributor

akeep commented Jul 17, 2016

I've tested this out on my Mac using the standard terminal, and it also fails here if the expression editor is enabled:

% scheme
Chez Scheme Version 9.4.1
Copyright 1984-2016 Cisco Systems, Inc.

> (define ;; <option-u>u which should produce ü, produces nothing.

However, when I turn the expeditor off, it seems to work fine:

[akeep@hawkeye ~]% scheme --eedisable
Chez Scheme Version 9.4.1
Copyright 1984-2016 Cisco Systems, Inc.

> (define ü 5)
> ü
5

Additionally, echoing the instructions in from the terminal worked in both cases:

% echo '(define ü 5) ü' | scheme
Chez Scheme Version 9.4.1
Copyright 1984-2016 Cisco Systems, Inc.

> > 5
> 

and with --eedisable:

echo '(define ü 5) ü' | scheme --eedisable
Chez Scheme Version 9.4.1
Copyright 1984-2016 Cisco Systems, Inc.

> > 5
> 

However, command line arguments (which allow us to name files to load also fails). So, for instance if we have two files u.ss and ü.ss with the contents:

u.ss:

(define u 9)
(pretty-print u)

and ü.ss:

(define ü 5)
(pretty-print ü)

Then Chez can load u.ss:

% scheme u.ss
Chez Scheme Version 9.4.1
Copyright 1984-2016 Cisco Systems, Inc.

9
> 

But not ü.ss:

% scheme ü.ss
Chez Scheme Version 9.4.1
Copyright 1984-2016 Cisco Systems, Inc.

Exception in load: failed for ????????.ss: no such file or directory

However, Chez can load the ü.ss and execute it correctly, if we disable the expression editor.

% scheme --eedisable
Chez Scheme Version 9.4.1
Copyright 1984-2016 Cisco Systems, Inc.

> (load "ü.ss")
5
> 

So, anyway, it seems like there are two problems here:

  1. The expression editor isn't able to take in unicode characters.
  2. The command line arguments aren't being processed in a way that accepts unicode characters.

I think in both cases OS X is probably providing the characters in UTF-8, but I was a little surprised by the number of ? characters in the load error report.

So, there are some work arounds (though not being able to use the expression editor is a pretty big bummer). Worth noting though is that file and console IO seem to do the right thing when the expression editor isn't involved.

I'll also try to take a look into this and see what I can figure out.

@jltaylor-us

This comment has been minimized.

Show comment
Hide comment
@jltaylor-us

jltaylor-us Jul 17, 2016

Contributor

The inability to enter non-latin characters in the expression editor is #32.

This issue would be more accurately titled "Command line arguments always treated as bytes". The C spec (at least as of C99) says that argv contains "strings", which by definition are made up of characters, not wide characters. It is silent on whether those strings can be "multi-byte strings" (which of course it would be, since they're the same pointer type). It appears that it's up to the OS (or maybe even the particular shell) to fill in argv however it sees fit, and it's not always consistent. See, e.g., http://stackoverflow.com/questions/5408730/what-is-the-encoding-of-argv

In any case, if you want to take a stab at making things better, looks like scheme.c:1112 is the place to start... possibly using the mbstowcs c library function and a new function like Sstring that takes wide character strings instead.

Contributor

jltaylor-us commented Jul 17, 2016

The inability to enter non-latin characters in the expression editor is #32.

This issue would be more accurately titled "Command line arguments always treated as bytes". The C spec (at least as of C99) says that argv contains "strings", which by definition are made up of characters, not wide characters. It is silent on whether those strings can be "multi-byte strings" (which of course it would be, since they're the same pointer type). It appears that it's up to the OS (or maybe even the particular shell) to fill in argv however it sees fit, and it's not always consistent. See, e.g., http://stackoverflow.com/questions/5408730/what-is-the-encoding-of-argv

In any case, if you want to take a stab at making things better, looks like scheme.c:1112 is the place to start... possibly using the mbstowcs c library function and a new function like Sstring that takes wide character strings instead.

@akeep

This comment has been minimized.

Show comment
Hide comment
@akeep

akeep Jul 17, 2016

Contributor

Yes, I was just looking at that file and the that stack overflow article.

The pertinent code for the expression editor is in expeditor.c at lines 639, 644, 653, 657, and 662, where we use read (through the READ macro) to read a single byte (lines 639, 644, 653, and 657) and then convert it into a character with Schar (line 662).

The mbstowcs and mbrtowc seem like they may help for this and #32.

Contributor

akeep commented Jul 17, 2016

Yes, I was just looking at that file and the that stack overflow article.

The pertinent code for the expression editor is in expeditor.c at lines 639, 644, 653, 657, and 662, where we use read (through the READ macro) to read a single byte (lines 639, 644, 653, and 657) and then convert it into a character with Schar (line 662).

The mbstowcs and mbrtowc seem like they may help for this and #32.

eraserhd added a commit to eraserhd/ChezScheme that referenced this issue Jul 21, 2016

- add unicode support to the expression editor. entry and display now…
… work

  except that combining characters are not treated correctly for
  line-wrapping.  this addresses github issue cisco#32 and part of issue cisco#81.
    c/expeditor.c, s/expeditor.ss
@eraserhd

This comment has been minimized.

Show comment
Hide comment
@eraserhd

eraserhd Jul 21, 2016

Contributor

@akeep You will probably also have to use newlocale() and uselocale() in a way similar to PR #83.

Contributor

eraserhd commented Jul 21, 2016

@akeep You will probably also have to use newlocale() and uselocale() in a way similar to PR #83.

@burgerrg

This comment has been minimized.

Show comment
Hide comment
@burgerrg

burgerrg Apr 2, 2018

Contributor

The command-line arguments are converted to Scheme strings using Sstring. Sstring does not process UTF-8, which explains the behavior you're experiencing.

The command-line argument handling should account for the encoding used by the operating system. For unix-like systems, it is UTF-8. For Windows, it's UTF-16LE when the arguments are obtained from CommandLineToArgvW.

Contributor

burgerrg commented Apr 2, 2018

The command-line arguments are converted to Scheme strings using Sstring. Sstring does not process UTF-8, which explains the behavior you're experiencing.

The command-line argument handling should account for the encoding used by the operating system. For unix-like systems, it is UTF-8. For Windows, it's UTF-16LE when the arguments are obtained from CommandLineToArgvW.

@burgerrg

This comment has been minimized.

Show comment
Hide comment
@burgerrg

burgerrg Apr 2, 2018

Contributor

It would be helpful to add Sstring_utf8 and Sstring_utf16le to scheme.h.

Contributor

burgerrg commented Apr 2, 2018

It would be helpful to add Sstring_utf8 and Sstring_utf16le to scheme.h.

@burgerrg

This comment has been minimized.

Show comment
Hide comment
@burgerrg

burgerrg Jun 8, 2018

Contributor

@dybvig, do you think we should add Sstring_utf8, or update S_string to process UTF-8? I don't see any cases in the C code that use an 8-bit encoding other than UTF-8.

Contributor

burgerrg commented Jun 8, 2018

@dybvig, do you think we should add Sstring_utf8, or update S_string to process UTF-8? I don't see any cases in the C code that use an 8-bit encoding other than UTF-8.

@burgerrg

This comment has been minimized.

Show comment
Hide comment
@burgerrg

burgerrg Jun 14, 2018

Contributor

Commit aa1c2c4 addresses this issue. The command-line arguments and environment variables are now processed for Unicode. I added Sstring_utf8 to the scheme.h interface to make it convenient to create Unicode Scheme strings from C.

Contributor

burgerrg commented Jun 14, 2018

Commit aa1c2c4 addresses this issue. The command-line arguments and environment variables are now processed for Unicode. I added Sstring_utf8 to the scheme.h interface to make it convenient to create Unicode Scheme strings from C.

@burgerrg burgerrg removed their assignment Jun 14, 2018

@atticus0

This comment has been minimized.

Show comment
Hide comment
@atticus0

atticus0 Sep 11, 2018

Thank you for fixing this issue. Closed.

atticus0 commented Sep 11, 2018

Thank you for fixing this issue. Closed.

@atticus0 atticus0 closed this Sep 11, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment