Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing writeln Unicode normalization #9920

Open
dlangBugzillaToGithub opened this issue Dec 9, 2011 · 5 comments
Open

Missing writeln Unicode normalization #9920

dlangBugzillaToGithub opened this issue Dec 9, 2011 · 5 comments

Comments

@dlangBugzillaToGithub
Copy link

bearophile_hugs reported this on 2011-12-09T01:12:59Z

Transfered from https://issues.dlang.org/show_bug.cgi?id=7084

CC List

Description

In this program the string 'txt1' contains two codepoints: LATIN CAPITAL LETTER A, and COMBINING DIAERESIS.

I think a good printing function has to perform Unicode normalization and show a single \U000000C4 (LATIN CAPITAL LETTER A WITH DIAERESIS) glyph. But with DMD 2.057beta it shows two glyphs (on Windows), an 'A' followed by a diaeresis.

writeln(txt2) shows what I think is the correct output for writeln(txt1) too:


import std.stdio;
void main() {
    dstring txt1 = "\U00000041\U00000308"d;
    writeln(txt1);
    dstring txt2 = "\U000000C4"d;
    writeln(txt2);
}
@dlangBugzillaToGithub
Copy link
Author

hsteoh commented on 2012-02-25T17:57:14Z

IMO this should be an enhancement request. As I understand, Unicode normalization is non-trivial, so we probably should think over how we want to do it.

@dlangBugzillaToGithub
Copy link
Author

bearophile_hugs commented on 2012-02-26T14:59:46Z

(In reply to comment #1)
> IMO this should be an enhancement request. As I understand, Unicode
> normalization is non-trivial, so we probably should think over how we want to
> do it.

OK, now it's an enhancement.

@dlangBugzillaToGithub
Copy link
Author

hsteoh commented on 2012-02-26T22:22:24Z

Here's a link to the relevant part of the Unicode standard for whoever wants to implement normalization:

http://unicode.org/reports/tr15/

Note that there are several different normalizations, with NFC probably being the closest to what this bug requires.

After scanning through the standard, it seems to me that rather than putting this in std.stdio (or the prospective std.io), we really should put it in std.uni or std.utf, and have different algorithms available for programs to choose the normalization form. The algorithms involved are not trivial, and some people may not want std.stdio to automatically normalize to a particular form when they want specifically to use a different form or a non-normalized output for whatever reason.

@dlangBugzillaToGithub
Copy link
Author

hsteoh commented on 2016-10-15T05:12:28Z

@andralex: Are you sure this bug qualifies for 'bootcamp'? Unicode normalization is highly-nontrivial, and requires significant effort to support correctly, and will probably involve multiple modules (at least std.uni and std.stdio, perhaps also std.utf). Plus, deciding which normalization scheme(s) to default to is a decision that can only be made with more experience with the language and community.

@dlangBugzillaToGithub
Copy link
Author

dfj1esp02 commented on 2016-10-17T17:01:13Z

This can have the same problem as issue 2742: normalizing it always may be not what one wants, and detecting console is problematic. Also AFAIK not all characters have precomposed variants.

@thewilsonator thewilsonator removed OS:Windows Issues Specific to Windows Arch:x86 Issues specific to x86 P4 labels Dec 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants