Skip to content

Unicode characters are mangled in JavaScript kata output #307

@paul-calvelage

Description

@paul-calvelage

Unicode characters in the JavaScript kata output are being mangled.

Each byte of the UTF-8 encoding seems to be printing as a separate Unicode character. So the Chinese greeting 你好 displays as ä½ å¥½.

This is very bad for anyone needing more than 7-bit ASCII.

Now for some examples of what I think may be happening. This code in a JavaScript kata:

console.log("£");

displays as the two characters £ ("\u00c2\u00a3"). The Unicode code point for £ ("\u00a3") is normally encoded in UTF-8 as 0xc2a3. But Codewars apparently re-encodes each byte: 0xc2, 0xa3 to get £.

This:

console.log("\uffff")

is displayed as three characters ï¿¿ ("\u00ef\u00bf\u00bf"). The Unicode code point 0xffff is normally encoded in UTF-8 as 0xefbfbf. But as above, Codewars then seems to re-encode 0xef, 0xbf, 0xbf to ï¿¿.

I could give as many examples as there are multiple-byte UTF-8 encodings, but this suffices to show the pattern for a single character. Longer strings just repeat the problem, so that console.log("£££££"); displays as £££££ for example.

As I said, this seems to be pretty serious for anyone needing Unicode.


Note: I discovered this while completing Simple Change Machine, which uses the pound symbol.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions