Wrong formatting of string arguments containing umlauts #2888

ToolsDevler · 2022-05-09T15:07:22Z

Issue

We switched from libfmt 6 to 8 on our RedHat Linux development environments and we noticed a change in the behavior of formatting. When using precision formatting on strings (e.g. %-5.5s or {0:<5.5}, see repro for test case) fmt adds spaces at the end of the output (the more umlauts, the more spaces).

fmt::sprintf("%-5.5s", "ö-----") // Returns  'ö--- ' <-- Space at the end

fmt::format("{0:<5.5}", "ö-----") // Returns 'ö--- ' <-- Again

//And as a reference the standard library:
snprintf("%-5.5s", "ö-----") // Returns 'ö---' <-- No space

Details

I try to debug in the library where the problem comes from. It looks like the extra spaces originate in the function 'code_point_length' in core.h. The umlaut causes this function to return 4. If I hard code the result to 1 I have the expected result.

This happens on our RedHat Linux Servers with LC_CTYPE=en_US.iso885915

I try to continue but I have to admit, that I struggle a bit on this one.

ToolsDevler · 2022-05-10T11:29:08Z

I think I made some progress. I somehow got the idea that both UTF-8 and ISO-8859-15 encode the letter 'ö' as 0xF6. But this is wrong.

UTF-8: 0xC3 0xB6
ISO-8859-15: 0xF6

F6 translates to 11110110 which starts with four 1's. In UTF-8 this means, the character is encoded in 4 bytes (for the UTF-8
version 0xC3 translates to 11000011 with two 1's which indicate a two byte character).

I never worked deeply with encoding so I'm not sure if this is really correct, but from what I see it makes sense. Can somebody confirm that?

ToolsDevler · 2022-05-10T11:31:31Z

If my above comment is correct, this would mean that libfmt has UTF-8 specialized functions that can not work with other encodings. Since there are other popular UTF-8 encodings in use, I think this will cause a lot of issues for many users of libfmt.

vitaut · 2022-05-11T13:46:09Z

Fixed in 358f5a7, thanks for reporting. Now fmt::format("{0:<5.5}", "ö-----") returns "ö----" (5 code points) as expected.

ToolsDevler · 2022-05-12T06:56:02Z

Thanks for the fast fix!

ToolsDevler changed the title ~~Wrong formatting of Latin1 encoded strings~~ Wrong formatting of string arguments containing umlauts May 9, 2022

vitaut closed this as completed May 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong formatting of string arguments containing umlauts #2888

Wrong formatting of string arguments containing umlauts #2888

ToolsDevler commented May 9, 2022

ToolsDevler commented May 10, 2022 •

edited

ToolsDevler commented May 10, 2022

vitaut commented May 11, 2022

ToolsDevler commented May 12, 2022 •

edited

Wrong formatting of string arguments containing umlauts #2888

Wrong formatting of string arguments containing umlauts #2888

Comments

ToolsDevler commented May 9, 2022

Issue

Details

ToolsDevler commented May 10, 2022 • edited

ToolsDevler commented May 10, 2022

vitaut commented May 11, 2022

ToolsDevler commented May 12, 2022 • edited

ToolsDevler commented May 10, 2022 •

edited

ToolsDevler commented May 12, 2022 •

edited