Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong formatting of string arguments containing umlauts #2888

Closed
ToolsDevler opened this issue May 9, 2022 · 4 comments
Closed

Wrong formatting of string arguments containing umlauts #2888

ToolsDevler opened this issue May 9, 2022 · 4 comments

Comments

@ToolsDevler
Copy link
Contributor

repro: https://godbolt.org/z/T5PWoEhdW

Issue

We switched from libfmt 6 to 8 on our RedHat Linux development environments and we noticed a change in the behavior of formatting. When using precision formatting on strings (e.g. %-5.5s or {0:<5.5}, see repro for test case) fmt adds spaces at the end of the output (the more umlauts, the more spaces).

fmt::sprintf("%-5.5s", "ö-----") // Returns  'ö--- ' <-- Space at the end

fmt::format("{0:<5.5}", "ö-----") // Returns 'ö--- ' <-- Again

//And as a reference the standard library:
snprintf("%-5.5s", "ö-----") // Returns 'ö---' <-- No space

Details

I try to debug in the library where the problem comes from. It looks like the extra spaces originate in the function 'code_point_length' in core.h. The umlaut causes this function to return 4. If I hard code the result to 1 I have the expected result.

This happens on our RedHat Linux Servers with LC_CTYPE=en_US.iso885915

I try to continue but I have to admit, that I struggle a bit on this one.

@ToolsDevler ToolsDevler changed the title Wrong formatting of Latin1 encoded strings Wrong formatting of string arguments containing umlauts May 9, 2022
@ToolsDevler
Copy link
Contributor Author

ToolsDevler commented May 10, 2022

I think I made some progress. I somehow got the idea that both UTF-8 and ISO-8859-15 encode the letter 'ö' as 0xF6. But this is wrong.

UTF-8: 0xC3 0xB6
ISO-8859-15: 0xF6

F6 translates to 11110110 which starts with four 1's. In UTF-8 this means, the character is encoded in 4 bytes (for the UTF-8
version 0xC3 translates to 11000011 with two 1's which indicate a two byte character).

I never worked deeply with encoding so I'm not sure if this is really correct, but from what I see it makes sense. Can somebody confirm that?

@ToolsDevler
Copy link
Contributor Author

If my above comment is correct, this would mean that libfmt has UTF-8 specialized functions that can not work with other encodings. Since there are other popular UTF-8 encodings in use, I think this will cause a lot of issues for many users of libfmt.

@vitaut
Copy link
Contributor

vitaut commented May 11, 2022

Fixed in 358f5a7, thanks for reporting. Now fmt::format("{0:<5.5}", "ö-----") returns "ö----" (5 code points) as expected.

@vitaut vitaut closed this as completed May 11, 2022
@ToolsDevler
Copy link
Contributor Author

ToolsDevler commented May 12, 2022

Thanks for the fast fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants