printf %ls support - C++ standard compatibility #572

vgalka-sl · 2017-09-24T16:01:11Z

Hi,

The current library version does not allow mixing wchar_t* arguments into char* format strings. The following code does not compile (at least not on MS VS2017):

fmt::printf("%ls", L"foo");

However, according to the C++11 standard, it should be valid for printf-like functions.

C++11 includes the C library as described by the 1999 ISO C standard and its Technical Corrigenda 1, 2 and 3 (ISO/IEC 9899:1999 and ISO/IEC 9899:1999/Cor.1,2,3), plus (as by ISO/IEC 19769:2004).

Looking into ISO/IEC 9899:TC2, it describes the %ls format specifier as following (page 279):

If an l length modifier is present, the argument shall be a pointer to the initial element of an array of wchar_t type. Wide characters from the array are converted to multibyte characters (each as if by a call to the wcrtomb function, with the conversion state described by an mbstate_t object initialized to zero before the first wide character is converted) up to and including a terminating null wide character. The resulting multibyte characters are written up to (but not including) the terminating null character (byte).

It would be nice to have this standard compatibility :-)

Best regards,
Vasili

The text was updated successfully, but these errors were encountered:

vitaut · 2017-09-27T13:26:26Z

Thanks for raising this, I agree that it should be supported. Would you be willing to contribute a fix by any chance?

vgalka-sl · 2017-09-27T15:30:52Z

I haven't dived into the internals yet. Made a workaround for now...
But I'll try when I find time :-)

pulsa · 2017-12-23T20:59:01Z

This sounds very useful. I started looking at how to do it.
My ideal implementation would ignore the 'l' in "%ls" and automatically convert all string arguments to match the format string encoding, for both (s)printf() and format(). This simplifies the format string requirements and avoids a source of run-time exceptions, which is important because...

If narrow format strings can accept wide arguments (and vice versa), we lose compile-time safety checks like #606. But if we can always do the right thing with the input, this is not a problem.

However, conversion errors can happen if the input is invalid UTF-8, or if it contains characters that aren't in the output code set (non-Unicode only). We could throw an exception, replace with some other character or truncate the string. Should this be configurable somehow?

One more complication: setlocale() can't use UTF-8 on Windows/MSVC, so std::wcrtomb/mbrtowc() won't work here. I'm guessing most users want UTF-8, but console output uses the system code page by default so if they do not explicitly set up for UTF-8 they will get mojibake or worse when writing non-ASCII. This points to another new option: use the locale setting or force UTF-8. Or we could say fmtlib always uses UTF-8 and if you want some other char encoding you have to convert it yourself. I would be happy with this restriction.

BasicWriter::write_str() currently uses std::uninitialized_copy() to convert from char to wchar_t. This only works if the input is ASCII. I think this would have to change.

I'm assuming wchar_t* always means UTF-16 on Windows and UTF-32 elsewhere. Hope that's OK.

@vitaut, are you ready to dive into Unicode madness? 😃

vitaut · 2017-12-24T17:11:57Z

We could throw an exception, replace with some other character or truncate the string. Should this be configurable somehow?

I'd go with an exception by default, but it would be nice to make this configurable. The std branch introduced error handlers to make error behavior configurable, maybe they can be used here.

This points to another new option: use the locale setting or force UTF-8.

fmt::*printf should probably use the locale setting for compatibility with system printf. For the new format functions I'd go with UTF-8 since it has become the de-facto standard pretty much everywhere.

BasicWriter::write_str() currently uses std::uninitialized_copy() to convert from char to wchar_t. This only works if the input is ASCII. I think this would have to change.

Yes, this is a pre-#606 artefact.

I'm assuming wchar_t* always means UTF-16 on Windows and UTF-32 elsewhere. Hope that's OK.

Why is it necessary? If we use wcrtomb then it shouldn't matter.

@vitaut, are you ready to dive into Unicode madness?

Yes, sounds exciting =).

pulsa · 2017-12-29T07:52:16Z

I have a very rough proof of concept using wcrtomb/mbrtowc which works great in Linux, but with MSVC we get ANSI code pages and UCS-2 instead of UTF-8 and UTF-16. I'm working on an alternate method for proper Unicode support, using nowide::utf.

This work overlaps with #628, but if I tried to handle that at the same time I would never finish anything, so I'm ignoring it for now. This modern C++ stuff is still new to me.

aetchevarne · 2018-06-29T20:22:38Z

Just commenting that boost.locale and https://tzlaine.github.io/text/doc/html/index.html provide tools to work with unicode; maybe they are useful for fmt?

vitaut · 2018-07-01T21:22:45Z

Thanks, @aetchevarne.

matt77hias · 2018-09-16T13:46:30Z

Is there a cheap workaround to use wchar_t*/std::wstring/std::wstring_view arguments for std::string_view format strings (similar to the use of %ls in std::printf)? Or the dual: using char_t*/std::string/std::string_view arguments for std::wstring_view format strings (similar to the use of %s in std::wprintf). Converting std::wstring to std::string and vice versa on the fly is pretty expensive (i.e. allocations).

On Windows these ANSI<>UTF-16 differences are really an issue. std::filesystem::path for instance uses wchar_t (UTF-16) and is quite useful to output in addition to errors while parsing files.

vitaut · 2018-09-16T21:42:47Z

You could use utf16_to_utf8 which will not do dynamic allocations for strings smaller than inline_buffer_size (500 chars).

matt77hias · 2018-09-17T08:26:25Z

But does that mean that I need to link against Microsoft's complete C++ Rest SDK?

_ASYNCRTIMP std::string __cdecl utf16_to_utf8(const utf16string &w);

How does this returned string's content outlive the call to utf16_to_utf8?

As a side note, sometimes I need to perform conversions between std::string<>std::wstring (e.g., reading a std::string from a file and using it as a filename for another file), then I use the following:

#include <AtlBase.h>
#include <atlconv.h>

[[nodiscard]]
const std::wstring StringToWString(const std::string& str) {
    return std::wstring(CA2W(str.c_str()));
}

[[nodiscard]]
const std::string WStringToString(const std::wstring& str) {
    return std::string(CW2A(str.c_str()));
}

matt77hias · 2018-09-17T08:45:39Z

CA2W and CW2A use fixed size buffers of length 128. This is, however, in most cases insufficient for a complete file path while developing in Visual Studio. The MAX_PATH macro is set to 256.

In the above example, however, allocation will always happen, since the CA2W and CW2A will be destroyed upon returning. Alternatively, one can keep the CA2W and CW2A alive while using fmt, but this is not very transparent, since one needs to write this boilerplate for every occurence, and is not always needed, when redefining assert using fmt (NDEBUG).

A more transparent way consists of partially specializing fmt::formatter, but fmt::formatter<T>::format<FormatContextT>(const T& a, FormatContextT& ctx) returns the format and destroys all local buffers upon doing so.

Or alternatively why does fmt not just perform the char<>wchar_t conversion on the fly (Windows: WideCharToMultiByte and MultiByteToWideChar)? There is always the possibility of data loss, but fmt's output is pretty visible to the programmer ;-)

vitaut · 2018-09-17T15:43:15Z

But does that mean that I need to link against Microsoft's complete C++ Rest SDK?

No, I was talking about fmt's utf16_to_utf8: https://github.com/fmtlib/fmt/blob/master/include/fmt/format.h#L1131. It gives you a temporary buffer and only allocates on large strings.

matt77hias · 2018-09-17T16:28:41Z

No, I was talking about fmt's utf16_to_utf8: https://github.com/fmtlib/fmt/blob/master/include/fmt/format.h#L1131. It gives you a temporary buffer and only allocates on large strings.

Ah, ok that seems like a more appropriate choice. Seems like ATL's CA2W and CW2A, but with exceptions.

CA2W and CW2A use fixed size buffers of length 128. This is, however, in most cases insufficient for a complete file path while developing in Visual Studio. The MAX_PATH macro is set to 256.

Correction: the size of the buffer is a template argument set to 128 by default. So I can increase this to 256 or 512 as well.

So can I call these utf16_to_utf8/utf8_to_utf16 inside a partial specialization of fmt::formatter<T>::format<FormatContextT>(const T& a, FormatContextT& ctx)?

Sample: Godbolt
Update

Without knowing much about the internals of fmt, this could only work if fmt::format_to is not lazy and directly evaluates the arguments (i.e. no capture beyond the lifetime of the fmt::formatter<T>::format<FormatContextT>(const T& a, FormatContextT& ctx) call).

Thanks for the support.

vitaut · 2018-09-19T00:08:39Z

So can I call these utf16_to_utf8/utf8_to_utf16 inside a partial specialization of fmt::formatter::format(const T& a, FormatContextT& ctx)?

Sure, if it compiles =).

You can safely pass temporaries to fmt::format_to. It is not "lazy".

vitaut · 2019-03-15T02:42:08Z

Looks like there is not enough interest in this feature so closing, but PRs are still welcome.

jovibor · 2020-01-05T01:25:50Z

Could you please reconsider implementing this.
This is very useful feature, and it's very inconvenient to first convert char*<->wchar* to be able to use with format.
Thanks.

vitaut mentioned this issue Nov 10, 2017

Disallow strings as input for wstring templates #606

Closed

pulsa mentioned this issue Jan 7, 2018

WIP: Automatically convert string/wstring arguments to match format string. #635

Closed

matt77hias mentioned this issue Sep 16, 2018

Mixing char and wchar_t matt77hias/MAGE-v0#68

Closed

vitaut added the help wanted label Feb 10, 2019

vitaut closed this as completed Mar 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

printf %ls support - C++ standard compatibility #572

printf %ls support - C++ standard compatibility #572

vgalka-sl commented Sep 24, 2017 •

edited

vitaut commented Sep 27, 2017

vgalka-sl commented Sep 27, 2017

pulsa commented Dec 23, 2017

vitaut commented Dec 24, 2017

pulsa commented Dec 29, 2017

aetchevarne commented Jun 29, 2018

vitaut commented Jul 1, 2018

matt77hias commented Sep 16, 2018 •

edited

vitaut commented Sep 16, 2018 •

edited

matt77hias commented Sep 17, 2018 •

edited

matt77hias commented Sep 17, 2018 •

edited

vitaut commented Sep 17, 2018 •

edited

matt77hias commented Sep 17, 2018 •

edited

vitaut commented Sep 19, 2018

vitaut commented Mar 15, 2019

jovibor commented Jan 5, 2020

printf %ls support - C++ standard compatibility #572

printf %ls support - C++ standard compatibility #572

Comments

vgalka-sl commented Sep 24, 2017 • edited

vitaut commented Sep 27, 2017

vgalka-sl commented Sep 27, 2017

pulsa commented Dec 23, 2017

vitaut commented Dec 24, 2017

pulsa commented Dec 29, 2017

aetchevarne commented Jun 29, 2018

vitaut commented Jul 1, 2018

matt77hias commented Sep 16, 2018 • edited

vitaut commented Sep 16, 2018 • edited

matt77hias commented Sep 17, 2018 • edited

matt77hias commented Sep 17, 2018 • edited

vitaut commented Sep 17, 2018 • edited

matt77hias commented Sep 17, 2018 • edited

vitaut commented Sep 19, 2018

vitaut commented Mar 15, 2019

jovibor commented Jan 5, 2020

vgalka-sl commented Sep 24, 2017 •

edited

matt77hias commented Sep 16, 2018 •

edited

vitaut commented Sep 16, 2018 •

edited

matt77hias commented Sep 17, 2018 •

edited

matt77hias commented Sep 17, 2018 •

edited

vitaut commented Sep 17, 2018 •

edited

matt77hias commented Sep 17, 2018 •

edited