Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: better support for UTF-8 localized number formatting #1861

Closed
jwtowner opened this issue Sep 5, 2020 · 7 comments
Closed

Comments

@jwtowner
Copy link

jwtowner commented Sep 5, 2020

Hi! Here's the problem. Given that char8_t is kind of broken currently and still lacks a good standardized transcoding library, we're treating char based strings as if they were UTF-8. We want to format localized numbers with these strings. The problem is that std::numpunct<char>::decimal_point() and thousands_sep() only return a char and so it's not possible for these to represent UTF-8 characters beyond the ASCII subset. Some locales use non-ASCII characters for these, an example would be de-CH, which uses U+2019 for the digit separator. What we want to do is somehow transcode the values from std::numpunct<wchar_t> to UTF-8 and have libfmt use these instead. Using a custom formatter specialization and a wrapper type isn't really an option for us since we want this to be less-intrusive and not something user's of the library need to be concerned about. Fortunately, we do have a facade class for formatting localized strings, and this class owns the std::locale object, so it's possible for us to do some pre-processing or post-processing on the input and output arguments respectively. However, pre-processing doesn't work that well for vformat, since there's no way to convert format_args to wformat_args (and there probably shouldn't be, since that would end up increasing code-bloat by creating a link dependency between char and wchar_t formatters due to the type erase that is involved). So we're kind of stuck with post-processing.

Ideally, if libfmt (and perhaps eventually the standardized version) had an option to use the std::num_put<char> facet instead of std::numpunct<char>, we could solve the problem that way by providing a custom std::num_put<char> facet that outputs the correct UTF-8 sequence. libfmt would then need to recognize a different localized number format specifier, perhaps uppercase N instead of L, to indicate that it should use std::num_put instead of std::numpunct.

Another option would be to eventually support char8_t, char16_t and char32_t string formatting, and automatically transcode the values from std::numpunct<wchar_t> to the target character encoding.

It also looks like it would be possible to specialize the internal detail::int_writer or detail::arg_formatter template classes for each of the integral and floating point types to override the default formatting behavior, but this isn't a solution that would be portable to other implementations of std::format. So not really a valid solution for us.

Our current workaround that should work with the standardized std::format is to detect when the decimal point or thousands separator are non-ASCII characters and override the std::locale object with a custom std::numpunct<char> facet. This facet uses ASCII control characters \x01 and \x02 for the decimal point and digit separator, since these aren't found in strings in any of our uses cases. We then do a post-processing pass on the formatted string to replace \x01 or \x02 with the correct UTF-8 octet sequence. It's definitely a hack, but it works.

It looks something like this:

class Localizer
{
public:
    std::string VFormat(std::string_view fmtstr, fmt::format_args args) const
    {
       std::string result = fmt::vformat(locale_, fmtstr, args);
       if (postProcess_)
           DoPostProcess(result);
       return result;
    }

    template <typename... Args>
    std::string Format(std::string_view fmtstr, const Args&... args) const
    {
        return VFormat(fmtstr, fmt::make_format_args(args...));
    }

    // implementation also provides VFormatTo and FormatTo member functions similar to above

private:
    void DoPostProcess(std::string& result) const
    {
        // substitute \x01 and \x02 in result with values from the replacements_ array
    }

    std::locale locale_;
    bool postProcess_;
    std::array<std::string, 2> replacements_;
    // localized string tables, etc. etc.
};

What are your thoughts? Any better ideas? Is there any good way out of this quagmire?

@jwtowner
Copy link
Author

jwtowner commented Sep 5, 2020

On second thought, a different format specifier to indicate preference of std::num_put instead of std::numpunct is probably a bad idea. It still would nice if there was a standard, out-of-band way to tell it to use std::num_put instead though.

@DanielaE
Copy link
Contributor

DanielaE commented Sep 6, 2020

I can certainly relate to your pain. Proper localization without proper UTF support still is a mirage for the most part. Sooner or later you will get bitten by reality and implicit assumptions like a single character equates a single code unit. I've noticed this with the de_CH locale just recently during my attempt to serve our Swiss customers better. Experiences like these are my main motivation to refuse any meaningful string handling using char-based strings. So there you go: either do string conversions between char-based and wchar_t-based strings wherever possibly needed (leaving litter all around) or simply stick with wchar_t-based strings and live with the size and performance impact.

@foonathan
Copy link
Contributor

Just using wchar_t for the notion of thousands separator is still wrong. The type to represent a UTF character is string, as it can span arbitrary many code points.

@jwtowner
Copy link
Author

jwtowner commented Sep 7, 2020

Just using wchar_t for the notion of thousands separator is still wrong. The type to represent a UTF character is string, as it can span arbitrary many code points.

Yeah exactly, the real problem is the Standard library facets that return char_type for certain fields rather than string_type. Namely std::numpunct and std::moneypunct. That part of the library hasn't aged nearly as well. Perhaps what is needed is for someone to write a proposal to modernize those while maintaining backwards compatibility. I think it should be possible. Getting it approved, well that's a different story.

@jwtowner
Copy link
Author

jwtowner commented Sep 7, 2020

@DanielaE

either do string conversions between char-based and wchar_t-based strings wherever possibly needed (leaving litter all around) or simply stick with wchar_t-based strings and live with the size and performance impact.

Definitely the case today, but would be nice to get this fixed for the future.

@vitaut
Copy link
Contributor

vitaut commented Sep 20, 2020

a different format specifier to indicate preference of std::num_put instead of std::numpunct is probably a bad idea.

It is.

As Jonathan correctly pointed out wchar_t doesn't solve the problem (and should generally be avoided for other reasons).

It might be possible to replace numpunct with num_put for locale-specific formatting in {fmt} although it better use something less trashy than ostreambuf_iterator. A PR is welcome.

@vitaut
Copy link
Contributor

vitaut commented Sep 3, 2022

{fmt} now supports the UTF-8 format_facet locale facet which, among other things, makes using multi-code-unit digit separators possible. For example:

#include <fmt/format.h>
#include <locale>

int main() {
  std::locale::global(std::locale({}, new fmt::format_facet<std::locale>("")));
  fmt::print("{:L}\n", 1000);
}

prints:

1’000

Here is U+2019 (\xe2\x80\x99 in UTF-8).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants