to_utf8 truncates characters instead of performing the conversion #395

isnullxbh · 2018-08-28T08:28:28Z

spirit/include/boost/spirit/home/support/utf8.hpp

Line 66 in 925f40a

*utf8_iter++ = (UChar)ch;

Is it permissible - convert wchar_t to char (in the case of type of the input parameter is std::wstring)?

Kojoley · 2018-10-25T19:34:36Z

I have looked more into to_utf8 and cannot find any problem there. It casts wchar_t to unsigned short and then assigns the value to an iterator. Can you provide an example?

This one works as expected:

#include <boost/spirit/home/support/utf8.hpp>
#include <string>

int main()
{
    std::wstring s = L"привет";
    return "\xd0\xbf\xd1\x80\xd0\xb8\xd0\xb2\xd0\xb5\xd1\x82" != boost::spirit::to_utf8(s);
}

isnullxbh · 2018-10-26T07:34:37Z

Can you try to execute next code on Windows machine?

#include <boost/spirit/home/support/utf8.hpp>
#include <string>

int main()
{
    std::wstring s = L"𠼭";
    return "\xf0\xa0\xbc\xad" ==  boost::spirit::to_utf8(s);
}

Kojoley · 2018-10-26T11:21:34Z

I think the problem is in you string literal. On Windows wchar_t is 16-bit while the char in your literal u20F2D does not fit into 16-bit (134957 > 65535).

Try to run:

#include <boost/spirit/home/support/utf8.hpp>
#include <string>
#include <iostream>

int main()
{
    std::wstring s = L"𠼭";
    for (auto c : s) std::wcout << +c << '\n';
    std::wcout << L"'" << s << L"'\n";
    return "\xf0\xa0\xbc\xad" == boost::spirit::to_utf8(s);
}

https://wandbox.org/permlink/AGJqDmp1kngplN0X

Kojoley · 2018-10-28T14:50:43Z

I made a research and there is a problem in to_utf8.

The problem

From the [lex.ccon]/6:

A character literal that begins with the letter L, such as L'z', is a wide-character literal. A wide-character literal has type wchar_t.¹⁸ The value of a wide-character literal containing a single c-char has value equal to the numerical value of the encoding of the c-char in the execution wide-character set, unless the c-char has no representation in the execution wide-character set, in which case the value is implementation-defined. [ Note: The type wchar_t is able to represent all members of the execution wide-character set (see [basic.fundamental]). — end note ] The value of a wide-character literal containing multiple c-chars is implementation-defined.

#include <iostream>
#include <string>

int main()
{
    using namespace std::literals;
    std::cout << "sizeof(wchar_t): " << sizeof(wchar_t) << '\n';
    std::cout << "string literal size: " << L"𠼭"s.size() << '\n';
}

Linux (GCC 8.2/Clang 7)

sizeof(wchar_t): 4
string literal size: 1

Windows (MSVC 14.1/GCC 8.2/Clang 7):

sizeof(wchar_t): 2
string literal size: 2

On Windows the content of wchar_t string literal seems to be UTF-16, and this is where the problem come. Spirit does not do charset conversions, it simply feeds the data to utf8_output_iterator which expects UCS-4. This is definitely a bug, but I cannot say how it will be addressed, because in my opinion dealing with wchar_t is not worth the efforts. Ignoring the problem is a way to go, because this function is cosmetic one and is used only in parser debugging. Full character conversion support is too much as for me, it will bring more dependencies with little benefits. A compromise solution would be to assume UCS-4 for Unix and UTF-16 for Windows and this one I would rather go.

What you can do

If you want a UTF-8 encoded string from the string literal, you would better to use the C++11 u8 prefix ^[lex.ccon]/3, it will give you a compile time UTF-8 string while your code will have the actual character literals. If you need a unicode string literal, use the C++11 U (uppercase!) prefix ^[lex.ccon]/4. It will give you a UCS-4 string and Spirit works flawlessly with it on all platforms.

#include <boost/core/lightweight_test.hpp>
#include <boost/spirit/home/support/utf8.hpp>
#include <iostream>

int main()
{
    auto s = U"𠼭";
    BOOST_TEST_EQ("\xf0\xa0\xbc\xad", boost::spirit::to_utf8(s));
    BOOST_TEST_CSTR_EQ("\xf0\xa0\xbc\xad", u8"𠼭");
    return boost::report_errors();
}

Kojoley · 2018-10-28T15:03:28Z

@isnullxbh Can you please check if #413 solves the problem for you?

isnullxbh · 2018-10-30T07:08:59Z

Hi, @Kojoley! Thanks a lot for publishing your research results! I can't check it on Windows machine at the moment, but I saw your merge and I'm trusting to tests you have written.

isnullxbh changed the title ~~Conversion wchar_t to char~~ wchar_t to char conversion Aug 28, 2018

This comment has been minimized.

Sign in to view

Kojoley added the bug label Aug 28, 2018

Kojoley changed the title ~~wchar_t to char conversion~~ to_utf8 truncates characters instead of performing the conversion Aug 28, 2018

Kojoley mentioned this issue Oct 28, 2018

to_utf8: Fixed wchar_t handling on Windows #413

Merged

Kojoley closed this as completed in #413 Oct 30, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

to_utf8 truncates characters instead of performing the conversion #395

to_utf8 truncates characters instead of performing the conversion #395

isnullxbh commented Aug 28, 2018 •

edited

Loading

This comment has been minimized.

Kojoley commented Oct 25, 2018

isnullxbh commented Oct 26, 2018 •

edited

Loading

Kojoley commented Oct 26, 2018

Kojoley commented Oct 28, 2018

Kojoley commented Oct 28, 2018

isnullxbh commented Oct 30, 2018

to_utf8 truncates characters instead of performing the conversion #395

to_utf8 truncates characters instead of performing the conversion #395

Comments

isnullxbh commented Aug 28, 2018 • edited Loading

This comment has been minimized.

Kojoley commented Oct 25, 2018

isnullxbh commented Oct 26, 2018 • edited Loading

Kojoley commented Oct 26, 2018

Kojoley commented Oct 28, 2018

The problem

What you can do

Kojoley commented Oct 28, 2018

isnullxbh commented Oct 30, 2018

isnullxbh commented Aug 28, 2018 •

edited

Loading

isnullxbh commented Oct 26, 2018 •

edited

Loading