Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

to_utf8 truncates characters instead of performing the conversion #395

Closed
isnullxbh opened this issue Aug 28, 2018 · 7 comments

Comments

@isnullxbh
Copy link

commented Aug 28, 2018

*utf8_iter++ = (UChar)ch;

Is it permissible - convert wchar_t to char (in the case of type of the input parameter is std::wstring)?

@isnullxbh isnullxbh changed the title Conversion wchar_t to char wchar_t to char conversion Aug 28, 2018

@Kojoley

This comment was marked as outdated.

Copy link
Collaborator

commented Aug 28, 2018

Ummm... yes, it is permissible, but it truncates the chars.
However, this 10+ years old code written by @hkaiser is used only inside Parser::what functions (parser debug function which returns the parser name and values it was parametrized with).

@Kojoley Kojoley added the bug label Aug 28, 2018

@Kojoley Kojoley changed the title wchar_t to char conversion to_utf8 truncates characters instead of performing the conversion Aug 28, 2018

@Kojoley

This comment has been minimized.

Copy link
Collaborator

commented Oct 25, 2018

I have looked more into to_utf8 and cannot find any problem there. It casts wchar_t to unsigned short and then assigns the value to an iterator. Can you provide an example?

This one works as expected:

#include <boost/spirit/home/support/utf8.hpp>
#include <string>

int main()
{
    std::wstring s = L"привет";
    return "\xd0\xbf\xd1\x80\xd0\xb8\xd0\xb2\xd0\xb5\xd1\x82" != boost::spirit::to_utf8(s);
}
@isnullxbh

This comment has been minimized.

Copy link
Author

commented Oct 26, 2018

Can you try to execute next code on Windows machine?

#include <boost/spirit/home/support/utf8.hpp>
#include <string>

int main()
{
    std::wstring s = L"𠼭";
    return "\xf0\xa0\xbc\xad" ==  boost::spirit::to_utf8(s);
}
@Kojoley

This comment has been minimized.

Copy link
Collaborator

commented Oct 26, 2018

I think the problem is in you string literal. On Windows wchar_t is 16-bit while the char in your literal u20F2D does not fit into 16-bit (134957 > 65535).

Try to run:

#include <boost/spirit/home/support/utf8.hpp>
#include <string>
#include <iostream>

int main()
{
    std::wstring s = L"𠼭";
    for (auto c : s) std::wcout << +c << '\n';
    std::wcout << L"'" << s << L"'\n";
    return "\xf0\xa0\xbc\xad" == boost::spirit::to_utf8(s);
}

https://wandbox.org/permlink/AGJqDmp1kngplN0X

@Kojoley

This comment has been minimized.

Copy link
Collaborator

commented Oct 28, 2018

I made a research and there is a problem in to_utf8.

The problem

From the [lex.ccon]/6:

A character literal that begins with the letter L, such as L'z', is a wide-character literal. A wide-character literal has type wchar_­t.18 The value of a wide-character literal containing a single c-char has value equal to the numerical value of the encoding of the c-char in the execution wide-character set, unless the c-char has no representation in the execution wide-character set, in which case the value is implementation-defined. [ Note: The type wchar_­t is able to represent all members of the execution wide-character set (see [basic.fundamental]). — end note ] The value of a wide-character literal containing multiple c-chars is implementation-defined.

#include <iostream>
#include <string>

int main()
{
    using namespace std::literals;
    std::cout << "sizeof(wchar_t): " << sizeof(wchar_t) << '\n';
    std::cout << "string literal size: " << L"𠼭"s.size() << '\n';
}

Linux (GCC 8.2/Clang 7)

sizeof(wchar_t): 4
string literal size: 1

Windows (MSVC 14.1/GCC 8.2/Clang 7):

sizeof(wchar_t): 2
string literal size: 2

On Windows the content of wchar_t string literal seems to be UTF-16, and this is where the problem come. Spirit does not do charset conversions, it simply feeds the data to utf8_output_iterator which expects UCS-4. This is definitely a bug, but I cannot say how it will be addressed, because in my opinion dealing with wchar_t is not worth the efforts. Ignoring the problem is a way to go, because this function is cosmetic one and is used only in parser debugging. Full character conversion support is too much as for me, it will bring more dependencies with little benefits. A compromise solution would be to assume UCS-4 for Unix and UTF-16 for Windows and this one I would rather go.

What you can do

If you want a UTF-8 encoded string from the string literal, you would better to use the C++11 u8 prefix [lex.ccon]/3, it will give you a compile time UTF-8 string while your code will have the actual character literals. If you need a unicode string literal, use the C++11 U (uppercase!) prefix [lex.ccon]/4. It will give you a UCS-4 string and Spirit works flawlessly with it on all platforms.

#include <boost/core/lightweight_test.hpp>
#include <boost/spirit/home/support/utf8.hpp>
#include <iostream>

int main()
{
    auto s = U"𠼭";
    BOOST_TEST_EQ("\xf0\xa0\xbc\xad", boost::spirit::to_utf8(s));
    BOOST_TEST_CSTR_EQ("\xf0\xa0\xbc\xad", u8"𠼭");
    return boost::report_errors();
}
@Kojoley

This comment has been minimized.

Copy link
Collaborator

commented Oct 28, 2018

@isnullxbh Can you please check if #413 solves the problem for you?

@isnullxbh

This comment has been minimized.

Copy link
Author

commented Oct 30, 2018

Hi, @Kojoley! Thanks a lot for publishing your research results! I can't check it on Windows machine at the moment, but I saw your merge and I'm trusting to tests you have written.

@Kojoley Kojoley closed this in #413 Oct 30, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.