Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Properly handle supplementary characters when saving XML files #197

Closed
wants to merge 2 commits into from

Conversation

@Yarn366
Copy link
Contributor

@Yarn366 Yarn366 commented Jun 8, 2017

This fixes the problem experienced by a number of users, including this one, where supplementary characters (which includes many emoji) are written as pairs of escaped surrogate code points.

Note that the problem only affects platforms like Windows where wchar_t is 2 bytes.

…on platforms where wchar_t (and, by extension, wxUChar) is 2 bytes. Also, ignore invalid surrogates and the noncharacters U+FFFE and U+FFFF.
@Paul-Licameli
Copy link
Member

@Paul-Licameli Paul-Licameli commented Jun 8, 2017

Thanks for this proposal.
I am not familiar with details of character sets, but I understand that on Windows, the loss of
information happens on Windows only at the line
wxUChar c = s.GetChar(i);

Would it not be simpler to change the type of c to something sure to be 32 bits wide on all platforms? Could it be the char32_t type, which is standard since C++11?

http://en.cppreference.com/w/cpp/language/types

@Yarn366
Copy link
Contributor Author

@Yarn366 Yarn366 commented Jun 9, 2017

The problem appears to be much deeper than I originally thought.

If you look up the return type of s.GetChar(i);, you'll notice that it returns not wxUChar but wxUniChar. From what I gather from the documentation and personal testing, wxUniChar is always 4 bytes, not the size of wchar_t. However, Audacity doesn't seem to use this type anywhere and instead always uses wxChar and its variants, which are typecasts of wchar_t, and thus can store characters differently depending on the platform. This results in wxString only storing character values that wchar_t can store, at least some of the time.

It looks like the real solution here is to change the entire source code over from wxChar and its variants to wxUniChar, which appears to be no small task, considering how often the former occurs.

EDIT: Or maybe I'm completely wrong there. What's definitely true, though, is that if I switch just that one function to use a 32-bit character type, it would still be necessary to deal with surrogate pairs, and additionally it would be necessary to decode and encode them in that function, which isn't necessary with the code that I submitted.

The code also uses wxL() everywhere, whose use has, according to the documentation, been discouraged since wxWidgets 2.9.0. I don't know how important it is to change this, but I do know that the cases of the switch statement in XMLEsc use wlL() and cause compiler errors if the type of c is simply changed to wxUniChar. Unfortunately, I'm not aware of an alternative. (And I actually have very little experience with wxWidgets, for that matter; I learned most of this only today.) EDIT: According to this page, wxL() not only is harmless but can simply be removed.

@Yarn366
Copy link
Contributor Author

@Yarn366 Yarn366 commented Jul 19, 2017

In case any developers see this, I posted this issue on audacity-devel (here: https://sourceforge.net/p/audacity/mailman/message/35931978/) almost two weeks ago; it has yet to get a response.

@SteveDaulton
Copy link
Member

@SteveDaulton SteveDaulton commented Nov 6, 2017

Thanks Yarn. This fix has now been committed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants