Actually use utf8cpp after building it #3354

skef · 2021-11-14T18:12:39Z

I've been trying to raise this utf8cpp problem with the antlr4 team in #3200 and #3280, so far with little luck. So I'm filing this PR as a potential resolution in the hope of moving forward.

Right now, as far as I can tell, the antlr4 build process always either ensures utf8cpp is present on the system or downloads and builds utf8cpp. It then doesn't use utf8cpp by default. This makes little sense. It should either

use utf8cpp by default, or
only find/download utf8cpp when it is configured.

This PR is an attempt to implement option 1.

Note that StringUtils.h is included by parser implementations. This means that if it includes utf8.h then every parser implementation must have utf8cpp's include files in its include path. Given how the runtime is built and used this seems like a huge mess. Accordingly, I de-templatized utf32_to_utf8() (the generality wasn't needed) and moved both it and utf8_to_utf32() down into StringUtils.cpp.

Obviously I don't know why the code got into the state it did or the full implications of moving to default use of utf8cpp. However, if something like this PR doesn't make sense then there should be an alternative PR to turn off the utf8cpp find/download unless it's going to be used.

@mike-lischke Please take a look at this. Once this issue is resolved I'll have a better idea of what to do about #3200 and I may submit a PR for #3280.

Make utf32_to_utf8 and utf8_to_utf32 non-inline to avoid utf8.h include

mike-lischke · 2021-11-28T11:35:42Z

We might have gone the utf8cpp route too hasty, me thinks now. The library codecvt has been deprecated in C++17, yes, but there are already new functions for what we need here (https://en.cppreference.com/w/cpp/locale/codecvt). We are only interested in conversions between std::string <-> std::wstring (for debugging only!) and UTF-8 <-> UTF-32, which still work in C++17 and C++20.

So I wonder if we shouldn't rather remove the 3rd party lib utf8cpp, whose inclusion has created quite some trouble in the past (and which still requires patches like this one, years after including it). What's your take on this @skef? Should we remove utf8cpp?

Regarding your patch: I dislike that utf8cpp will be made the default regardless of the used environment, if that USE_CODECVT_INSTEAD_OF_UTF8 macro is not defined. Keep in mind there are many existing project files (e.g. for Xcode and Visual Studio) that do not have this defined and would hence switch the used UTF transformation APIs on next ANTLR4 upgrade.

skef · 2021-11-28T12:27:56Z

What's your take on this @skef? Should we remove utf8cpp?

I guess my initial take is disappointment, given that my interest is in some fix for #3280 and it seems like both the old and new codecvt-based conversion functions just throw std::range_error without much information. utf8cpp includes methods to identify the location of non-utf8 data; if you settle on the new codecvt interfaces with wstring_convert I guess those would have to be built. And it''s not that clear from the codecvt documentation, at least at first glance, how to go about building them.

Regarding your patch: I dislike that utf8cpp will be made the default regardless of the used environment, if that USE_CODECVT_INSTEAD_OF_UTF8 macro is not defined. Keep in mind there are many existing project files (e.g. for Xcode and Visual Studio) that do not have this defined and would hence switch the used UTF transformation APIs on next ANTLR4 upgrade.

I took the fact that utf8cpp was always either found or built, together with the nodes on deprecation in v17, as implying that it was the desired implementation. If that's not the case that's OK, but it does mean that if there's going to be a fix for #3280 and utf8cpp is optional then the problem will have to be fixed twice, once for each set of underlying conversion methods.

So I guess what I'd recommend is to settle on one of the two while keeping the problem of finding the location of unencode-able data in mind.

mike-lischke · 2021-11-30T08:20:58Z

I see your point, skef. But keeping a 3rd party lib (with all attached problems) only for the lack of good error reporting is pretty awkward. Converting between the different Unicode transformation formats is not rocket science. We could instead write our own conversions, which is just a couple lines of code and would have the best error reporting possible.

So, concluding the discussion, we should either go for an own solution or stay with what we have currently. However, it's very desirable to go with an own solution in the future to get rid of the 3rd party lib.

skef · 2021-11-30T08:40:15Z

Alright.

skef added 2 commits November 14, 2021 08:46

Make utfcpp the default

af9f209

De-templatize utf32_to_utf8

2a6b90e

Make utf32_to_utf8 and utf8_to_utf32 non-inline to avoid utf8.h include

iterumllc mentioned this pull request Nov 23, 2021

Please allow use of system antlr4 to compile adobe-type-tools/afdko#1407

Open

skef closed this Nov 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Actually use utf8cpp after building it #3354

Actually use utf8cpp after building it #3354

skef commented Nov 14, 2021

mike-lischke commented Nov 28, 2021 •

edited

Loading

skef commented Nov 28, 2021

mike-lischke commented Nov 30, 2021

skef commented Nov 30, 2021

Actually use utf8cpp after building it #3354

Actually use utf8cpp after building it #3354

Conversation

skef commented Nov 14, 2021

mike-lischke commented Nov 28, 2021 • edited Loading

skef commented Nov 28, 2021

mike-lischke commented Nov 30, 2021

skef commented Nov 30, 2021

mike-lischke commented Nov 28, 2021 •

edited

Loading