Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Actually use utf8cpp after building it #3354

Closed
wants to merge 2 commits into from
Closed

Conversation

skef
Copy link
Contributor

@skef skef commented Nov 14, 2021

I've been trying to raise this utf8cpp problem with the antlr4 team in #3200 and #3280, so far with little luck. So I'm filing this PR as a potential resolution in the hope of moving forward.

Right now, as far as I can tell, the antlr4 build process always either ensures utf8cpp is present on the system or downloads and builds utf8cpp. It then doesn't use utf8cpp by default. This makes little sense. It should either

  1. use utf8cpp by default, or
  2. only find/download utf8cpp when it is configured.

This PR is an attempt to implement option 1.

Note that StringUtils.h is included by parser implementations. This means that if it includes utf8.h then every parser implementation must have utf8cpp's include files in its include path. Given how the runtime is built and used this seems like a huge mess. Accordingly, I de-templatized utf32_to_utf8() (the generality wasn't needed) and moved both it and utf8_to_utf32() down into StringUtils.cpp.

Obviously I don't know why the code got into the state it did or the full implications of moving to default use of utf8cpp. However, if something like this PR doesn't make sense then there should be an alternative PR to turn off the utf8cpp find/download unless it's going to be used.

@mike-lischke Please take a look at this. Once this issue is resolved I'll have a better idea of what to do about #3200 and I may submit a PR for #3280.

Make utf32_to_utf8 and utf8_to_utf32 non-inline to avoid utf8.h include
@mike-lischke
Copy link
Member

mike-lischke commented Nov 28, 2021

We might have gone the utf8cpp route too hasty, me thinks now. The library codecvt has been deprecated in C++17, yes, but there are already new functions for what we need here (https://en.cppreference.com/w/cpp/locale/codecvt). We are only interested in conversions between std::string <-> std::wstring (for debugging only!) and UTF-8 <-> UTF-32, which still work in C++17 and C++20.

So I wonder if we shouldn't rather remove the 3rd party lib utf8cpp, whose inclusion has created quite some trouble in the past (and which still requires patches like this one, years after including it). What's your take on this @skef? Should we remove utf8cpp?

Regarding your patch: I dislike that utf8cpp will be made the default regardless of the used environment, if that USE_CODECVT_INSTEAD_OF_UTF8 macro is not defined. Keep in mind there are many existing project files (e.g. for Xcode and Visual Studio) that do not have this defined and would hence switch the used UTF transformation APIs on next ANTLR4 upgrade.

@skef
Copy link
Contributor Author

skef commented Nov 28, 2021

What's your take on this @skef? Should we remove utf8cpp?

I guess my initial take is disappointment, given that my interest is in some fix for #3280 and it seems like both the old and new codecvt-based conversion functions just throw std::range_error without much information. utf8cpp includes methods to identify the location of non-utf8 data; if you settle on the new codecvt interfaces with wstring_convert I guess those would have to be built. And it''s not that clear from the codecvt documentation, at least at first glance, how to go about building them.

Regarding your patch: I dislike that utf8cpp will be made the default regardless of the used environment, if that USE_CODECVT_INSTEAD_OF_UTF8 macro is not defined. Keep in mind there are many existing project files (e.g. for Xcode and Visual Studio) that do not have this defined and would hence switch the used UTF transformation APIs on next ANTLR4 upgrade.

I took the fact that utf8cpp was always either found or built, together with the nodes on deprecation in v17, as implying that it was the desired implementation. If that's not the case that's OK, but it does mean that if there's going to be a fix for #3280 and utf8cpp is optional then the problem will have to be fixed twice, once for each set of underlying conversion methods.

So I guess what I'd recommend is to settle on one of the two while keeping the problem of finding the location of unencode-able data in mind.

@mike-lischke
Copy link
Member

I see your point, skef. But keeping a 3rd party lib (with all attached problems) only for the lack of good error reporting is pretty awkward. Converting between the different Unicode transformation formats is not rocket science. We could instead write our own conversions, which is just a couple lines of code and would have the best error reporting possible.

So, concluding the discussion, we should either go for an own solution or stay with what we have currently. However, it's very desirable to go with an own solution in the future to get rid of the 3rd party lib.

@skef
Copy link
Contributor Author

skef commented Nov 30, 2021

Alright.

@skef skef closed this Nov 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants