Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Add utf8proc library to toolchain #25086

Closed
asfimport opened this issue May 27, 2020 · 12 comments
Closed

[C++] Add utf8proc library to toolchain #25086

asfimport opened this issue May 27, 2020 · 12 comments

Comments

@asfimport
Copy link
Collaborator

asfimport commented May 27, 2020

This is a minimal MIT-licensed library for UTF-8 data processing originally developed for use in Julia

https://github.com/JuliaStrings/utf8proc

Reporter: Wes McKinney / @wesm
Assignee: Uwe Korn / @xhochy

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-8961. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Uwe Korn / @xhochy:
For conda-forge and other distributions that can handle binary dependencies, we want to have use the system one. So we definitely need a ARROW_USE_SYSTEM_UTF8PROC option if we vendor.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
I'll take a look sometimes if you don't beat me to it.

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
@xhochy I would say it would be worth going ahead and adding utf8proc to conda-forge if it is not there already.

@asfimport
Copy link
Collaborator Author

Uwe Korn / @xhochy:
It's already there, named libutf8proc.

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
Ah great. I see that utf8proc includes a 1.5 MB data file, so we shouldn't be too cavalier about vendoring it. If utf8proc is only required when -DARROW_COMPUTE=ON then perhaps we can just add it as a normal thirdparty toolchain library

@asfimport
Copy link
Collaborator Author

Maarten Breddels / @maartenbreddels:
FWIW, in Vaex i've relied on https://github.com/ufal/unilib which is a very minimal/barebone library, I have no strong opinions about this though (unless benchmarks tell me otherwise).

@asfimport
Copy link
Collaborator Author

Uwe Korn / @xhochy:
We should definitely run benchmarks as in the utf8proc issue tracker they mention that icu seems to be significantly faster than utf8proc. Still, icu is much fatter than utf8proc and we probably need exact the functionality that is part of utf8proc, not more.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
What algorithms would we use in utf8proc ? If it's just tolower() and friends, the implementation seems simple and fast to me (and I doubt other libraries would be significantly faster).

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Also, unilib uses similar a lookup scheme, so it's unlikely to be significantly faster (it's actually a bit more complicated, because it seems it tries to compress the data tables more, at the expense of slightly more complicated lookup).

A concern about unilib, though, would be that it has had a single contributor over its 6 years of existence.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
I've compiled both libraries:

  • utf8proc weighs around 300 kB (mostly static data)
  • the weight of unilib depends on which functionality is being used, as it's header only; for example a test executable that uses property lookup and conversion, but not codepoint combining weighs around 120 kB

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
unilib's license (MPL 2.0) isn't ideal, see https://www.apache.org/legal/resolved.html#weak-copyleft-licenses. I'd prefer to only depend on MPL 2.0 libraries as a last resort.

@asfimport
Copy link
Collaborator Author

Kouhei Sutou / @kou:
Issue resolved by pull request 7452
#7452

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants