Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast case insensitive sort order and regex #149

Closed
niklas88 opened this issue Nov 13, 2018 · 3 comments
Closed

Fast case insensitive sort order and regex #149

niklas88 opened this issue Nov 13, 2018 · 3 comments
Assignees
Projects

Comments

@niklas88
Copy link
Member

niklas88 commented Nov 13, 2018

We need a way to optimize case insensitive regex filters like FILTER regex(?var, "astr", "i") matching of Astronaut.

This could be solved by always doing a case insensitive sort but I think that might be against the spec (e.g. this SO answer suggests it should be case sensitive). Then again looking at the spec and the linked xpath function it seems to me, that we are free to use any default collation strategy which would allow us to do case-insensitive sort.

@niklas88
Copy link
Member Author

This is now supported through building a case insensitive index with #209. It remains an open issue to be able to use both case sensitive and case insensitive prefix search with the same index

@joka921
Copy link
Member

joka921 commented Dec 10, 2019

I have been giving this some thought:
What you are missing is easy to implement, because the case-sensitive Prefix filter is a subset
of the case-insensitive prefix filter (If you sort the values that are the same when ignoring the case according to their casing. Every useful collation strategy should ensure this).

However I would also like to support diacritic-agnostic filtering and I in general don't like my current solution (It is really hacky with Uppercasing and lowercasing).

What I would really like to do is introduce boost_locale as a dependency which correctly supports all the features we would need to do all this collation stuff properly.
IMHO this dependency is justified since
a) Properly handling international (Including handling german Umlauts etc.) should be a priority.
b) Some of the things we need (e.g. find the range of all strings that have the same prefix) cannot be done by std::locale in a way that is correct and portable.

(Some background: Unicode exactly supports what we want, a collation that sorts first by the character 'value' (e) , then by added accents(ée) and then by the case (EeÉé). std::locale does not support only performing the first one or two of those comparison steps since it provides a generic interface that also has to support non-unicode locales.

@joka921
Copy link
Member

joka921 commented Apr 19, 2021

This has finally been solved for some time now using the ICU library

@joka921 joka921 closed this as completed Apr 19, 2021
QLever automation moved this from To do to Done Apr 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
QLever
  
Done
Development

No branches or pull requests

3 participants