Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
JENA-1313: compare using a Collator when both literals are tagged with same language #237
Mimics the behaviour of Dydra described here.
When there are strings with the same language, then instead of simply comparing the text, it uses java.text.Collator and the language locale to compare strings.
This does not create a collate:collate function as described in JENA-1313 as a possible solution, but could be still useful for users that expect the sort results to follow the values' language collation.
Needs further tests and discussion.
The collator is applied multiple times so if there is a mixed set of lang-tagged-literals this is inconsistent. It only applies to a same-lang,same-lang comparison while it may also be called with same first argument but different, different-lang argument at some other point in the sort.
This runs into the problem of unstable sorts or Java sort crashing out.
We need to switch to sorting by an ordering of (lang, lex), not (lex, lang), not just reorder within same-lang.
I agree with the other commenters, the general order should be (lang, lex) to avoid potentially inconsistent ordering. Also the language tag may not match any Locale. We also need to have unit tests that verify that the code works in corner cases like this.
But what about subtags like
Ack, that makes sense +1
Sure, tests and more defensive programming will come later. Right now looking more for comments on how to sort, where to sort, etc.
Besides typos/mispellings, there are also valid tags such as i-klingon (I believe this is mentioned in some specification linked in the SPARQL spec page). For cases like this I think we would simply try to match against the JVM's available locales, and if not existing, then just use normal string comparison.
So for the examples above, I think the general order would be:
Right? (lang, lex)...
Thanks @afs ! Will try to update the pull request following the comments here so far.
I've been thinking about this, and I can't see how this could produce a usable order when there are several language tags (even subtags) involved. For example, in a multilingual SKOS thesaurus, it's quite likely that there are
Now an ORDER BY that sorts primarily by language tag, then by language-specific collation rules, would order these skos:prefLabels as:
I have a hard time seeing how this order would be useful to anyone. These are all English language words; as a user I don't care whether they are sorted by GB or US collation rules (even if they differed, as in fr-CA vs. fr-FR), but this is clearly worse than the current behavior which sorts first by lexical value, then by language tag.
My conclusion of this thought experiment is that there should be a way to specify the collation order in the ORDER BY statement independent of the language tags of the literals being sorted.
Good points. I believe we want to give users the option to specify the collation and override the language tag then. I think we could, however, still offer this as fall back, in case no collation is specified.
Sorry about the mess. I reverted the previous changes, and wanted to keep everything in the branch history in case we decided to go back that way, but messed up with a
So now this updated pull request is following a different direction. Instead of changing the default behaviour, based on language tags, it contains a 2-parameters "collation" function. All changes in ARQ.
Please, ignore comments/unit tests/code readability/etc, as what this pull request is right now is a mere suggestion of an alternative for JENA-1313, and may be again discarded in case there are too many problems with this implementation.
The FN_Collation.java contains the code for the new function. The first argument is a locale, used for finding the collator. The second argument to the function is the NodeValue (Expr). What the function does, is quite simple - and possibly naïve?. It extracts the string literal from the Expr part, then creates a new NodeValue that contains both String + locale.
Further down, the NodeValueString was modified as well to keep track of the string locale. Alternatively, we could create a new NodeValue subtype, instead of adding an optional locale (backward binary compatible change, as we add, but not change existing methods).
Then, when the SortCondition in the Query is evaluated, and then the NodeValueString#compare method is called, it checks if it was given a desired locale. If so, it sorts using that locale.
Notice that this function will be applied always in the String Value Space in ARQ, as even when we have a Language Tag, it is discarded and we use only the string. Basically, any node with a literal string will become a NodeValueString, when this function is applied to the node.
With this, users are able to choose a Collation, overriding any language tags. This way, if your data contains @en and @en-GB, you can decide to use any Collation you desire on your query.
I have a sandbox project here https://github.com/kinow/jena-arq-filter, not really unit tests, but some renamed main methods that I use for experimenting. You can try checking out this pull request, opening in the same workspace both projects, then trying something like this:
The result will be:
If you change the locale for "en", then it will be:
@kinow I think this looks promising! I don't have much comments about the implementation code, but just being able to use
Did you by any chance test this with the performance test case that I wrote up earlier? I'd like to know how it compares to a plain ORDER BY in terms of performance. I can test that myself too when I have a suitable slot of time, but that might take a while since many deadlines are coming up in the next few days...
What is the outcome of ordering mixed language tags and also plain xsd:string?
I still think there is merit in
A possibility is a
Yes - I was agreeing with the approach of a collation function and it being app-decided not fixed by the nature of the data.
If it can be done without to be hardwired in to
Well remembered. Updated my sandbox to include a JMH test.
Initial version was using the average time. Here are the results.
Then updated it to actually benchmark the throughput.
Throughput displays no difference. Average time was about the same for minimum, but average and max displayed a slight increase when using collation. But I think the overhead won't be really noticeable for end users.
Agreed. I thought there should be another way Having a new
Will update this pull request in the next hours.
Pull request updated. Now we have a new
When the query contains text annotated with multiple languages, and you use the arq:collation function, everything gets overwritten as a sort key < collation, string >. So say that you have values such as "Casa"@es, "Casa"@pt, "Haus"@de, and "House" in your values.
The function will get the string part (i.e. Casa, Casa, Haus, House), discard anything else, and will sort everything according to the collation that the user provided.
All unit tests passing. Might be interesting testing now in Fuseki, with more elaborated queries. I believe the desired behaviour will work with this change, but would be nice to have others playing a little bit with the function and checking if there are no undesired changes.
The first param, you mean the collation? If we get two NodeValueSortKey with different collation language tags, it will sort the text as normal strings (i.e. ignoring locale specific collation orders, using JVM default behaviour of String.compareTo I think).
Not sure if we could take another action here...
I think so, and add to the function documentation (probably somewhere here ?) about the implementation details, risks, considerations, and so on.
Doesn't this run into the unstable sort issue that @afs cautioned against?
I think it could be avoided by the following logic: If two NodeValueSortKeys have different collation languages, sort them by the collation languages instead of even looking at the text.
This is the (lang, lex) approach discussed earlier, just applied in a slightly different context.
Not sure. I think not because of this approach, but I tried to find if sort could be unstable, and think I found one case.
Sounds like a plan. Let's wait and see what others think.
Now, on stability...
I tried finding ways that the sort would be unstable, but for two values A and B, with same collation, the result would be stable. For two values C and D with different collations, or missing collations, the result would be the sort by the string literal. The node produced would be a
Now here is the interesting part.
I believe this could cause problems, where the merge-sort sort would be stable (I think), but using the elements (sorted or not) in a map/set could result in weird behaviours...
Some code to illustrate the above stated:
I wonder if we should create a new
By using the current implementation, plus @osma's suggestion of comparing the collation language tag, and finally by making sure equals/hashcode agree with what our comparable says; then I believe we would have a stable sort. Thoughts?
I'm not so worried about unstable sorts when using this collation function approach. It is possible to write bad functions anyway (example
ORDER BY 1)
I would worry if it was built-in to ORDER BY. By having the query writer ask for a certain collation, the responsibility is passed to the query writer to use it when valid. Basically, don't misuse on on mixed data. Use something like
ORDER BY lang(?x) arq:collate(lang(?x), ?x) if necessary.