Fix reduce view row collation with unicode equivalent keys #3783

nickva · 2021-10-13T06:16:43Z

Previously, view reduce collation with keys relied on the keys in the rows returned from the view shards to exactly match (=:=) the keys specified in the args. However, in the case when there are multiple rows which compare equal with the unicode collator, that may not always be the case.

In that case when the rows are fetched from the row dict by key, they should be matched using the same collation algorithm as the one used on the view shards.

nono · 2021-10-22T08:56:51Z

@jcoglan what do you think of this PR?

jcoglan · 2021-10-22T13:31:24Z

What behaviour does this produce for the examples in #3773? In that issue we have two stored view rows with keys that have different bytes, representing different codepoints, but which can be viewed equal under Unicode normalisation.

one key (call this key A) has bytes c3 ae representing codepoints U+00EE
the other (key B) has bytes 69 cc 82 representing codepoints U+0069 U+0302

When these keys are retrieved from the same shard they're considered equal when reducing, but not when fetched from different shards. This meant that:

with q=1, querying the view with no filter gives 1 row. Querying for key A gives 1 row and key B gives 0 rows.
with q=2, querying the view with no filter gives 2 rows. Querying for key A gives 1 row and key B also gives 1 row.

What behaviour does this patch give for these scenarios? Would it be possible to write tests for them?

jcoglan · 2021-10-22T13:35:04Z

For clarity I'm referring to the queries performed by these requests:

curl -s "$COUCH_URL/debug/_design/by-type-name/_view/by-type-name?group=true"

curl -s "$COUCH_URL/debug/_design/by-type-name/_view/by-type-name?group=true" -H "Content-Type: application/json" -d '{"keys": [["file", "chaîne"]]}'

curl -s "$COUCH_URL/debug/_design/by-type-name/_view/by-type-name?group=true" -H "Content-Type: application/json" -d '{"keys": [["file", "chaîne"]]}'

jcoglan · 2021-10-22T13:55:52Z

Seems to me that we either want:

the keys are considered equal, in which case all three queries return a single row with value 2
the keys are considered different, in which case the first query returns two rows with value 1, and the latter two queries return one row with value 1

And we want these results to be independent of q.

jcoglan · 2021-10-22T14:02:20Z

It looks like #3773 (comment) is consistent with the first case I mention above. Do we mind that the results contain unnormalised keys? In @nickva's example:

--- c h a i n e ---
{"rows":[
{"key":["file","chaîne"],"value":2}
]}

--- c h a i ^ n e ---
{"rows":[
{"key":["file","chaîne"],"value":2}
]}

The first result has c3 ae and the second has 69 cc 82 in the "key" field. I'm wondering if applications could get confused by this, if CouchDB has considered those strings to be equal but doesn't normalise them in its results. It's quite possible we should leave those bytes alone though and not transform strings passed to us by the application.

nickva · 2021-10-22T20:15:06Z

if CouchDB has considered those strings to be equal but doesn't normalise them in its results.

CouchDB currently doesn't normalize json keys in the views, neither when updating the view or the start/end keys or key dicts when querying. Perhaps we should do it, but I think that's a larger decision to be made, as it would involve compatibility with existent views.

CouchDB relies on unicode comparisons only (less(A,B) -> -1 | 0 | 1). That function is used on each shard as base BTree ordering function, and used in the coordinator (fabric) when merging results. The primary issue that previously there was a subtle difference in how the functions worked on each shard vs how it works in the coordinator.

jcoglan · 2021-10-25T08:25:27Z

@nickva I think you're right, normalising keys in output could be a significant behaviour change so something that requires more thought and planning, not something to roll into this fix 👍

Previously, view reduce collation with keys relied on the keys in the rows returned from the view shards to exactly match (=:=) the keys specified in the args. However, in the case when there are multiple rows which compare equal with the unicode collator, that may not always be the case. In that case when the rows are fetched from the row dict by key, they should be matched using the same collation algorithm as the one used on the view shards.

nickva mentioned this pull request Oct 13, 2021

Weird encoding issue in view keys #3773

Closed

nickva force-pushed the fix-reduce-collation-bug branch 2 times, most recently from a1b7385 to 4da057a Compare October 21, 2021 15:14

janl added this to the 3.2.1 milestone Oct 30, 2021

nickva force-pushed the fix-reduce-collation-bug branch from f5dedf7 to 8b08449 Compare November 1, 2021 15:04

janl merged commit 77b4402 into 3.x Nov 1, 2021

janl deleted the fix-reduce-collation-bug branch November 1, 2021 15:47

kocolosk mentioned this pull request Nov 4, 2021

Grouped reductions break ICU collation #2008

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix reduce view row collation with unicode equivalent keys #3783

Fix reduce view row collation with unicode equivalent keys #3783

nickva commented Oct 13, 2021 •

edited

nono commented Oct 22, 2021

jcoglan commented Oct 22, 2021

jcoglan commented Oct 22, 2021

jcoglan commented Oct 22, 2021

jcoglan commented Oct 22, 2021

nickva commented Oct 22, 2021 •

edited

jcoglan commented Oct 25, 2021

Fix reduce view row collation with unicode equivalent keys #3783

Fix reduce view row collation with unicode equivalent keys #3783

Conversation

nickva commented Oct 13, 2021 • edited

nono commented Oct 22, 2021

jcoglan commented Oct 22, 2021

jcoglan commented Oct 22, 2021

jcoglan commented Oct 22, 2021

jcoglan commented Oct 22, 2021

nickva commented Oct 22, 2021 • edited

jcoglan commented Oct 25, 2021

nickva commented Oct 13, 2021 •

edited

nickva commented Oct 22, 2021 •

edited