Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix reduce view row collation with unicode equivalent keys #3783

Merged
merged 1 commit into from Nov 1, 2021

Conversation

nickva
Copy link
Contributor

@nickva nickva commented Oct 13, 2021

Previously, view reduce collation with keys relied on the keys in the rows returned from the view shards to exactly match (=:=) the keys specified in the args. However, in the case when there are multiple rows which compare equal with the unicode collator, that may not always be the case.

In that case when the rows are fetched from the row dict by key, they should be matched using the same collation algorithm as the one used on the view shards.

@nickva nickva force-pushed the fix-reduce-collation-bug branch 2 times, most recently from a1b7385 to 4da057a Compare October 21, 2021 15:14
@nono
Copy link

nono commented Oct 22, 2021

@jcoglan what do you think of this PR?

@jcoglan
Copy link
Contributor

jcoglan commented Oct 22, 2021

What behaviour does this produce for the examples in #3773? In that issue we have two stored view rows with keys that have different bytes, representing different codepoints, but which can be viewed equal under Unicode normalisation.

  • one key (call this key A) has bytes c3 ae representing codepoints U+00EE
  • the other (key B) has bytes 69 cc 82 representing codepoints U+0069 U+0302

When these keys are retrieved from the same shard they're considered equal when reducing, but not when fetched from different shards. This meant that:

  • with q=1, querying the view with no filter gives 1 row. Querying for key A gives 1 row and key B gives 0 rows.
  • with q=2, querying the view with no filter gives 2 rows. Querying for key A gives 1 row and key B also gives 1 row.

What behaviour does this patch give for these scenarios? Would it be possible to write tests for them?

@jcoglan
Copy link
Contributor

jcoglan commented Oct 22, 2021

For clarity I'm referring to the queries performed by these requests:

curl -s "$COUCH_URL/debug/_design/by-type-name/_view/by-type-name?group=true"

curl -s "$COUCH_URL/debug/_design/by-type-name/_view/by-type-name?group=true" -H "Content-Type: application/json" -d '{"keys": [["file", "chaîne"]]}'

curl -s "$COUCH_URL/debug/_design/by-type-name/_view/by-type-name?group=true" -H "Content-Type: application/json" -d '{"keys": [["file", "chaîne"]]}'

@jcoglan
Copy link
Contributor

jcoglan commented Oct 22, 2021

Seems to me that we either want:

  • the keys are considered equal, in which case all three queries return a single row with value 2
  • the keys are considered different, in which case the first query returns two rows with value 1, and the latter two queries return one row with value 1

And we want these results to be independent of q.

@jcoglan
Copy link
Contributor

jcoglan commented Oct 22, 2021

It looks like #3773 (comment) is consistent with the first case I mention above. Do we mind that the results contain unnormalised keys? In @nickva's example:

--- c h a i n e ---
{"rows":[
{"key":["file","chaîne"],"value":2}
]}

--- c h a i ^ n e ---
{"rows":[
{"key":["file","chaîne"],"value":2}
]}

The first result has c3 ae and the second has 69 cc 82 in the "key" field. I'm wondering if applications could get confused by this, if CouchDB has considered those strings to be equal but doesn't normalise them in its results. It's quite possible we should leave those bytes alone though and not transform strings passed to us by the application.

@nickva
Copy link
Contributor Author

nickva commented Oct 22, 2021

if CouchDB has considered those strings to be equal but doesn't normalise them in its results.

CouchDB currently doesn't normalize json keys in the views, neither when updating the view or the start/end keys or key dicts when querying. Perhaps we should do it, but I think that's a larger decision to be made, as it would involve compatibility with existent views.

CouchDB relies on unicode comparisons only (less(A,B) -> -1 | 0 | 1). That function is used on each shard as base BTree ordering function, and used in the coordinator (fabric) when merging results. The primary issue that previously there was a subtle difference in how the functions worked on each shard vs how it works in the coordinator.

@jcoglan
Copy link
Contributor

jcoglan commented Oct 25, 2021

@nickva I think you're right, normalising keys in output could be a significant behaviour change so something that requires more thought and planning, not something to roll into this fix 👍

@janl janl added this to the 3.2.1 milestone Oct 30, 2021
Previously, view reduce collation with keys relied on the keys in the
rows returned from the view shards to exactly match (=:=) the keys
specified in the args. However, in the case when there are multiple
rows which compare equal with the unicode collator, that may not
always be the case.

In that case when the rows are fetched from the row dict by key, they
should be matched using the same collation algorithm as the one used
on the view shards.
@janl janl merged commit 77b4402 into 3.x Nov 1, 2021
@janl janl deleted the fix-reduce-collation-bug branch November 1, 2021 15:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants