New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Weird encoding issue in view keys #3773
Comments
That looks like a collation bug. We use the ICU library for comparisons when view rows are ordered with respect to each other. It could be that the collation library is too old, or, most likely at some point, we compare or merge rows not based on the ICU order but using binary comparisons. |
Hmm, I could not reproduce it with latest 3.x branch. Erlang 20, MacOS, libicu 59:
I slightly tweaked your script https://gist.github.com/nickva/e351e678fc10d3b5424de44be992703c but ensure it still preserved the encoded values:
See if you can determine the version of Erlang and libicu used? And you're definitely not using the "raw" collation option? Another idea is to try with the the latest 3.1.1 version, perhaps different OS... |
We had the problem on debian 9 stretch with couchdb 2.3.0 from debian package found at https://apache.jfrog.io/ui/native/couchdb-deb/dists/stretch. Erlang comes with that package and is version 8.3.5 |
well, couchdb from upstream package seems to uses system's library so it's libICU 57.1:
|
Could it be related to clustering? We don't see this issue when trying the script on a single node server with debian stretch and the official debian package. |
Some more tests from our side : on debian 9 stretch with official couchdb deb package. Bug not show with default config (standalone mode, no clustering, n=1) BUT, if I add Steps to reproduce:
|
So, we can reproduce with: #!/bin/sh
COUCH_URL="http://localhost:5984"
curl -s -X DELETE "$COUCH_URL/debug"
sleep 1
curl -s -X PUT "$COUCH_URL/debug?n=1&q=1"
sleep 1
curl -s -X PUT "$COUCH_URL/debug/_design/by-type-name" -d '{ "views": { "by-type-name": { "map": "function (doc) { emit([doc.type, doc.name]) }", "reduce": "_count" } } }'
curl -s -X PUT "$COUCH_URL/debug/doc1" -H "Content-Type: application/json" -d '{ "type": "file", "name": "chaîne" }'
curl -s -X PUT "$COUCH_URL/debug/doc2" -H "Content-Type: application/json" -d '{ "type": "file", "name": "chaîne" }'
curl -s "$COUCH_URL/debug/_design/by-type-name/_view/by-type-name?group=true" -H "Content-Type: application/json" -d '{"keys": [["file", "chaîne"]]}' I have 0 rows in the response of the last request, but I would definitively expect a row. |
I also reproduce the problem on debian 10 buster with couchdb 3.1.2 debian package |
This is James from Neighbourhoodie; CozyCloud contacted us for assistance with this issue. So far I've confirmed I can repro this issue, on macOS 11.6 with CouchDB 3.1.1 and Erlang 24. With the DB defaults ( |
It's also interesting that querying the view without
... but only a single row with
So it does seem as though if these two rows are stored in the same shard, their keys are considered equal and they get merged, but not if they're (potentially) stored in different shards. I also tried running this example with the
Something curious is going on here: both documents are stored distinctly and produce two distinct view rows, the |
So, the problem is not happening when view rows are stored -- two rows are always present. The problem is that if the rows are in the same shard, then reductions over them might consider their keys equal. Wondering if this has anything to do with how intermediate reduction results are stored in view B-trees (docs)? |
Great analysis, @jcoglan I think you may be right that it has to do with how intermediate results are stored and how the unicode collator compares them. We had a recent fix that may be related 4f33f14, there we noticed that the collation rules on the shards are different than the collation rules used when aggregating rows in the coordinator (fabric). Wonder which representation is the correct one - should these two rows be considered equal (does unicode collation consider them equivalent)? Or, is it correct that they would be emitted as separate rows. If first is correct, then it could be that the coordinator reduce step (in fabric) has a bug where it matches keys exactly instead of using unicode collation. |
Collation would depend on your locale, in general. But normalisation would at least let you decide that As @nono says there's no an obvious correct choice here, the problem is the inconsistent behaviour depending on |
I did wonder if this would have anything to do with those strings being round-tripped through JS, but that ought to preserve the codepoints that are present even if JS uses a different internal byte encoding (the bytes in examples here are UTF-8). As an aside, I made some notes about string encoding/comparison when I first picked up CouchDB as I was curious about whether JS would affect how strings get sorted. |
I think the issue is in the logic where we match reduce rows by keys. In case when there are keys which are effectively equivalent under unicode collation rules, the worker might return the 69cc82 row with value 2 already reduced but if the requested key is c3ae then it won't be found and we'd get 0 rows in the result. Instead, we would like it to compare the rows in the row dict not by exact matching, but using the same collation algorithm as the one used when building the view. I made an attempt here #3783 With that PR and my altered reproducer script I get:
For both q=1 and q=2 cases |
CouchDB has changed the way several documents with the same field in different encoding are returned in view requests. But, it wasn't what I was expecting, and with CouchDB 3.2.1+, it wasn't possible to change a file or directory name to a new name when just the encoding has changed (eg NFC -> NFD). Cf apache/couchdb#3773
CouchDB has changed the way several documents with the same field in different encoding are returned in view requests. But, it wasn't what I was expecting, and with CouchDB 3.2.1+, it wasn't possible to change a file or directory name to a new name when just the encoding has changed (eg NFC -> NFD). Cf apache/couchdb#3773
Description
I have two documents with the same value for the field
name
, but not encoded in the same way. On our production cluster, when I request the view with this name and one encoding, I got no response.Steps to Reproduce
Expected Behaviour
I don't really know if I expect the two results to be merged or not. I would accept that both requests return 1 row (with the same key byte per byte). I would also accept that both requests return 2 rows (same string with unicode normalization).
But at least, I know that returning 0 rows in one response when we have a document with the exact byte per byte string looks wrong to me.
Your Environment
{"couchdb":"Welcome","version":"2.3.0","git_sha":"07ea0c7","uuid":"c479fe2120631815755a0e4106dfcea0","features":["pluggable-storage-engines","scheduler"],"vendor":{"name":"The Apache Software Foundation"}}
Additional Context
I don't reproduce this issue on my computer when taking the same 2.3.0 version of CouchDB via the official docker image.
The text was updated successfully, but these errors were encountered: