Weird encoding issue in view keys #3773

nono · 2021-10-01T11:58:40Z

Description

I have two documents with the same value for the field name, but not encoded in the same way. On our production cluster, when I request the view with this name and one encoding, I got no response.

Steps to Reproduce

#!/bin/sh
COUCH_URL="http://localhost:5984"

curl -s -X DELETE "$COUCH_URL/debug"
sleep 1
curl -s -X PUT "$COUCH_URL/debug"
sleep 1
curl -s -X PUT "$COUCH_URL/debug/_design/by-type-name" -d '{ "views": { "by-type-name": { "map": "function (doc) { emit([doc.type, doc.name]) }", "reduce": "_count" } } }'
curl -s -X PUT "$COUCH_URL/debug/doc1" -H "Content-Type: application/json" -d '{ "type": "file", "name": "chaîne" }'
curl -s -X PUT "$COUCH_URL/debug/doc2" -H "Content-Type: application/json" -d '{ "type": "file", "name": "chaîne" }'

echo 'We can see that "chaîne" is encoded one time as 69cc82, and one time as c3ae'
curl -s "$COUCH_URL/debug/_all_docs?include_docs=true" | xxd

echo 'See what is in the view'
curl -s "$COUCH_URL/debug/_design/by-type-name/_view/by-type-name?group=true" | xxd

echo 'Request the view, one time for each encoding'
curl -s "$COUCH_URL/debug/_design/by-type-name/_view/by-type-name?group=true" -H "Content-Type: application/json" -d '{"keys": [["file", "chaîne"]]}' | xxd
curl -s "$COUCH_URL/debug/_design/by-type-name/_view/by-type-name?group=true" -H "Content-Type: application/json" -d '{"keys": [["file", "chaîne"]]}' | xxd
echo 'Expected: 1 row in each response, but I got 2 rows in first response and 0 on the second'

Expected Behaviour

I don't really know if I expect the two results to be merged or not. I would accept that both requests return 1 row (with the same key byte per byte). I would also accept that both requests return 2 rows (same string with unicode normalization).

But at least, I know that returning 0 rows in one response when we have a document with the exact byte per byte string looks wrong to me.

Your Environment

CouchDB version used: {"couchdb":"Welcome","version":"2.3.0","git_sha":"07ea0c7","uuid":"c479fe2120631815755a0e4106dfcea0","features":["pluggable-storage-engines","scheduler"],"vendor":{"name":"The Apache Software Foundation"}}
Operating system and version: Debian stable

Additional Context

I don't reproduce this issue on my computer when taking the same 2.3.0 version of CouchDB via the official docker image.

The text was updated successfully, but these errors were encountered:

nickva · 2021-10-01T21:25:40Z

That looks like a collation bug. We use the ICU library for comparisons when view rows are ordered with respect to each other. It could be that the collation library is too old, or, most likely at some point, we compare or merge rows not based on the ICU order but using binary comparisons.

nickva · 2021-10-01T22:05:32Z

Hmm, I could not reproduce it with latest 3.x branch. Erlang 20, MacOS, libicu 59:

otool -L couch_ejson_compare.so
couch_ejson_compare.so:
	/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1292.100.5)
	/usr/local/opt/icu4c/lib/libicuuc.59.dylib (compatibility version 59.0.0, current version 59.1.0)
	/usr/local/opt/icu4c/lib/libicudata.59.1.dylib (compatibility version 59.0.0, current version 59.1.0)
	/usr/local/opt/icu4c/lib/libicui18n.59.dylib (compatibility version 59.0.0, current version 59.1.0)

I slightly tweaked your script https://gist.github.com/nickva/e351e678fc10d3b5424de44be992703c but ensure it still preserved the encoded values:

 % ./collation_bug_view.sh
{"ok":true}
{"ok":true}
{"ok":true,"id":"_design/by-type-name","rev":"1-c5fc5a56efeddb94b1c3a3de0d25f7dd"}
{"ok":true,"id":"doc1","rev":"1-b269c8b395d44f4a054ad149f232886c"}
{"ok":true,"id":"doc2","rev":"1-af18b03890d4f08e8e41b720f27c11fa"}
We can see that "chaîne" is encoded one time as 69cc82, and one time as c3ae

00000000: 2020 2020 2020 2020 226e 616d 6522 3a20          "name":
00000010: 2263 6861 c3ae 6e65 220a 2020 2020 2020  "cha..ne".
00000020: 2020 226e 616d 6522 3a20 2263 6861 69cc    "name": "chai.
00000030: 826e 6522 0a                             .ne".

Request the view, one time for each encoding

--- c h a i n e ---
{"rows":[
{"key":["file","chaîne"],"value":1}
]}

--- c h a i ^ n e ---
{"rows":[
{"key":["file","chaîne"],"value":1}
]}

Expected: 1 row in each response, but I got 2 rows in first response and 0 on the second

See if you can determine the version of Erlang and libicu used?

And you're definitely not using the "raw" collation option?

Another idea is to try with the the latest 3.1.1 version, perhaps different OS...

sblaisot · 2021-10-04T07:57:15Z

We had the problem on debian 9 stretch with couchdb 2.3.0 from debian package found at https://apache.jfrog.io/ui/native/couchdb-deb/dists/stretch.

Erlang comes with that package and is version 8.3.5
System's libICU is version 57.1 (I'm not sure couchdb from debian package uses system's library but I can't find any libICU installed with that package)

sblaisot · 2021-10-04T08:01:43Z

well, couchdb from upstream package seems to uses system's library so it's libICU 57.1:

# ldd /opt/couchdb/lib/couch-2.3.0-RC1/priv/couch_ejson_compare.so | grep -i icu
	libicuuc.so.57 => /usr/lib/x86_64-linux-gnu/libicuuc.so.57 (0x00007f7d4a74c000)
	libicudata.so.57 => /usr/lib/x86_64-linux-gnu/libicudata.so.57 (0x00007f7d48ccf000)
	libicui18n.so.57 => /usr/lib/x86_64-linux-gnu/libicui18n.so.57 (0x00007f7d48854000)

nono · 2021-10-04T09:30:38Z

Could it be related to clustering? We don't see this issue when trying the script on a single node server with debian stretch and the official debian package.

sblaisot · 2021-10-04T13:27:58Z

Some more tests from our side :

on debian 9 stretch with official couchdb deb package.

Bug not show with default config (standalone mode, no clustering, n=1)

BUT, if I add q = 1 in [cluster] section of the config, I can reproduce the problem 100%

Steps to reproduce:

install couchdb 2.3.0 from debian package on debian stretch, selecting standalone mode
add q = 1 in /opt/couchdb/etc/default.d/5-single-node.ini
restart couchdb
wait 10 seconds to let it start
run above script

nono · 2021-10-04T13:35:37Z

So, we can reproduce with:

#!/bin/sh
COUCH_URL="http://localhost:5984"
curl -s -X DELETE "$COUCH_URL/debug"
sleep 1
curl -s -X PUT "$COUCH_URL/debug?n=1&q=1"
sleep 1
curl -s -X PUT "$COUCH_URL/debug/_design/by-type-name" -d '{ "views": { "by-type-name": { "map": "function (doc) { emit([doc.type, doc.name]) }", "reduce": "_count" } } }'
curl -s -X PUT "$COUCH_URL/debug/doc1" -H "Content-Type: application/json" -d '{ "type": "file", "name": "chaîne" }'
curl -s -X PUT "$COUCH_URL/debug/doc2" -H "Content-Type: application/json" -d '{ "type": "file", "name": "chaîne" }'
curl -s "$COUCH_URL/debug/_design/by-type-name/_view/by-type-name?group=true" -H "Content-Type: application/json" -d '{"keys": [["file", "chaîne"]]}'

I have 0 rows in the response of the last request, but I would definitively expect a row.

sblaisot · 2021-10-04T13:56:44Z

I also reproduce the problem on debian 10 buster with couchdb 3.1.2 debian package

jcoglan · 2021-10-07T14:18:14Z

This is James from Neighbourhoodie; CozyCloud contacted us for assistance with this issue. So far I've confirmed I can repro this issue, on macOS 11.6 with CouchDB 3.1.1 and Erlang 24.

With the DB defaults (n=1, q=2) I get 1 row for each query. With q=1 I get zero rows for the second query -- the one with key bytes 69 cc 82, representing codepoints U+0069 U+0302.

jcoglan · 2021-10-07T14:26:52Z

It's also interesting that querying the view without keys lists two rows with q=2:

See what is in the view
00000000  7b 22 72 6f 77 73 22 3a  5b 0d 0a 7b 22 6b 65 79  |{"rows":[..{"key|
00000010  22 3a 5b 22 66 69 6c 65  22 2c 22 63 68 61 c3 ae  |":["file","cha..|
00000020  6e 65 22 5d 2c 22 76 61  6c 75 65 22 3a 31 7d 2c  |ne"],"value":1},|
00000030  0d 0a 7b 22 6b 65 79 22  3a 5b 22 66 69 6c 65 22  |..{"key":["file"|
00000040  2c 22 63 68 61 69 cc 82  6e 65 22 5d 2c 22 76 61  |,"chai..ne"],"va|
00000050  6c 75 65 22 3a 31 7d 0d  0a 5d 7d 0a              |lue":1}..]}.|
0000005c

... but only a single row with q=1. The row returned has key bytes c3 ae representing codepoint U+00EE. Notes its value is 2 indicating there are actually two rows stored in the view, but their keys have been considered equal by the _count function.

See what is in the view
00000000  7b 22 72 6f 77 73 22 3a  5b 0d 0a 7b 22 6b 65 79  |{"rows":[..{"key|
00000010  22 3a 5b 22 66 69 6c 65  22 2c 22 63 68 61 c3 ae  |":["file","cha..|
00000020  6e 65 22 5d 2c 22 76 61  6c 75 65 22 3a 32 7d 0d  |ne"],"value":2}.|
00000030  0a 5d 7d 0a                                       |.]}.|
00000034

So it does seem as though if these two rows are stored in the same shard, their keys are considered equal and they get merged, but not if they're (potentially) stored in different shards. I also tried running this example with the reduce and group operations removed just to see what rows we'd get. With q=1 or q=2 I get the same results; all queries return two rows:

See what is in the view
00000000  7b 22 74 6f 74 61 6c 5f  72 6f 77 73 22 3a 32 2c  |{"total_rows":2,|
00000010  22 6f 66 66 73 65 74 22  3a 30 2c 22 72 6f 77 73  |"offset":0,"rows|
00000020  22 3a 5b 0d 0a 7b 22 69  64 22 3a 22 64 6f 63 31  |":[..{"id":"doc1|
00000030  22 2c 22 6b 65 79 22 3a  5b 22 66 69 6c 65 22 2c  |","key":["file",|
00000040  22 63 68 61 c3 ae 6e 65  22 5d 2c 22 76 61 6c 75  |"cha..ne"],"valu|
00000050  65 22 3a 6e 75 6c 6c 7d  2c 0d 0a 7b 22 69 64 22  |e":null},..{"id"|
00000060  3a 22 64 6f 63 32 22 2c  22 6b 65 79 22 3a 5b 22  |:"doc2","key":["|
00000070  66 69 6c 65 22 2c 22 63  68 61 69 cc 82 6e 65 22  |file","chai..ne"|
00000080  5d 2c 22 76 61 6c 75 65  22 3a 6e 75 6c 6c 7d 0d  |],"value":null}.|
00000090  0a 5d 7d 0a                                       |.]}.|
00000094
Request the view, one time for each encoding
00000000  7b 22 74 6f 74 61 6c 5f  72 6f 77 73 22 3a 32 2c  |{"total_rows":2,|
00000010  22 6f 66 66 73 65 74 22  3a 30 2c 22 72 6f 77 73  |"offset":0,"rows|
00000020  22 3a 5b 0d 0a 7b 22 69  64 22 3a 22 64 6f 63 31  |":[..{"id":"doc1|
00000030  22 2c 22 6b 65 79 22 3a  5b 22 66 69 6c 65 22 2c  |","key":["file",|
00000040  22 63 68 61 c3 ae 6e 65  22 5d 2c 22 76 61 6c 75  |"cha..ne"],"valu|
00000050  65 22 3a 6e 75 6c 6c 7d  2c 0d 0a 7b 22 69 64 22  |e":null},..{"id"|
00000060  3a 22 64 6f 63 32 22 2c  22 6b 65 79 22 3a 5b 22  |:"doc2","key":["|
00000070  66 69 6c 65 22 2c 22 63  68 61 69 cc 82 6e 65 22  |file","chai..ne"|
00000080  5d 2c 22 76 61 6c 75 65  22 3a 6e 75 6c 6c 7d 0d  |],"value":null}.|
00000090  0a 5d 7d 0a                                       |.]}.|
00000094
00000000  7b 22 74 6f 74 61 6c 5f  72 6f 77 73 22 3a 32 2c  |{"total_rows":2,|
00000010  22 6f 66 66 73 65 74 22  3a 30 2c 22 72 6f 77 73  |"offset":0,"rows|
00000020  22 3a 5b 0d 0a 7b 22 69  64 22 3a 22 64 6f 63 31  |":[..{"id":"doc1|
00000030  22 2c 22 6b 65 79 22 3a  5b 22 66 69 6c 65 22 2c  |","key":["file",|
00000040  22 63 68 61 c3 ae 6e 65  22 5d 2c 22 76 61 6c 75  |"cha..ne"],"valu|
00000050  65 22 3a 6e 75 6c 6c 7d  2c 0d 0a 7b 22 69 64 22  |e":null},..{"id"|
00000060  3a 22 64 6f 63 32 22 2c  22 6b 65 79 22 3a 5b 22  |:"doc2","key":["|
00000070  66 69 6c 65 22 2c 22 63  68 61 69 cc 82 6e 65 22  |file","chai..ne"|
00000080  5d 2c 22 76 61 6c 75 65  22 3a 6e 75 6c 6c 7d 0d  |],"value":null}.|
00000090  0a 5d 7d 0a                                       |.]}.|
00000094

Something curious is going on here: both documents are stored distinctly and produce two distinct view rows, the keys param matches both rows no matter which encoding is used, and when q=1 the reduce/group operation with _count considers the rows' keys equal and merges them.

jcoglan · 2021-10-07T14:42:13Z

So, the problem is not happening when view rows are stored -- two rows are always present. The problem is that if the rows are in the same shard, then reductions over them might consider their keys equal. Wondering if this has anything to do with how intermediate reduction results are stored in view B-trees (docs)?

nickva · 2021-10-07T16:28:42Z

Great analysis, @jcoglan

I think you may be right that it has to do with how intermediate results are stored and how the unicode collator compares them. We had a recent fix that may be related 4f33f14, there we noticed that the collation rules on the shards are different than the collation rules used when aggregating rows in the coordinator (fabric).

Wonder which representation is the correct one - should these two rows be considered equal (does unicode collation consider them equivalent)? Or, is it correct that they would be emitted as separate rows. If first is correct, then it could be that the coordinator reduce step (in fabric) has a bug where it matches keys exactly instead of using unicode collation.

jcoglan · 2021-10-07T16:33:37Z

Collation would depend on your locale, in general. But normalisation would at least let you decide that U+00EE and U+0069 U+0302 are the same thing and convert one to the other, producing identical byte sequences. If you're being consistent in comparison of either codepoints or bytes you should get consistent behaviour.

As @nono says there's no an obvious correct choice here, the problem is the inconsistent behaviour depending on q, indicating different bits of CouchDB disagree about how to compare strings.

jcoglan · 2021-10-07T16:36:20Z

I did wonder if this would have anything to do with those strings being round-tripped through JS, but that ought to preserve the codepoints that are present even if JS uses a different internal byte encoding (the bytes in examples here are UTF-8). As an aside, I made some notes about string encoding/comparison when I first picked up CouchDB as I was curious about whether JS would affect how strings get sorted.

nickva · 2021-10-13T06:26:37Z

I think the issue is in the logic where we match reduce rows by keys. In case when there are keys which are effectively equivalent under unicode collation rules, the worker might return the 69cc82 row with value 2 already reduced but if the requested key is c3ae then it won't be found and we'd get 0 rows in the result. Instead, we would like it to compare the rows in the row dict not by exact matching, but using the same collation algorithm as the one used when building the view.

I made an attempt here #3783

With that PR and my altered reproducer script I get:

--- c h a i n e ---
{"rows":[
{"key":["file","chaîne"],"value":2}
]}

--- c h a i ^ n e ---
{"rows":[
{"key":["file","chaîne"],"value":2}
]}

For both q=1 and q=2 cases

CouchDB has changed the way several documents with the same field in different encoding are returned in view requests. But, it wasn't what I was expecting, and with CouchDB 3.2.1+, it wasn't possible to change a file or directory name to a new name when just the encoding has changed (eg NFC -> NFD). Cf apache/couchdb#3773

nono added bug needs-triage labels Oct 1, 2021

nickva mentioned this issue Oct 13, 2021

Fix reduce view row collation with unicode equivalent keys #3783

Merged

janl added this to the 3.2.1 milestone Oct 25, 2021

nickva removed the needs-triage label Oct 26, 2021

janl closed this as completed Nov 1, 2021

kocolosk mentioned this issue Nov 4, 2021

Grouped reductions break ICU collation #2008

Closed

nono mentioned this issue Jul 7, 2022

Fix encoding issues of dir/file name in CouchDB cozy/cozy-stack#3459

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weird encoding issue in view keys #3773

Weird encoding issue in view keys #3773

nono commented Oct 1, 2021

nickva commented Oct 1, 2021

nickva commented Oct 1, 2021

sblaisot commented Oct 4, 2021

sblaisot commented Oct 4, 2021

nono commented Oct 4, 2021

sblaisot commented Oct 4, 2021

nono commented Oct 4, 2021

sblaisot commented Oct 4, 2021

jcoglan commented Oct 7, 2021

jcoglan commented Oct 7, 2021 •

edited

jcoglan commented Oct 7, 2021

nickva commented Oct 7, 2021

jcoglan commented Oct 7, 2021

jcoglan commented Oct 7, 2021

nickva commented Oct 13, 2021 •

edited

Weird encoding issue in view keys #3773

Weird encoding issue in view keys #3773

Comments

nono commented Oct 1, 2021

Description

Steps to Reproduce

Expected Behaviour

Your Environment

Additional Context

nickva commented Oct 1, 2021

nickva commented Oct 1, 2021

sblaisot commented Oct 4, 2021

sblaisot commented Oct 4, 2021

nono commented Oct 4, 2021

sblaisot commented Oct 4, 2021

nono commented Oct 4, 2021

sblaisot commented Oct 4, 2021

jcoglan commented Oct 7, 2021

jcoglan commented Oct 7, 2021 • edited

jcoglan commented Oct 7, 2021

nickva commented Oct 7, 2021

jcoglan commented Oct 7, 2021

jcoglan commented Oct 7, 2021

nickva commented Oct 13, 2021 • edited

jcoglan commented Oct 7, 2021 •

edited

nickva commented Oct 13, 2021 •

edited