cli/sql: \u0000 in JSON strings doesn't render properly (compared to psql) #33165

rolandcrosby · 2018-12-14T17:04:19Z

Tried the following in cockroach demo on my Mac, with CockroachDB 2.1.1:

root@127.0.0.1:56202/defaultdb> create table foo (body jsonb);
CREATE TABLE

Time: 3.24ms

root@127.0.0.1:56202/defaultdb> insert into foo (body) values ('{"name": "roland", "favorite_charcters": ["\u0000", "\u0041", "\u26a3", "\ud83e\udd37"]}');
INSERT 1

Time: 4.081ms

root@127.0.0.1:56202/defaultdb> select * from foo;
                                               body
+------------------------------------------------------------------------------------------------+
  {"favorite_charcters": ["\\u0000", "A", "\342\232\243", "\360\237\244\267"], "name": "roland"}
(1 row)

Time: 3.763ms

While most Unicode escapes in JSON strings get encoded into UTF-8, it looks like "\u0000" isn't recognized that way. I would expect to see \u0000 show up as \x00 or something, as it does in other CockroachDB strings:

root@127.0.0.1:56202/defaultdb> create table bar (foo string);
CREATE TABLE

Time: 4.08ms

root@127.0.0.1:56202/defaultdb> insert into bar (foo) values (e'abcd\000\040efg');
INSERT 1

Time: 2.797ms

root@127.0.0.1:56202/defaultdb> select * from bar;
      foo
+--------------+
  abcd\x00 efg
(1 row)

Time: 2.266ms

The text was updated successfully, but these errors were encountered:

rolandcrosby · 2018-12-14T17:04:43Z

cc @knz

knz · 2018-12-14T22:02:04Z

Yep it does indeed look like \u0000 is not parsed properly in a json context.

maddyblue · 2018-12-18T06:07:05Z

This is surprisingly not a bug. Printing as \u0001 is what postgres does. (Testing with \u0000 reveals that postgres thinks it's an invalid string.) Opened a PR to inscribe the current behavior.

knz · 2018-12-18T11:20:21Z

@mjibson if pg reports an error on \u0000 but not crdb, and the crdb behavior doesn't appear to be consistent (we handle NUL bytes properly in other places), I think we should change crdb to produce an error on \u0000 too.

maddyblue · 2018-12-18T19:24:14Z

Where do we handle that properly? Go's utf8.ValidString function says these are valid utf8 strings, so I'm not sure we're doing the wrong thing here.

> select e'\u0000';
  ?column?  
+----------+
  \x00      
(1 row)

33221: sql: always use typed expressions for shift on int r=mjibson a=mjibson Previously the expressions 1 << 63 and 1::int << 63 would return positive and negative results, respectively. This was because constant folding in the first case is not subject to 2s complent that causes negative ints in the second case. Change shift on int to not fold. This prevents the surprising behavior described above in addition to other bugs where shifting by a large amount could allocate huge amounts of memory and crash the process. Fixes #32682 Release note (sql change): Shift operators on integers now always operate on explicit types instead of possibly an untyped constant. They also now return an error if the right-side operand is out of range. 33222: pgwire: test \u0001 in JSON pgwire encoding r=mjibson a=mjibson Verify that we handle "\u0001" JSON strings identically to postgres. The printing of \u0001 instead of some other character is correct, and the same as what postgres does. (We do not test \u0000 here because postgres rejects it as a valid string, even though we accept it.) Closes #33165 Release note: None Co-authored-by: Matt Jibson <matt.jibson@gmail.com>

knz · 2018-12-19T12:12:56Z

@mjibson unfortunately I think you have misunderstood the nature of the problem here.

There are two layers of complexity: the conversion from the string the user enters to a SQL string. This unescapes \uxxxx properly, as you have found.

Then there is a conversion of a SQL string to a JSON string. This re-escapes NUL characters to the JSON string \u0000, which seems to be incorrect. Meanwhile other unicode characters (including \u0001) are converted to JSON strings containing that unicode character, properly.

I do think we need further investigation of the handling of the SQL -> JSON string conversion.

knz · 2019-01-02T13:38:59Z

I have investigated this further. The problem is in cockroach sql actually.

Using the psql command using PostgreSQL:

kena=> create table foo (body jsonb);
CREATE TABLE
kena=> insert into foo (body) values ('{"name": "roland", "favorite_charcters": ["\u0001", "\u0041", "\u26a3", "\ud83e\udd37"]}');
INSERT 0 1
kena=> select * from foo;
                                body
---------------------------------------------------------------------
 {"name": "roland", "favorite_charcters": ["\u0001", "A", "⚣", "🤷"]}

kena=> select body->'favorite_charcters'->0 from foo;
 ?column?
----------
 "\u0001"

Then again psql, but this time with CockroachDB:

root=# create table foo (body jsonb);
CREATE TABLE
root=# insert into foo (body) values ('{"name": "roland", "favorite_charcters": ["\u0000", "\u0041", "\u26a3", "\ud83e\udd37"]}');
INSERT 0 1
root=# select * from foo;
                                body
---------------------------------------------------------------------
 {"favorite_charcters": ["\u0000", "A", "⚣", "🤷"], "name": "roland"}
root=# select body->'favorite_charcters'->0 from foo;
 ?column?
----------
 "\u0000"

So the behavior is consistent between PostgreSQL and CockroachDB.

However, the cockroach sql command is unhappy.

This is likely related to #31872.

knz · 2019-08-03T20:14:31Z

I have finished investigating this: the feature works as intended, and the weird display is a current (mis)feature of the SQL shell - one covered by #28756 and perhaps #31872.

Long story short: both PostgreSQL and our JSONB parser accept to process and retain (in memory) special characters with a code < 32 (including \u0001). We go the extra mile and accept to process and retain \u0000, which pg rejects outright.

On input special characters are accepted even if not escaped, for example select to_json(e'\x01'::text)::jsonb works.

The special chars are stored in-memory, as intended, as a single characters.

However every time the JSONB is printed out - either via a direct conversion to string, or a conversion to byte array towards pgwire, any character with a code lower than 32 will get escaped using the \uNNNN escape format.

I verified this to be true in both pg and CockroachDb. So far, so good.

The final step is printing out the string to the screen/terminal.

psql receives the bytes (result of json->string conversion) and prints them as-is to the screen/terminal
cockroach sql will first convert the bytes to a Go string using the standard Go escaping algorithm, and only then print the result to the screen/terminal. This additional conversion step will double any \ character, escape some non-printable characters, etc.

I still have to figure out if this is an instance of a problem in lib/pq (that would be #31872), which we don't have control over, or the cockroach sql code itself trying to be too clever (that would be #28756).

In any case, both possible causes of the original symptoms are tracked by these separate issues, so I am going to close this one which has become redundant.

rolandcrosby assigned knz and unassigned knz Dec 14, 2018

knz changed the title ~~\u0000 in JSON strings doesn't appear to be parsed as a null character~~ sql: \u0000 in JSON strings doesn't appear to be parsed as a null character Dec 14, 2018

knz added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-sql-json JSON handling in SQL. labels Dec 14, 2018

knz added this to Triage in (DEPRECATED) SQL Front-end, Lang & Semantics via automation Dec 14, 2018

knz moved this from Triage to Monday pile in (DEPRECATED) SQL Front-end, Lang & Semantics Dec 17, 2018

maddyblue mentioned this issue Dec 18, 2018

pgwire: test \u0001 in JSON pgwire encoding #33222

Merged

craig bot closed this as completed in #33222 Dec 18, 2018

knz reopened this Dec 19, 2018

(DEPRECATED) SQL Front-end, Lang & Semantics automation moved this from Monday pile to Triage Dec 19, 2018

knz added A-cli and removed A-sql-json JSON handling in SQL. labels Jan 2, 2019

knz added this to To do in DB Server & Security via automation Jan 2, 2019

knz removed this from Triage in (DEPRECATED) SQL Front-end, Lang & Semantics Jan 2, 2019

knz changed the title ~~sql: \u0000 in JSON strings doesn't appear to be parsed as a null character~~ cli/sql: \u0000 in JSON strings doesn't render properly (compared to psql) Jan 2, 2019

piyush-singh moved this from To do to 19.1 in DB Server & Security Mar 6, 2019

knz moved this from 19.1 to 19.2 Release in DB Server & Security Aug 2, 2019

knz closed this as completed Aug 3, 2019

DB Server & Security automation moved this from 19.2 Release to Done Aug 3, 2019

knz moved this from Archive to Done 19.2 in DB Server & Security Aug 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cli/sql: \u0000 in JSON strings doesn't render properly (compared to psql) #33165

cli/sql: \u0000 in JSON strings doesn't render properly (compared to psql) #33165

rolandcrosby commented Dec 14, 2018

rolandcrosby commented Dec 14, 2018

knz commented Dec 14, 2018

maddyblue commented Dec 18, 2018

knz commented Dec 18, 2018

maddyblue commented Dec 18, 2018

knz commented Dec 19, 2018

knz commented Jan 2, 2019 •

edited

knz commented Aug 3, 2019

cli/sql: \u0000 in JSON strings doesn't render properly (compared to psql) #33165

cli/sql: \u0000 in JSON strings doesn't render properly (compared to psql) #33165

Comments

rolandcrosby commented Dec 14, 2018

rolandcrosby commented Dec 14, 2018

knz commented Dec 14, 2018

maddyblue commented Dec 18, 2018

knz commented Dec 18, 2018

maddyblue commented Dec 18, 2018

knz commented Dec 19, 2018

knz commented Jan 2, 2019 • edited

knz commented Aug 3, 2019

knz commented Jan 2, 2019 •

edited