Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cli/sql: \u0000 in JSON strings doesn't render properly (compared to psql) #33165

Closed
rolandcrosby opened this issue Dec 14, 2018 · 8 comments
Closed
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

Comments

@rolandcrosby
Copy link

Tried the following in cockroach demo on my Mac, with CockroachDB 2.1.1:

root@127.0.0.1:56202/defaultdb> create table foo (body jsonb);
CREATE TABLE

Time: 3.24ms

root@127.0.0.1:56202/defaultdb> insert into foo (body) values ('{"name": "roland", "favorite_charcters": ["\u0000", "\u0041", "\u26a3", "\ud83e\udd37"]}');
INSERT 1

Time: 4.081ms

root@127.0.0.1:56202/defaultdb> select * from foo;
                                               body
+------------------------------------------------------------------------------------------------+
  {"favorite_charcters": ["\\u0000", "A", "\342\232\243", "\360\237\244\267"], "name": "roland"}
(1 row)

Time: 3.763ms

While most Unicode escapes in JSON strings get encoded into UTF-8, it looks like "\u0000" isn't recognized that way. I would expect to see \u0000 show up as \x00 or something, as it does in other CockroachDB strings:

root@127.0.0.1:56202/defaultdb> create table bar (foo string);
CREATE TABLE

Time: 4.08ms

root@127.0.0.1:56202/defaultdb> insert into bar (foo) values (e'abcd\000\040efg');
INSERT 1

Time: 2.797ms

root@127.0.0.1:56202/defaultdb> select * from bar;
      foo
+--------------+
  abcd\x00 efg
(1 row)

Time: 2.266ms
@rolandcrosby rolandcrosby assigned knz and unassigned knz Dec 14, 2018
@rolandcrosby
Copy link
Author

cc @knz

@knz
Copy link
Contributor

knz commented Dec 14, 2018

Yep it does indeed look like \u0000 is not parsed properly in a json context.

@knz knz changed the title \u0000 in JSON strings doesn't appear to be parsed as a null character sql: \u0000 in JSON strings doesn't appear to be parsed as a null character Dec 14, 2018
@knz knz added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-sql-json JSON handling in SQL. labels Dec 14, 2018
@knz knz added this to Triage in (DEPRECATED) SQL Front-end, Lang & Semantics via automation Dec 14, 2018
@knz knz moved this from Triage to Monday pile in (DEPRECATED) SQL Front-end, Lang & Semantics Dec 17, 2018
@maddyblue
Copy link
Contributor

This is surprisingly not a bug. Printing as \u0001 is what postgres does. (Testing with \u0000 reveals that postgres thinks it's an invalid string.) Opened a PR to inscribe the current behavior.

@knz
Copy link
Contributor

knz commented Dec 18, 2018

@mjibson if pg reports an error on \u0000 but not crdb, and the crdb behavior doesn't appear to be consistent (we handle NUL bytes properly in other places), I think we should change crdb to produce an error on \u0000 too.

@maddyblue
Copy link
Contributor

Where do we handle that properly? Go's utf8.ValidString function says these are valid utf8 strings, so I'm not sure we're doing the wrong thing here.

> select e'\u0000';
  ?column?  
+----------+
  \x00      
(1 row)

craig bot pushed a commit that referenced this issue Dec 18, 2018
33221: sql: always use typed expressions for shift on int r=mjibson a=mjibson

Previously the expressions 1 << 63 and 1::int << 63 would return positive
and negative results, respectively. This was because constant folding
in the first case is not subject to 2s complent that causes negative
ints in the second case.

Change shift on int to not fold. This prevents the surprising behavior
described above in addition to other bugs where shifting by a large
amount could allocate huge amounts of memory and crash the process.

Fixes #32682

Release note (sql change): Shift operators on integers now always operate
on explicit types instead of possibly an untyped constant. They also
now return an error if the right-side operand is out of range.

33222: pgwire: test \u0001 in JSON pgwire encoding r=mjibson a=mjibson

Verify that we handle "\u0001" JSON strings identically to postgres. The
printing of \u0001 instead of some other character is correct, and the
same as what postgres does. (We do not test \u0000 here because postgres
rejects it as a valid string, even though we accept it.)

Closes #33165

Release note: None

Co-authored-by: Matt Jibson <matt.jibson@gmail.com>
@craig craig bot closed this as completed in #33222 Dec 18, 2018
@knz knz reopened this Dec 19, 2018
(DEPRECATED) SQL Front-end, Lang & Semantics automation moved this from Monday pile to Triage Dec 19, 2018
@knz
Copy link
Contributor

knz commented Dec 19, 2018

@mjibson unfortunately I think you have misunderstood the nature of the problem here.

There are two layers of complexity: the conversion from the string the user enters to a SQL string. This unescapes \uxxxx properly, as you have found.

Then there is a conversion of a SQL string to a JSON string. This re-escapes NUL characters to the JSON string \u0000, which seems to be incorrect. Meanwhile other unicode characters (including \u0001) are converted to JSON strings containing that unicode character, properly.

I do think we need further investigation of the handling of the SQL -> JSON string conversion.

@knz
Copy link
Contributor

knz commented Jan 2, 2019

I have investigated this further. The problem is in cockroach sql actually.

Using the psql command using PostgreSQL:

kena=> create table foo (body jsonb);
CREATE TABLE
kena=> insert into foo (body) values ('{"name": "roland", "favorite_charcters": ["\u0001", "\u0041", "\u26a3", "\ud83e\udd37"]}');
INSERT 0 1
kena=> select * from foo;
                                body
---------------------------------------------------------------------
 {"name": "roland", "favorite_charcters": ["\u0001", "A", "⚣", "🤷"]}

kena=> select body->'favorite_charcters'->0 from foo;
 ?column?
----------
 "\u0001"

Then again psql, but this time with CockroachDB:

root=# create table foo (body jsonb);
CREATE TABLE
root=# insert into foo (body) values ('{"name": "roland", "favorite_charcters": ["\u0000", "\u0041", "\u26a3", "\ud83e\udd37"]}');
INSERT 0 1
root=# select * from foo;
                                body
---------------------------------------------------------------------
 {"favorite_charcters": ["\u0000", "A", "⚣", "🤷"], "name": "roland"}
root=# select body->'favorite_charcters'->0 from foo;
 ?column?
----------
 "\u0000"

So the behavior is consistent between PostgreSQL and CockroachDB.

However, the cockroach sql command is unhappy.

This is likely related to #31872.

@knz knz added A-cli and removed A-sql-json JSON handling in SQL. labels Jan 2, 2019
@knz knz added this to To do in DB Server & Security via automation Jan 2, 2019
@knz knz changed the title sql: \u0000 in JSON strings doesn't appear to be parsed as a null character cli/sql: \u0000 in JSON strings doesn't render properly (compared to psql) Jan 2, 2019
@piyush-singh piyush-singh moved this from To do to 19.1 in DB Server & Security Mar 6, 2019
@knz knz moved this from 19.1 to 19.2 Release in DB Server & Security Aug 2, 2019
@knz
Copy link
Contributor

knz commented Aug 3, 2019

I have finished investigating this: the feature works as intended, and the weird display is a current (mis)feature of the SQL shell - one covered by #28756 and perhaps #31872.

Long story short: both PostgreSQL and our JSONB parser accept to process and retain (in memory) special characters with a code < 32 (including \u0001). We go the extra mile and accept to process and retain \u0000, which pg rejects outright.

On input special characters are accepted even if not escaped, for example select to_json(e'\x01'::text)::jsonb works.

The special chars are stored in-memory, as intended, as a single characters.

However every time the JSONB is printed out - either via a direct conversion to string, or a conversion to byte array towards pgwire, any character with a code lower than 32 will get escaped using the \uNNNN escape format.

I verified this to be true in both pg and CockroachDb. So far, so good.

The final step is printing out the string to the screen/terminal.

  • psql receives the bytes (result of json->string conversion) and prints them as-is to the screen/terminal

  • cockroach sql will first convert the bytes to a Go string using the standard Go escaping algorithm, and only then print the result to the screen/terminal. This additional conversion step will double any \ character, escape some non-printable characters, etc.

I still have to figure out if this is an instance of a problem in lib/pq (that would be #31872), which we don't have control over, or the cockroach sql code itself trying to be too clever (that would be #28756).

In any case, both possible causes of the original symptoms are tracked by these separate issues, so I am going to close this one which has become redundant.

@knz knz closed this as completed Aug 3, 2019
DB Server & Security automation moved this from 19.2 Release to Done Aug 3, 2019
@knz knz moved this from Archive to Done 19.2 in DB Server & Security Aug 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Projects
None yet
Development

No branches or pull requests

3 participants