Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JOHNZON-209 Fix JsonObject#toString() to escape key names. #40

Merged
merged 2 commits into from
Apr 29, 2019

Conversation

leadpony
Copy link
Contributor

@leadpony leadpony commented Apr 20, 2019

This PR will fix JOHNZON-209 with additional test cases.

Copy link
Contributor

@rmannibucau rmannibucau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the PR!

I wonder if the escaping can be less costly, wdyt?

Btw a love the unit test you added, we can add unicode related one too but this is very good to show the issue and prevent any regression, thanks for doing that in this PR!

@@ -147,7 +147,7 @@ public String toString() {
while (hasNext) {
final Map.Entry<String, JsonValue> entry = it.next();

builder.append('"').append(entry.getKey()).append("\":");
builder.append('"').append(Strings.escape(entry.getKey())).append("\":");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This relies on a global buffer cache, we likely want to use the builder - object builder or parser factory - buffer factory and not the global one which is there for compatibility IIRC.
It also slows down default case - not escaped - which is a concern because toString is a serialization solution for jsonobject for jsonb and some other cases.

From memory, caching was not efficient until you are lock free and hash free.

Alternatively we can make Strings with a method taking a builder and without the buffer logic we could call if the key contains characters to convert in unicode or escapable, in this last case the check should be fast and conversion can surely be done without buffer since it is very unlikely it needs to be done. Needs a small benchmark but it sounds the simplest and fastest to me assuming escaping is rare (from my experience it is for keys). Also iterating over the char[] instead of String with chartAt will be faster.

Wdyt? Can you try to make this escape call a bit less impacting?

@leadpony
Copy link
Contributor Author

leadpony commented Apr 22, 2019

Hello @rmannibucau
You are right. My first commit was too naive.
I wrote a small benchmarking program and measured the throughput of JsonObject.toString() using Java Microbenchmark Harness after I made the fix less impacting to the performance.

The result are:

Johnzon 1.1.11 (current stable)

Benchmark                                (name)   Mode  Cnt       Score      Error  Units
JsonpBenchmark.jsonValueToString  glossary.json  thrpt   25  416666.246 ± 1022.390  ops/s

Johnzon 1.1.12-SNAPSHOT (including the commit 542765d)

Benchmark                                (name)   Mode  Cnt       Score      Error  Units
JsonpBenchmark.jsonValueToString  glossary.json  thrpt   25  394121.959 ± 1196.864  ops/s

It is slower but not so bad now.

Please note that I deleted the code escaping some range of code points: '\u0080'-'\u00a0' and '\u2000'-'\u2100'. Is this handling really required?
Thank you.

@rmannibucau
Copy link
Contributor

Hi @leadpony, it is required AFAIK. Did yiu try to move from String+charAt to toCharArray+[] fir the iteration? Also preallocating the stringbuilder with the incoming string length can help. Then we would need to be able to bypass the escaping when relevant but this requires a cache and interning string or caching based on a hashcode will be likely costly as well and will require eviction.
To give an idea, if perfs are not impacted more than ~5% we are good IMHO, if they are we can need a toggle to deactivate that fix. But first let's try to be a bit better before adding conf ;).

@leadpony
Copy link
Contributor Author

Hello @rmannibucau,
I tried moving String + charAt() to toCharArray() + [].
The result are:

Benchmark                                (name)   Mode  Cnt       Score      Error  Units
JsonpBenchmark.jsonValueToString  glossary.json  thrpt   25  380042.206 ± 5879.107  ops/s

It is slower than the previous measurement. This is because toCharArray() allocates additional memory for array and the memory allocation makes the code slower.
The modified code is available at branch fix-tostring-using-chararray

@leadpony
Copy link
Contributor Author

JSON Processing API is based on RFC 7159 and the document states:

All Unicode characters may be placed within the
quotation marks, except for the characters that must be escaped:
quotation mark, reverse solidus, and the control characters (U+0000
through U+001F).

Therefore, we do not need to escape the characters between U+0080-U+00a0 and U+2000-U+2100. The Reference Implementation of JSON-P also does not escape these characters.

@rmannibucau
Copy link
Contributor

Oki so maybe split the escape method to keep escaping where it is today and drop it from toString and we should be in the 5%, right?

Btw thanks for testing the toCharArray path, string have this particularity to be optimized a lot by the jvm so classical optimizations sometime fail. I appreciate you tested.

@leadpony
Copy link
Contributor Author

leadpony commented Apr 27, 2019

Hi @rmannibucau,
Optimizing code WITHOUT measurement is almost always bad.

I do not feel like optimizing the current code further because JsonObject.toString() in the Reference Implementation is twice faster than that of Johnzon 1.1.11 (current release) as shown below.

Benchmark                                (name)   Mode  Cnt       Score       Error  Units
JsonpBenchmark.jsonValueToString  glossary.json  thrpt   25  940578.189 ± 18156.788  ops/s

RI is not highly optimized but fast and correct.
We need to rework completely the algorithm, if you would like to make it twice faster.
Right now, correctness is of first priority for me.

@rmannibucau
Copy link
Contributor

Hmm, did you check where the diff comes from, cached escaped value maybe? Recall toString was not originally intended to be fast but more a debug thing but it is now a runtime API so we must enhance it IMHO.

I will try to have a look or at least help next week if you cant, Id like it fixed for next release.

@leadpony
Copy link
Contributor Author

leadpony commented Apr 27, 2019

Thank you @rmannibucau
No caching. They are using JsonWriter with StringWriter, which avoids multiple instantiation of StringBuilder even when the object to be stringified is deeply nested.

@rmannibucau
Copy link
Contributor

@leadpony applied this one - was particularly interested in tests ;). Will check toString now. Thanks a lot for the work!

@asfgit asfgit merged commit 542765d into apache:master Apr 29, 2019
@leadpony leadpony deleted the fix-tostring branch April 29, 2019 08:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants