Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elasticsearch accepts invalid json with unpredictable behavior #19614

Closed
HonzaKral opened this issue Jul 27, 2016 · 11 comments

Comments

Projects
None yet
7 participants
@HonzaKral
Copy link
Member

commented Jul 27, 2016

When a key is present in json object multiple times it doesn't raise a parse error and only last value is used. This should instead raise json_parse_exception.

Elasticsearch version: verified on 2.x, 5.0.0-alpha3

Steps to reproduce:

  1. curl -X PUT localhost:9200/i -d '{"settings": {"number_of_replicas": 2}, "settings": {"number_of_shards": 1}}'
  2. curl -X GET localhost:9200/i
@javanna

This comment has been minimized.

Copy link
Member

commented Jul 27, 2016

I could see this becoming a long discussion around whether that one is invalid json or not and whether we should return a parse exception or some other error. The json library we use for parsing allows this, then we should improve this on our end rather than being lenient.

This reminds me of #19547 too and is a very common problem with the way we pull parse json. It can easily be solved case by case but every single parser in our codebase is subject to this so it would be nice to have some generic solution for it. Not sure if there are alternatives to adding lots of ifs to all our pull parsers, we should evaluate that.

@tlrx

This comment has been minimized.

Copy link
Member

commented Jul 27, 2016

The json library we use for parsing allows this, then we should improve this on our end rather than being lenient.

For the record, the option is JsonParser.STRICT_DUPLICATE_DETECTION and has the following warning:

Note that enabling this feature will incur performance overhead 
due to having to store and check additional information: 
this typically adds 20-30% to execution time for basic parsing.
@clintongormley

This comment has been minimized.

Copy link
Member

commented Jul 27, 2016

According to the JSON spec, this isn't invalid JSON. The spec doesn't mention how duplicate keys should be treated. Many languages will simply overwrite older values with newer values, without generating any warning. This is essentially what Elasticsearch does today, and i'm not sure it is worth a 20-30% penalty to prevent this behaviour.

@HonzaKral

This comment has been minimized.

Copy link
Member Author

commented Jul 27, 2016

Yes, strictly speaking (the rfc only says the keys SHOULD be unique), this is valid. I also agree that the performance penalty isn't worth it. It would, however, be nice to document this behavior and perhaps (if it's easy) have an option to turn on strict checking (ideally per request) - it would be useful as debugging tool and perhaps when running tests.

@martijnvg

This comment has been minimized.

Copy link
Member

commented Jul 28, 2016

Allowing duplicate keys adds a lot of confusion: https://discuss.elastic.co/t/using-the-remove-processor-for-ingest-node/56500

Maybe for certain apis we should enable strict parsing? (admin like APIs?)

@jpountz

This comment has been minimized.

Copy link
Contributor

commented Jul 29, 2016

Discussed in FixitFriday: let's play with the jackon feature to reject duplicated keys and make sure that it works and has a reasonable performance hit. If it is not satisfactory, then let's look into whether there are things that we can do at a higher level such as ObjectParser.

@danielmitterdorfer

This comment has been minimized.

Copy link
Member

commented Dec 1, 2016

Macrobenchmark Results

We have run our whole macrobenchmark suite with JsonParser.STRICT_DUPLICATE_DETECTION == false (baseline) and JsonParser.STRICT_DUPLICATE_DETECTION == true (STRICT_DUPLICATE_DETECTION) and published the results. https://elasticsearch-benchmarks.elastic.co/19614/

We see at most a reduction in median indexing throughput of 3% for our macrobenchmark suite (PMC track).

Microbenchmark Results

I also double-checked a few scenarios with a microbenchmark and saw similar results (see https://gist.github.com/danielmitterdorfer/9236796a46f3956447171313a6a0b365):

Below are the results of both configurations showing the average time for one iteration (smaller is better).

JsonParser.Feature.STRICT_DUPLICATE_DETECTION: false:

Benchmark                      Mode  Cnt   Score   Error  Units
JsonParserBenchmark.largeJson  avgt   60  19.414 ± 0.044  us/op
JsonParserBenchmark.smallJson  avgt   60   0.479 ± 0.001  us/op

JsonParser.Feature.STRICT_DUPLICATE_DETECTION: true:

Benchmark                      Mode  Cnt   Score   Error  Units
JsonParserBenchmark.largeJson  avgt   60  20.642 ± 0.064  us/op
JsonParserBenchmark.smallJson  avgt   60   0.487 ± 0.001  us/op

For smaller JSON objects (49 bytes) the overhead of duplication check is 8ns or 1.6%. For a large JSON object (6440 bytes) the overhead of duplication check is in the range 1.12us [1] and 1.3us [2] or in the range 5.8% and 6.7%.

[1] best case duplication check enabled 20.578 us, worst case duplication check enabled: 19.458 us
[2] worst case duplication check enabled: 20.706 us, best case duplication check disabled: 19.370 us

Please refer to the gist for more details.

@jpountz

This comment has been minimized.

Copy link
Contributor

commented Dec 1, 2016

Thanks @danielmitterdorfer. To me that means we should do it. We can have an undocumented escape hatch if we do not feel confident the overhead will be low in all cases.

@danielmitterdorfer

This comment has been minimized.

Copy link
Member

commented Dec 1, 2016

We can have an undocumented escape hatch

@jpountz The relevant code is in a static block so we can't use our settings infrastructure. I guess that means we'd use a system property?

@jpountz

This comment has been minimized.

Copy link
Contributor

commented Dec 1, 2016

That would work for me. Or we could handle it like INDICES_MAX_CLAUSE_COUNT_SETTING I suppose, which is a node setting that sets the static limit on the number of boolean clauses.

@tlrx

This comment has been minimized.

Copy link
Member

commented Dec 5, 2016

Thanks @danielmitterdorfer.

I agree with @jpountz, and a first step would be to see if our tests pass (I'm pretty sure we will have to adapt some of them). Also, the same JSON factory is used for both parsing and generating JSON: if we enable this feature then we'll also see if we generate duplicate keys somewhere, which is cool.

danielmitterdorfer added a commit to danielmitterdorfer/elasticsearch that referenced this issue Dec 9, 2016

Enable strict duplicate checks for JSON content
With this commit we enable the Jackson feature 'STRICT_DUPLICATE_DETECTION'
by default. This ensures that JSON keys are always unique. While this has
a performance impact, benchmarking has indicated that the typical drop in
indexing throughput is around 1 - 2%.

As a last resort, we allow users to still disable strict duplicate checks
by setting `-Des.json.strict_duplicate_detection=false` which is
intentionally undocumented.

Closes elastic#19614

danielmitterdorfer added a commit that referenced this issue Dec 14, 2016

Enable strict duplicate checks for JSON content
With this commit we enable the Jackson feature 'STRICT_DUPLICATE_DETECTION'
by default. This ensures that JSON keys are always unique. While this has
a performance impact, benchmarking has indicated that the typical drop in
indexing throughput is around 1 - 2%.

As a last resort, we allow users to still disable strict duplicate checks
by setting `-Des.json.strict_duplicate_detection=false` which is
intentionally undocumented.

Closes #19614

danielmitterdorfer added a commit to danielmitterdorfer/elasticsearch that referenced this issue Dec 16, 2016

Enable strict duplicate checks for all XContent types
With this commit we enable the Jackson feature 'STRICT_DUPLICATE_DETECTION'
by default for all XContent types (not only JSON).

We have also changed the name of the system property to disable this feature
from `es.json.strict_duplicate_detection` to the now more appropriate name
`es.xcontent.strict_duplicate_detection`.

Relates elastic#19614
Relates elastic#22073

danielmitterdorfer added a commit that referenced this issue Dec 19, 2016

Enable strict duplicate checks for all XContent types (#22225)
With this commit we enable the Jackson feature 'STRICT_DUPLICATE_DETECTION'
by default for all XContent types (not only JSON).

We have also changed the name of the system property to disable this feature
from `es.json.strict_duplicate_detection` to the now more appropriate name
`es.xcontent.strict_duplicate_detection`.

Relates #19614
Relates #22073

@danielmitterdorfer danielmitterdorfer removed their assignment Jan 10, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.