[SPARK-29758][SQL][2.4] Fix truncation of requested string fields in json_tuple#26563
[SPARK-29758][SQL][2.4] Fix truncation of requested string fields in json_tuple#26563MaxGekk wants to merge 4 commits intoapache:branch-2.4from
json_tuple#26563Conversation
json_tuplejson_tuple
|
cc @gatorsmile , @cloud-fan |
|
Test build #113965 has finished for PR 26563 at commit
|
| .createWithDefault(false) | ||
|
|
||
| val OPTIMIZE_STRING_COPY_IN_JSON_TUPLE = | ||
| buildConf("spark.sql.optimizeStringCopyInJsonTuple.enabled") |
There was a problem hiding this comment.
The name is misleading. It's for fixing a correctness bug not a perf improvement right?
There was a problem hiding this comment.
The flag turns on/off an optimization which may produce wrong results for large fields. For small fields, the optimization can work as tests in JsonFunctionsSuite show. Enabling the optimization is up to users.
How would you name this flag?
| // if the user requests a string field it needs to be returned without enclosing | ||
| // quotes which is accomplished via JsonGenerator.writeRaw instead of JsonGenerator.write | ||
| case JsonToken.VALUE_STRING if parser.hasTextCharacters => | ||
| case JsonToken.VALUE_STRING if optimizeStringCopy && parser.hasTextCharacters => |
There was a problem hiding this comment.
I'd simply remove this optimization. correctness is critical.
There was a problem hiding this comment.
Initially, I thought of removing the optimization but just hesitated that it could impact on users who have small fields. ok, let's remove it.
|
Test build #114004 has finished for PR 26563 at commit
|
|
@HyukjinKwon @gatorsmile any thoughts? |
|
I think it's fine. |
|
@MaxGekk let's resolve conflicts and get this in. Thanks! |
…ncation-by-json_tuple-2.4 # Conflicts: # sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala
|
Test build #114112 has finished for PR 26563 at commit
|
…`json_tuple` ### What changes were proposed in this pull request? In the PR, I propose to remove an optimization in `json_tuple` which causes truncation of results for large requested string fields. ### Why are the changes needed? Spark 2.4 uses Jackson Core 2.6.7 which has a bug in copying string. This bug may lead to truncation of results in some cases. The bug has been already fixed by the commit FasterXML/jackson-core@554f8db which is a part of Jackson Core since the version 2.7.7. Upgrading Jackson Core up to 2.7.7 or later version is risky. That's why this PR propose to avoid using the buggy methods of Jackson Core 2.6.7. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By new test added to `JsonFunctionsSuite` Closes #26563 from MaxGekk/fix-truncation-by-json_tuple-2.4. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
|
thanks, merging to 2.4! |
What changes were proposed in this pull request?
In the PR, I propose to remove an optimization in
json_tuplewhich causes truncation of results for large requested string fields.Why are the changes needed?
Spark 2.4 uses Jackson Core 2.6.7 which has a bug in copying string. This bug may lead to truncation of results in some cases. The bug has been already fixed by the commit FasterXML/jackson-core@554f8db which is a part of Jackson Core since the version 2.7.7. Upgrading Jackson Core up to 2.7.7 or later version is risky. That's why this PR propose to avoid using the buggy methods of Jackson Core 2.6.7.
Does this PR introduce any user-facing change?
No
How was this patch tested?
By new test added to
JsonFunctionsSuite