[SPARK-29758][SQL][2.4] Fix truncation of requested string fields in `json_tuple` by MaxGekk · Pull Request #26563 · apache/spark

MaxGekk · 2019-11-17T20:02:46Z

What changes were proposed in this pull request?

In the PR, I propose to remove an optimization in json_tuple which causes truncation of results for large requested string fields.

Why are the changes needed?

Spark 2.4 uses Jackson Core 2.6.7 which has a bug in copying string. This bug may lead to truncation of results in some cases. The bug has been already fixed by the commit FasterXML/jackson-core@554f8db which is a part of Jackson Core since the version 2.7.7. Upgrading Jackson Core up to 2.7.7 or later version is risky. That's why this PR propose to avoid using the buggy methods of Jackson Core 2.6.7.

Does this PR introduce any user-facing change?

No

How was this patch tested?

By new test added to JsonFunctionsSuite

dongjoon-hyun · 2019-11-17T20:45:10Z

cc @gatorsmile , @cloud-fan

SparkQA · 2019-11-17T23:16:56Z

Test build #113965 has finished for PR 26563 at commit a28f90c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-11-18T07:03:01Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

      .createWithDefault(false)
+
+  val OPTIMIZE_STRING_COPY_IN_JSON_TUPLE =
+    buildConf("spark.sql.optimizeStringCopyInJsonTuple.enabled")


The name is misleading. It's for fixing a correctness bug not a perf improvement right?

The flag turns on/off an optimization which may produce wrong results for large fields. For small fields, the optimization can work as tests in JsonFunctionsSuite show. Enabling the optimization is up to users.

How would you name this flag?

cloud-fan · 2019-11-18T07:27:51Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala

      // if the user requests a string field it needs to be returned without enclosing
      // quotes which is accomplished via JsonGenerator.writeRaw instead of JsonGenerator.write
-      case JsonToken.VALUE_STRING if parser.hasTextCharacters =>
+      case JsonToken.VALUE_STRING if optimizeStringCopy && parser.hasTextCharacters =>


I'd simply remove this optimization. correctness is critical.

Initially, I thought of removing the optimization but just hesitated that it could impact on users who have small fields. ok, let's remove it.

SparkQA · 2019-11-18T11:42:02Z

Test build #114004 has finished for PR 26563 at commit 2ec9120.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-11-18T15:07:08Z

@HyukjinKwon @gatorsmile any thoughts?

HyukjinKwon · 2019-11-19T04:58:45Z

I think it's fine.

cloud-fan · 2019-11-19T11:30:49Z

@MaxGekk let's resolve conflicts and get this in. Thanks!

…ncation-by-json_tuple-2.4 # Conflicts: # sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala

SparkQA · 2019-11-19T19:44:24Z

Test build #114112 has finished for PR 26563 at commit f4fd00f.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class TimestampParser(fastDateFormat: FastDateFormat)

…`json_tuple` ### What changes were proposed in this pull request? In the PR, I propose to remove an optimization in `json_tuple` which causes truncation of results for large requested string fields. ### Why are the changes needed? Spark 2.4 uses Jackson Core 2.6.7 which has a bug in copying string. This bug may lead to truncation of results in some cases. The bug has been already fixed by the commit FasterXML/jackson-core@554f8db which is a part of Jackson Core since the version 2.7.7. Upgrading Jackson Core up to 2.7.7 or later version is risky. That's why this PR propose to avoid using the buggy methods of Jackson Core 2.6.7. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By new test added to `JsonFunctionsSuite` Closes #26563 from MaxGekk/fix-truncation-by-json_tuple-2.4. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

cloud-fan · 2019-11-20T07:32:42Z

thanks, merging to 2.4!

MaxGekk added 2 commits November 17, 2019 22:09

Add a test to JsonFunctionsSuite

cc77903

Add a config to control the optimization

a28f90c

MaxGekk changed the title ~~[SPARK-29758][SQL][2.4] Fix truncation of request string field in json_tuple~~ [SPARK-29758][SQL][2.4] Fix truncation of requested string fields in json_tuple Nov 17, 2019

dongjoon-hyun added the SQL label Nov 17, 2019

cloud-fan reviewed Nov 18, 2019

View reviewed changes

Remove the optimization and new config

2ec9120

Merge remote-tracking branch 'remotes/origin/branch-2.4' into fix-tru…

f4fd00f

…ncation-by-json_tuple-2.4 # Conflicts: # sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala

cloud-fan closed this Nov 20, 2019

MaxGekk deleted the fix-truncation-by-json_tuple-2.4 branch June 5, 2020 19:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-29758][SQL][2.4] Fix truncation of requested string fields in `json_tuple`#26563

[SPARK-29758][SQL][2.4] Fix truncation of requested string fields in `json_tuple`#26563
MaxGekk wants to merge 4 commits intoapache:branch-2.4from
MaxGekk:fix-truncation-by-json_tuple-2.4

MaxGekk commented Nov 17, 2019 •

edited

Loading

Uh oh!

dongjoon-hyun commented Nov 17, 2019

Uh oh!

SparkQA commented Nov 17, 2019

Uh oh!

cloud-fan Nov 18, 2019

Uh oh!

MaxGekk Nov 18, 2019

Uh oh!

cloud-fan Nov 18, 2019

Uh oh!

MaxGekk Nov 18, 2019

Uh oh!

SparkQA commented Nov 18, 2019

Uh oh!

cloud-fan commented Nov 18, 2019

Uh oh!

HyukjinKwon commented Nov 19, 2019

Uh oh!

cloud-fan commented Nov 19, 2019

Uh oh!

SparkQA commented Nov 19, 2019

Uh oh!

cloud-fan commented Nov 20, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

MaxGekk commented Nov 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

dongjoon-hyun commented Nov 17, 2019

Uh oh!

SparkQA commented Nov 17, 2019

Uh oh!

cloud-fan Nov 18, 2019

Choose a reason for hiding this comment

Uh oh!

MaxGekk Nov 18, 2019

Choose a reason for hiding this comment

Uh oh!

cloud-fan Nov 18, 2019

Choose a reason for hiding this comment

Uh oh!

MaxGekk Nov 18, 2019

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 18, 2019

Uh oh!

cloud-fan commented Nov 18, 2019

Uh oh!

HyukjinKwon commented Nov 19, 2019

Uh oh!

cloud-fan commented Nov 19, 2019

Uh oh!

SparkQA commented Nov 19, 2019

Uh oh!

cloud-fan commented Nov 20, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

MaxGekk commented Nov 17, 2019 •

edited

Loading