Skip to content

[SPARK-29758][SQL][2.4] Fix truncation of requested string fields in json_tuple#26563

Closed
MaxGekk wants to merge 4 commits intoapache:branch-2.4from
MaxGekk:fix-truncation-by-json_tuple-2.4
Closed

[SPARK-29758][SQL][2.4] Fix truncation of requested string fields in json_tuple#26563
MaxGekk wants to merge 4 commits intoapache:branch-2.4from
MaxGekk:fix-truncation-by-json_tuple-2.4

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Nov 17, 2019

What changes were proposed in this pull request?

In the PR, I propose to remove an optimization in json_tuple which causes truncation of results for large requested string fields.

Why are the changes needed?

Spark 2.4 uses Jackson Core 2.6.7 which has a bug in copying string. This bug may lead to truncation of results in some cases. The bug has been already fixed by the commit FasterXML/jackson-core@554f8db which is a part of Jackson Core since the version 2.7.7. Upgrading Jackson Core up to 2.7.7 or later version is risky. That's why this PR propose to avoid using the buggy methods of Jackson Core 2.6.7.

Does this PR introduce any user-facing change?

No

How was this patch tested?

By new test added to JsonFunctionsSuite

@MaxGekk MaxGekk changed the title [SPARK-29758][SQL][2.4] Fix truncation of request string field in json_tuple [SPARK-29758][SQL][2.4] Fix truncation of requested string fields in json_tuple Nov 17, 2019
@dongjoon-hyun
Copy link
Member

cc @gatorsmile , @cloud-fan

@SparkQA
Copy link

SparkQA commented Nov 17, 2019

Test build #113965 has finished for PR 26563 at commit a28f90c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

.createWithDefault(false)

val OPTIMIZE_STRING_COPY_IN_JSON_TUPLE =
buildConf("spark.sql.optimizeStringCopyInJsonTuple.enabled")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name is misleading. It's for fixing a correctness bug not a perf improvement right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The flag turns on/off an optimization which may produce wrong results for large fields. For small fields, the optimization can work as tests in JsonFunctionsSuite show. Enabling the optimization is up to users.

How would you name this flag?

// if the user requests a string field it needs to be returned without enclosing
// quotes which is accomplished via JsonGenerator.writeRaw instead of JsonGenerator.write
case JsonToken.VALUE_STRING if parser.hasTextCharacters =>
case JsonToken.VALUE_STRING if optimizeStringCopy && parser.hasTextCharacters =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd simply remove this optimization. correctness is critical.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially, I thought of removing the optimization but just hesitated that it could impact on users who have small fields. ok, let's remove it.

@SparkQA
Copy link

SparkQA commented Nov 18, 2019

Test build #114004 has finished for PR 26563 at commit 2ec9120.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

@HyukjinKwon @gatorsmile any thoughts?

@HyukjinKwon
Copy link
Member

I think it's fine.

@cloud-fan
Copy link
Contributor

@MaxGekk let's resolve conflicts and get this in. Thanks!

…ncation-by-json_tuple-2.4

# Conflicts:
#	sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala
@SparkQA
Copy link

SparkQA commented Nov 19, 2019

Test build #114112 has finished for PR 26563 at commit f4fd00f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class TimestampParser(fastDateFormat: FastDateFormat)

cloud-fan pushed a commit that referenced this pull request Nov 20, 2019
…`json_tuple`

### What changes were proposed in this pull request?
In the PR, I propose to remove an optimization in `json_tuple` which causes truncation of results for large requested string fields.

### Why are the changes needed?
Spark 2.4 uses Jackson Core 2.6.7 which has a bug in copying string. This bug may lead to truncation of results in some cases. The bug has been already fixed by the commit FasterXML/jackson-core@554f8db which is a part of Jackson Core since the version 2.7.7. Upgrading Jackson Core up to 2.7.7 or later version is risky. That's why this PR propose to avoid using the buggy methods of Jackson Core 2.6.7.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By new test added to `JsonFunctionsSuite`

Closes #26563 from MaxGekk/fix-truncation-by-json_tuple-2.4.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@cloud-fan
Copy link
Contributor

thanks, merging to 2.4!

@cloud-fan cloud-fan closed this Nov 20, 2019
@MaxGekk MaxGekk deleted the fix-truncation-by-json_tuple-2.4 branch June 5, 2020 19:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants