Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-48148][CORE] JSON objects should not be modified when read as STRING #46408

Closed
wants to merge 9 commits into from

Conversation

eric-maynard
Copy link
Contributor

@eric-maynard eric-maynard commented May 6, 2024

What changes were proposed in this pull request?

Currently, when reading a JSON like this:

{"a": {"b": -999.99999999999999999999999999999999995}}

With the schema:

a STRING

Spark will yield a result like this:

{"b": -1000.0}

Other changes such as changes to the input string's whitespace may also occur. In some cases, we apply scientific notation to an input floating-point number when reading it as STRING.

This applies to reading JSON files (as with spark.read.json) as well as the SQL expression from_json.

Why are the changes needed?

Correctness issues may occur if a field is read as a STRING and then later parsed (e.g. with from_json) after the contents have been modified.

Does this PR introduce any user-facing change?

Yes, when reading non-string fields from a JSON object using the STRING type, we will now extract the field exactly as it appears.

How was this patch tested?

Added a test in JsonSuite.scala

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label May 6, 2024

val df = spark.read.schema("data STRING").json(path.getAbsolutePath)

val expected = s"""{"v": ${granularFloat}}"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add more test cases for the following?

  • {"data": {"v": "abc"}}, expected: "{"v": "abc"}"
  • {"data":{"v": "0.999"}}, expected: "{"v": "0.999"}"
  • {"data": [1, 2, 3]}, expected: "[1, 2, 3]"
  • {"data": }, expected the object as string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added more tests -- can you clarify the last example and what we expect that to do? It seems like invalid JSON

Utils.tryWithResource(factory.createGenerator(writer, JsonEncoding.UTF8)) {
generator => generator.copyCurrentStructure(parser)
val startLocation = parser.getTokenLocation
startLocation.contentReference().getRawContent match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an existing API to get the remaining content as string? Also, would it work with multi-line JSON?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not able to find such an existing API -- there is JacksonParser.getText but that appears to simply get the current value if it's a string value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrt. multiline JSON, I have added a test to cover this. It seems that the content reference is not a byte array when using multiline mode.

@sadikovi
Copy link
Contributor

sadikovi commented May 8, 2024

cc @dongjoon-hyun @HyukjinKwon

@HyukjinKwon
Copy link
Member

SPARK-48148: values are unchanged when read as string *** FAILED *** (134 milliseconds)

seems it fails

expectedExactData = Seq(s"""{"v": ${granularFloat}}""")
)
// In multiLine, we fall back to the inexact method:
extractData(
s"""{"data": {"white":\n"space"}}""",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes \n no longer function as a newline here

@eric-maynard
Copy link
Contributor Author

Hey @HyukjinKwon, can you take another look and possible re-trigger tests? I believe multiline should be working now.

@HyukjinKwon
Copy link
Member

Merged to master.

@HyukjinKwon
Copy link
Member

btw you can trigger on your own https://github.com/eric-maynard/spark/runs/24789350525 I can't trigger :-).

JacobZheng0927 pushed a commit to JacobZheng0927/spark that referenced this pull request May 11, 2024
…STRING

### What changes were proposed in this pull request?

Currently, when reading a JSON like this:
```
{"a": {"b": -999.99999999999999999999999999999999995}}
```

With the schema:
```
a STRING
```

Spark will yield a result like this:
```
{"b": -1000.0}
```

Other changes such as changes to the input string's whitespace may also occur. In some cases, we apply scientific notation to an input floating-point number when reading it as STRING.

This applies to reading JSON files (as with `spark.read.json`) as well as the SQL expression `from_json`.

### Why are the changes needed?

Correctness issues may occur if a field is read as a STRING and then later parsed (e.g. with `from_json`) after the contents have been modified.

### Does this PR introduce _any_ user-facing change?
Yes, when reading non-string fields from a JSON object using the STRING type, we will now extract the field exactly as it appears.

### How was this patch tested?
Added a test in `JsonSuite.scala`

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#46408 from eric-maynard/SPARK-48148.

Lead-authored-by: Eric Maynard <eric.maynard@databricks.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants