Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-47922][SQL] Implement the try_parse_json expression #46141

Closed
wants to merge 89 commits into from

Conversation

harshmotw-db
Copy link
Contributor

What changes were proposed in this pull request?

This pull request implements the try_parse_json that runs parse_json on string expressions to extract variants. However, if parse_json throws an exception on a row, the value null is returned.

Why are the changes needed?

Sometimes, columns containing JSON strings may contain some invalid inputs that should be ignored instead of having the whole execution failed because of it.

Does this PR introduce any user-facing change?

Yes, it allows users to run the try_parse_json expression.

How was this patch tested?

Unit tests to check if try_parse_json works just like parse_json on valid inputs, returns null on invalid inputs, and fails on incorrect input data types.

Was this patch authored or co-authored using generative AI tooling?

No

harshmotw-db and others added 30 commits April 11, 2024 10:37
…essions/jsonExpressions.scala

Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
// scalastyle:on line.size.limit
case class TryParseJson(expr: Expression, replacement: Expression)
extends RuntimeReplaceable with InheritAnalysisRules {
def this(child: Expression) = this(child, TryEval(ParseJson(child)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a big fan of the TryEval approach, as it's more like a temporary workaround. See this TODO: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/TryEval.scala#L83

The issue is, TryEval swallows all the exceptions, including the ones from inputs. e.g. try_parse_json(a function that produces JSON string but fails) should still fail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand. I have reimplemented it and made it similar to the implementation of the try_variant_get expression.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 033ca3e Apr 26, 2024
dongjoon-hyun pushed a commit that referenced this pull request May 7, 2024
### What changes were proposed in this pull request?

The `try_parse_json` expression added in #46141 declares improper output nullability: the `try_` version's output must be marked as nullable. This PR corrects the nullability and adds a test.

### Why are the changes needed?

Incorrectly declaring an expression's output as non-nullable when it is actually nullable may lead to crashes.

### Does this PR introduce _any_ user-facing change?

Yes, it affects output nullability and thus may affect query result schemas.

### How was this patch tested?

New unit test cases.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46409 from JoshRosen/fix-try-parse-json-nullability.

Authored-by: Josh Rosen <joshrosen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
JacobZheng0927 pushed a commit to JacobZheng0927/spark that referenced this pull request May 11, 2024
### What changes were proposed in this pull request?

This pull request implements the `try_parse_json` that runs `parse_json` on string expressions to extract variants. However, if `parse_json` throws an exception on a row, the value `null` is returned.

### Why are the changes needed?

Sometimes, columns containing JSON strings may contain some invalid inputs that should be ignored instead of having the whole execution failed because of it.

### Does this PR introduce _any_ user-facing change?

Yes, it allows users to run the `try_parse_json` expression.

### How was this patch tested?

Unit tests to check if `try_parse_json` works just like `parse_json` on valid inputs, returns `null` on invalid inputs, and fails on incorrect input data types.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#46141 from harshmotw-db/try_parse_json.

Authored-by: Harsh Motwani <harsh.motwani@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
JacobZheng0927 pushed a commit to JacobZheng0927/spark that referenced this pull request May 11, 2024
### What changes were proposed in this pull request?

The `try_parse_json` expression added in apache#46141 declares improper output nullability: the `try_` version's output must be marked as nullable. This PR corrects the nullability and adds a test.

### Why are the changes needed?

Incorrectly declaring an expression's output as non-nullable when it is actually nullable may lead to crashes.

### Does this PR introduce _any_ user-facing change?

Yes, it affects output nullability and thus may affect query result schemas.

### How was this patch tested?

New unit test cases.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#46409 from JoshRosen/fix-try-parse-json-nullability.

Authored-by: Josh Rosen <joshrosen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants