Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-31106][SQL] Support valid_json function #28167

Closed
wants to merge 3 commits into from

Conversation

iRakson
Copy link
Contributor

@iRakson iRakson commented Apr 9, 2020

What changes were proposed in this pull request?

In this PR, I am proposing a JSON function valid_json.
This function will check whether a given string is valid JSON string or not. It it is valid it will return true, if it is invalid it will return false. In case of NULL input NULL will be returned.

scala> sql("select valid_json('{[]}')").show
+----------------+
|valid_json({[]})|
+----------------+
|           false|
+----------------+

scala> sql("select valid_json('[1, 2, 3]')").show
+---------------------+
|valid_json([1, 2, 3])|
+---------------------+
|                 true|
+---------------------+

scala> sql("select valid_json(null)").show
+--------------------------------+
|valid_json(CAST(NULL AS STRING))|
+--------------------------------+
|                            null|
+--------------------------------+

Why are the changes needed?

This is one the most basic and required functions when we are dealing with JSON data.
Say, user is reading a table then he can segregate the table into rows with valid JSON and rows with invalid JSON. Now user only needs to process on the rows containing valid/invalid JSON.

This function is supported by some popular databases as well. Some of them are :

Does this PR introduce any user-facing change?

Yes, Now users can use valid_json() function to verify whether the given JSON string is valid or not.

How was this patch tested?

Pass the Jenkins with newly added test cases.

@iRakson
Copy link
Contributor Author

iRakson commented Apr 9, 2020

@HyukjinKwon @maropu @dongjoon-hyun @MaxGekk
Kindly review this PR.

@iRakson iRakson changed the title [SPARK-31006] Support is_json function [SPARK-31106][SQL] Support is_json function Apr 9, 2020
@HyukjinKwon
Copy link
Member

@iRakson, how many functions do you target to add? I would like to avoid adding a bunch of functions just for the sake of matching.

@iRakson
Copy link
Contributor Author

iRakson commented Apr 9, 2020

@iRakson, how many functions do you target to add? I would like to avoid adding a bunch of functions just for the sake of matching.

This function and two aggregate functions, if you are ok with these.
I have listed them here -> SPARK-31103

  • json_array_agg and json_object_agg -> These functions can be very useful with Group By queries.

@HyukjinKwon
Copy link
Member

Can you clarify the usage and benefit? in particular, considering we can easily work around via from/to_json

@iRakson
Copy link
Contributor Author

iRakson commented Apr 9, 2020

Can you clarify the usage and benefit? in particular, considering we can easily work around via from/to_json

Although these aggregate functions can be a little faster than working around with to_json, I agree implementing them might not worth it. So I will not implement them. Thank you.

@iRakson
Copy link
Contributor Author

iRakson commented Apr 9, 2020

is_json will perform better than from_json though, as it is only reading the tokens and nothing else.

@HyukjinKwon
Copy link
Member

Is is_json last one you want to add for now?

@iRakson
Copy link
Contributor Author

iRakson commented Apr 9, 2020

Is is_json last one you want to add for now?

Yes.
If we can improve performance significantly using those aggregate functions then may be those aggregate functions can be added.
I already updated the jiras for those methods. You can mark them resolved.

@HyukjinKwon
Copy link
Member

ok to test

parser => {
// parse the JSON string
while (parser.nextToken() != null) {
parser.skipChildren()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you check that only top level fields are valid?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, nested fields are checked by JsonParser as well.

@SparkQA
Copy link

SparkQA commented Apr 9, 2020

Test build #121032 has finished for PR 28167 at commit d427915.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class IsJson(child: Expression) extends UnaryExpression with CodegenFallback

@SparkQA
Copy link

SparkQA commented Apr 9, 2020

Test build #121042 has finished for PR 28167 at commit 6f7d569.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

@iRakson Could you provide the links?

  • MySQL
  • SQL Server
  • IBM Db2
  • Sqlite
  • MariaDB
  • Amazon Redshift

@iRakson
Copy link
Contributor Author

iRakson commented Apr 10, 2020

@iRakson Could you provide the links?

  • MySQL
  • SQL Server
  • IBM Db2
  • Sqlite
  • MariaDB
  • Amazon Redshift

MySQL
SQL Server
IBM Db2
Sqlite
MariaDB
Amazon Redshift

Names may differ across DBMSs.

@maropu
Copy link
Member

maropu commented Apr 10, 2020

Names may differ across DBMSs.

How did you decide the is_json name? It seems the name is different from the others in the systems above.

@iRakson
Copy link
Contributor Author

iRakson commented Apr 10, 2020

Names may differ across DBMSs.

How did you decide the is_json name? It seems the name is different from the others in the systems above.

Actually ISJSON seemed more intuitive to me. And in order to follow other JSON function names get_json_object, from_json , i chose is_json.
But looking at the other DBMSs function names we can name it valid_json.
WDYT?

@maropu
Copy link
Member

maropu commented Apr 10, 2020

I don't have strong preference though, valid_json looks better to me.

@SparkQA
Copy link

SparkQA commented Apr 10, 2020

Test build #121075 has finished for PR 28167 at commit c06abe0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class ValidJson(child: Expression) extends UnaryExpression with CodegenFallback

@dongjoon-hyun
Copy link
Member

Hi, @iRakson . Please update the PR description with the following information which were requested by @gatorsmile .

@iRakson
Copy link
Contributor Author

iRakson commented Apr 13, 2020

Hi, @iRakson . Please update the PR description with the following information which were requested by @gatorsmile .

Updated.

@maropu maropu changed the title [SPARK-31106][SQL] Support is_json function [SPARK-31106][SQL] Support valid_json function Apr 13, 2020
@maropu
Copy link
Member

maropu commented Apr 13, 2020

is_json -> valid_json in the PR description.

@iRakson
Copy link
Contributor Author

iRakson commented Apr 13, 2020

is_json -> valid_json in the PR description.

@maropu Updated.

@dongjoon-hyun
Copy link
Member

Retest this please.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Apr 14, 2020

I saw the discussion on the naming in the above comments. The naming is always difficult.

The current one looks like too-Apache-Spark-specific.

  • VALID_JSON: This PR.
  • JSON_VALID: MySQL, MariaDB, Sqlite
  • ISJSON: SQL Server
  • IS JSON predicate: Oracle, IBM Db2
  • is_valid_json: Amazon Redshift

@maropu , @gatorsmile , and @HyukjinKwon . Although VALID_JSON have a correct meaning, shall we avoid adding a new dialect in the SQL world by our hands?

@SparkQA
Copy link

SparkQA commented Apr 14, 2020

Test build #121288 has finished for PR 28167 at commit c06abe0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class ValidJson(child: Expression) extends UnaryExpression with CodegenFallback

@maropu
Copy link
Member

maropu commented Apr 15, 2020

Yea, thanks for your suggestion, @dongjoon-hyun. Since all the existing names are totally different, I suggested the clear meaning one above. Actually, I'm still not sure we need this function because we already the workaround for this purpose as @HyukjinKwon suggested above.

@gatorsmile
Copy link
Member

FYI, IBM Db2 for I is neither DB2 for LUW nor DB2 for zOS. DB2 for I is not widely used. Please get rid of "IBM Db2: IS JSON"

@gatorsmile
Copy link
Member

@iRakson @maropu @dongjoon-hyun @HyukjinKwon @MaxGekk I checked the above reference links. They are very helpful. There are many different JSON-related functions? Could we first stop merging new functions first?

Let us do more investigation, discuss which APIs we need to add and why they are needed?

@iRakson
Copy link
Contributor Author

iRakson commented Apr 15, 2020

@iRakson @maropu @dongjoon-hyun @HyukjinKwon @MaxGekk I checked the above reference links. They are very helpful. There are many different JSON-related functions? Could we first stop merging new functions first?

Let us do more investigation, discuss which APIs we need to add and why they are needed?

Yeah, I think this is much better approach.

@HyukjinKwon
Copy link
Member

I am good to stop it for now, yes. I have the same concern (#28167 (comment)).

@dongjoon-hyun
Copy link
Member

Thanks, @maropu , @gatorsmile , @iRakson , @HyukjinKwon . +1 for more invesigations on JSON effort.

@yaooqinn
Copy link
Member

Since 3.0, we are able to inject functions. Maybe we can create a separate sub-project or third-party project to maintain dialect functions from other platforms instead of adding them directly to Spark. For example, I recently created https://github.com/yaooqinn/spark-func-extras to achieve such a goal(Maybe we can move this project to a proper place to maintain). Functions in this project can overwrite spark builtin ones to keep the meaning of the platforms where they come from if conflicts occur. Also, some function candidates may eventually go into the spark’s master branch if they turn out that they are quite useful and do not conflict.

cc @cloud-fan @maropu, @gatorsmile, @HyukjinKwon, @dongjoon-hyun, thanks.

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Jul 28, 2020
@github-actions github-actions bot closed this Jul 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
8 participants