[SPARK-31106][SQL] Support valid_json function #28167

iRakson · 2020-04-09T10:35:03Z

What changes were proposed in this pull request?

In this PR, I am proposing a JSON function valid_json.
This function will check whether a given string is valid JSON string or not. It it is valid it will return true, if it is invalid it will return false. In case of NULL input NULL will be returned.

scala> sql("select valid_json('{[]}')").show
+----------------+
|valid_json({[]})|
+----------------+
|           false|
+----------------+

scala> sql("select valid_json('[1, 2, 3]')").show
+---------------------+
|valid_json([1, 2, 3])|
+---------------------+
|                 true|
+---------------------+

scala> sql("select valid_json(null)").show
+--------------------------------+
|valid_json(CAST(NULL AS STRING))|
+--------------------------------+
|                            null|
+--------------------------------+

Why are the changes needed?

This is one the most basic and required functions when we are dealing with JSON data.
Say, user is reading a table then he can segregate the table into rows with valid JSON and rows with invalid JSON. Now user only needs to process on the rows containing valid/invalid JSON.

This function is supported by some popular databases as well. Some of them are :

MySQL: JSON_VALID
SQL Server: ISJSON
Oracle: IS JSON
IBM Db2: IS JSON
Sqlite: json_valid
MariaDB: JSON_VALID
Amazon Redshift: is_valid_json

Does this PR introduce any user-facing change?

Yes, Now users can use valid_json() function to verify whether the given JSON string is valid or not.

How was this patch tested?

Pass the Jenkins with newly added test cases.

iRakson · 2020-04-09T10:35:46Z

@HyukjinKwon @maropu @dongjoon-hyun @MaxGekk
Kindly review this PR.

HyukjinKwon · 2020-04-09T11:07:33Z

@iRakson, how many functions do you target to add? I would like to avoid adding a bunch of functions just for the sake of matching.

iRakson · 2020-04-09T11:19:23Z

@iRakson, how many functions do you target to add? I would like to avoid adding a bunch of functions just for the sake of matching.

This function and two aggregate functions, if you are ok with these.
I have listed them here -> SPARK-31103

json_array_agg and json_object_agg -> These functions can be very useful with Group By queries.

HyukjinKwon · 2020-04-09T11:47:12Z

Can you clarify the usage and benefit? in particular, considering we can easily work around via from/to_json

iRakson · 2020-04-09T12:13:05Z

Can you clarify the usage and benefit? in particular, considering we can easily work around via from/to_json

Although these aggregate functions can be a little faster than working around with to_json, I agree implementing them might not worth it. So I will not implement them. Thank you.

iRakson · 2020-04-09T12:18:04Z

is_json will perform better than from_json though, as it is only reading the tokens and nothing else.

HyukjinKwon · 2020-04-09T12:21:37Z

Is is_json last one you want to add for now?

iRakson · 2020-04-09T12:23:41Z

Is is_json last one you want to add for now?

Yes.
If we can improve performance significantly using those aggregate functions then may be those aggregate functions can be added.
I already updated the jiras for those methods. You can mark them resolved.

HyukjinKwon · 2020-04-09T12:43:16Z

ok to test

...catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala

MaxGekk · 2020-04-09T15:58:19Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala

+        parser => {
+          // parse the JSON string
+          while (parser.nextToken() != null) {
+            parser.skipChildren()


Do you check that only top level fields are valid?

no, nested fields are checked by JsonParser as well.

SparkQA · 2020-04-09T18:11:13Z

Test build #121032 has finished for PR 28167 at commit d427915.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class IsJson(child: Expression) extends UnaryExpression with CodegenFallback

SparkQA · 2020-04-09T21:32:21Z

Test build #121042 has finished for PR 28167 at commit 6f7d569.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2020-04-10T04:21:43Z

@iRakson Could you provide the links?

MySQL
SQL Server
IBM Db2
Sqlite
MariaDB
Amazon Redshift

iRakson · 2020-04-10T05:06:56Z

@iRakson Could you provide the links?

MySQL

SQL Server

IBM Db2

Sqlite

MariaDB

Amazon Redshift

MySQL
SQL Server
IBM Db2
Sqlite
MariaDB
Amazon Redshift

Names may differ across DBMSs.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala

maropu · 2020-04-10T08:09:56Z

Names may differ across DBMSs.

How did you decide the is_json name? It seems the name is different from the others in the systems above.

iRakson · 2020-04-10T08:18:42Z

Names may differ across DBMSs.

How did you decide the is_json name? It seems the name is different from the others in the systems above.

Actually ISJSON seemed more intuitive to me. And in order to follow other JSON function names get_json_object, from_json , i chose is_json.
But looking at the other DBMSs function names we can name it valid_json.
WDYT?

maropu · 2020-04-10T08:25:54Z

I don't have strong preference though, valid_json looks better to me.

SparkQA · 2020-04-10T14:18:29Z

Test build #121075 has finished for PR 28167 at commit c06abe0.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class ValidJson(child: Expression) extends UnaryExpression with CodegenFallback

dongjoon-hyun · 2020-04-12T20:27:45Z

Hi, @iRakson . Please update the PR description with the following information which were requested by @gatorsmile .

[SPARK-31106][SQL] Support valid_json function #28167 (comment)

iRakson · 2020-04-13T03:49:28Z

Hi, @iRakson . Please update the PR description with the following information which were requested by @gatorsmile .

#28167 (comment)

Updated.

maropu · 2020-04-13T07:08:56Z

is_json -> valid_json in the PR description.

iRakson · 2020-04-13T09:15:21Z

is_json -> valid_json in the PR description.

@maropu Updated.

dongjoon-hyun · 2020-04-14T18:40:50Z

Retest this please.

dongjoon-hyun · 2020-04-14T19:09:54Z

I saw the discussion on the naming in the above comments. The naming is always difficult.

The current one looks like too-Apache-Spark-specific.

VALID_JSON: This PR.
JSON_VALID: MySQL, MariaDB, Sqlite
ISJSON: SQL Server
IS JSON predicate: Oracle, IBM Db2
is_valid_json: Amazon Redshift

@maropu , @gatorsmile , and @HyukjinKwon . Although VALID_JSON have a correct meaning, shall we avoid adding a new dialect in the SQL world by our hands?

SparkQA · 2020-04-14T23:38:46Z

Test build #121288 has finished for PR 28167 at commit c06abe0.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class ValidJson(child: Expression) extends UnaryExpression with CodegenFallback

maropu · 2020-04-15T09:27:26Z

Yea, thanks for your suggestion, @dongjoon-hyun. Since all the existing names are totally different, I suggested the clear meaning one above. Actually, I'm still not sure we need this function because we already the workaround for this purpose as @HyukjinKwon suggested above.

gatorsmile · 2020-04-15T14:58:52Z

FYI, IBM Db2 for I is neither DB2 for LUW nor DB2 for zOS. DB2 for I is not widely used. Please get rid of "IBM Db2: IS JSON"

gatorsmile · 2020-04-15T15:09:52Z

@iRakson @maropu @dongjoon-hyun @HyukjinKwon @MaxGekk I checked the above reference links. They are very helpful. There are many different JSON-related functions? Could we first stop merging new functions first?

Let us do more investigation, discuss which APIs we need to add and why they are needed?

iRakson · 2020-04-15T15:52:54Z

@iRakson @maropu @dongjoon-hyun @HyukjinKwon @MaxGekk I checked the above reference links. They are very helpful. There are many different JSON-related functions? Could we first stop merging new functions first?

Let us do more investigation, discuss which APIs we need to add and why they are needed?

Yeah, I think this is much better approach.

HyukjinKwon · 2020-04-16T00:39:35Z

I am good to stop it for now, yes. I have the same concern (#28167 (comment)).

dongjoon-hyun · 2020-04-16T00:50:09Z

Thanks, @maropu , @gatorsmile , @iRakson , @HyukjinKwon . +1 for more invesigations on JSON effort.

yaooqinn · 2020-04-18T06:06:27Z

Since 3.0, we are able to inject functions. Maybe we can create a separate sub-project or third-party project to maintain dialect functions from other platforms instead of adding them directly to Spark. For example, I recently created https://github.com/yaooqinn/spark-func-extras to achieve such a goal(Maybe we can move this project to a proper place to maintain). Functions in this project can overwrite spark builtin ones to keep the meaning of the platforms where they come from if conflicts occur. Also, some function candidates may eventually go into the spark’s master branch if they turn out that they are quite useful and do not conflict.

cc @cloud-fan @maropu, @gatorsmile, @HyukjinKwon, @dongjoon-hyun, thanks.

github-actions · 2020-07-28T00:32:58Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

[SPARK-31006] Support is_json function

d427915

iRakson changed the title ~~[SPARK-31006] Support is_json function~~ [SPARK-31106][SQL] Support is_json function Apr 9, 2020

MaxGekk reviewed Apr 9, 2020

View reviewed changes

check nested invalid objects

6f7d569

dongjoon-hyun added the SQL label Apr 10, 2020

maropu reviewed Apr 10, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala Show resolved Hide resolved

rename function

c06abe0

maropu changed the title ~~[SPARK-31106][SQL] Support is_json function~~ [SPARK-31106][SQL] Support valid_json function Apr 13, 2020

github-actions bot added the Stale label Jul 28, 2020

github-actions bot closed this Jul 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-31106][SQL] Support valid_json function #28167

[SPARK-31106][SQL] Support valid_json function #28167

iRakson commented Apr 9, 2020 •

edited by dongjoon-hyun

iRakson commented Apr 9, 2020

HyukjinKwon commented Apr 9, 2020

iRakson commented Apr 9, 2020 •

edited

HyukjinKwon commented Apr 9, 2020

iRakson commented Apr 9, 2020

iRakson commented Apr 9, 2020

HyukjinKwon commented Apr 9, 2020

iRakson commented Apr 9, 2020 •

edited

HyukjinKwon commented Apr 9, 2020

MaxGekk Apr 9, 2020

iRakson Apr 9, 2020

SparkQA commented Apr 9, 2020

SparkQA commented Apr 9, 2020

gatorsmile commented Apr 10, 2020

iRakson commented Apr 10, 2020

maropu commented Apr 10, 2020

iRakson commented Apr 10, 2020

maropu commented Apr 10, 2020

SparkQA commented Apr 10, 2020

dongjoon-hyun commented Apr 12, 2020

iRakson commented Apr 13, 2020

maropu commented Apr 13, 2020

iRakson commented Apr 13, 2020 •

edited

dongjoon-hyun commented Apr 14, 2020

dongjoon-hyun commented Apr 14, 2020 •

edited

SparkQA commented Apr 14, 2020

maropu commented Apr 15, 2020 •

edited

gatorsmile commented Apr 15, 2020

gatorsmile commented Apr 15, 2020

iRakson commented Apr 15, 2020

HyukjinKwon commented Apr 16, 2020

dongjoon-hyun commented Apr 16, 2020

yaooqinn commented Apr 18, 2020

github-actions bot commented Jul 28, 2020

[SPARK-31106][SQL] Support valid_json function #28167

[SPARK-31106][SQL] Support valid_json function #28167

Conversation

iRakson commented Apr 9, 2020 • edited by dongjoon-hyun

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

iRakson commented Apr 9, 2020

HyukjinKwon commented Apr 9, 2020

iRakson commented Apr 9, 2020 • edited

HyukjinKwon commented Apr 9, 2020

iRakson commented Apr 9, 2020

iRakson commented Apr 9, 2020

HyukjinKwon commented Apr 9, 2020

iRakson commented Apr 9, 2020 • edited

HyukjinKwon commented Apr 9, 2020

MaxGekk Apr 9, 2020

Choose a reason for hiding this comment

iRakson Apr 9, 2020

Choose a reason for hiding this comment

SparkQA commented Apr 9, 2020

SparkQA commented Apr 9, 2020

gatorsmile commented Apr 10, 2020

iRakson commented Apr 10, 2020

maropu commented Apr 10, 2020

iRakson commented Apr 10, 2020

maropu commented Apr 10, 2020

SparkQA commented Apr 10, 2020

dongjoon-hyun commented Apr 12, 2020

iRakson commented Apr 13, 2020

maropu commented Apr 13, 2020

iRakson commented Apr 13, 2020 • edited

dongjoon-hyun commented Apr 14, 2020

dongjoon-hyun commented Apr 14, 2020 • edited

SparkQA commented Apr 14, 2020

maropu commented Apr 15, 2020 • edited

gatorsmile commented Apr 15, 2020

gatorsmile commented Apr 15, 2020

iRakson commented Apr 15, 2020

HyukjinKwon commented Apr 16, 2020

dongjoon-hyun commented Apr 16, 2020

yaooqinn commented Apr 18, 2020

github-actions bot commented Jul 28, 2020

iRakson commented Apr 9, 2020 •

edited by dongjoon-hyun

iRakson commented Apr 9, 2020 •

edited

iRakson commented Apr 9, 2020 •

edited

iRakson commented Apr 13, 2020 •

edited

dongjoon-hyun commented Apr 14, 2020 •

edited

maropu commented Apr 15, 2020 •

edited