[SPARK-31008][SQL] Support json_array_length function#27759
[SPARK-31008][SQL] Support json_array_length function#27759iRakson wants to merge 14 commits intoapache:masterfrom
Conversation
There are cases where we need json arrays whose size is greater than a certain threshold. At the moment we have to parse all the json arrays and then only we can decide whether that json array is needed or not. |
You can avoid deep parsing by specifying string as element type. For example: scala> val df = Seq("""[{"a":1}, {"a": 2}]""").toDF("json")
df: org.apache.spark.sql.DataFrame = [json: string]
scala> df.select(size(from_json($"json", ArrayType(StringType)))).show
+---------------------+
|size(from_json(json))|
+---------------------+
| 2|
+---------------------+It does actually the same as your expression. Maybe it is less optimal because |
I believe public API might serve better as user are more familiar with |
|
Could you check more databases, e.g., oracle, sql server, snowflake, ...? |
After checking Databases, i found out that these DBMSs support
While these DBMSs support
|
|
ok to test |
| > SELECT _FUNC_('[1,2,3,{"f1":1,"f2":[5,6]},4]'); | ||
| 5 | ||
| """, | ||
| since = "3.0.0" |
| * A function that returns number of elements in outer Json Array. | ||
| */ | ||
| @ExpressionDescription( | ||
| usage = "_FUNC_(jsonArray) - Returns length of the jsonArray", |
There was a problem hiding this comment.
Can you also add arguments description?
There was a problem hiding this comment.
arguments description is added.
There was a problem hiding this comment.
nit: I like a consistent word: length of or number of?
|
Test build #119263 has finished for PR 27759 at commit
|
| """, | ||
| since = "3.1.0" | ||
| ) | ||
| case class LengthOfJsonArray(child: Expression) |
There was a problem hiding this comment.
The semantic of LengthOfJsonArray is similar to Size + JsonToStructs. What about to extend RuntimeReplaceable and map LengthOfJsonArray to the combination of existing expressions.
There was a problem hiding this comment.
I was thinking about this but seems JsonToStructs requires a schema, which is unknown in this expression.
| override def prettyName: String = "json_array_length" | ||
|
|
||
| override def eval(input: InternalRow): Any = { | ||
| @transient |
There was a problem hiding this comment.
Is this annotation really needed?
|
|
||
| private def parseCounter(parser: JsonParser, input: InternalRow): Int = { | ||
| // Counter for length of array | ||
| var array_length: Int = 0; |
There was a problem hiding this comment.
variable name should be in camelCase, see https://github.com/databricks/scala-style-guide#variable-naming-convention
There was a problem hiding this comment.
I forgot to change the name. I will rename it and comment will be removed as well.
| } | ||
|
|
||
| private def parseCounter(parser: JsonParser, input: InternalRow): Int = { | ||
| // Counter for length of array |
There was a problem hiding this comment.
I think the comment can be removed. I would rename array_length to counter or length
| } | ||
| } | ||
| } catch { | ||
| case _: JsonProcessingException => null |
There was a problem hiding this comment.
nextToken can throw IOException. see:
public abstract JsonToken nextToken() throws IOException;
There was a problem hiding this comment.
Just in case, the exception are handled by JacksonParser:
There was a problem hiding this comment.
I will handle IOException. Missed that. Thanks for pointing out.
| Examples: | ||
| > SELECT _FUNC_('[1,2,3,4]'); | ||
| 4 | ||
| > SELECT _FUNC_('[1,2,3,{"f1":1,"f2":[5,6]},4]'); |
There was a problem hiding this comment.
Should the expression support array of different element types? For instance, from_json() can parse arrays only of particular type. So, you can get length but cannot parse it by from_json.
There was a problem hiding this comment.
Actually i wanted it to work like other json_array_length functions, which take any input and result length. I can change its implementation, if required.
| throw new AnalysisException(s"$prettyName can only be called on Json Array.") | ||
| } | ||
| // Keep traversing until the end of Json Array | ||
| while(parser.nextToken() != JsonToken.END_ARRAY) { |
There was a problem hiding this comment.
Can nextToken return null? Looks like it can:
There was a problem hiding this comment.
It returns null when end of input is reached.
If it returns null before returning END_ARRAY then our json is invalid. Invalid input was already handled.
Anyway now i will add one more check for null.
|
Test build #119279 has finished for PR 27759 at commit
|
|
Test build #119303 has finished for PR 27759 at commit
|
| } | ||
|
|
||
| /** | ||
| * A function that returns number of elements in outer Json Array. |
| usage = "_FUNC_(jsonArray) - Returns length of the jsonArray", | ||
| arguments = """ | ||
| jsonArray - A JSON array is required as argument. `Analysis Exception` is thrown if any other | ||
| valid JSON expression is passed. `NULL` is returned in case of invalid JSON. |
There was a problem hiding this comment.
Can you follow the other format? e.g., https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala#L292
There was a problem hiding this comment.
ok. I will make the changes
|
Test build #119365 has finished for PR 27759 at commit
|
|
Test build #119367 has finished for PR 27759 at commit
|
|
retest this please |
|
@HyukjinKwon @maropu @MaxGekk I have handled review comments from my side. Please Review once. |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
Outdated
Show resolved
Hide resolved
| arguments = """ | ||
| Arguments: | ||
| * jsonArray - A JSON array. An exception is thrown if any other valid JSON strings are passed. | ||
| `NULL` is returned in case of an invalid JSON. |
There was a problem hiding this comment.
Shall we mention NULL input explicitly?
- `NULL` is returned in case of an invalid JSON.
+ `NULL` is returned in case of `NULL` or an invalid JSON
| } | ||
|
|
||
| /** | ||
| * A function that returns the number of elements in outer JSON array. |
There was a problem hiding this comment.
Since JSON can be a nested structure, there might be multiple inner and outer. Can we use the outmost instead of outer?
| } | ||
|
|
||
| private def parseCounter(parser: JsonParser, input: InternalRow): Int = { | ||
| var length = 0; |
There was a problem hiding this comment.
Ur, shall we remove ; since this is Scala? Please check the other code together.
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
Show resolved
Hide resolved
| df.selectExpr("json_array_length(json)") | ||
| }.getMessage | ||
| assert(errMsg.contains("due to data type mismatch")) | ||
| } |
There was a problem hiding this comment.
Shall we remove this because this is already covered at json-functions.sql more extensively.
|
Please update the PR description, too. It's important because it will be a commit log. |
Updated. :) |
|
Test build #120817 has finished for PR 27759 at commit
|
| options.asJava)), | ||
| Seq(Row("string"))) | ||
| } | ||
|
|
There was a problem hiding this comment.
Thank you for deletion, but to be complete, you need to recover to the master branch version. You may need something like the following.
git checkout -f master sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala
| ("""[1,2,3,[33,44],{"key":[2,3,4]}]""", 5), | ||
| ("""[1,2,3,4,5""", null), | ||
| ("""Random String""", null) | ||
| ).foreach{ |
There was a problem hiding this comment.
nit. ).foreach{ -> ).foreach {
| ("""""", null), | ||
| ("""[1,2,3]""", 3), | ||
| ("""[]""", 0), | ||
| ("""[[1],[2,3],[]]""", 3), |
There was a problem hiding this comment.
This is different from the example I gave you. Do you have any reason to prefer """ here? Usually, Apache Spark prefer to use a simple form " than """.
Seq(
("", null),
("[]", 0),
("[1,2,3]", 3),
("[[1],[2,3],[]]", 3),
| while(parser.nextToken() != JsonToken.END_ARRAY) { | ||
| // Null indicates end of input. | ||
| if (parser.currentToken == null) { | ||
| throw new IllegalArgumentException("Please provide a valid JSON array.") |
There was a problem hiding this comment.
I'm wondering if we can have a test coverage for this code path. Otherwise, this code path can be considered as a dead code.
There was a problem hiding this comment.
Yeah. It is unreachable code. Because if we encounter null before END_ARRAY then our JSON is invalid. I will remove this check.
|
Hi, @iRakson . Thank you for updating. I left only a few comments. The other things look okay to me. |
|
@dongjoon-hyun Done. |
|
Test build #120860 has finished for PR 27759 at commit
|
|
retest this please |
|
Test build #120862 has finished for PR 27759 at commit
|
|
@cloud-fan I made the changes which you asked for in #27836. |
|
Test build #120872 has finished for PR 27759 at commit
|
dongjoon-hyun
left a comment
There was a problem hiding this comment.
+1, LGTM. Thank you, @iRakson and all!
Merged to master for Apache Spark 3.1.0.
|
Thank you all for patiently reviewing the PR. |
### What changes were proposed in this pull request?
At the moment we do not have any function to compute length of JSON array directly.
I propose a `json_array_length` function which will return the length of the outermost JSON array.
- This function will return length of the outermost JSON array, if JSON array is valid.
```
scala> spark.sql("select json_array_length('[1,2,3,[33,44],{\"key\":[2,3,4]}]')").show
+--------------------------------------------------+
|json_array_length([1,2,3,[33,44],{"key":[2,3,4]}])|
+--------------------------------------------------+
| 5|
+--------------------------------------------------+
scala> spark.sql("select json_array_length('[[1],[2,3]]')").show
+------------------------------+
|json_array_length([[1],[2,3]])|
+------------------------------+
| 2|
+------------------------------+
```
- In case of any other valid JSON string, invalid JSON string or null array or `NULL` input , `NULL` will be returned.
```
scala> spark.sql("select json_array_length('')").show
+-------------------+
|json_array_length()|
+-------------------+
| null|
+-------------------+
```
### Why are the changes needed?
- As mentioned in JIRA, this function is supported by presto, postgreSQL, redshift, SQLite, MySQL, MariaDB, IBM DB2.
- for better user experience and ease of use.
```
Performance Result for Json array - [1, 2, 3, 4]
Intel(R) Core(TM) i7-9750H CPU 2.60GHz
JSON functions: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
json_array_length 7728 7762 53 1.3 772.8 1.0X
size+from_json 12739 12895 199 0.8 1273.9 0.6X
```
### Does this PR introduce any user-facing change?
Yes, now users can get length of a json array by using `json_array_length`.
### How was this patch tested?
Added UT.
Closes apache#27759 from iRakson/jsonArrayLength.
Authored-by: iRakson <raksonrakesh@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…1284) * [SPARK-31008][SQL] Support json_array_length function ### What changes were proposed in this pull request? At the moment we do not have any function to compute length of JSON array directly. I propose a `json_array_length` function which will return the length of the outermost JSON array. - This function will return length of the outermost JSON array, if JSON array is valid. ``` scala> spark.sql("select json_array_length('[1,2,3,[33,44],{\"key\":[2,3,4]}]')").show +--------------------------------------------------+ |json_array_length([1,2,3,[33,44],{"key":[2,3,4]}])| +--------------------------------------------------+ | 5| +--------------------------------------------------+ scala> spark.sql("select json_array_length('[[1],[2,3]]')").show +------------------------------+ |json_array_length([[1],[2,3]])| +------------------------------+ | 2| +------------------------------+ ``` - In case of any other valid JSON string, invalid JSON string or null array or `NULL` input , `NULL` will be returned. ``` scala> spark.sql("select json_array_length('')").show +-------------------+ |json_array_length()| +-------------------+ | null| +-------------------+ ``` ### Why are the changes needed? - As mentioned in JIRA, this function is supported by presto, postgreSQL, redshift, SQLite, MySQL, MariaDB, IBM DB2. - for better user experience and ease of use. ``` Performance Result for Json array - [1, 2, 3, 4] Intel(R) Core(TM) i7-9750H CPU 2.60GHz JSON functions: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ json_array_length 7728 7762 53 1.3 772.8 1.0X size+from_json 12739 12895 199 0.8 1273.9 0.6X ``` ### Does this PR introduce any user-facing change? Yes, now users can get length of a json array by using `json_array_length`. ### How was this patch tested? Added UT. Closes #27759 from iRakson/jsonArrayLength. Authored-by: iRakson <raksonrakesh@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 71022d7)
What changes were proposed in this pull request?
At the moment we do not have any function to compute length of JSON array directly.
I propose a
json_array_lengthfunction which will return the length of the outermost JSON array.NULLinput ,NULLwill be returned.Why are the changes needed?
As mentioned in JIRA, this function is supported by presto, postgreSQL, redshift, SQLite, MySQL, MariaDB, IBM DB2.
for better user experience and ease of use.
Does this PR introduce any user-facing change?
Yes, now users can get length of a json array by using
json_array_length.How was this patch tested?
Added UT.