Add SQL functions for working with Maps and Arrays #1611

blueedgenick · 2018-07-22T00:44:37Z

Description

A new set of SQL functions to make it possible to manipulate columns of type map and array:

array_distinct(array)
array_except(array1, array2)
array_intersect(array1, array2)
array_slice(array, start, length)
array_union(array1, array2)
cardinality(array | map)
element(array, index) | element(map, key) <-- function-based equivalent of array[n] etc.
arrays_to_map(key-array, value-array)
map_keys(map)
map_values(map)
map_union(map1, map2)

All of the above added to syntax guide.

Testing done

Unit tests and a quick manual test of some queries.

Reviewer checklist

Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
Ensure relevant issues are linked (description should include text like "Fixes #")

… functions

big-andy-coates

Thanks @blueedgenick, I've left a few comments.

big-andy-coates · 2018-07-24T11:39:08Z

docs/syntax-reference.rst

+| ARRAY_DISTINCT         |  ``ARRAY_DISTINCT(array_col)``                             | Returns an array of all the distinct values from  |
+|                        |                                                            | the input array, or NULL if the input is NULL.    |
+------------------------+------------------------------------------------------------+---------------------------------------------------+
+| ARRAY_EXCEPT           |  ``ARRAY_EXCEPT(array1, array2)``                          | Returns an array of all the distinct values from  |


Should this one be returning distinct values? Or just the array1 filtering out values in array2? That would make the function more single purpose. They can always compose it with ARRAY_DISTINCT if they want a distinct output.

big-andy-coates · 2018-07-24T11:40:05Z

docs/syntax-reference.rst

@@ -983,13 +983,46 @@ Scalar functions
 +========================+============================================================+===================================================+
 | ABS                    | ``ABS(col1)``                                              | The absolute value of a value.                    |
 +------------------------+------------------------------------------------------------+---------------------------------------------------+
+| ARRAY_DISTINCT         |  ``ARRAY_DISTINCT(array_col)``                             | Returns an array of all the distinct values from  |


Is order maintained? Might be worth documenting, (and testing!), order for this and other methods.

big-andy-coates · 2018-07-24T11:43:35Z

docs/syntax-reference.rst

+|                        |                                                            | array1 except for those also presnet in array2.   |
+|                        |                                                            | NULL is returned if either input array is NULL.   |
+------------------------+------------------------------------------------------------+---------------------------------------------------+
+| ARRAY_INTERSECT        |  ``ARRAY_INTERSECT(array1, array2)``                       | Returns an array of all the distinct elements     |


Likewise, I'm not sure this should be distinct. This is the intersect of the array, not set. ARRAY_INTERSECT([1,2,2,2,3],[2,2,3]) should, IMHO, return [2,2,3].

Presto and Teradata have this similar array_intersect() function, and they specify that the returned array does not have duplicates (same for array_except). I think it makes sense to follow the same standard from both databases.

indeed, that's why it was this way to begin with ;)

big-andy-coates · 2018-07-24T11:45:06Z

docs/syntax-reference.rst

+|                        |                                                            | the rqeuested length or offset are negative the   |
+|                        |                                                            | entire array is returend.                         |
+------------------------+------------------------------------------------------------+---------------------------------------------------+
+| ARRAY_UNION            |  ``ARRAY_UNION(array1, array2)``                           | Returns an array of all the distinct values from  |


Likewise, don't think this should be distinct.

Similar to above, this is modelled after the comparable functions in other systems (teradata nand presto). This case is also more "normal" from the rdbms world - consider the difference between UNION and UNION ALL in standard sql (the first returns a distinct result list, the second requires an additional keyword to specify "please keep all the duplicates")

big-andy-coates · 2018-07-24T11:45:40Z

docs/syntax-reference.rst

+| ARRAY_UNION            |  ``ARRAY_UNION(array1, array2)``                           | Returns an array of all the distinct values from  |
+|                        |                                                            | all input arrays, or NULL if either input is NULL.|
+------------------------+------------------------------------------------------------+---------------------------------------------------+
+| ARRAYS_TO_MAP          |  ``ARRAYS_TO_MAP(key_array, map_array)``                   | Creates a map from a passed array of keys and an  |


Would benefit from documentation of NULL behaviour, as you have for others.

big-andy-coates · 2018-07-24T12:56:41Z

ksql-engine/src/main/java/io/confluent/ksql/function/udf/map/ArraysToMapKudf.java

+      String thisKey = inputKeys.get(i);
+      Object thisValue = inputValues.get(i);
+      if (thisValue != null) {
+        outputMap.put(inputKeys.get(i), thisValue);


not sure what your intention is here... the if/else branches both do this same thing.

Does your map type allow null keys? Good one to test & document!

big-andy-coates · 2018-07-24T12:58:34Z

ksql-engine/src/main/java/io/confluent/ksql/function/udf/map/ElementKudf.java

+  }
+
+  @SuppressWarnings("rawtypes")
+  @Udf(description = "Returns the element at the specified index (counting from 0) in an array.")


Isn't SQL convention normally to be base-1, rather than base-0? Or has this changed with newer products?

yes indeed - this was done just to align with the rest of ksql back at the time. seeing as we've subsequently updated other parts of the code to be 1-based, the same change should be applied here too

big-andy-coates · 2018-07-24T12:59:41Z

ksql-engine/src/main/java/io/confluent/ksql/function/udf/map/MapUnionKudf.java

+import io.confluent.ksql.function.udf.Udf;
+import io.confluent.ksql.function.udf.UdfDescription;
+
+@UdfDescription(name = "map_concat", author = "Confluent",


It's MAP_UNION in the docs...

big-andy-coates · 2018-07-24T13:00:18Z

ksql-engine/src/test/java/io/confluent/ksql/function/udf/array/ArrayDistinctKudfTest.java

+  @Test
+  public void shouldDistinctArray() {
+    final List<Object> result = udf.distinct(Arrays.asList("foo", " ", "foo", "bar"));
+    assertThat(result, containsInAnyOrder(" ", "bar", "foo"));


I think this should maintain order.

big-andy-coates · 2018-07-24T13:02:06Z

ksql-engine/src/test/java/io/confluent/ksql/function/udf/array/ArrayExceptKudfTest.java

+  @SuppressWarnings("rawtypes")
+  @Test
+  public void shouldReturnNullForNullInput() {
+    List result = udf.except(null, null);


Could you split into two tests, each with one null param, please? (It catches places where one null check is hiding a second NPE).

blueedgenick · 2019-01-14T22:09:13Z

Note: this PR was put on hold some months ago due to issues with getting complex types (map/array/struct) into and out of the UDF invocation framework. I think a more recent PR has at least partially alleviated that situation, so i'd be happy to take another pass at this. I do also recall addressing several of @big-andy-coates 's code comments from above, not sure what happened to those fixes - likely i have them in a private branch somewhere. Will go dig those up.

spena · 2019-01-14T22:32:20Z

docs/syntax-reference.rst

+|                        |                                                            | those elements which are present in both inputs.  |
+|                        |                                                            | NULL is returned if either input array is NULL.   |
+------------------------+------------------------------------------------------------+---------------------------------------------------+
+| ARRAY_SLICE            |  ``ARRAY_SLICE(array_col, start, length)``                 | Returns a subsection of an array, of requested    |


Presto and Terdata has a similar function called 'slice(x, start, length)' which does the same thing.
Snowflake has a similar function too called 'array_slice(x, from, to)' which does the same thing (but using end of array instead of length).

This function for KSQL looks pretty similar to Presto and Teradata. Should we use the same name 'slice' instead of 'array_slice'?

spena · 2019-01-14T22:33:19Z

docs/syntax-reference.rst

 | ARRAYCONTAINS          |  ``ARRAYCONTAINS('[1, 2, 3]', 3)``                         | Given JSON or AVRO array checks if a search       |
 |                        |                                                            | value contains in it.                             |
 +------------------------+------------------------------------------------------------+---------------------------------------------------+
+| CARDINALITY            |  ``CARDINALITY(array | map)``                              | Returns the number of entries in the specified    |


Presto and Teradata has a similar function that returns the size of the array (no map). Snowflake has only 'array_size(x)' for the same result.

For KSQL, should we split this function into 'array_size(x)' and 'map_size(x)' for a better meaning? Or just 'size(x)', for both, size(array) or size(map)?

spena · 2019-01-14T22:35:16Z

ksql-engine/src/main/java/io/confluent/ksql/function/udf/array/ArrayDistinctKudf.java

+    if (input == null) {
+      return null;
+    }
+    return Lists.newArrayList(Sets.newHashSet(input));


The Set will not preserve the order of the original array. What should the result be? Keeping the order (just removing duplicates), or ignoring the order of the array?

If could use a stream, like input.stream().distinct().collect(Collectors.toList()), which I think will preserve the order. But I read that will not perform very good. How do others databases work?

spena · 2019-01-14T22:55:37Z

ksql-engine/src/main/java/io/confluent/ksql/function/udf/array/ArrayExceptKudf.java

+        .filter(e -> !rhs.contains(e))
+        .collect(Collectors.toSet());
+    final List result = Lists.newArrayList(distinct);
+    return result;


Can this be reduced to one line using the Collectors.toList() instead?
Like:
return lhs.stream().filter(e -> !rhs.contains(e)).collect(Collectors.toList());

Also, the above rhs.contains(e) is linear, so if you have a large array on rhs, then the performance won't be great. What about converting the rhs to a set, and then use the .contains() in the filter method?

blueedgenick · 2019-01-16T02:59:26Z

investigation reveals this PR to still be blocked on #1791 , unable to bind UDFs which declare non-concrete parameter and return types.

JimGalasyn

@blueedgenic, Very cool stuff! This is a very old version of the Syntax Reference topic. Can you please move this content to the new topic at https://github.com/confluentinc/ksql/blob/master/docs/developer-guide/ksqldb-reference/scalar-functions.md? Note that all of the new ksqlDB content is in markdown. Thanks!

blueedgenick · 2020-10-19T18:38:03Z

hey @JimGalasyn - apologies, i have no idea how you got requested to review this again, it's totally superseded by the 2 PRs mentioned just above (#5536 and #5548) which already added these functions nad their docs (to the right docs file!). I'll close out this dangling PR

JimGalasyn · 2020-10-19T18:42:25Z

@blueedgenick Heh, no worries!

Unknown author and others added 8 commits July 17, 2018 01:47

initial version

836f4f8

initial version

590a6cc

cleaning up docs

445c434

merge

5956ce7

add new Element(array|map, index) and array_intersect(array1, array2)…

f0f7cf0

… functions

clean up docs and tests

16c3433

fix some odd tab/spaces problem in ArrayIntersectKudf

875de39

not sure waht my IDE did to these 2 files...

0852191

rodesai requested a review from a team July 24, 2018 06:15

big-andy-coates reviewed Jul 24, 2018

View reviewed changes

blueedgenick mentioned this pull request Jul 27, 2018

KSQL Functions and Arrays should use 1-based indexing #1659

Closed

big-andy-coates requested a review from a team August 25, 2018 04:27

spena reviewed Jan 14, 2019

View reviewed changes

This was referenced Jun 3, 2020

feat: new UDFs for working with Maps #5536

Merged

feat: new UDFs for set-like operations on Arrays #5548

Merged

blueedgenick requested a review from JimGalasyn as a code owner October 17, 2020 20:01

JimGalasyn requested changes Oct 19, 2020

View reviewed changes

blueedgenick closed this Oct 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SQL functions for working with Maps and Arrays #1611

Add SQL functions for working with Maps and Arrays #1611

blueedgenick commented Jul 22, 2018

big-andy-coates left a comment

big-andy-coates Jul 24, 2018

big-andy-coates Jul 24, 2018

big-andy-coates Jul 24, 2018

spena Jan 14, 2019

blueedgenick Jan 14, 2019

big-andy-coates Jul 24, 2018

blueedgenick Jan 14, 2019

big-andy-coates Jul 24, 2018

big-andy-coates Jul 24, 2018

big-andy-coates Jul 24, 2018

blueedgenick Jan 14, 2019

big-andy-coates Jul 24, 2018

big-andy-coates Jul 24, 2018

big-andy-coates Jul 24, 2018

blueedgenick commented Jan 14, 2019

spena Jan 14, 2019

spena Jan 14, 2019

spena Jan 14, 2019

spena Jan 14, 2019

blueedgenick commented Jan 16, 2019

JimGalasyn left a comment •

edited

Loading

blueedgenick commented Oct 19, 2020

JimGalasyn commented Oct 19, 2020

Add SQL functions for working with Maps and Arrays #1611

Add SQL functions for working with Maps and Arrays #1611

Conversation

blueedgenick commented Jul 22, 2018

Description

Testing done

Reviewer checklist

big-andy-coates left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

blueedgenick commented Jan 14, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

blueedgenick commented Jan 16, 2019

JimGalasyn left a comment • edited Loading

Choose a reason for hiding this comment

blueedgenick commented Oct 19, 2020

JimGalasyn commented Oct 19, 2020

JimGalasyn left a comment •

edited

Loading