Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SQL functions for working with Maps and Arrays #1611

Closed

Conversation

blueedgenick
Copy link
Contributor

Description

A new set of SQL functions to make it possible to manipulate columns of type map and array:

  • array_distinct(array)
  • array_except(array1, array2)
  • array_intersect(array1, array2)
  • array_slice(array, start, length)
  • array_union(array1, array2)
  • cardinality(array | map)
  • element(array, index) | element(map, key) <-- function-based equivalent of array[n] etc.
  • arrays_to_map(key-array, value-array)
  • map_keys(map)
  • map_values(map)
  • map_union(map1, map2)

All of the above added to syntax guide.

Testing done

Unit tests and a quick manual test of some queries.

Reviewer checklist

  • Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
  • Ensure relevant issues are linked (description should include text like "Fixes #")

@rodesai rodesai requested a review from a team July 24, 2018 06:15
Copy link
Contributor

@big-andy-coates big-andy-coates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @blueedgenick, I've left a few comments.

| ARRAY_DISTINCT | ``ARRAY_DISTINCT(array_col)`` | Returns an array of all the distinct values from |
| | | the input array, or NULL if the input is NULL. |
+------------------------+------------------------------------------------------------+---------------------------------------------------+
| ARRAY_EXCEPT | ``ARRAY_EXCEPT(array1, array2)`` | Returns an array of all the distinct values from |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this one be returning distinct values? Or just the array1 filtering out values in array2? That would make the function more single purpose. They can always compose it with ARRAY_DISTINCT if they want a distinct output.

@@ -983,13 +983,46 @@ Scalar functions
+========================+============================================================+===================================================+
| ABS | ``ABS(col1)`` | The absolute value of a value. |
+------------------------+------------------------------------------------------------+---------------------------------------------------+
| ARRAY_DISTINCT | ``ARRAY_DISTINCT(array_col)`` | Returns an array of all the distinct values from |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is order maintained? Might be worth documenting, (and testing!), order for this and other methods.

| | | array1 except for those also presnet in array2. |
| | | NULL is returned if either input array is NULL. |
+------------------------+------------------------------------------------------------+---------------------------------------------------+
| ARRAY_INTERSECT | ``ARRAY_INTERSECT(array1, array2)`` | Returns an array of all the distinct elements |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise, I'm not sure this should be distinct. This is the intersect of the array, not set. ARRAY_INTERSECT([1,2,2,2,3],[2,2,3]) should, IMHO, return [2,2,3].

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presto and Teradata have this similar array_intersect() function, and they specify that the returned array does not have duplicates (same for array_except). I think it makes sense to follow the same standard from both databases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed, that's why it was this way to begin with ;)

| | | the rqeuested length or offset are negative the |
| | | entire array is returend. |
+------------------------+------------------------------------------------------------+---------------------------------------------------+
| ARRAY_UNION | ``ARRAY_UNION(array1, array2)`` | Returns an array of all the distinct values from |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise, don't think this should be distinct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to above, this is modelled after the comparable functions in other systems (teradata nand presto). This case is also more "normal" from the rdbms world - consider the difference between UNION and UNION ALL in standard sql (the first returns a distinct result list, the second requires an additional keyword to specify "please keep all the duplicates")

| ARRAY_UNION | ``ARRAY_UNION(array1, array2)`` | Returns an array of all the distinct values from |
| | | all input arrays, or NULL if either input is NULL.|
+------------------------+------------------------------------------------------------+---------------------------------------------------+
| ARRAYS_TO_MAP | ``ARRAYS_TO_MAP(key_array, map_array)`` | Creates a map from a passed array of keys and an |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would benefit from documentation of NULL behaviour, as you have for others.

String thisKey = inputKeys.get(i);
Object thisValue = inputValues.get(i);
if (thisValue != null) {
outputMap.put(inputKeys.get(i), thisValue);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure what your intention is here... the if/else branches both do this same thing.

Does your map type allow null keys? Good one to test & document!

}

@SuppressWarnings("rawtypes")
@Udf(description = "Returns the element at the specified index (counting from 0) in an array.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't SQL convention normally to be base-1, rather than base-0? Or has this changed with newer products?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes indeed - this was done just to align with the rest of ksql back at the time. seeing as we've subsequently updated other parts of the code to be 1-based, the same change should be applied here too

import io.confluent.ksql.function.udf.Udf;
import io.confluent.ksql.function.udf.UdfDescription;

@UdfDescription(name = "map_concat", author = "Confluent",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's MAP_UNION in the docs...

@Test
public void shouldDistinctArray() {
final List<Object> result = udf.distinct(Arrays.asList("foo", " ", "foo", "bar"));
assertThat(result, containsInAnyOrder(" ", "bar", "foo"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should maintain order.

@SuppressWarnings("rawtypes")
@Test
public void shouldReturnNullForNullInput() {
List result = udf.except(null, null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you split into two tests, each with one null param, please? (It catches places where one null check is hiding a second NPE).

@blueedgenick
Copy link
Contributor Author

Note: this PR was put on hold some months ago due to issues with getting complex types (map/array/struct) into and out of the UDF invocation framework. I think a more recent PR has at least partially alleviated that situation, so i'd be happy to take another pass at this. I do also recall addressing several of @big-andy-coates 's code comments from above, not sure what happened to those fixes - likely i have them in a private branch somewhere. Will go dig those up.

| | | those elements which are present in both inputs. |
| | | NULL is returned if either input array is NULL. |
+------------------------+------------------------------------------------------------+---------------------------------------------------+
| ARRAY_SLICE | ``ARRAY_SLICE(array_col, start, length)`` | Returns a subsection of an array, of requested |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presto and Terdata has a similar function called 'slice(x, start, length)' which does the same thing.
Snowflake has a similar function too called 'array_slice(x, from, to)' which does the same thing (but using end of array instead of length).

This function for KSQL looks pretty similar to Presto and Teradata. Should we use the same name 'slice' instead of 'array_slice'?

| ARRAYCONTAINS | ``ARRAYCONTAINS('[1, 2, 3]', 3)`` | Given JSON or AVRO array checks if a search |
| | | value contains in it. |
+------------------------+------------------------------------------------------------+---------------------------------------------------+
| CARDINALITY | ``CARDINALITY(array | map)`` | Returns the number of entries in the specified |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presto and Teradata has a similar function that returns the size of the array (no map). Snowflake has only 'array_size(x)' for the same result.

For KSQL, should we split this function into 'array_size(x)' and 'map_size(x)' for a better meaning? Or just 'size(x)', for both, size(array) or size(map)?

if (input == null) {
return null;
}
return Lists.newArrayList(Sets.newHashSet(input));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Set will not preserve the order of the original array. What should the result be? Keeping the order (just removing duplicates), or ignoring the order of the array?

If could use a stream, like input.stream().distinct().collect(Collectors.toList()), which I think will preserve the order. But I read that will not perform very good. How do others databases work?

.filter(e -> !rhs.contains(e))
.collect(Collectors.toSet());
final List result = Lists.newArrayList(distinct);
return result;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be reduced to one line using the Collectors.toList() instead?
Like:
return lhs.stream().filter(e -> !rhs.contains(e)).collect(Collectors.toList());

Also, the above rhs.contains(e) is linear, so if you have a large array on rhs, then the performance won't be great. What about converting the rhs to a set, and then use the .contains() in the filter method?

@blueedgenick
Copy link
Contributor Author

investigation reveals this PR to still be blocked on #1791 , unable to bind UDFs which declare non-concrete parameter and return types.

Copy link
Member

@JimGalasyn JimGalasyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@blueedgenic, Very cool stuff! This is a very old version of the Syntax Reference topic. Can you please move this content to the new topic at https://github.com/confluentinc/ksql/blob/master/docs/developer-guide/ksqldb-reference/scalar-functions.md? Note that all of the new ksqlDB content is in markdown. Thanks!

@blueedgenick
Copy link
Contributor Author

hey @JimGalasyn - apologies, i have no idea how you got requested to review this again, it's totally superseded by the 2 PRs mentioned just above (#5536 and #5548) which already added these functions nad their docs (to the right docs file!). I'll close out this dangling PR

@JimGalasyn
Copy link
Member

@blueedgenick Heh, no worries!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants