New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support EXPLODE function (UDTF) for handling array columns #527

Open
hpgrahsl opened this Issue Dec 13, 2017 · 10 comments

Comments

Projects
None yet
5 participants
@hpgrahsl

hpgrahsl commented Dec 13, 2017

as very briefly discussed with @hjafarpour in the slack community:

I've seen the array column support as well as the extractjsonfield UDF explained in the docs which are for different use cases than the feature/enhancement I'm suggestion here: it would be great to have support for an explode UDTF so that it's e.g. possible to GROUP BY the values contained in an array-like structure of a column. illustrating example below:

col1     col2 
1       ["a","a","b"]
2       ["b","c","c"]
3       ["a","d"]

when applying SELECT col1, explode(col2) FROM ... the expected result would be

col1    col2
1       "a"
1       "a"
1       "b"
2       "b"
2       "c"
2       "c"
3       "a"
3       "d"

based on this you could e.g. GROUP BY col2

@k1th

This comment has been minimized.

k1th commented Feb 15, 2018

We would benefit from this as well.
Our case is a JSON structure that contains an array of n events, each having another array of m sub-events.
We would need explode this into n*m messages…

@miguno

This comment has been minimized.

Member

miguno commented Feb 15, 2018

Thanks for the +1 @k1th.

@hpgrahsl

This comment has been minimized.

hpgrahsl commented Feb 15, 2018

@k1th thx for the comment and upvote. while your suggestion sounds interesting I guess a first version shoud probably stick to 1 level of unnesting per UDTF call. recursively 'exploding' across multiple levels might result in a very large number of rows pretty quickly based on an originally single record, which is something that needs to be carefully done - if at all.

@k1th

This comment has been minimized.

k1th commented Feb 16, 2018

If this were JSON-specific, a syntax that actually defines the flatten (or flatMap) would be handy. Then the user would have control over how many levels this would go and also be aware or the potentially massive expansion of data.
For this one would need a JSON processing library like jq, though…

@miguno miguno changed the title from feature: explode UDTF for handling array columns to Support EXPLODE function (UDTF) for handling array columns Sep 5, 2018

@miguno

This comment has been minimized.

Member

miguno commented Sep 5, 2018

@k1th : With the new STRUCT support in KSQL 5.0, would this solve your problem mentioned above at #527 (comment)?

@k1th

This comment has been minimized.

k1th commented Sep 5, 2018

With the new STRUCT support in KSQL 5.0, would this solve your problem mentioned above at #527 (comment)?

We just tested array/struct combinations a week ago. this actually allows us to fully map the json to ksql in 5.0.
The EXPLODE functionality discussed above is not covered by that, though. The STRUCT is the first step to correctly represent our JSON. After that we'd still need to extract the data to multiple records...

@hpgrahsl

This comment has been minimized.

hpgrahsl commented Sep 5, 2018

@miguno I'm afraid EXPLODE needs to be 'compiled down' to something like flatMapValues. IMHO should be doable when UDTFs are available :)

@nalinaly

This comment has been minimized.

nalinaly commented Oct 29, 2018

waiting for EXLODE too,
need extract the data to multiple records

@lamerexter

This comment has been minimized.

lamerexter commented Nov 26, 2018

Can someone explain the status of the explode() or unnest() (POSTGRES equivalent) functionality for dealing with ARRAY types?

Has anyone found a workaround to navigate through ARRAY types (for...each) without having to use indices?

@miguno

This comment has been minimized.

Member

miguno commented Nov 28, 2018

+1 from a KSQL user in an offline conversation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment