Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add klip-24: key column semantics in queries. #5115

Conversation

big-andy-coates
Copy link
Contributor

This klip looks to address some of the shortcomings found during the recent work to remove the restriction that all key columns must be named ROWKEY.

Please have a read and let me know your thoughts.

Importantly, the 'any key name' work is not yet enabled, (See #5093). So we have a choice... do we fix these semantics before or after we enable this feature?

Before means less disruption for users, but may mean the feature slips the milestone.

After means we'll hit the milestone, but users will be asked to change their existing queries one way, only to be asked to change them back on the next release ... annoying! For example:

-- existing valid GROUP BY persistent query:
CREATE TABLE OUTPUT AS 
   SELECT V0, COUNT() AS COUNT 
     FROM INPUT GROUP BY V0;

-- with 'any key name' merged the above fails with a duplicate column error on V0.
-- the user needs to change the query to:
CREATE TABLE OUTPUT AS 
   SELECT COUNT() AS COUNT 
     FROM INPUT GROUP BY V0;

-- with this klip the first statement is valid, but the second is not, hence we'd be asking users to change their query back again.

@big-andy-coates big-andy-coates requested a review from a team as a code owner April 20, 2020 13:35
@big-andy-coates
Copy link
Contributor Author

Copy link
Contributor

@agavra agavra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @big-andy-coates! I like all of the "main" changes proposed in this KLIP, my comments are mostly on the "secondary" and TBD parts.

Before means less disruption for users, but may mean the feature slips the milestone.

This is my vote, I think the current behavior for any keys would turn off some users - and the confusion for users jumping between models would be really frustrating.

@agavra agavra requested a review from a team April 20, 2020 16:48
This was referenced Apr 20, 2020
update with Almog's requested changes and suggestions and doc another edge case
@big-andy-coates
Copy link
Contributor Author

@mjsax & @agavra : I've updated this inline with discussions and also discovered another edge case: grouping by multiple expressions. Can you please take a look at this section and provide any feedback? Thanks!

Copy link
Contributor

@rmoff rmoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a few comments but in general this all looks good 👍

Copy link
Contributor

@agavra agavra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@derekjn
Copy link
Contributor

derekjn commented Apr 21, 2020

Great proposal @big-andy-coates, I think this makes things much cleaner and more explicit. The only question I have has been asked by @rmoff: #5115 (comment)

Otherwise LGTM 👍

@big-andy-coates
Copy link
Contributor Author

Great proposal @big-andy-coates, I think this makes things much cleaner and more explicit. The only question I have has been asked by @rmoff: #5115 (comment)

Otherwise LGTM 👍

@derekjn, replied to Robin's comment: it's actually a editing mistake in the KLIP.

@big-andy-coates
Copy link
Contributor Author

I've updated the KLIP with a new edge case around outer joins and joins on non-column-refs. The proposed short to medium term work around is not pretty, but is practical.

Would appreciate peoples views. cc @rmoff, @blueedgenick, @agavra, @derekjn, @apurvam

Copy link
Contributor

@agavra agavra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still LGTM

alias for the system generated `KSQL_COL_0` key column name. Any solution to allow providing an
alias would likely be incompatible with the planned multiple key column support.

Hence, we propose leaving this edge case unsolved, i.e. users will _not_ be able to provide an alias
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the proposal is to use ROWKEY column for outer joins, it seems the same pattern could be applied for this case to allow people to rename the PK?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hummmm.... you've got a point.

However, I think there's a subtle difference we may want to consider.

With a grouping statement on multiple things, the current implementation combines all the result columns into a single STRING value. So the Kafka message's key contains all the grouping data, just in a nasty format. No additional column is being synthesised.

Conversely, for outer joins the Kafka message's key contains the result of COALESCE(leftJoinExp, rightJoinExp), i.e. unlike the grouping statement, it does NOT contain all the joining data: it loses the data about any side being null.

Because of this subtle difference we know that the upcoming structured keys work will allow users to alias the multiple grouping expressions in the projection. However, the issue with joins can not be fixed with structured key support alone.

If we go the same route for groupings as we've proposed for joins, then we end up with:

CREATE TABLE OUTPUT AS
   SELECT ROWKEY AS K, COUNT() FROM INPUT GROUP BY V0, V1;

-- vs --

CREATE TABLE OUTPUT AS
   SELECT V0, V1, COUNT() FROM INPUT GROUP BY V0, V1;

There's certainly arguments for going either way. Personally, I'm happy with what's been proposed for groupings in the KLIP.

cc @derekjn for a product view.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we agree that the end-state should be the last query you showed:

CREATE TABLE OUTPUT AS
   SELECT V0, V1, COUNT() FROM INPUT GROUP BY V0, V1;

(With a proper structured key <V0,V1> stored in the message key).

I guess the question is what intermediate state we want to be in. The query from above is a valid query now however, it does not really expose the PK that is stored in the ROWKEY.

From the KLIP:

we propose that the projection should still accept the individual columns, and recognise them as key columns

Not sure if I can follow. If both columns V0 and V1 are store in the value, how can this be done?

As this KLIP seems to try to expose the actual message key in the schema, it seems consequent to add ROWKEY for this case as an intermediate step. I understand, that it might look like a step backward for a language POV as the above query that is valid now, would not be valid until we reach the end-state and it becomes valid again...

Copy link
Contributor Author

@big-andy-coates big-andy-coates May 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear the KLIP is proposing that the following will be valid even before structured keys:

CREATE TABLE OUTPUT AS
   SELECT V0, V1, COUNT() FROM INPUT GROUP BY V0, V1;

And both V0 and V1 will be stored in the key in a munged together column called KSQL_COL_0, not in the value.

Yes, we could support:

CREATE TABLE OUTPUT AS
   SELECT KSQL_COL_0, COUNT() FROM INPUT GROUP BY V0, V1;

But sures will be left wondering where KSQL_COL_0 came from.

Given than we're adding support for multiple key columns next, product are happy with not supporting aliasing of the resulting key column name.

big-andy-coates added a commit to big-andy-coates/ksql that referenced this pull request Apr 27, 2020
Part of [klip-24](confluentinc#5115).

A Udf that indicates that a key column in a projection should be copied into a value column, for example:

```sql
-- Given:
CREATE STREAM INPUT (ID INT KEY, V0 INT, V1 INT) WITH (kafka_topic='input', value_format='JSON');

-- When:
CREATE STREAM OUTPUT AS SELECT ID, AS_VALUE(ID) AS ID_COPY, V1 FROM INPUT;

-- Then:
-- resulting schema: ID INT KEY, ID_COPY INT, V1 INT
```

Note, the UDF doesn't actually _do_ anything as yet. It requires the request of klip-24 to enable its true purpose.
@big-andy-coates big-andy-coates mentioned this pull request Apr 27, 2020
2 tasks
big-andy-coates added a commit that referenced this pull request Apr 28, 2020
* chore: add AS_VALUE Udf

Part of [klip-24](#5115).

A Udf that indicates that a key column in a projection should be copied into a value column, for example:

```sql
-- Given:
CREATE STREAM INPUT (ID INT KEY, V0 INT, V1 INT) WITH (kafka_topic='input', value_format='JSON');

-- When:
CREATE STREAM OUTPUT AS SELECT ID, AS_VALUE(ID) AS ID_COPY, V1 FROM INPUT;

-- Then:
-- resulting schema: ID INT KEY, ID_COPY INT, V1 INT
```

Co-authored-by: Andy Coates <big-andy-coates@users.noreply.github.com>
big-andy-coates added a commit to big-andy-coates/ksql that referenced this pull request May 4, 2020
This change implements the change in key semantics in queries outlined in [KLIP-24](confluentinc#5115).
big-andy-coates added a commit that referenced this pull request May 6, 2020
* chore: implement new key semantics in queries

This change implements the change in key semantics in queries outlined in [KLIP-24](#5115).


Co-authored-by: Andy Coates <big-andy-coates@users.noreply.github.com>
@big-andy-coates big-andy-coates merged commit bd9302a into confluentinc:master May 7, 2020
@big-andy-coates big-andy-coates deleted the klip-24-key-column-query-semantics branch May 7, 2020 16:01
Copy link
Member

@mjsax mjsax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the KLIP!

Copy link

@S-makes S-makes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

design-proposals/klip-24-key-column-semantics-in-queries.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants