Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Make outputs distinct #871

Closed
ryzhyk opened this issue Oct 11, 2023 · 26 comments
Closed

[RFC] Make outputs distinct #871

ryzhyk opened this issue Oct 11, 2023 · 26 comments
Assignees
Labels
adapters Issues related to the adapters crate SQL compiler Related to the SQL compiler

Comments

@ryzhyk
Copy link
Contributor

ryzhyk commented Oct 11, 2023

We have discussed this before but I think we should revisit the issue. SQL queries generally produce bags, which may include the same record with an arbitrary (positive) multiplicity. This means that the output stream produced by Feldera consists of Z-sets with arbitrary (positive or negative) multiplicities. Trouble is, most downstream consumers don't understand Z-sets and expect upserts. For example, there is no standard way in SQL to delete one of multiple identical records. I cannot think of a fool-proof solutions that doesn't involve enforcing distinct outputs. Question is, what's the best way to achieve this.

One option is to add static analysis to the SQL compiler to identify distinct tables and only allow output connector to be attached to such tables. An even more advanced version of this would allow the connector to decide whether it can handle non-distinct outputs.

Another no-frills option is to just distinct() all output tables, which will change the semantics in a way that no real users likely care about.

@mihaibudiu

@ryzhyk ryzhyk added this to the October 24, 2023 milestone Oct 11, 2023
@ryzhyk ryzhyk added SQL compiler Related to the SQL compiler adapters Issues related to the adapters crate labels Oct 11, 2023
@ryzhyk ryzhyk removed this from the October 24, 2023 milestone Oct 11, 2023
@mihaibudiu
Copy link
Collaborator

Not sure that "no users care about" means. Select t.col from t will produce wrong results

@mihaibudiu
Copy link
Collaborator

The only solution I can to of to remove a certain number of tuples is to use except all. You need to keep one table for insertions and one for deletions and do union all and except all at each step. No idea whether this can be made efficiently.

@mihaibudiu
Copy link
Collaborator

Another solution is to delete all copies of a record and then reinsert as many as needed. This is more incremental

@ryzhyk
Copy link
Contributor Author

ryzhyk commented Oct 11, 2023

Not sure that "no users care about" means. Select t.col from t will produce wrong results

Yes, we are definitely changing the semantics of queries, effectively attaching SELECT DISTINCT to all of them. I doubt this is an issue in practice, but this is why the ideal solution would be to get users to write distinct queries and simply verify that they did via static analysis, but that seems like much more work.

@ryzhyk
Copy link
Contributor Author

ryzhyk commented Oct 11, 2023

Another solution is to delete all copies of a record and then reinsert as many as needed. This is more incremental

I've seen other DB-specific hacks out there, but these are all terrible. I think we should stay away from them.

My conjecture is that for streaming things the only two options are Z-sets (which almost noone supports today) and upserts, which require set or map semantics (all multiplicities are 1). We should find a way to enforce that rather than try to deal with arbitrary multiplicities.

@mihaibudiu
Copy link
Collaborator

Even if inputs are sets the address too many useful cases where outputs aren't. We have to find a solution for that case

@ryzhyk
Copy link
Contributor Author

ryzhyk commented Oct 11, 2023

Even if inputs are sets the address too many useful cases where outputs aren't. We have to find a solution for that case

Do you have an example?

@mihaibudiu
Copy link
Collaborator

I have many tests like that. Do you want me to check standard benchmarks or code from our sources?

@ryzhyk
Copy link
Contributor Author

ryzhyk commented Oct 11, 2023

I am curious about a real-world use case where multiplicities >1 are important. With DDlog I've seen many where users find them annoying, I've never seen anyone complain about the second copy of a record not showing up :) So I wonder if there is a real-world scenario where attaching SELECT DISTINCT to each view would cause any issues.

@mihaibudiu
Copy link
Collaborator

Query q8 in tpch need duplicates. It's in our repo

@mihaibudiu
Copy link
Collaborator

I take that back

@mohsinbeg
Copy link
Contributor

I am curious about a real-world use case where multiplicities >1 are important

We must support duplicates. For example "Get all users with same social security number from my database". User must explicitly manage data quality. System should not implicitly do it.

@blp
Copy link
Member

blp commented Oct 11, 2023

I am curious about a real-world use case where multiplicities >1 are important

We must support duplicates. For example "Get all users with same social security number from my database". User must explicitly manage data quality. System should not implicitly do it.

That's not an example of duplicates. The result tuples are (ssn, user) or possibly just user depending on how you interpret the question.

@ryzhyk
Copy link
Contributor Author

ryzhyk commented Oct 11, 2023

I am curious about a real-world use case where multiplicities >1 are important

We must support duplicates. For example "Get all users with same social security number from my database". User must explicitly manage data quality. System should not implicitly do it.

That's not an example of duplicates. The result tuples are (ssn, user) or possibly just user depending on how you interpret the question.

yes, these are not duplicates.

@mohsinbeg
Copy link
Contributor

The result tuples are (ssn, user) or possibly just user depending on how you interpret the question.

Yes they are. Multiple tuples with same values. Users don't think about internal or system keys when they write queries.

@ryzhyk
Copy link
Contributor Author

ryzhyk commented Oct 11, 2023

The result tuples are (ssn, user) or possibly just user depending on how you interpret the question.

Yes they are. Multiple tuples with same values. Users don't think about internal or system keys when they write queries.

Then I don't understand the example.

@lalithsuresh
Copy link
Collaborator

I assume you are worried about a situation like this:

  • a view (a bag), outputs: ("foo", "bar", insert), ("foo", "bar", insert)
  • some subsequent transaction outputs: ("foo", "bar", delete),

now the output needs to somehow delete only one of the two inserts?

@ryzhyk
Copy link
Contributor Author

ryzhyk commented Oct 11, 2023

yep

@blp
Copy link
Member

blp commented Oct 11, 2023

We could provide both the weight delta and the total weight (the integral) along with each record in our output.

@ryzhyk
Copy link
Contributor Author

ryzhyk commented Oct 11, 2023

We could provide both the weight delta and the total weight (the integral) along with each record in our output.

It's a neat idea. We're not really setup to do that now, but it's possible in principle. There's still a bunch of complexity involved in processing such entries on the receiver side, e.g., I'm not sure I know how to adapt the Snowflake ingest to support this, but it's probably doable. The main problem with this approach is that it assumes that we control the output format, which will not always be the case.

My preference would be to keep this solution in mind, but not implement it until we've seen at least one real-world use case that requires non-unit weights.

@mihaibudiu
Copy link
Collaborator

I plan to introduce two compiler flags to handle SQL constructs differently from the standard:

  • ignore ORDER BY at the last stage (but not ORDER BY with LIMIT)
  • insert a DISTINCT on the output views if necessary. Less clear what happens to internal views which are also output views.
    I am tempted to make this a flag and not the default behavior. Or we could have these behaviors automatic when compiling incremental queries (-i flag) and they can be inhibited by the corresponding flag.

@lalithsuresh
Copy link
Collaborator

Do we see users wanting to prefer one mode or the other (if it is a flag)? How would users express preference? It looks like the flag would have to propagate through the API and UI (e.g., you have to say "compile this program with this behavior").

Is there a way to specify this in the program itself?

@mihaibudiu
Copy link
Collaborator

If we had any users we could ask them.
The API should not care, it is just about interpreting a SQL program in a different way than the standard.
SQL is already a mess, I think this belongs to a compiler flag. We could perhaps add some annotation/comments to SQL. But the point is to not modify the SQL.

@ryzhyk
Copy link
Contributor Author

ryzhyk commented Oct 12, 2023

  • Making it a flag sounds good to me. We will always set this flag when invoking the compiler from the compiler service, but not, e.g., when running SQL tests. I don't think this should be user configurable, since not setting this flag will likely cause correctness issue for various downstream consumers. At least this would be my initial suggestion until we have some additional signal from users.

  • Internal views that are not output views: problem is we do not yet have any way to distinguish those, but we need to. At the moment all views are output views in the sense that the user can attach an output connector to the view and observe changes to it. We probably want to explicitly annotate output views.

@gz
Copy link
Collaborator

gz commented Oct 12, 2023

Making it a flag sounds good to me. We will always set this flag when invoking the compiler from the compiler service, but not, e.g., when running SQL tests. I don't think this should be user configurable, since not setting this flag will likely cause correctness issue for various downstream consumers. At least this would be my initial suggestion until we have some additional signal from users.

👍 this doesn't seem like something that is good to have user configurable

ryzhyk pushed a commit that referenced this issue Oct 17, 2023
Use the `--outputsAreSets` compiler flag to enforce that pipelines
produce distinct outputs.  See #871
for a detailed discussion.

Signed-off-by: Leonid Ryzhyk <leonid@feldera.com>
ryzhyk pushed a commit that referenced this issue Oct 17, 2023
Use the `--outputsAreSets` compiler flag to enforce that pipelines
produce distinct outputs.  See #871
for a detailed discussion.

Also use the `--ignoreOrder` compiler flag to ignore the `order by`
clause in SQL.

Added an integration test for the distinct output behavior.  And while I
was at it, documented steps for running integration tests, which I
always forget.

Signed-off-by: Leonid Ryzhyk <leonid@feldera.com>
ryzhyk pushed a commit that referenced this issue Oct 17, 2023
Use the `--outputsAreSets` compiler flag to enforce that pipelines
produce distinct outputs.  See #871
for a detailed discussion.

Also use the `--ignoreOrder` compiler flag to ignore the `order by`
clause in SQL.

Added an integration test for the distinct output behavior.  And while I
was at it, documented steps for running integration tests, which I
always forget.

Signed-off-by: Leonid Ryzhyk <leonid@feldera.com>
ryzhyk pushed a commit that referenced this issue Oct 17, 2023
Use the `--outputsAreSets` compiler flag to enforce that pipelines
produce distinct outputs.  See #871
for a detailed discussion.

Also use the `--ignoreOrder` compiler flag to ignore the `order by`
clause in SQL.

Added an integration test for the distinct output behavior.  And while I
was at it, documented steps for running integration tests, which I
always forget.

Signed-off-by: Leonid Ryzhyk <leonid@feldera.com>
ryzhyk pushed a commit that referenced this issue Oct 18, 2023
Use the `--outputsAreSets` compiler flag to enforce that pipelines
produce distinct outputs.  See #871
for a detailed discussion.

Also use the `--ignoreOrder` compiler flag to ignore the `order by`
clause in SQL.

Added an integration test for the distinct output behavior.  And while I
was at it, documented steps for running integration tests, which I
always forget.

Signed-off-by: Leonid Ryzhyk <leonid@feldera.com>
ryzhyk pushed a commit that referenced this issue Oct 18, 2023
Use the `--outputsAreSets` compiler flag to enforce that pipelines
produce distinct outputs.  See #871
for a detailed discussion.

Also use the `--ignoreOrder` compiler flag to ignore the `order by`
clause in SQL.

Added an integration test for the distinct output behavior.  And while I
was at it, documented steps for running integration tests, which I
always forget.

Signed-off-by: Leonid Ryzhyk <leonid@feldera.com>
ryzhyk pushed a commit that referenced this issue Oct 18, 2023
Use the `--outputsAreSets` compiler flag to enforce that pipelines
produce distinct outputs.  See #871
for a detailed discussion.

Also use the `--ignoreOrder` compiler flag to ignore the `order by`
clause in SQL.

Added an integration test for the distinct output behavior.  And while I
was at it, documented steps for running integration tests, which I
always forget.

Signed-off-by: Leonid Ryzhyk <leonid@feldera.com>
@mihaibudiu
Copy link
Collaborator

Can we close this issue?

@ryzhyk ryzhyk added this to the October 24, 2023 milestone Oct 20, 2023
@ryzhyk ryzhyk closed this as completed Oct 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
adapters Issues related to the adapters crate SQL compiler Related to the SQL compiler
Projects
None yet
Development

No branches or pull requests

6 participants