[RFC] Make outputs distinct #871

ryzhyk · 2023-10-11T04:38:55Z

We have discussed this before but I think we should revisit the issue. SQL queries generally produce bags, which may include the same record with an arbitrary (positive) multiplicity. This means that the output stream produced by Feldera consists of Z-sets with arbitrary (positive or negative) multiplicities. Trouble is, most downstream consumers don't understand Z-sets and expect upserts. For example, there is no standard way in SQL to delete one of multiple identical records. I cannot think of a fool-proof solutions that doesn't involve enforcing distinct outputs. Question is, what's the best way to achieve this.

One option is to add static analysis to the SQL compiler to identify distinct tables and only allow output connector to be attached to such tables. An even more advanced version of this would allow the connector to decide whether it can handle non-distinct outputs.

Another no-frills option is to just distinct() all output tables, which will change the semantics in a way that no real users likely care about.

@mihaibudiu

The text was updated successfully, but these errors were encountered:

mihaibudiu · 2023-10-11T07:39:31Z

Not sure that "no users care about" means. Select t.col from t will produce wrong results

mihaibudiu · 2023-10-11T08:41:18Z

The only solution I can to of to remove a certain number of tuples is to use except all. You need to keep one table for insertions and one for deletions and do union all and except all at each step. No idea whether this can be made efficiently.

mihaibudiu · 2023-10-11T08:47:00Z

Another solution is to delete all copies of a record and then reinsert as many as needed. This is more incremental

ryzhyk · 2023-10-11T16:58:45Z

Not sure that "no users care about" means. Select t.col from t will produce wrong results

Yes, we are definitely changing the semantics of queries, effectively attaching SELECT DISTINCT to all of them. I doubt this is an issue in practice, but this is why the ideal solution would be to get users to write distinct queries and simply verify that they did via static analysis, but that seems like much more work.

ryzhyk · 2023-10-11T17:03:15Z

Another solution is to delete all copies of a record and then reinsert as many as needed. This is more incremental

I've seen other DB-specific hacks out there, but these are all terrible. I think we should stay away from them.

My conjecture is that for streaming things the only two options are Z-sets (which almost noone supports today) and upserts, which require set or map semantics (all multiplicities are 1). We should find a way to enforce that rather than try to deal with arbitrary multiplicities.

mihaibudiu · 2023-10-11T17:04:51Z

Even if inputs are sets the address too many useful cases where outputs aren't. We have to find a solution for that case

ryzhyk · 2023-10-11T17:07:11Z

Even if inputs are sets the address too many useful cases where outputs aren't. We have to find a solution for that case

Do you have an example?

mihaibudiu · 2023-10-11T17:10:00Z

I have many tests like that. Do you want me to check standard benchmarks or code from our sources?

ryzhyk · 2023-10-11T17:14:16Z

I am curious about a real-world use case where multiplicities >1 are important. With DDlog I've seen many where users find them annoying, I've never seen anyone complain about the second copy of a record not showing up :) So I wonder if there is a real-world scenario where attaching SELECT DISTINCT to each view would cause any issues.

mihaibudiu · 2023-10-11T17:38:53Z

Query q8 in tpch need duplicates. It's in our repo

mihaibudiu · 2023-10-11T17:39:21Z

I take that back

mohsinbeg · 2023-10-11T17:43:43Z

I am curious about a real-world use case where multiplicities >1 are important

We must support duplicates. For example "Get all users with same social security number from my database". User must explicitly manage data quality. System should not implicitly do it.

blp · 2023-10-11T17:55:15Z

I am curious about a real-world use case where multiplicities >1 are important

We must support duplicates. For example "Get all users with same social security number from my database". User must explicitly manage data quality. System should not implicitly do it.

That's not an example of duplicates. The result tuples are (ssn, user) or possibly just user depending on how you interpret the question.

ryzhyk · 2023-10-11T17:56:35Z

I am curious about a real-world use case where multiplicities >1 are important

We must support duplicates. For example "Get all users with same social security number from my database". User must explicitly manage data quality. System should not implicitly do it.

That's not an example of duplicates. The result tuples are (ssn, user) or possibly just user depending on how you interpret the question.

yes, these are not duplicates.

mohsinbeg · 2023-10-11T17:58:54Z

The result tuples are (ssn, user) or possibly just user depending on how you interpret the question.

Yes they are. Multiple tuples with same values. Users don't think about internal or system keys when they write queries.

ryzhyk · 2023-10-11T18:56:18Z

The result tuples are (ssn, user) or possibly just user depending on how you interpret the question.

Yes they are. Multiple tuples with same values. Users don't think about internal or system keys when they write queries.

Then I don't understand the example.

lalithsuresh · 2023-10-11T21:16:23Z

I assume you are worried about a situation like this:

a view (a bag), outputs: ("foo", "bar", insert), ("foo", "bar", insert)
some subsequent transaction outputs: ("foo", "bar", delete),

now the output needs to somehow delete only one of the two inserts?

ryzhyk · 2023-10-11T21:23:21Z

yep

blp · 2023-10-11T21:28:51Z

We could provide both the weight delta and the total weight (the integral) along with each record in our output.

ryzhyk · 2023-10-11T22:25:21Z

We could provide both the weight delta and the total weight (the integral) along with each record in our output.

It's a neat idea. We're not really setup to do that now, but it's possible in principle. There's still a bunch of complexity involved in processing such entries on the receiver side, e.g., I'm not sure I know how to adapt the Snowflake ingest to support this, but it's probably doable. The main problem with this approach is that it assumes that we control the output format, which will not always be the case.

My preference would be to keep this solution in mind, but not implement it until we've seen at least one real-world use case that requires non-unit weights.

mihaibudiu · 2023-10-12T22:00:23Z

I plan to introduce two compiler flags to handle SQL constructs differently from the standard:

ignore ORDER BY at the last stage (but not ORDER BY with LIMIT)
insert a DISTINCT on the output views if necessary. Less clear what happens to internal views which are also output views.
I am tempted to make this a flag and not the default behavior. Or we could have these behaviors automatic when compiling incremental queries (-i flag) and they can be inhibited by the corresponding flag.

lalithsuresh · 2023-10-12T22:31:55Z

Do we see users wanting to prefer one mode or the other (if it is a flag)? How would users express preference? It looks like the flag would have to propagate through the API and UI (e.g., you have to say "compile this program with this behavior").

Is there a way to specify this in the program itself?

mihaibudiu · 2023-10-12T22:33:19Z

If we had any users we could ask them.
The API should not care, it is just about interpreting a SQL program in a different way than the standard.
SQL is already a mess, I think this belongs to a compiler flag. We could perhaps add some annotation/comments to SQL. But the point is to not modify the SQL.

ryzhyk · 2023-10-12T22:58:42Z

Making it a flag sounds good to me. We will always set this flag when invoking the compiler from the compiler service, but not, e.g., when running SQL tests. I don't think this should be user configurable, since not setting this flag will likely cause correctness issue for various downstream consumers. At least this would be my initial suggestion until we have some additional signal from users.
Internal views that are not output views: problem is we do not yet have any way to distinguish those, but we need to. At the moment all views are output views in the sense that the user can attach an output connector to the view and observe changes to it. We probably want to explicitly annotate output views.

gz · 2023-10-12T23:22:25Z

Making it a flag sounds good to me. We will always set this flag when invoking the compiler from the compiler service, but not, e.g., when running SQL tests. I don't think this should be user configurable, since not setting this flag will likely cause correctness issue for various downstream consumers. At least this would be my initial suggestion until we have some additional signal from users.

👍 this doesn't seem like something that is good to have user configurable

Use the `--outputsAreSets` compiler flag to enforce that pipelines produce distinct outputs. See #871 for a detailed discussion. Signed-off-by: Leonid Ryzhyk <leonid@feldera.com>

Use the `--outputsAreSets` compiler flag to enforce that pipelines produce distinct outputs. See #871 for a detailed discussion. Also use the `--ignoreOrder` compiler flag to ignore the `order by` clause in SQL. Added an integration test for the distinct output behavior. And while I was at it, documented steps for running integration tests, which I always forget. Signed-off-by: Leonid Ryzhyk <leonid@feldera.com>

mihaibudiu · 2023-10-20T16:57:48Z

Can we close this issue?

ryzhyk added this to the October 24, 2023 milestone Oct 11, 2023

ryzhyk assigned ryzhyk and mihaibudiu Oct 11, 2023

ryzhyk added SQL compiler Related to the SQL compiler adapters Issues related to the adapters crate labels Oct 11, 2023

ryzhyk removed this from the October 24, 2023 milestone Oct 11, 2023

mihaibudiu mentioned this issue Oct 14, 2023

[SQL] Add two compiler flags: --outputsAreSets and --ignoreOrder #883

Merged

ryzhyk pushed a commit that referenced this issue Oct 17, 2023

Use outputsAreSets flag.

fdbda54

Use the `--outputsAreSets` compiler flag to enforce that pipelines produce distinct outputs. See #871 for a detailed discussion. Signed-off-by: Leonid Ryzhyk <leonid@feldera.com>

ryzhyk mentioned this issue Oct 17, 2023

Use outputsAreSets flag. #894

Merged

ryzhyk added this to the October 24, 2023 milestone Oct 20, 2023

ryzhyk closed this as completed Oct 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Make outputs distinct #871

[RFC] Make outputs distinct #871

ryzhyk commented Oct 11, 2023

mihaibudiu commented Oct 11, 2023

mihaibudiu commented Oct 11, 2023

mihaibudiu commented Oct 11, 2023

ryzhyk commented Oct 11, 2023

ryzhyk commented Oct 11, 2023

mihaibudiu commented Oct 11, 2023

ryzhyk commented Oct 11, 2023

mihaibudiu commented Oct 11, 2023

ryzhyk commented Oct 11, 2023

mihaibudiu commented Oct 11, 2023

mihaibudiu commented Oct 11, 2023

mohsinbeg commented Oct 11, 2023

blp commented Oct 11, 2023

ryzhyk commented Oct 11, 2023

mohsinbeg commented Oct 11, 2023

ryzhyk commented Oct 11, 2023

lalithsuresh commented Oct 11, 2023

ryzhyk commented Oct 11, 2023

blp commented Oct 11, 2023

ryzhyk commented Oct 11, 2023

mihaibudiu commented Oct 12, 2023

lalithsuresh commented Oct 12, 2023

mihaibudiu commented Oct 12, 2023

ryzhyk commented Oct 12, 2023

gz commented Oct 12, 2023

mihaibudiu commented Oct 20, 2023

[RFC] Make outputs distinct #871

[RFC] Make outputs distinct #871

Comments

ryzhyk commented Oct 11, 2023

mihaibudiu commented Oct 11, 2023

mihaibudiu commented Oct 11, 2023

mihaibudiu commented Oct 11, 2023

ryzhyk commented Oct 11, 2023

ryzhyk commented Oct 11, 2023

mihaibudiu commented Oct 11, 2023

ryzhyk commented Oct 11, 2023

mihaibudiu commented Oct 11, 2023

ryzhyk commented Oct 11, 2023

mihaibudiu commented Oct 11, 2023

mihaibudiu commented Oct 11, 2023

mohsinbeg commented Oct 11, 2023

blp commented Oct 11, 2023

ryzhyk commented Oct 11, 2023

mohsinbeg commented Oct 11, 2023

ryzhyk commented Oct 11, 2023

lalithsuresh commented Oct 11, 2023

ryzhyk commented Oct 11, 2023

blp commented Oct 11, 2023

ryzhyk commented Oct 11, 2023

mihaibudiu commented Oct 12, 2023

lalithsuresh commented Oct 12, 2023

mihaibudiu commented Oct 12, 2023

ryzhyk commented Oct 12, 2023

gz commented Oct 12, 2023

mihaibudiu commented Oct 20, 2023