-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Make outputs distinct #871
Comments
Not sure that "no users care about" means. |
The only solution I can to of to remove a certain number of tuples is to use except all. You need to keep one table for insertions and one for deletions and do union all and except all at each step. No idea whether this can be made efficiently. |
Another solution is to delete all copies of a record and then reinsert as many as needed. This is more incremental |
Yes, we are definitely changing the semantics of queries, effectively attaching |
I've seen other DB-specific hacks out there, but these are all terrible. I think we should stay away from them. My conjecture is that for streaming things the only two options are Z-sets (which almost noone supports today) and upserts, which require set or map semantics (all multiplicities are 1). We should find a way to enforce that rather than try to deal with arbitrary multiplicities. |
Even if inputs are sets the address too many useful cases where outputs aren't. We have to find a solution for that case |
Do you have an example? |
I have many tests like that. Do you want me to check standard benchmarks or code from our sources? |
I am curious about a real-world use case where multiplicities >1 are important. With DDlog I've seen many where users find them annoying, I've never seen anyone complain about the second copy of a record not showing up :) So I wonder if there is a real-world scenario where attaching |
Query q8 in tpch need duplicates. It's in our repo |
I take that back |
We must support duplicates. For example "Get all users with same social security number from my database". User must explicitly manage data quality. System should not implicitly do it. |
That's not an example of duplicates. The result tuples are |
yes, these are not duplicates. |
Yes they are. Multiple tuples with same values. Users don't think about internal or system keys when they write queries. |
Then I don't understand the example. |
I assume you are worried about a situation like this:
now the output needs to somehow delete only one of the two inserts? |
yep |
We could provide both the weight delta and the total weight (the integral) along with each record in our output. |
It's a neat idea. We're not really setup to do that now, but it's possible in principle. There's still a bunch of complexity involved in processing such entries on the receiver side, e.g., I'm not sure I know how to adapt the Snowflake ingest to support this, but it's probably doable. The main problem with this approach is that it assumes that we control the output format, which will not always be the case. My preference would be to keep this solution in mind, but not implement it until we've seen at least one real-world use case that requires non-unit weights. |
I plan to introduce two compiler flags to handle SQL constructs differently from the standard:
|
Do we see users wanting to prefer one mode or the other (if it is a flag)? How would users express preference? It looks like the flag would have to propagate through the API and UI (e.g., you have to say "compile this program with this behavior"). Is there a way to specify this in the program itself? |
If we had any users we could ask them. |
|
👍 this doesn't seem like something that is good to have user configurable |
Use the `--outputsAreSets` compiler flag to enforce that pipelines produce distinct outputs. See #871 for a detailed discussion. Signed-off-by: Leonid Ryzhyk <leonid@feldera.com>
Use the `--outputsAreSets` compiler flag to enforce that pipelines produce distinct outputs. See #871 for a detailed discussion. Also use the `--ignoreOrder` compiler flag to ignore the `order by` clause in SQL. Added an integration test for the distinct output behavior. And while I was at it, documented steps for running integration tests, which I always forget. Signed-off-by: Leonid Ryzhyk <leonid@feldera.com>
Use the `--outputsAreSets` compiler flag to enforce that pipelines produce distinct outputs. See #871 for a detailed discussion. Also use the `--ignoreOrder` compiler flag to ignore the `order by` clause in SQL. Added an integration test for the distinct output behavior. And while I was at it, documented steps for running integration tests, which I always forget. Signed-off-by: Leonid Ryzhyk <leonid@feldera.com>
Use the `--outputsAreSets` compiler flag to enforce that pipelines produce distinct outputs. See #871 for a detailed discussion. Also use the `--ignoreOrder` compiler flag to ignore the `order by` clause in SQL. Added an integration test for the distinct output behavior. And while I was at it, documented steps for running integration tests, which I always forget. Signed-off-by: Leonid Ryzhyk <leonid@feldera.com>
Use the `--outputsAreSets` compiler flag to enforce that pipelines produce distinct outputs. See #871 for a detailed discussion. Also use the `--ignoreOrder` compiler flag to ignore the `order by` clause in SQL. Added an integration test for the distinct output behavior. And while I was at it, documented steps for running integration tests, which I always forget. Signed-off-by: Leonid Ryzhyk <leonid@feldera.com>
Use the `--outputsAreSets` compiler flag to enforce that pipelines produce distinct outputs. See #871 for a detailed discussion. Also use the `--ignoreOrder` compiler flag to ignore the `order by` clause in SQL. Added an integration test for the distinct output behavior. And while I was at it, documented steps for running integration tests, which I always forget. Signed-off-by: Leonid Ryzhyk <leonid@feldera.com>
Use the `--outputsAreSets` compiler flag to enforce that pipelines produce distinct outputs. See #871 for a detailed discussion. Also use the `--ignoreOrder` compiler flag to ignore the `order by` clause in SQL. Added an integration test for the distinct output behavior. And while I was at it, documented steps for running integration tests, which I always forget. Signed-off-by: Leonid Ryzhyk <leonid@feldera.com>
Can we close this issue? |
We have discussed this before but I think we should revisit the issue. SQL queries generally produce bags, which may include the same record with an arbitrary (positive) multiplicity. This means that the output stream produced by Feldera consists of Z-sets with arbitrary (positive or negative) multiplicities. Trouble is, most downstream consumers don't understand Z-sets and expect upserts. For example, there is no standard way in SQL to delete one of multiple identical records. I cannot think of a fool-proof solutions that doesn't involve enforcing distinct outputs. Question is, what's the best way to achieve this.
One option is to add static analysis to the SQL compiler to identify distinct tables and only allow output connector to be attached to such tables. An even more advanced version of this would allow the connector to decide whether it can handle non-distinct outputs.
Another no-frills option is to just
distinct()
all output tables, which will change the semantics in a way that no real users likely care about.@mihaibudiu
The text was updated successfully, but these errors were encountered: