Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

result of StreamJoin or OuterJoin is not equal with database #319

Closed
Lvnszn opened this issue Apr 27, 2023 · 4 comments
Closed

result of StreamJoin or OuterJoin is not equal with database #319

Lvnszn opened this issue Apr 27, 2023 · 4 comments

Comments

@Lvnszn
Copy link

Lvnszn commented Apr 27, 2023

I looked at the source code carefully.
In the code related to Join, the data is mainly obtained from the left table and the right table in an asynchronous manner, and then sent to chan for consumption. This will cause a problem that if the left table receives data, the right table has not yet received it. To the data that can match the data in the left table, this will cause the record in the left table to generate a piece of data that is not associated with the right table. In fact, he can be linked from the next few lines. I think this asynchronous design method will cause the final result to be smaller than the real data that can be matched.

@cube2222
Copy link
Owner

Hey! Do you have a reproduction?

Outer join will use retractions to retract the early "not matched" record if the left table receives a record before the right one.

StreamJoin only sends matches so should work regardless of retractions.

@Lvnszn
Copy link
Author

Lvnszn commented May 6, 2023

Thanks for your reply.
How is this retractions triggered? When I look at with output is print, I judge whether to use the produce function according to a quarter of the time.

@cube2222
Copy link
Owner

cube2222 commented May 7, 2023

Try using batch_table or stream_native output formats. With JSON it will indeed print both the send and the retraction as normal records (which is not good). Could be improved by doing a batch JSON printer as the output if retractions are possible, or by adding an undo field that is true on JSON outputs that are retractions.

You can actually work around this by using an ORDER BY, that forces buffering and will process all retractions before outputting anything. Basically

SELECT .... ORDER BY true

@Lvnszn
Copy link
Author

Lvnszn commented May 10, 2023

thanks for your reply

@Lvnszn Lvnszn closed this as completed May 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants