New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Left joins with grace period do not emit non-joined rows #9084
Comments
I did not look into the details of the link, but does the example process a finite data set? If yes, it could explain what you observe. If input data stop, stream-time which is based on the event-time of the input data does not advance, and thus the join window cannot be closed. Thus the left join result cannot be emitted. |
Hi @mjsax, thank you for the response. That makes sense, and in the example above (which is using debezium to populate the topics), inserting more rows does indeed cause the non-joined rows to be emitted. It seems like some sort of heartbeat to the topic could ensure that they are emitted. However, it seems that if the key is unique, or at least not replicable (e.g., it includes a timestamp, or a transaction id, etc.), the non-joined rows could never be emitted, as a new event could never be emitted for that key. That, at least, is my takeaway from reading the first sentence of this section. Am I misinterpreting that? Or is there a way to force a window to update without a matching key event? |
Yes, but it's a larger change to the Kafka brokers and Kafka Streams to add such a feature -- personally, I would love to add something like this, but the scope is quite big...
Time progress is not tracked per-key, but per partition. Thus, this is not an issue. (There is a ticket to switch to key-based time-tracking though: https://issues.apache.org/jira/browse/KAFKA-8769 |
Oh, okay, that makes a lot of sense then! In the example above I am using 1 partition, and non-joined rows would show up after another update/insert. In the actual case there are many more partitions, and thus it is much less likely that a non-joined row will show up after another update/insert. It sounds like as of now, the only way for those non-joined rows to be emitted is when another event is processed by the same partition. Is that the case? |
That's correct. Given that we do data stream processing, there should always be a next record to be processed :) |
how it's supposed to manage the case where I need the LEFT join stream records to appear immediately but I also would like to get updates for right join stream WITHIN 50 MILLISECONDS and due to some delay on the right stream I'd have up to 5 minutes GRACE PERIOD. I'd expect to see two messages on the output topic - one for the left stream and one for the right stream under condition the message has arrived up to 5 minutes and has timestamp within 50 milliseconds window
|
Describe the bug
Left joins with grace period do not emit non-joined rows
To Reproduce
Follow the readme in https://github.com/lucasprograms/ksql-left-join-debug
Expected behavior
Left joins with grace period of 0 seconds eagerly emit non-joined rows
Actual behaviour
non-joined rows are never emitted
The text was updated successfully, but these errors were encountered: