Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Left joins with grace period do not emit non-joined rows #9084

Closed
lucasprograms opened this issue Apr 29, 2022 · 6 comments
Closed

Left joins with grace period do not emit non-joined rows #9084

lucasprograms opened this issue Apr 29, 2022 · 6 comments
Assignees
Labels

Comments

@lucasprograms
Copy link

Describe the bug
Left joins with grace period do not emit non-joined rows

To Reproduce
Follow the readme in https://github.com/lucasprograms/ksql-left-join-debug

Expected behavior
Left joins with grace period of 0 seconds eagerly emit non-joined rows

Actual behaviour
non-joined rows are never emitted

@mjsax
Copy link
Member

mjsax commented May 3, 2022

I did not look into the details of the link, but does the example process a finite data set? If yes, it could explain what you observe. If input data stop, stream-time which is based on the event-time of the input data does not advance, and thus the join window cannot be closed. Thus the left join result cannot be emitted.

@lucasprograms
Copy link
Author

Hi @mjsax, thank you for the response. That makes sense, and in the example above (which is using debezium to populate the topics), inserting more rows does indeed cause the non-joined rows to be emitted. It seems like some sort of heartbeat to the topic could ensure that they are emitted.

However, it seems that if the key is unique, or at least not replicable (e.g., it includes a timestamp, or a transaction id, etc.), the non-joined rows could never be emitted, as a new event could never be emitted for that key. That, at least, is my takeaway from reading the first sentence of this section. Am I misinterpreting that? Or is there a way to force a window to update without a matching key event?

@mjsax
Copy link
Member

mjsax commented May 5, 2022

It seems like some sort of heartbeat to the topic could ensure that they are emitted.

Yes, but it's a larger change to the Kafka brokers and Kafka Streams to add such a feature -- personally, I would love to add something like this, but the scope is quite big...

However, it seems that if the key is unique, or at least not replicable (e.g., it includes a timestamp, or a transaction id, etc.), the non-joined rows could never be emitted, as a new event could never be emitted for that key.

Time progress is not tracked per-key, but per partition. Thus, this is not an issue. (There is a ticket to switch to key-based time-tracking though: https://issues.apache.org/jira/browse/KAFKA-8769

@lucasprograms
Copy link
Author

Oh, okay, that makes a lot of sense then! In the example above I am using 1 partition, and non-joined rows would show up after another update/insert.

In the actual case there are many more partitions, and thus it is much less likely that a non-joined row will show up after another update/insert. It sounds like as of now, the only way for those non-joined rows to be emitted is when another event is processed by the same partition. Is that the case?

@mjsax
Copy link
Member

mjsax commented May 5, 2022

It sounds like as of now, the only way for those non-joined rows to be emitted is when another event is processed by the same partition. Is that the case?

That's correct. Given that we do data stream processing, there should always be a next record to be processed :)

@mjsax mjsax self-assigned this May 11, 2022
@ppilev
Copy link

ppilev commented Apr 12, 2023

how it's supposed to manage the case where I need the LEFT join stream records to appear immediately but I also would like to get updates for right join stream WITHIN 50 MILLISECONDS and due to some delay on the right stream I'd have up to 5 minutes GRACE PERIOD.

I'd expect to see two messages on the output topic - one for the left stream and one for the right stream under condition the message has arrived up to 5 minutes and has timestamp within 50 milliseconds window

LEFT STREAM | RIGHT STEAM
-------------------------
    ls-data |        NULL
-------------------------
    ls-data |     rs-data
-------------------------

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants