[FEATURE] Support stream/table joins #177

dberardo-com · 2022-11-28T07:09:47Z

Is it possible to use bytewax for joining content of different kafka topics (similar to what ksqldb is doing) ?

doing this will be an example of integrating "persistent queries" (permanent background processes that never stops). is this a good use case for bytewax?

also comparing to ksqldb, what happens if the bytewax workers are killed? will those persistent queries restart automatically when workers come back up and will they use the latest/earliest committed offset on the kafka topics ? or is the restart manual?

cheers

awmatheson · 2022-11-28T21:55:07Z

👋 @dberardo-com, I will try and answer your questions below.

Is it possible to use bytewax for joining content of different kafka topics (similar to what ksqldb is doing)?

Yes, you can join streams together in Bytewax. The caveat is that the native Kafka connector (KafkaInputConfig) today does not provide this functionality and you would have to use the ManualInputConfig.

Today, a dataflow (like this example) can be written with the ManualInputConfig functionality that would allow the dataflow to consume from multiple topics and join the streams together. Consuming from kafka with the ManualInputConfig will require you to manage offsets and state recovery in the manual input.

doing this will be an example of integrating "persistent queries" (permanent background processes that never stops). is this a good use case for bytewax?

Yes, persistent queries, if I understand what you mean, are a good use case for Bytewax and Stateful operators (stateful_map, reduce_window, etc.) would allow for this type of behavior. You can see a very rudimentary example of the persistent query across multiple streams in the linked example above.

also comparing to ksqldb, what happens if the bytewax workers are killed? will those persistent queries restart automatically when workers come back up and will they use the latest/earliest committed offset on the kafka topics ? or is the restart manual?

If a worker dies and you have recovery enabled, you will be able to restart the workflow and recover the state and it will start at the appropriate offset automatically. If you are using Bytewax on k8s or as a service via (waxctl)[https://www.bytewax.io/docs/deployment/waxctl] you will be able to restart automatically as well.

colebaileygit · 2023-11-30T23:21:48Z

Would be great if this could be done more cleanly, e.g. having two different inputs in a flow which can be transformed independently, and then later keyed and joined. Otherwise the whole paradigm is untyped in python and would require messy if blocks 🤔

davidselassie · 2024-01-08T20:02:34Z

This is now cleanly possible in the latest version of Bytewax https://github.com/bytewax/bytewax/releases/tag/v0.18.0 . It supports having multiple independent input sources and an explicit join operator. See our documentation on joins for how this works.

github-actions bot added the needs triage New issue, needs triage label Nov 28, 2022

awmatheson added question Further information is requested and removed needs triage New issue, needs triage labels Nov 28, 2022

davidselassie closed this as completed Jan 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Support stream/table joins #177

[FEATURE] Support stream/table joins #177

dberardo-com commented Nov 28, 2022

awmatheson commented Nov 28, 2022

colebaileygit commented Nov 30, 2023

davidselassie commented Jan 8, 2024

[FEATURE] Support stream/table joins #177

[FEATURE] Support stream/table joins #177

Comments

dberardo-com commented Nov 28, 2022

awmatheson commented Nov 28, 2022

colebaileygit commented Nov 30, 2023

davidselassie commented Jan 8, 2024