feat(router) Add a router strategy to route messages to multiple strategies. #368

fpacifici · 2024-05-19T03:04:08Z

This PR introduces a Router processing strategy that delivers messages to multiple independent strategies, keeps track of the commits that can happen out of order and commit in the right order on Kafka.

Why a router strategy?
There are multiple use cases. In the specific this is to route messages to a low priority second topic to put old messages aside and prioritize new ones. This is meant to automate the process that we employ to route messages to ingest-events-2.
There are other scenarios like sending messages to different topics in the indexer, divide processing in multiple classes on different multiprocess pools, etc.

How does it work ?

the strategy is provided a selector and a list of strategies to route messages to.
the selector inspects a message and routes it. Ideally this should be based on the message header to avoid parsing the message.
the strategy wraps the commit function passed to the destinations in order to intercept all commits.
all commits are registered into an object that keeps track of the watermark to commit.

What's the commit policy ?
Each strategy can have its own commit policy, they can commit when they want. Each strategy is expected to commit offsets in order per partition. This is not different than the standard Kafka behavior. Different parallel destination strategies can commit out of order with respect to each other.

How will this be used ?
The first use case will be to automate the ingest-events2 process to put older messages aside during ingestion when we are trying to burn a backlog.
Today that process requires people to manually change the Relay configuration to route new messages to a separate topic and start a consumer there.
This strategy will be used in the ingest consumer so that really old messages will be sent to a strategy that produces on a backup topic reaching the newer messages soon.
If this works well we will expand the usage to all consumers where in order delivery is not strictly needed. We could consider adding it inside the StreamProcessor to ensire it will always be there.

untitaker · 2024-05-20T11:05:50Z

Different parallel destination strategies can commit out of order with respect to each other.

this basically means that parallel strategies cannot work on the same topic, right? they have to commit to entirely different topics or consumer groups.

untitaker · 2024-05-22T14:11:10Z

@fpacifici i've read the code, it looks like it would work. but what do you think of changing Message so that it carries a set of ranges per partition to commit? then there would be less complexity in figuring out what is safe to commit, and invocations of the commit callback would not have to carry information about which route was taken. it would therefore make the router a simple strategy instead of a strategy-factory that also has to wrap the commit callback.

the downside is that it is a more fundamental change and requires changes to all existing strategies that merge/unmerge messages

fpacifici · 2024-05-22T22:49:08Z

this basically means that parallel strategies cannot work on the same topic, right? they have to commit to entirely different topics or consumer groups.

If I understand your question correctly, you are asking whether the routes (the strategies we route message on) would have to commit on different topics.
That would not be the case. The router is meant exactly to track commits from parallel strategies on the same topic and reorder them.

fpacifici · 2024-05-22T22:53:21Z

@fpacifici i've read the code, it looks like it would work. but what do you think of changing Message so that it carries a set of ranges per partition to commit? then there would be less complexity in figuring out what is safe to commit, and invocations of the commit callback would not have to carry information about which route was taken. it would therefore make the router a simple strategy instead of a strategy-factory that also has to wrap the commit callback.

I am not sure I grasp how this would work. I have a few questions:

Do I understand it correctly that the message that reach the strategy would already know up to where it can commit so the strategy could just go on and commit when it wants ?

If that the case how would we deal with a scenario where two parallel strategies commit at different intervals: the first commit each message while the second commit in large batches ? It is only when the commit is issued by the strategy that we can know which range we can commit.

anonrig · 2024-05-29T00:38:42Z