-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Undesirable resource usage related to producer concurrency #132
Comments
I am ok with sharing the client but I think your proposed solution would be non-trivial. Perhaps the best is to have a |
Yeah, that sounds simpler indeed. The only problem I can see is that it's a one or all solution. Maybe receive an integer to create multiple clients and select a random one on producer initialization? What is the part you consider non trivial? Multiple (but not all) clients or starting them inside producer init? Gonna start to work on a PR receiving a boolean flag at first and we decide about the integer later. Thanks! |
I'm not sure if I follow the use-case: why would you have a producer concurrency higher than the number of brod clients required for maximum parallelism? i.e.: assume you have 4 partitions, why'd you start more than 4 producers? And if it has some benefit, wouldn't those benefits be lost by sharing the same brod client/group coordinator? |
Yes, AFAICT we do not need more producer concurrency than 4, but that is not the problem here. The problem is that each broadway producer initialize a new brod client. With the current behaviour if you setup this pipeline you will end up with 4 brod clients and 4 TCP conections (assuming just 1 broker). The point is that you could reuse the brod client and not start a new one per producer because the paralelism needed for the producer scales differently from the parallelism needed for brod clients.
The only scenario I can see starting more brod clients would help is if the bottleneck of your pipeline is the TCP connection between the application and the broker, which is almost never the case if you are batching properly and the connection uses non blocking IO. As the kafka protocol guide says https://kafka.apache.org/protocol.html#protocol_network:
Considering that, I think most cases would not be negativelly impacted by sharing clients but since it may be a problem for some specific scenarios we should keep the current behaviour as the default. That's makes sense to you @v0idpwn? |
Absolutely, thanks for the thoughtful explanation! |
@josevalim after my first pass at the code I have some considerations. Since we just have access to the client config when The only way I could start the client before client_1 = %{
id: :my_shared_client_1,
hosts: ["host1", "host2"]
group_config_options: foo,
client_config_options: bar,
fetch_config_options: baz
}
config :broadway_kafka, :shared_clients, [ client_1, client_2, client_3 ] And them accepts the Broadway.start_link(MyBroadway,
name: MyBroadway,
producer: [
module: {BroadwayKafka.Producer, [
shared_client_id: :my_shared_client_1
topics: ["test"],
]},
concurrency: 1
],
processors: [
default: [
concurrency: 10
]
]
) Was that what you had in mind? Seems like a bigger change than what you had proposed, am I missing something? |
The prepare_for_start callback should allow you to specify more children that are added to the supervision tree: https://github.com/dashbitco/broadway/blob/ebee2a94ffa6f16bc14ffa6dbc20d3c2f7b5bb73/lib/broadway/producer.ex#L114 |
I've tested the changes on the sandbox environment of a real world system we have here and the results are great so far. The memory usage decreased 1.6 GB and port usage decreased in 500 ports. Some results from the Given that, I think we can close this issue! Thanks for help and feedback. 😄 |
Context
I've noticed that the current implementation of
BroadwayKafka.BrodClient.setup/4
always starts a new :brod client as follow:The problem is that this function is called for every new
BroadwayKafka.Producer
which may be initialized multiple times if producer concurrency is set to a number greater than one.At my current understanding, in order to achieve maximum parallelism the number of broadway producers we need is the lowest between schedulers online and the sum of all topic's partitions.
But with the current implementation this would lead to a new TCP connection with each broker of the cluster for each one of the producers which is undesirable.
Proposal
Since 1 brod client is enough to handle most workloads, we could offer a new client_option called
max_concurrency
defaults to:infinity
that would control how many brod clients will be started.At my first look at the code I think the best approach would be start all brod clients before any producer and select a random client for each one on intialization.
The general approach consist in the following changes:
On
BroadwayKafka.Producer.init/1
call a functionmaybe_start_clients(opts)
that will return a list of tuples{client_id, group_coordinator}
select a random tuple to use as its internal statemaybe_start_clients(opts)
If client max_concurrency is infinity, starts a single client and return a single tuple. If it is a positive integer start N clients and return it if they are not yet started. It will save the information about started clients on a shared resource such as an ETS table or persistent termThis proposal is very broad and I'll probably need to refine it during development considering possible side effects.
Closing thoughts
Let me know if all this makes sense to you or if I misunderstood something about the problem or it's another way to solve this with the current features we have.
If all make sense I'll start working on the PR for this. Just let me know! Thanks! 😃
The text was updated successfully, but these errors were encountered: