-
Notifications
You must be signed in to change notification settings - Fork 953
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to use batch.size? #290
Comments
|
@ewencp sorry I couldn't get your reply. |
I'm also struggling to increase Sink batch more than 500. |
@SAEEDSH instead of batching it on the sink side, I guess it can be done on the source connector side or you can try to increase the |
Thanks @dipdeb. But my problem is when I have data in Kafka and need to Sink them. For example, when I have a million records in Kafka and run JDBC Sink connector, it sends to DB in batches, 500 each, which takes quite time. I don't know how to increase number of records go to DB. |
The Connect worker consumes the messages from the topics, and the consumer's |
I am hitting this issue as well. Even with I have 8 partitions across 3 brokers. The I turned on DEBUG to see if the logs turned up anything and here is an example:
Any help would be appreciated. These small batch sizes are causing the destination DB ( oracle ) to commit too frequently and is causing waits and IO contention. |
I was able to get greater throughput by adjusting the following properties in the
|
Did the above work for anyone else? I still am stuck at 500 records per batch |
No. It seems 500 is fixed. I couldn't increase it as well |
interesting discussion, I would like to add to this discussion that what if batch size is set to 500 and the topic has 499 records, and it takes an hour for 500th record to be inserted into topic, will the Sink connector halt completely for that one hour? is there a maximum wait limit property for Jdbc-Sinks? |
I have same problem. I changed |
I am having the same issue, and looking at the code the buffer is flushed once per SinkTask.put method call, so the limiting factor is the number of messages consumed from kafka at a time. |
@sashati have you got any solution? |
@jukops Hello. I have the same problem, but my records are saved by one, not in batches (https://stackoverflow.com/questions/59049762/kafka-jdbc-sink-connector-insert-values-in-batches). I tried to use this options:
This options do not give any results for me (records always saved by one). How did you achieve saving in batches? |
@rhauch, do you know if this behavior is still correct? If so, would it be possible to update the docs to mention that 'batch.size' is essentially tightly coupled to 'max.poll.records' ? |
@MashaFomina Did you ever find a workaround here? As it stands, we're stuck using a custom consumer to perform this sink job. I still find it hard to believe this would only batch by creating 1000s of individual insert statements as this completely neuters what databases are amazingly efficient at. I don't have much Postgres experience, but for MySQL this would wreak havoc on the binlog and replication. I suppose Kafka would not be aware of the exact offset in the batch that fails the insert but there will always be unaccounted for issues with garbage in. Or maybe some of the database flavors this connector supports have compatibility issues here or fail multiple inserts differently? The only thing I know for certain is that for MS SQL, MySql and Oracle this is a huge performance deal breaker. |
@unfrgivn I did not find proper decision, as i finally understood that records are really polled from topic by batches, but are inserted in database in transaction that contain batch of inserts by one record. I investigated speed of writing to databse. You can try to write you own consumer of data to insert records in database. |
@MashaFomina yea I rewrote this connector with similar config as a consumer. Just asking because I'd love to use the Connect API the same way I do for my other source/sinks, but the performance is almost 30x worse with 1 insert per row, plus the massive increase in disk I/O for MySQL binlog transactions (assuming the most common setup using innodb with autocommit on). |
Are you setting these properties from the worker side or from the connector (JSON) side? One thing that I don't think is documented well is on the connector (JSON) side you need use the prefix Regardless, I would be quite surprised if your records were actually being inserted one at a time. If you can't get over 500, make sure to follow the above. You can turn the root log level to DEBUG and search for "Flushing" as it will tell you how many records it is flushing at a time. |
@unfrgivn I'd love to see the tweaks you made to the sink connector to support batch inserts. My team is running into this exact issue, and we seem to have no choice but to fork the repo and write our own custom implementation of the |
@dconger we are using a nodejs implementation of a Kafka consumer to replicate the sink functionality so it's not a fork nor even a Kafka Connector |
@ewencp Does it mean that we should create custom sink connector with a custom adjusted poll method that runs on bigger batches? Just to clarify if there is no solution for this somehow yet or it's in the road map for near future. |
Is there any solution? |
so, Do we have any authoritative Kafka connect source and sink test reports? such as mysql to kafka... |
I've no experience with JDBC connectors, but assuming that this is generic configuration for all kinds of connectors, and based on this stackoverflow answer, did you try adding I was struggling with this while working with S3 sink connectors, where it was always configuring the max.poll.records to 500 by default. The property above fixed the issue for me. |
Sorry I'm also very interested in this topic but I have a question, this consumer.max.poll.records is this a Kafka configuration or would this be defined at each connector level? I'm asking and interested in this topic because I noticed that my connectors are waiting for days (I'm in a development setup and I don't have massive load on my system) for the data to be pushed from Kafka to the sink systems (both jdbc and s3) |
I was able to increase the batch size behavior. As indicated above, Kafka Connect needs to enable I was expecting a performance increase, but it stayed around 4k messages per second. When I increased the amount of connector tasks though, I was able to get over 20k messages per second. |
@MrMarshall was was your setup like, # of topic partitions? # of tasks for the connector? |
@ykcai I ran 3 brokers on the same machine, the topic had 6 partitions I believe. I ran multiple tests with a varying number of connector tasks and batch size. I checked the amount of records in postgres every 5 seconds and plotted the added amount of records in the graph below. Each line shows numberoftasks_batchsize, e.g. the brown line has 10 tasks and 10k batch size. I didn't even include running only 1 tasks because it took about 3x longer and that messed up the graph. My conclusion was that batch size does not significantly affect the performance, but the number of tasks does. When using 6 partitions it doesn't help much to use 5 tasks because one of the tasks will need to take care of 2 partitions, so in that case it's better to use 2, 3 or 6 tasks. |
@MrMarshall just an FYI total # of tasks > # of topic partitions is unnecessary. Did you turn on trace logs and see if you were actually getting batches of 2000, 3000, 10000 etc? Very often it is much less than people expect if everything isnt configured just right. There are several parameters that need to be set such as fetching limits and byte limits It is to be expected that more tasks will be more important than batch size up to the number of partitions in the topic also |
Yes I saw the count go up only by the batch size (or a multitude of the batch size). I didn't check the DB logs though |
In my experience, I have come to many of the same conclusions as you. If you want to do further testing I would do all with 3 tasks or all with 6 tasks. As you mentioned, trying to get an even assignment of partitions to tasks is typically ideal. I would skip any testing with more tasks than you have partitions as those will not be assigned any work. Off the top of my head my best performance was with batch sizes in the few thousands, diminishing returns on either side of that. If you want to be 100% sure of what is going on I would turn on trace logs for a small workload and search in it to make sure it isnt slowly building up to your batch limit, it is advantageous within reason to have a similar amount of data being read in from my experience. |
So, the solution is: |
Hi,
I want to read only 50 records in a batch through jdbc sink, for which I've used the
batch.size
in the jdbc sink config file:name=jdbc-sink
connector.class=io.confluent.connect.jdbc.JdbcSinkConnector
tasks.max=1
batch.size=50
topics=postgres_users
connection.url=jdbc:postgresql://localhost:34771/postgres?user=foo&password=bar
file=test.sink.txt
auto.create=true
But the
batch.size
has no effect as records are getting inserted into the database when new records are inserted into the source database.How can I achieve to insert in a batch of 50?
The text was updated successfully, but these errors were encountered: