Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kafka: Play nice with transactional topics #5404

Closed
gianm opened this issue Feb 20, 2018 · 15 comments
Closed

Kafka: Play nice with transactional topics #5404

gianm opened this issue Feb 20, 2018 · 15 comments

Comments

@gianm
Copy link
Contributor

gianm commented Feb 20, 2018

See https://groups.google.com/d/topic/druid-user/QNgVy7IDsgA/discussion for issues reported with transactional topics. It sounds like the "there will be no message gaps" assumption is not valid in the transactional world. So we should at least look at that, and also make sure that in general we are calling the consumer properly.

The message gap check was a nice sanity check on the Kafka consumer and I will be sad to see it go. I guess we'll have to be more trusting of Kafka now :)

@b-slim
Copy link
Contributor

b-slim commented Feb 23, 2018

I have looked into this and found out that the issue is not really the transactional world (transactions are part of separate topics) but as pointed by @gianm the optional log compaction that can happen to the Kafka log as per docs Kafka Brokers can compact record with the same key, therefore Kafka log can have gaps. This option is on by default starting from Kafka 0.9.1 log.cleanup.policy=compact. To close this issue, if needed we can update the docs to make sure Druid users know about the issue and the flag. Also IMO we can remove the warning log msg I don't think it really helps, if the user is asking to allow gaps he doesn't need a warning about that. @pjain1 and @gianm what you think?

@pjain1
Copy link
Member

pjain1 commented Feb 27, 2018

Docs can be updated if you are sure the problem in discussion is because of log compaction. I think some form of logging should be there indicating that offsets are being skipped, may be just count the number of skipped offsets and log it every X seconds and switch the current log to debug level if it logs too much.

@gianm
Copy link
Contributor Author

gianm commented Feb 27, 2018

@b-slim are you sure the issue reported in the thread is because of compaction? I don't see a reason why the user's topic would have been compacted. Also he said that when he turned off transactionality then he stopped having issues. It seems more likely to me that Kafka is playing some games and not returning messages for every single offset (e.g. when those messages are part of failed transactions). That would make sense to me if the offset is assigned optimistically before the transaction commits.

@b-slim
Copy link
Contributor

b-slim commented Mar 2, 2018

@gianm I thought Druid is using old consumer client that reads committed and uncommitted, so not sure why the consumer will not see a sequential order of offsets. Also according to my understanding commits are stored at Brokers and I don't think they get written on the log itself. The original claim says turned on exactly once am not sure what that means in practice I don't think there is such a flag. Anyway, this seems like the black swan...

@gianm
Copy link
Contributor Author

gianm commented Mar 2, 2018

Hmm, I don't know enough about how Kafka transactions work to know what to do. Is anyone listening to the thread that has enough Kafka expertise to know for sure if we should be expecting offset gaps in transactional topics or not?

Fwiw, we are using the new consumer API, kafka-clients version 0.10.2.0.

@b-slim
Copy link
Contributor

b-slim commented Mar 3, 2018

Transaction is 0.11 https://issues.apache.org/jira/browse/KAFKA-4815.

@ebolwidt
Copy link

ebolwidt commented Mar 8, 2018

I added more detail to the Google groups topic. In particular,

Kafka KIP-98 for transactions:

https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging

Look at section 5.2 of that KIP "5.2 WriteTxnMarkerRequest"

It's also mentioned in https://www.confluent.io/blog/transactions-apache-kafka/ (Search for "transaction commit markers").
I believe that using the Kafka 0.11 client (or newer) will make the problem go away.

@gianm
Copy link
Contributor Author

gianm commented Mar 8, 2018

Hi @ebolwidt, thanks for the research!

Are you interested in contributing a test case (one that fails today) along with a patch to upgrade us to Kafka 0.11? If it turns out that the upgrade is all we need, that's great and it sounds like we should just do it.

@ebolwidt
Copy link

ebolwidt commented Mar 8, 2018

Interested but time-constrained. I'll look into it at home as work infrastructure doesn't allow me to run the docker-based integration testsuite.

@xvrl
Copy link
Member

xvrl commented Mar 9, 2018

@ebolwidt @gianm @b-slim with transactional producers you can no longer expect data message offsets to be sequential. Control messages (used for transaction markers) take up their own offsets. Applications, however, never see those offsets, because they either get filtered out on the broker (when down-converting for older clients) or on the client (during transaction processing).

Switching to >0.11 consumer and enabling read_committed will do the right thing and properly consume data from transactional producers (i.e. ignoring aborted transactions), while preserving the existing behavior for data from non-transactional producers.

It should be safe to use read_committed by default, as the overhead of enabling it for non-transactional logs should be minimal. We should still verify the performance impact in practice though.

@dclim dclim self-assigned this Mar 21, 2018
@jihoonson
Copy link
Contributor

with transactional producers you can no longer expect data message offsets to be sequential. Control messages (used for transaction markers) take up their own offsets. Applications, however, never see those offsets, because they either get filtered out on the broker (when down-converting for older clients) or on the client (during transaction processing).

@xvrl thanks, good to know.

IMO, it sounds good to me to trust Kafka more and remove our sanity check if there is no way to figure out which offsets are taken up by control messages.

@xvrl
Copy link
Member

xvrl commented Mar 21, 2018

@jihoonson One thing we can check for though is to make sure that offsets haven't moved from where we expect the to be if we for whatever reason we have to re-create the consumer in the same consumer group (e.g. maybe on task restart). This would avoid losing data in situations where offsets were reset for some reason.

@peferron
Copy link
Contributor

peferron commented Jun 19, 2018

Looking forward to #5674—we are having the same issue as in the original Google Groups topic, where indexing tasks get stuck when seeking to a transaction marker offset.

gianm pushed a commit that referenced this issue Feb 18, 2019
* Support kafka transactional topics

* update kafka to version 2.0.0
* Remove the skipOffsetGaps option since it's not used anymore
* Adjust kafka consumer to use transactional semantics
* Update tests

* Remove unused import from test

* Fix compilation

* Invoke transaction api to fix a unit test

* temporary modification of travis.yml for debugging

* another attempt to get travis tasklogs

* update kafka to 2.0.1 at all places

* Remove druid-kafka-eight dependency from integration-tests, remove the kafka firehose test and deprecate kafka-eight classes

* Add deprecated in docs for kafka-eight and kafka-simple extensions

* Remove skipOffsetGaps and code changes for transaction support

* Fix indentation

* remove skipOffsetGaps from kinesis

* Add transaction api to KafkaRecordSupplierTest

* Fix indent

* Fix test

* update kafka version to 2.1.0
gianm pushed a commit to implydata/druid-public that referenced this issue Mar 9, 2019
* Support kafka transactional topics

* update kafka to version 2.0.0
* Remove the skipOffsetGaps option since it's not used anymore
* Adjust kafka consumer to use transactional semantics
* Update tests

* Remove unused import from test

* Fix compilation

* Invoke transaction api to fix a unit test

* temporary modification of travis.yml for debugging

* another attempt to get travis tasklogs

* update kafka to 2.0.1 at all places

* Remove druid-kafka-eight dependency from integration-tests, remove the kafka firehose test and deprecate kafka-eight classes

* Add deprecated in docs for kafka-eight and kafka-simple extensions

* Remove skipOffsetGaps and code changes for transaction support

* Fix indentation

* remove skipOffsetGaps from kinesis

* Add transaction api to KafkaRecordSupplierTest

* Fix indent

* Fix test

* update kafka version to 2.1.0
gianm pushed a commit to implydata/druid-public that referenced this issue Apr 9, 2019
* Support kafka transactional topics

* update kafka to version 2.0.0
* Remove the skipOffsetGaps option since it's not used anymore
* Adjust kafka consumer to use transactional semantics
* Update tests

* Remove unused import from test

* Fix compilation

* Invoke transaction api to fix a unit test

* temporary modification of travis.yml for debugging

* another attempt to get travis tasklogs

* update kafka to 2.0.1 at all places

* Remove druid-kafka-eight dependency from integration-tests, remove the kafka firehose test and deprecate kafka-eight classes

* Add deprecated in docs for kafka-eight and kafka-simple extensions

* Remove skipOffsetGaps and code changes for transaction support

* Fix indentation

* remove skipOffsetGaps from kinesis

* Add transaction api to KafkaRecordSupplierTest

* Fix indent

* Fix test

* update kafka version to 2.1.0
@stale
Copy link

stale bot commented Jun 21, 2019

This issue has been marked as stale due to 280 days of inactivity. It will be closed in 2 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.

@stale stale bot added the stale label Jun 21, 2019
@stale
Copy link

stale bot commented Jul 5, 2019

This issue has been closed due to lack of activity. If you think that is incorrect, or the issue requires additional review, you can revive the issue at any time.

@stale stale bot closed this as completed Jul 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants