-
Notifications
You must be signed in to change notification settings - Fork 397
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing data caused by failure to find latest offsets from files during recovery #118
Comments
There has not been a comment on this issue yet, but this is critical for anyone using a non-default partitioner based on the behavior I have experienced. I was able to create a temporary fix by reverting changes made on 20151125 which would delete the WAL and tmp files instead of flushing and persisting them. Now a graceful shutdown or rebalance will commit inflight data and resume consuming data from Kafka based on its previous consumer offset. Obviously this is not ideal in case Kafka Connect is killed, someone nukes the persisted files in the commit directory or the consumer offsets are disrupted. |
@zako I'd like to revisit this issue with you if you are still interested. Would you be able to share some more details here? The expected behavior of the state machine is such that the consumer offsets on the Kafka side will not be committed until there has been a confirmation of the write to the WAL file. In this way we are meant to guarantee that no data is marked as "read" from Kafka until it has been confirmed as "written" to HDFS in the form of the WAL file. This should be independent of your implementation of The changes from 20151125 should make it so that any WAL file that is not yet been confirmed as written and closed by HDFS is discarded upon rebalance so that you end up with a clean transition. If you have a reproducible test case, I would be interested in trying it out. |
@zako please let us know if you have a reproducible test case here as we would like to get this fixed, but without a test case we can't see the circumstances that lead to the problem. I'll reopen this if you are able to provide such a test case. Thanks! |
We noticed that there is a gap in the Kafka offsets in the filenames of our Avro files when a new connector is added triggering a rebalance:
Our setup is simple - 1 Kafka Connector node running in distributed mode. We created a worker for a topic via REST service then created another worker for a different topic to reproduce this behavior. Both workers have the HDFS connector enabled.
I verified that the data persisted in the Avro files does reflect the offsets in the filenames and data between 200,001 and 215,748 is missing from HDFS.
After doing a remote debug of the Kafka Connect process, I saw that Kafka Connect fails to find any files to recover the latest offset after rebalancing. It assumes we store all data in the standard directory format of
/<topics directory>/<topic name>/<filename>
. We are using a custom partitioner, similar to the time based partitioner, which sets the commit file destination directory to a value other than the standard directory. I assume this will happen to anyone using a partitioner other than the default partitioner.The text was updated successfully, but these errors were encountered: