Missing data caused by failure to find latest offsets from files during recovery #118

zako · 2016-09-20T16:12:26Z

We noticed that there is a gap in the Kafka offsets in the filenames of our Avro files when a new connector is added triggering a rebalance:

-rw-r--r--   3 user group       1654 2016-09-20 08:35 /data/kafka/date_id=20160920/key=value/topic1+0+0000000000+0000000000.avro
-rw-r--r--   3 user group   19546950 2016-09-20 08:35 /data/kafka/date_id=20160920/key=value/topic1+0+0000000001+0000100000.avro
-rw-r--r--   3 user group   19544711 2016-09-20 08:43 /data/kafka/date_id=20160920/key=value/topic1+0+0000100001+0000200000.avro
-rw-r--r--   3 user group       1659 2016-09-20 08:46 /data/kafka/date_id=20160920/key=value/topic1+0+0000215749+0000215749.avro
-rw-r--r--   3 user group   19547591 2016-09-20 09:00 /data/kafka/date_id=20160920/key=value/topic1+0+0000215750+0000315749.avro

Our setup is simple - 1 Kafka Connector node running in distributed mode. We created a worker for a topic via REST service then created another worker for a different topic to reproduce this behavior. Both workers have the HDFS connector enabled.

I verified that the data persisted in the Avro files does reflect the offsets in the filenames and data between 200,001 and 215,748 is missing from HDFS.

After doing a remote debug of the Kafka Connect process, I saw that Kafka Connect fails to find any files to recover the latest offset after rebalancing. It assumes we store all data in the standard directory format of /<topics directory>/<topic name>/<filename>. We are using a custom partitioner, similar to the time based partitioner, which sets the commit file destination directory to a value other than the standard directory. I assume this will happen to anyone using a partitioner other than the default partitioner.

The text was updated successfully, but these errors were encountered:

zako · 2016-09-26T16:24:15Z

There has not been a comment on this issue yet, but this is critical for anyone using a non-default partitioner based on the behavior I have experienced.

I was able to create a temporary fix by reverting changes made on 20151125 which would delete the WAL and tmp files instead of flushing and persisting them. Now a graceful shutdown or rebalance will commit inflight data and resume consuming data from Kafka based on its previous consumer offset. Obviously this is not ideal in case Kafka Connect is killed, someone nukes the persisted files in the commit directory or the consumer offsets are disrupted.

cotedm · 2017-01-09T21:11:47Z

@zako I'd like to revisit this issue with you if you are still interested. Would you be able to share some more details here? The expected behavior of the state machine is such that the consumer offsets on the Kafka side will not be committed until there has been a confirmation of the write to the WAL file. In this way we are meant to guarantee that no data is marked as "read" from Kafka until it has been confirmed as "written" to HDFS in the form of the WAL file. This should be independent of your implementation of Partitioner as this logic is all done in TopicPartitionWriter.

The changes from 20151125 should make it so that any WAL file that is not yet been confirmed as written and closed by HDFS is discarded upon rebalance so that you end up with a clean transition. If you have a reproducible test case, I would be interested in trying it out.

cotedm · 2017-01-19T16:17:46Z

@zako please let us know if you have a reproducible test case here as we would like to get this fixed, but without a test case we can't see the circumstances that lead to the problem. I'll reopen this if you are able to provide such a test case. Thanks!

cotedm closed this as completed Jan 19, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing data caused by failure to find latest offsets from files during recovery #118

Missing data caused by failure to find latest offsets from files during recovery #118

zako commented Sep 20, 2016

zako commented Sep 26, 2016 •

edited

Loading

cotedm commented Jan 9, 2017

cotedm commented Jan 19, 2017

Missing data caused by failure to find latest offsets from files during recovery #118

Missing data caused by failure to find latest offsets from files during recovery #118

Comments

zako commented Sep 20, 2016

zako commented Sep 26, 2016 • edited Loading

cotedm commented Jan 9, 2017

cotedm commented Jan 19, 2017

zako commented Sep 26, 2016 •

edited

Loading