KAFKA-1911: Async delete topic #1664

sutambe · 2016-07-26T00:23:04Z

The last patch submitted by @MayureshGharat (back in Dec 15) has been rebased to the latest trunk. I took care of a couple of test failures (MetricsTest) along the way. @jjkoshy , @granders , @avianey , you may be interested in this PR.

sutambe · 2016-07-26T17:30:54Z

Jenkins build failed due to kafka.server.OffsetCommitTest.testUpdateOffsets test failure. It's an unrelated test to the PR and it runs fine on my dev box.

jjkoshy · 2016-09-02T22:42:30Z

core/src/main/scala/kafka/cluster/Partition.scala

        removePartitionMetrics()
+        info("Deleted log for [%s,%d] in %d ms".format(topic, partitionId, (time.milliseconds - start)))


We should make it clear in this message that the file isn't actually deleted at this point. It is basically inaccessible/deleted from the user's POV and scheduled for removal.

Here and elsewhere: rather than expand out the [%s,%d] format for the topic-partition we should just rely on the toString rendering of TopicAndPartition.

Here and elsewhere: can also make this a little more concise with Scala's string interpolation.

sutambe · 2016-09-26T18:56:59Z

core/src/main/scala/kafka/log/Log.scala

@@ -777,7 +778,12 @@ class Log(val dir: File,
  /**
   * The active segment that is currently taking appends
   */
-  def activeSegment = segments.lastEntry.getValue
+  def activeSegment: LogSegment = {


I don't think the check in activeSegment isn't necessary because segments is never empty. loadSegments is called right after val segments in the constructor. Perhaps loadSegments should return segments to make it clear.

sutambe · 2016-11-07T23:01:42Z

@jjkoshy Please take a look at the update PR for async-delete. This time the code is a better Scala citizen and improved logging messages.

jjkoshy · 2016-11-09T01:35:51Z

core/src/main/scala/kafka/cluster/Partition.scala

        removePartitionMetrics()
+        info(s"Log for $topicPartition renamed to $renamedDir and scheduled for deletion. Spent ${time.milliseconds - start} ms.")


The log line is slightly weird. Not sure how useful it is to have a "spent" part given that it can be derived from logging timestamps. It is also probably redundant to the log that is already inside asyncDelete. That said, if you want this I would go with ... renamed to $renamedDir in $time milliseconds and scheduled for deletion.

I zapped "time spent" reporting and calculation.

jjkoshy · 2016-11-09T01:44:13Z

core/src/main/scala/kafka/log/Log.scala

-      // reset the index size of the currently active log segment to allow more entries
-      activeSegment.index.resize(config.maxIndexSize)
-      activeSegment.timeIndex.resize(config.maxIndexSize)
+      if (!dir.getAbsolutePath.endsWith(Log.DeleteDirSuffix)) {


We could synchronously delete these during load right? Similar comment in LogManager

I'm leaning towards deleting them in a scheduled task to avoid wasting start-up time for data we don't need. What do you think?

Yes that is fine.

jjkoshy · 2016-11-09T01:46:51Z

core/src/main/scala/kafka/log/LogManager.scala

-            throw new IllegalArgumentException(
-              "Duplicate log directories found: %s, %s!".format(
-              current.dir.getAbsolutePath, previous.dir.getAbsolutePath))
+          if (logDir.getName.endsWith(Log.DeleteDirSuffix)) {


We could actually synchronously delete logs during load right instead of waiting for the scheduled task to kick in. OTOH maybe it is better to have a consistent/single place where we delete the logs.

If a particular broker holds a large amount of data for a particular topic or if we have a high throughput topic that was marked for delete, then deleting them synchronously when the broker starts up might increase the startup time significantly, right?

I'm leaning towards deleting them in a scheduled task to avoid wasting start-up time for data we don't need.

Yes either way is fine.

jjkoshy · 2016-11-09T01:55:08Z

core/src/main/scala/kafka/log/LogManager.scala

@@ -203,6 +208,11 @@ class LogManager(val logDirs: Array[File],
                         delay = InitialTaskDelayMs,
                         period = flushCheckpointMs,
                         TimeUnit.MILLISECONDS)
+      scheduler.schedule("kafka-delete-logs",


I think it is worth observing that an alternate approach to this recurring task that pulls of a queue is to just invoke async delete on demand when we delete a log. i.e., instead of having the asyncDelete method and log directory renames, instead call scheduler.schedule("delete", log.delete) which will set period -1 to get a single invocation. The reason this isn't very robust is we could get several partitions to delete in a short period of time which could lead to rejected executions if the threadpool is fully utilized.

The deleteLogs function swallows all exceptions and reports them. On IO exceptions on a particular log, it will retry later. The current implementation has a queue of logs already. It's semantically equivalent to scheudler.schedule afaik. I did not feel compelled to change.

It is similar, but as noted above I think scheduler.schedule could lead to a rejected execution if the executor is saturated; and then you would need a rejected execution handler.

The scheduler uses ScheduledThreadPoolExecutor, which has a fixed number of threads (say N) and an unbounded queue. Using this scheduler for IO risks blocking all the threads in the thread-pool if/when at least N disk IO operations take too long or never return. As this thread-pool executor is used for other periodic tasks such as, log-retention, log-flushing, and recovery-checkpointing, we definitely don't want to lock-up all the threads doing nothing. Direct use of scheduler.schedule opens up this unlikely but dangerous possibility. This is perhaps strong enough reason to stay away from direct scheduling and use an explicit queue of logs such that only one thread from the pool is used at a time.

That can be avoided by using a separate scheduler.

I might be missing something here, but would like to understand the advantage of scheduling delete on the fly over having it as a periodic task.

It's simpler - that is pretty much it.

A separate scheduler may need it's own defaultNumThreads config?

jjkoshy · 2016-11-09T02:00:04Z

core/src/main/scala/kafka/log/LogManager.scala

-      removedLog = logs.remove(topicAndPartition)
+  private def deleteLogs(): Unit = {
+    while (!logsToBeDeleted.isEmpty) {
+      val removedLog = logsToBeDeleted.take()


If there are any storage exceptions in delete, we probably want to retry - i.e., we should peek and take only after the delete succeeds.

sutambe · 2016-11-10T19:00:55Z

@jjkoshy Updated PR to be robust against IO exceptions. Not doing startup time deletion yet. Ping @MayureshGharat

sutambe · 2016-11-10T21:25:09Z

@jjkoshy The failed counter works fine when logsToBeDeleted grows while in the deleteLogs loop. It simply postpones retries to the next "round". The real questions is whether it makes sense to retry after an IO exception. It may not. If we anticipate transient IO failures on a particular disk in JBOD arrangement, this logic is perhaps simplest way to avoid head-of-the-line blocking.

jjkoshy

I think it would be worth spending some time looking into just using the scheduler in single-shot manner. i.e., rename the directory and then delete that asynchronously in a single-shot call to the executor. The only thing I'm not sure about is the likelihood of saturating the executor service.

Also, there are unit test failures that are likely related. For e.g.,

kafka.log.LogCleanerIntegrationTest > testCleansCombinedCompactAndDeleteTopic[0] FAILED
    java.lang.NullPointerException
        at kafka.log.Log.logSegments(Log.scala:917)
        at kafka.log.Log.recoverLog(Log.scala:286)
        at kafka.log.Log.loadSegments(Log.scala:265)
        at kafka.log.Log.<init>(Log.scala:107)
        at kafka.log.LogCleanerIntegrationTest$$anonfun$makeCleaner$1.apply(LogCleanerIntegrationTest.scala:329)
        at kafka.log.LogCleanerIntegrationTest$$anonfun$makeCleaner$1.apply(LogCleanerIntegrationTest.scala:325)
        at scala.collection.immutable.Range.foreach(Range.scala:141)
        at kafka.log.LogCleanerIntegrationTest.makeCleaner(LogCleanerIntegrationTest.scala:325)
        at kafka.log.LogCleanerIntegrationTest.runCleanerAndCheckCompacted$1(LogCleanerIntegrationTest.scala:104)
        at kafka.log.LogCleanerIntegrationTest.testCleansCombinedCompactAndDeleteTopic(LogCleanerIntegrationTest.scala:130)

jjkoshy · 2016-11-11T21:21:09Z

core/src/main/scala/kafka/log/Log.scala

-      // reset the index size of the currently active log segment to allow more entries
-      activeSegment.index.resize(config.maxIndexSize)
-      activeSegment.timeIndex.resize(config.maxIndexSize)
+      if (!dir.getAbsolutePath.endsWith(Log.DeleteDirSuffix)) {


Yes that is fine.

jjkoshy · 2016-11-11T21:21:40Z

core/src/main/scala/kafka/log/LogManager.scala

-            throw new IllegalArgumentException(
-              "Duplicate log directories found: %s, %s!".format(
-              current.dir.getAbsolutePath, previous.dir.getAbsolutePath))
+          if (logDir.getName.endsWith(Log.DeleteDirSuffix)) {


Yes either way is fine.

jjkoshy · 2016-11-11T21:35:01Z

core/src/main/scala/kafka/log/LogManager.scala

@@ -203,6 +208,11 @@ class LogManager(val logDirs: Array[File],
                         delay = InitialTaskDelayMs,
                         period = flushCheckpointMs,
                         TimeUnit.MILLISECONDS)
+      scheduler.schedule("kafka-delete-logs",


It is similar, but as noted above I think scheduler.schedule could lead to a rejected execution if the executor is saturated; and then you would need a rejected execution handler.

jjkoshy · 2016-11-11T23:01:31Z

core/src/main/scala/kafka/log/LogManager.scala

+      }
+    } catch {
+      case e: Throwable => 
+        error(s"Exception in kafka-delete-logs thread. Ignoring.", e)


Do we need this?

Yes, it's necessary for a couple of reasons. First, ScheduledExecutorService.scheduleAtFixedRate documentation says that "...If any execution of the task encounters an exception, subsequent executions are suppressed....". Second, LinkedBlockingQueue.put may throw an InterruptedException from the catch block. In both cases, we want to continue scheduling the task in future.

Yes, I was wondering if it makes more sense to do something more radical like shutdown. We don't have to - the existing behavior on storage exceptions on stop-replica is that the controller would retry the stop-replica request.

Also, the blocking queue is effectively unbounded (well, max-value) so that is probably not going to block. In fact, it should never be allowed to block as that would hold up the stop-replica request.

jjkoshy

While this LGTM, but I'm not convinced that just doing single-shot calls to the executor would not be better. I think that is much simpler and is backed by an unbounded task queue.

sutambe · 2016-11-23T22:01:00Z

@jjkoshy Summarizing our discussion and decision to use the existing background thread-pool for deleting logs. The dilemma was between two choices with the existing background scheduler: 1) single-shot scheduling via .schedule() of a task to delete a single directory (topic-partition) and 2) periodic scheduled task to delete queued logs. Single-shot scheduling, although simpler, is dangerous because a boatload of deleted topic-partitions may hog the entire background thread-pool for just deletions, starving other periodic tasks including, log-retention, log-flusher, and recovery-point-checkpoint.

An alternative thread-pool of background threads could be used but it requires a separate user-facing config and was therefore eliminated.

A periodically scheduled deletion task (option 2 above) has two sub-options: a) scan the directory looking for any directories marked for deletion ("*.delete"). b) Use an internal queue of logs that contain directory names. Option 2.a is not attractive because most of the time the directory scan will not result anything to delete. Besides it's IO heavy.

Therefore, the only reasonable option is to use 2.b where in an explicit LinkedBlockingQueue of logs is used in producer/consumer style.

The current patch keeps track of IO failures during one run of deletion and retries the failed directories at the next scheduled run. LinkedBlockingQueue has O(1) take and put. This scheme is perhaps simpler than an explicitly synchronized access to a regular list. Alternatively, removeFirstOccurance could be used in O(1) time (in absence of failures) because always the head of the list will be removed, terminating the search quickly.

jjkoshy · 2016-11-28T07:18:00Z

core/src/main/scala/kafka/log/Log.scala

@@ -106,7 +106,7 @@ class Log(val dir: File,
  val t = time.milliseconds
  /* the actual segments of the log */
  private val segments: ConcurrentNavigableMap[java.lang.Long, LogSegment] = new ConcurrentSkipListMap[java.lang.Long, LogSegment]
-  loadSegments()
+    loadSegments()


jjkoshy · 2016-11-28T07:20:57Z

core/src/main/scala/kafka/log/LogManager.scala

@@ -297,7 +307,8 @@ class LogManager(val logDirs: Array[File],

  /**
   *  Delete all data in a partition and start the log at the new offset
-   *  @param newOffset The new offset to start the log with
+    *


jjkoshy · 2016-11-28T07:22:41Z

core/src/main/scala/kafka/log/LogManager.scala

+      }
+    } catch {
+      case e: Throwable => 
+        error(s"Exception in kafka-delete-logs thread. Ignoring to ensure continued scheduling.", e)


Drop the Ignoring.. bit.

jjkoshy · 2016-11-28T07:25:31Z

core/src/main/scala/kafka/log/LogManager.scala

-      removedLog = logs.remove(topicAndPartition)
+  private def deleteLogs(): Unit = {
+    try {
+      var failed = 0


I thought we were going with the simpler approach of just peeking and letting it go (and have the next run re-attempt to delete it). Exceptions here are completely abnormal and it is likely that similar exceptions will affect regular fetch/produce requests. This is also fine though, but overkill IMO.

jjkoshy · 2016-11-28T07:28:14Z

core/src/main/scala/kafka/log/LogManager.scala

    }
+}
+
+  def asyncDelete(topicAndPartition: TopicAndPartition) : String = {


Can you add scaladoc to this? Specifically, document what it returns.

The method need not return anything. The return string (name of the renamed dir) was used just in a log statement. It was redundant too. So I zapped the return value. Added documentation.

jjkoshy · 2016-11-28T07:32:26Z

core/src/main/scala/kafka/log/LogManager.scala

+
+        logsToBeDeleted.add(removedLog)
+        removedLog.removeLogMetrics()
+        info(s"Log for partition ${removedLog.topicAndPartition.topic} is renamed to ${removedLog.topicAndPartition.partition} and is scheduled for deletion")


The log message doesn't seem right - it seems it would log "topic" is renamed to "partition". Can you verify?

oops! Fixed now.

Signed-off-by: Sumant Tambe <sutambe@linkedin.com>

…rectory Signed-off-by: Sumant Tambe <sutambe@linkedin.com>

…his is to speedup startup process. Also added check that Log directories ending with .delete be added to a separate set of logs that the are to be deleted asynchronously. Signed-off-by: Sumant Tambe <sutambe@linkedin.com>

Signed-off-by: Sumant Tambe <sutambe@linkedin.com>

…o loading segments on a crash Signed-off-by: Sumant Tambe <sutambe@linkedin.com>

Cleanup, async log deletion rebase, and tested successfully

jjkoshy · 2016-11-30T18:36:12Z

+1

…atmayuresh15@gmail.com> and Sumant Tambe <sutambe@yahoo.com> The last patch submitted by MayureshGharat (back in Dec 15) has been rebased to the latest trunk. I took care of a couple of test failures (MetricsTest) along the way. jjkoshy , granders , avianey , you may be interested in this PR. Author: Sumant Tambe <sutambe@yahoo.com> Author: Mayuresh Gharat <mgharat@mgharat-ld1.linkedin.biz> Author: MayureshGharat <gharatmayuresh15@gmail.com> Reviewers: Joel Koshy <jjkoshy.w@gmail.com> Closes apache#1664 from sutambe/async-delete-topic

jjkoshy reviewed Sep 2, 2016
View reviewed changes

sutambe commented Sep 26, 2016

View reviewed changes

sutambe force-pushed the async-delete-topic branch 2 times, most recently from 603b014 to 6022ffb Compare November 7, 2016 22:42

jjkoshy requested changes Nov 9, 2016

View reviewed changes

jjkoshy reviewed Nov 11, 2016

View reviewed changes

sutambe force-pushed the async-delete-topic branch from 5f04c63 to 141734b Compare November 12, 2016 02:17

jjkoshy reviewed Nov 18, 2016

View reviewed changes

sutambe force-pushed the async-delete-topic branch from f114e09 to 6ea745a Compare November 22, 2016 02:59

jjkoshy reviewed Nov 28, 2016

View reviewed changes

sutambe force-pushed the async-delete-topic branch from 6ea745a to 5c9ba74 Compare November 30, 2016 00:36

Mayuresh Gharat and others added 8 commits November 30, 2016 09:56

Made Delete topic on the brokers Async

f88dfc4

Signed-off-by: Sumant Tambe <sutambe@linkedin.com>

Change the file pointers for log and index to point to the renamed di…

684542e

…rectory Signed-off-by: Sumant Tambe <sutambe@linkedin.com>

Removed a bug from earlier commit

be9c16b

Signed-off-by: Sumant Tambe <sutambe@linkedin.com>

Removed the extra ';'

0068c33

Signed-off-by: Sumant Tambe <sutambe@linkedin.com>

Addressed Joel's comments on the patch

59239f3

Signed-off-by: Sumant Tambe <sutambe@linkedin.com>

Addressed the NPE issue with race condition and also issues related t…

23c4ea1

…o loading segments on a crash Signed-off-by: Sumant Tambe <sutambe@linkedin.com>

Avoid accessing the log object to avoid concurrent access.

470d87d

Cleanup, async log deletion rebase, and tested successfully

sutambe added 5 commits November 30, 2016 09:56

updated periodic delete-log logic to retry on IO exceptions

bc154fd

reverting refactoring in loadSegments

112b80d

updated error message

05a3b93

documented asyncDelete and cleanup

62e51fd

silly change to trigger a Jenkins rebuild

3139863

sutambe force-pushed the async-delete-topic branch from 5c9ba74 to 3139863 Compare November 30, 2016 17:57

asfgit closed this in 497e669 Nov 30, 2016

efeg added a commit to efeg/kafka that referenced this pull request May 29, 2024

Add retry logic for creation of all CC topics (apache#1664)

5d48f08

		removePartitionMetrics()
		info("Deleted log for [%s,%d] in %d ms".format(topic, partitionId, (time.milliseconds - start)))

		removePartitionMetrics()
		info(s"Log for $topicPartition renamed to $renamedDir and scheduled for deletion. Spent ${time.milliseconds - start} ms.")

KAFKA-1911: Async delete topic #1664

KAFKA-1911: Async delete topic #1664

Conversation

sutambe commented Jul 26, 2016

sutambe commented Jul 26, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sutambe commented Nov 7, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MayureshGharat Nov 10, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sutambe Nov 12, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sutambe commented Nov 10, 2016

sutambe commented Nov 10, 2016

jjkoshy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sutambe Nov 12, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jjkoshy Nov 18, 2016 • edited

Choose a reason for hiding this comment

jjkoshy left a comment

Choose a reason for hiding this comment

sutambe commented Nov 23, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jjkoshy commented Nov 30, 2016

MayureshGharat Nov 10, 2016 •

edited

sutambe Nov 12, 2016 •

edited

sutambe Nov 12, 2016 •

edited

jjkoshy Nov 18, 2016 •

edited

sutambe commented Nov 23, 2016 •

edited