overlord helpers framework and tasklog auto cleanup #3677

himanshug · 2016-11-10T17:01:36Z

We push the task logs to HDFS which fill up the scarce namespace and whose deletion is currently managed outside of Druid. This is a major inconvenience to most of the teams using Druid and most of them are deleting those manually.

This patch introduces a feature to configure Overlord to automatically delete older task logs. See updates to docs/content/configuration/indexing-service.md for the user documentation.

Currently, implementation is only provided for local and hdfs type tasklogs.

Framework introduced can be used to manage retention of other tasks related persisted state but that is for the future.

fjy · 2016-11-10T23:44:35Z

indexing-service/src/main/java/io/druid/indexing/overlord/helpers/OverlordHelper.java

+
+/**
+ */
+public interface OverlordHelper


While you are in this code, u should also consider merging overlord and coordinator :P

hmmm, actually we can. I believe we have had the desire for quite some time and it was discussed in the dev sync meeting too and agreed upon.

I would imagine introducing a CliMaster (a "master" node) that binds/start everything that coordinator and overlords do. also, have only one "leader election" for the "leader master" instead of the two.

@himanshug Yeah that would be pretty great, we don't have to do it as part of this PR though, but I think it would be a really useful to have

yeah, will do that in separate PR. let us keep this one as is.

fjy · 2016-11-10T23:46:04Z

indexing-service/src/main/java/io/druid/indexing/overlord/helpers/OverlordHelperManager.java

+            if (exec == null) {
+              exec = execFactory.create(1, "Overlord-Helper-Manager-Exec--%d");
+            }
+            helper.schedule(exec);


this logic is basically just repeating what the coordinator is doing

high level comment, can we have this as a coordinator helper?

yeah, it is similar. but, i did it on overlord because tasks and tasklogs are owned by overlord and their retention should also be handled at the overlord.

If the executor is having issues and rejecting execution, what is the intended behavior here?

that would be a fatal situation, will fail the lifecycle start and consequently prohibit druid process startup.... which is intended behavior in this case.

fjy · 2016-11-10T23:46:55Z

indexing-service/src/main/java/io/druid/indexing/overlord/helpers/OverlordHelper.java

+public interface OverlordHelper
+{
+  boolean isEnabled();
+  void schedule(ScheduledExecutorService exec);


i think the method here should be run()

the way i structured things are that helpers schedule themselves on the executor supplied by manager. this is done so that all helpers can have their independent schedule and can potentially run in parallel.

pjain1 · 2016-11-11T17:25:49Z

docs/content/configuration/indexing-service.md

+|Property|Description|Default|
+|--------|-----------|-------|
+|`druid.indexer.logs.kill.enabled`|Boolean value for whether to enable deletion of old task logs. |false|
+|`druid.indexer.logs.kill.durationToRetain`| Required if kill is enabled. In seconds, task logs to be retained created in last x seconds. |None|


default value - Long.MAX_VALUE ?

default does not matter because it is forced to be provided by the user when enabled. Long.MAX_VALUE is there just because some value needs to be stored there when it is not enabled.

pjain1 · 2016-11-11T17:26:35Z

docs/content/configuration/indexing-service.md

+|--------|-----------|-------|
+|`druid.indexer.logs.kill.enabled`|Boolean value for whether to enable deletion of old task logs. |false|
+|`druid.indexer.logs.kill.durationToRetain`| Required if kill is enabled. In seconds, task logs to be retained created in last x seconds. |None|
+|`druid.indexer.logs.kill.initialDelay`| Optional. Number of seconds after overlord start when first auto kill is run. |random value less than 300 (5 mins)|


random value between 1 and 5 mins ?

dint want to make document too confusing by declaring exact randomness behavior, i think knowing that its some random value less than 5 mins is enough.

drcrallen · 2016-11-11T19:37:44Z

Quasi-related #401

drcrallen

If there is clock skew between this service and the HDFS server's timestamps, how is that expected to be handled or detected?

Do you expect it to be a problem?

Can such a scenario be documented in the property docs?

drcrallen · 2016-11-11T18:57:52Z

api/src/main/java/io/druid/tasklogs/NoopTaskLogs.java

@@ -48,4 +48,10 @@ public void killAll() throws IOException
  {
    log.info("Noop: No task logs are deleted.");
  }
+
+  @Override
+  public void kill(long beforeTimestamp) throws IOException


can this be renamed killBefore ?

changed to void killOlderThan(long timestamp)

drcrallen · 2016-11-11T18:59:58Z

docs/content/configuration/indexing-service.md

+|Property|Description|Default|
+|--------|-----------|-------|
+|`druid.indexer.logs.kill.enabled`|Boolean value for whether to enable deletion of old task logs. |false|
+|`druid.indexer.logs.kill.durationToRetain`| Required if kill is enabled. In seconds, task logs to be retained created in last x seconds. |None|


Most places use either Milliseconds or a duration string, can you pick one here and use those?

changed to milliseconds

drcrallen · 2016-11-11T19:00:11Z

docs/content/configuration/indexing-service.md

+|--------|-----------|-------|
+|`druid.indexer.logs.kill.enabled`|Boolean value for whether to enable deletion of old task logs. |false|
+|`druid.indexer.logs.kill.durationToRetain`| Required if kill is enabled. In seconds, task logs to be retained created in last x seconds. |None|
+|`druid.indexer.logs.kill.initialDelay`| Optional. Number of seconds after overlord start when first auto kill is run. |random value less than 300 (5 mins)|


Same thing about either milliseconds or a Duration

milliseconds

drcrallen · 2016-11-11T19:00:28Z

docs/content/configuration/indexing-service.md

+|`druid.indexer.logs.kill.enabled`|Boolean value for whether to enable deletion of old task logs. |false|
+|`druid.indexer.logs.kill.durationToRetain`| Required if kill is enabled. In seconds, task logs to be retained created in last x seconds. |None|
+|`druid.indexer.logs.kill.initialDelay`| Optional. Number of seconds after overlord start when first auto kill is run. |random value less than 300 (5 mins)|
+|`druid.indexer.logs.kill.delay`|Optional. Number of seconds of delay between successive executions of auto kill run. |21600 (6 hours)|


Same thing about milliseconds or a string Duration

milliseconds

drcrallen · 2016-11-11T19:01:53Z

extensions-core/hdfs-storage/src/main/java/io/druid/storage/hdfs/tasklog/HdfsTaskLogs.java

+    FileSystem fs = taskLogDir.getFileSystem(hadoopConfig);
+    if (fs.exists(taskLogDir)) {
+      RemoteIterator<LocatedFileStatus> iter = fs.listLocatedStatus(taskLogDir);
+      while (iter.hasNext()) {


There's no failure recovery in this loop, and I'm forgetting how HDFS handles various issues, is this intended to just fail this kill run if there are any sort of issues?

if hdfs has an issue while iteration, it will throw some exception which will be caught by the caller. current caller, i.e. anonymous Runnable in TaskLogAutoCleaner.schedule(..) caches the exception and logs it as error. So behavior would be that, not all older logs are killed in that particular run and will be tried again later.

drcrallen · 2016-11-11T19:02:42Z

extensions-core/hdfs-storage/src/main/java/io/druid/storage/hdfs/tasklog/HdfsTaskLogs.java

+        }
+
+        if (Thread.interrupted()) {
+          throw new IOException("Thread interrupted. Couldn't delete all tasklogs.");


How about wrapping an InterruptedException in an IOException?

drcrallen · 2016-11-11T19:05:08Z

...ions-core/hdfs-storage/src/test/java/io/druid/indexing/common/tasklogs/HdfsTaskLogsTest.java

+    taskLogs.pushTaskLog("log2", logFile);
+    Assert.assertEquals("log2content", readLog(taskLogs, "log2", 0));
+
+    taskLogs.kill(time);


Before here, can you check and make sure the modified time for the old file is before time and the new file is after time?

drcrallen · 2016-11-11T19:47:24Z

indexing-service/src/main/java/io/druid/indexing/overlord/helpers/OverlordHelperManager.java

+            if (exec == null) {
+              exec = execFactory.create(1, "Overlord-Helper-Manager-Exec--%d");
+            }
+            helper.schedule(exec);


If the executor is having issues and rejecting execution, what is the intended behavior here?

drcrallen · 2016-11-11T19:51:07Z

Also, as a general comment, before the overlord and coordinator are merged, I'd like to see the Helper stuff go into an extensible form.

himanshug · 2016-11-14T16:31:08Z

@drcrallen
clock skew between druid nodes and hdfs nodes might end up deleting unintended logs, I will note that in the documentation.

regarding ovelord and coordinator merge, helper stuff introduced here is already in an extensible form.

himanshug · 2016-11-14T17:05:27Z

@fjy @drcrallen @pjain1 resolved all comments.

fjy · 2016-11-16T00:28:21Z

👍

himanshug · 2016-12-07T20:45:56Z

@drcrallen all comments were addressed, pls take a relook

drcrallen

Overall this is looking really good. I'm excited to see how the Overlord/Coordinator merge stuff comes along.

Minor changes suggested that should help future maintainers.

drcrallen · 2016-12-08T22:55:41Z

indexing-service/src/main/java/io/druid/indexing/common/tasklogs/FileTaskLogs.java

+  public void killOlderThan(final long timestamp) throws IOException
+  {
+    File taskLogDir = config.getDirectory();
+    if (taskLogDir.exists()) {


suggest && taskLogDir.isDirectory()

Actually, throwing an IOException if it is not a directory would be a reasonable alternative.

drcrallen · 2016-12-08T22:57:56Z

indexing-service/src/main/java/io/druid/indexing/overlord/helpers/OverlordHelperManager.java

+  private final Set<OverlordHelper> helpers;
+
+  private volatile ScheduledExecutorService exec;
+  private final Object lock = new Object();


Suggest renaming to startStopLock or similar to make it clearer what its used for.

drcrallen · 2016-12-08T23:00:53Z

indexing-service/src/main/java/io/druid/indexing/overlord/helpers/TaskLogAutoCleanerConfig.java

+
+    this.enabled = enabled;
+    this.initialDelay = initialDelay == null ? 60000 + new Random().nextInt(4*60000) : initialDelay.longValue();
+    this.delay = delay == null ? 6*60*60*1000 : delay.longValue();


Suggest sanity check for positive number.

drcrallen · 2016-12-08T23:01:00Z

indexing-service/src/main/java/io/druid/indexing/overlord/helpers/TaskLogAutoCleanerConfig.java

+    this.enabled = enabled;
+    this.initialDelay = initialDelay == null ? 60000 + new Random().nextInt(4*60000) : initialDelay.longValue();
+    this.delay = delay == null ? 6*60*60*1000 : delay.longValue();
+    this.durationToRetain = durationToRetain == null ? Long.MAX_VALUE : durationToRetain.longValue();


Suggest sanity check for positive number.

drcrallen · 2016-12-08T23:01:09Z

indexing-service/src/main/java/io/druid/indexing/overlord/helpers/TaskLogAutoCleanerConfig.java

+    }
+
+    this.enabled = enabled;
+    this.initialDelay = initialDelay == null ? 60000 + new Random().nextInt(4*60000) : initialDelay.longValue();


Suggest sanity check for positive number.

drcrallen · 2016-12-08T23:03:01Z

...g-service/src/test/java/io/druid/indexing/overlord/helpers/TaskLogAutoCleanerConfigTest.java

+
+/**
+ */
+public class TaskLogAutoCleanerConfigTest


Suggest adding test for defaults.

fjy · 2016-12-19T18:52:24Z

@drcrallen any more comments?

fjy · 2016-12-21T22:32:34Z

👍

drcrallen · 2016-12-21T23:18:05Z

indexing-service/src/main/java/io/druid/indexing/common/tasklogs/FileTaskLogs.java

+    if (taskLogDir.exists()) {
+
+      if (!taskLogDir.isDirectory()) {
+        throw new IOException(String.format("taskLogDir [%s] must be a directory.", taskLogDir));


For future reference IOE would work best here, not worth blocking on though.

* overlord helpers framework and tasklog auto cleanup * review comment changes * further review comments addressed

…cleanup) * overlord helpers framework and tasklog auto cleanup * review comment changes * further review comments addressed

himanshug added the Feature label Nov 10, 2016

himanshug added this to the 0.9.3 milestone Nov 10, 2016

fjy reviewed Nov 10, 2016

View reviewed changes

pjain1 reviewed Nov 11, 2016

View reviewed changes

drcrallen requested changes Nov 11, 2016

View reviewed changes

himanshug force-pushed the tasklog_cleaner branch from 8bec961 to ff703d3 Compare November 14, 2016 17:04

himanshug force-pushed the tasklog_cleaner branch from ff703d3 to 4c2588e Compare November 14, 2016 18:22

gianm assigned fjy, pjain1 and drcrallen and unassigned fjy Nov 29, 2016

drcrallen requested changes Dec 8, 2016

View reviewed changes

himanshug added 2 commits December 12, 2016 12:57

overlord helpers framework and tasklog auto cleanup

c8cc2c6

review comment changes

e9dae87

himanshug force-pushed the tasklog_cleaner branch from 4c2588e to 1c06c7e Compare December 12, 2016 19:20

further review comments addressed

8293b42

himanshug force-pushed the tasklog_cleaner branch from 1c06c7e to 8293b42 Compare December 13, 2016 18:16

drcrallen reviewed Dec 21, 2016

View reviewed changes

drcrallen approved these changes Dec 21, 2016

View reviewed changes

drcrallen merged commit 4ca3b7f into apache:master Dec 21, 2016

himanshug deleted the tasklog_cleaner branch January 3, 2017 16:24

dgolitsyn pushed a commit to metamx/druid that referenced this pull request Feb 14, 2017

overlord helpers framework and tasklog auto cleanup (apache#3677)

c0149c0

* overlord helpers framework and tasklog auto cleanup * review comment changes * further review comments addressed

clambertus unassigned drcrallen and pjain1 Jul 6, 2018

maytasm mentioned this pull request Apr 8, 2021

Add feature to automatically remove audit logs based on retention period #11084

Merged

9 tasks

maytasm mentioned this pull request Apr 27, 2021

Add feature to automatically remove rules based on retention period #11164

Merged

9 tasks

seoeun25 pushed a commit to seoeun25/incubator-druid that referenced this pull request Feb 25, 2022

apache#3677 Revert apache#3284 and fix it properly

4f57a63

overlord helpers framework and tasklog auto cleanup #3677

overlord helpers framework and tasklog auto cleanup #3677

Conversation

himanshug commented Nov 10, 2016

Choose a reason for hiding this comment

himanshug Nov 11, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

drcrallen commented Nov 11, 2016

drcrallen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

drcrallen commented Nov 11, 2016

himanshug commented Nov 14, 2016

himanshug commented Nov 14, 2016

fjy commented Nov 16, 2016

himanshug commented Dec 7, 2016

drcrallen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fjy commented Dec 19, 2016

fjy commented Dec 21, 2016

Choose a reason for hiding this comment

himanshug Nov 11, 2016 •

edited

Loading