Add feature to automatically remove audit logs based on retention period #11084

maytasm · 2021-04-08T23:50:56Z

Add feature to automatically remove audit logs based on retention period

Description

We currently already have a tasklog auto cleanup based on duration (time to retained) introduced in #3677. This PR adds a similar auto cleanup based on duration (time to retained) but for the audit table (using the coordinator duty instead of OverlordHelper).

This is useful when Druid user has a high churn of task / datasource in a short amount of time causing the metadata store size to grow uncontrollably.

This PR has:

been self-reviewed.
- using the concurrency checklist (Remove this item if the PR doesn't have any relation to concurrency.)
added documentation for new or modified features or behaviors.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added or updated version, license, or notice information in licenses.yaml
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
added integration tests.
been tested in a test Druid cluster.

jihoonson · 2021-04-16T18:54:06Z

docs/configuration/index.md

+|`druid.coordinator.period.metadataStoreManagementPeriod`|How often to run metadata management tasks. |No |PT3600S (1 hour)|
+|`druid.coordinator.kill.audit.on`| Boolean value for whether to enable automatic deletion of audit logs. If set to true, Coordinator will periodically remove audit logs from the audit table entries in metadata storage.| No | False| 
+|`druid.coordinator.kill.audit.period`| How often to do automatic deletion of audit logs. Value must be greater than `druid.coordinator.period.metadataStoreManagementPeriod`. Only applies if `druid.coordinator.kill.audit.on` is set to True.| Yes if `druid.coordinator.kill.audit.on` is set to True| None|
+|`druid.coordinator.kill.audit.durationToRetain`| In milliseconds, audit logs to be retained created in last x milliseconds. Only applies if `druid.coordinator.kill.audit.on` is set to True.| Yes if `druid.coordinator.kill.audit.on` is set to True| None|


Hmm seems like this config supports the ISO 8601 notation? Please update the doc to mention what notation is supported.

jihoonson · 2021-04-16T18:56:50Z

docs/operations/metrics.md

@@ -256,6 +256,8 @@ These metrics are for the Druid Coordinator and are reset each time the Coordina
 |`interval/skipCompact/count`|Total number of intervals of this datasource that are skipped (not eligible for auto compaction) by the auto compaction.|datasource.|Varies.|
 |`coordinator/time`|Approximate Coordinator duty runtime in milliseconds. The duty dimension is the string alias of the Duty that is being run.|duty.|Varies.|
 |`coordinator/global/time`|Approximate runtime of a full coordination cycle in milliseconds. The `dutyGroup` dimension indicates what type of coordination this run was. i.e. Historical Management vs Indexing|`dutyGroup`|Varies.|
+|`metadata/kill/audit/count`|Total number of audit logs deleted from metadata store audit table.| |Varies.|


How is this metric supposed to be used?

BTW, do we also need a metric for the current size (maybe in rows) of the audit logs? This should be done in a separate PR if needed.

The metric can give the user an idea of how many audit logs are being automatically deleted. Basically how many audit logs do they have within the duration configured. This can help them get a sense if they want to increase or decrease the durationToRetain. I think we should also have a metric for the current size (in rows) of the audit logs. Both of these metrics can be used together. You will be able to compare the current size (in rows) vs the kill size per run (in rows) and see if you want to increase or decrease durationToRetain.

I will add a metric for the current size (in rows) of the audit logs in a separate PR

The metric can give the user an idea of how many audit logs are being automatically deleted. Basically how many audit logs do they have within the duration configured. This can help them get a sense if they want to increase or decrease the durationToRetain.

Thanks for the explanation. Could you please explain this in the doc as well? Also please mention that this metric is emitted only when druid.coordinator.kill.audit.on is set to true.

jihoonson · 2021-04-16T19:01:14Z

server/src/main/java/org/apache/druid/server/coordinator/DruidCoordinator.java

@@ -743,13 +756,26 @@ private void stopBeingLeader()
    // CompactSegmentsDuty should be the last duty as it can take a long time to complete
    duties.addAll(makeCompactSegmentsDuty());

-    log.debug(
+    log.info(


Is change intentional? If so, can you describe why it is better to be info? I think this was changed to debug to reduce the amount of coordinator logs.

Changed back to debug.

jihoonson · 2021-04-16T19:03:50Z

server/src/main/java/org/apache/druid/server/coordinator/DruidCoordinator.java

+                                                .addAll(metadataStoreManagementDuties)
+                                                .build();
+
+    log.info(


Similarly, does this log need to be info? Coordinator prints a lot of logs and it's better to reduce them if possible.

Changed back to debug.

jihoonson · 2021-04-16T19:07:45Z

server/src/main/java/org/apache/druid/server/coordinator/duty/KillAuditLog.java

+    );
+    this.retainDuration = config.getCoordinatorAuditKillDurationToRetain().getMillis();
+    Preconditions.checkArgument(this.retainDuration >= 0, "coordinator audit kill retainDuration must be >= 0");
+    log.info(


Similarly, maybe debug?

Changed to debug.

jihoonson · 2021-04-16T19:12:22Z

server/src/main/java/org/apache/druid/server/coordinator/duty/KillAuditLog.java

+  public DruidCoordinatorRuntimeParams run(DruidCoordinatorRuntimeParams params)
+  {
+    if ((lastKillTime + period) < System.currentTimeMillis()) {
+      log.info("Running KillAuditLog duty");


This logging doesn't say much except that this duty is running. Maybe better to print how many logs are deleted (auditRemoved) in this run?

Good idea. Changed

jihoonson · 2021-04-16T19:14:17Z

docs/configuration/index.md

+
+|Property|Description|Required?|Default|
+|--------|-----------|---------|-------|
+|`druid.coordinator.period.metadataStoreManagementPeriod`|How often to run metadata management tasks. |No |PT3600S (1 hour)|


nit: PT1H seems clearer.

jihoonson · 2021-04-19T17:51:58Z

docs/configuration/index.md

@@ -746,10 +746,10 @@ These Coordinator static configurations can be defined in the `coordinator/runti

 |Property|Description|Required?|Default|
 |--------|-----------|---------|-------|
-|`druid.coordinator.period.metadataStoreManagementPeriod`|How often to run metadata management tasks. |No |PT3600S (1 hour)|
+|`druid.coordinator.period.metadataStoreManagementPeriod`|How often to run metadata management tasks in [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601) duration format. |No | PT1H|


The doc CI fails because of PT1H. One way to avoid the spell check for such variables is wrapping them with backticks (`).

jihoonson · 2021-04-19T17:57:20Z

docs/operations/metrics.md

@@ -256,6 +256,8 @@ These metrics are for the Druid Coordinator and are reset each time the Coordina
 |`interval/skipCompact/count`|Total number of intervals of this datasource that are skipped (not eligible for auto compaction) by the auto compaction.|datasource.|Varies.|
 |`coordinator/time`|Approximate Coordinator duty runtime in milliseconds. The duty dimension is the string alias of the Duty that is being run.|duty.|Varies.|
 |`coordinator/global/time`|Approximate runtime of a full coordination cycle in milliseconds. The `dutyGroup` dimension indicates what type of coordination this run was. i.e. Historical Management vs Indexing|`dutyGroup`|Varies.|
+|`metadata/kill/audit/count`|Total number of audit logs deleted from metadata store audit table.| |Varies.|


The metric can give the user an idea of how many audit logs are being automatically deleted. Basically how many audit logs do they have within the duration configured. This can help them get a sense if they want to increase or decrease the durationToRetain.

Thanks for the explanation. Could you please explain this in the doc as well? Also please mention that this metric is emitted only when druid.coordinator.kill.audit.on is set to true.

jihoonson

LGTM 👍

suneet-s · 2021-04-20T19:08:08Z

docs/configuration/index.md

+|`druid.coordinator.kill.audit.period`| How often to do automatic deletion of audit logs in [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601) duration format. Value must be greater than `druid.coordinator.period.metadataStoreManagementPeriod`. Only applies if `druid.coordinator.kill.audit.on` is set to True.| Yes if `druid.coordinator.kill.audit.on` is set to True| None|
+|`druid.coordinator.kill.audit.durationToRetain`| Duration of audit logs to be retained from created time in [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601) duration format. Only applies if `druid.coordinator.kill.audit.on` is set to True.| Yes if `druid.coordinator.kill.audit.on` is set to True| None|


It looks like both of these fields are required. Can you describe how a user should think about setting both of these fields so they work together. Is there a reason we don't allow users to just set one or the other?

Will this affect upgrades if someone has druid.coordinator.kill.audit.period set but druid.coordinator.kill.audit.durationToRetain unset?

Both of these fields are very different. druid.coordinator.kill.audit.period controls the frequency (how often) the duty runs, while druid.coordinator.kill.audit.durationToRetain controls how many audit logs the user wants to keep.
druid.coordinator.kill.audit.period is actually not required. Let me fix the docs

suneet-s · 2021-04-20T19:15:37Z

server/src/main/java/org/apache/druid/server/coordinator/DruidCoordinatorConfig.java

+  public abstract Duration getCoordinatorAuditKillPeriod();
+
+  @Config("druid.coordinator.kill.audit.durationToRetain")
+  @Default("PT-1s")


These defaults don't seem to match the documented defaults above

Fixed the docs.
druid.coordinator.kill.audit.period is not required. And the default value is P1D
druid.coordinator.kill.audit.durationToRetain is required. The default value which is -1 which causes the precondition check to fail. User should set it to a positive value. Instead of putting in the docs that the default value is -1 and that Coordinator will fail if value is negative, for simplicity, I just documented that default value is None and user must set this value.

suneet-s

The settings and documentation make sense to me.

The defaults seem reasonable, and this doesn't appear to introduce a breaking change.

add docs

c819d76

maytasm mentioned this pull request Apr 9, 2021

Introduce a new configuration that skip storing audit payload if payload size exceed limit and skip storing null fields for audit payload #11078

Merged

9 tasks

clintropolis added the Area - Operations label Apr 9, 2021

maytasm added 2 commits April 12, 2021 23:33

add impl

34128f7

fix checkstyle

ca2d632

maytasm changed the title ~~Add feature to automatically remove audit logs and task logs based on number of entry and time~~ Add feature to automatically remove audit logs based on retention period Apr 13, 2021

maytasm added 6 commits April 12, 2021 23:55

fix test

8176c64

add test

b298b18

fix checkstyle

5578630

fix checkstyle

971b660

Merge branch 'master' into IMPLY-6571

355ef52

fix test

362097e

jihoonson added the Design Review label Apr 16, 2021

jihoonson reviewed Apr 16, 2021

View reviewed changes

Address comments

55271c9

jihoonson reviewed Apr 19, 2021

View reviewed changes

maytasm added 2 commits April 19, 2021 18:00

Address comments

90d8095

fix spelling

bbc6746

jihoonson approved these changes Apr 20, 2021

View reviewed changes

suneet-s reviewed Apr 20, 2021

View reviewed changes

fix docs

4182ff9

suneet-s approved these changes Apr 20, 2021

View reviewed changes

maytasm merged commit 6d2b5cd into apache:master Apr 21, 2021

maytasm deleted the IMPLY-6571 branch April 21, 2021 00:10

maytasm mentioned this pull request Apr 27, 2021

Add feature to automatically remove rules based on retention period #11164

Merged

9 tasks

maytasm mentioned this pull request May 5, 2021

Add feature to automatically remove supervisor based on retention period #11200

Merged

9 tasks

maytasm added the Release Notes label May 7, 2021

This was referenced May 10, 2021

Add feature to automatically remove datasource metadata based on retention period #11227

Merged

Add feature to automatically remove compaction configurations for inactive datasources #11232

Merged

techdocsmith mentioned this pull request May 12, 2021

add docs for high-churn datasource cleanup #11245

Merged

1 task

maytasm mentioned this pull request May 14, 2021

Auto cleanup of metadata tables to enable high frequency creation/deletion of data sources #11173

Closed

clintropolis added this to the 0.22.0 milestone Aug 12, 2021

clintropolis mentioned this pull request Sep 3, 2021

[Draft] 0.22.0 Release Notes #11657

Closed

suneet-s mentioned this pull request Jan 21, 2022

Enable automatic metadata cleanup by default #12188

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add feature to automatically remove audit logs based on retention period #11084

Add feature to automatically remove audit logs based on retention period #11084

maytasm commented Apr 8, 2021 •

edited

Loading

jihoonson Apr 16, 2021

maytasm Apr 16, 2021

jihoonson Apr 16, 2021

maytasm Apr 16, 2021

jihoonson Apr 19, 2021

jihoonson Apr 16, 2021

maytasm Apr 16, 2021

jihoonson Apr 16, 2021

maytasm Apr 16, 2021

jihoonson Apr 16, 2021

maytasm Apr 16, 2021

jihoonson Apr 16, 2021

maytasm Apr 16, 2021

jihoonson Apr 16, 2021

maytasm Apr 16, 2021

jihoonson Apr 19, 2021

maytasm Apr 20, 2021

jihoonson Apr 19, 2021

jihoonson left a comment

suneet-s Apr 20, 2021

maytasm Apr 20, 2021

suneet-s Apr 20, 2021

maytasm Apr 20, 2021 •

edited

Loading

suneet-s left a comment

		\|`druid.coordinator.kill.audit.period`\| How often to do automatic deletion of audit logs in [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601) duration format. Value must be greater than `druid.coordinator.period.metadataStoreManagementPeriod`. Only applies if `druid.coordinator.kill.audit.on` is set to True.\| Yes if `druid.coordinator.kill.audit.on` is set to True\| None\|
		\|`druid.coordinator.kill.audit.durationToRetain`\| Duration of audit logs to be retained from created time in [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601) duration format. Only applies if `druid.coordinator.kill.audit.on` is set to True.\| Yes if `druid.coordinator.kill.audit.on` is set to True\| None\|

Add feature to automatically remove audit logs based on retention period #11084

Add feature to automatically remove audit logs based on retention period #11084

Conversation

maytasm commented Apr 8, 2021 • edited Loading

Description

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jihoonson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maytasm Apr 20, 2021 • edited Loading

Choose a reason for hiding this comment

suneet-s left a comment

Choose a reason for hiding this comment

maytasm commented Apr 8, 2021 •

edited

Loading

maytasm Apr 20, 2021 •

edited

Loading