Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add feature to automatically remove audit logs based on retention period #11084

Merged
merged 13 commits into from
Apr 21, 2021

Conversation

maytasm
Copy link
Contributor

@maytasm maytasm commented Apr 8, 2021

Add feature to automatically remove audit logs based on retention period

Description

We currently already have a tasklog auto cleanup based on duration (time to retained) introduced in #3677. This PR adds a similar auto cleanup based on duration (time to retained) but for the audit table (using the coordinator duty instead of OverlordHelper).

This is useful when Druid user has a high churn of task / datasource in a short amount of time causing the metadata store size to grow uncontrollably.

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@maytasm maytasm changed the title Add feature to automatically remove audit logs and task logs based on number of entry and time Add feature to automatically remove audit logs based on retention period Apr 13, 2021
|`druid.coordinator.period.metadataStoreManagementPeriod`|How often to run metadata management tasks. |No |PT3600S (1 hour)|
|`druid.coordinator.kill.audit.on`| Boolean value for whether to enable automatic deletion of audit logs. If set to true, Coordinator will periodically remove audit logs from the audit table entries in metadata storage.| No | False|
|`druid.coordinator.kill.audit.period`| How often to do automatic deletion of audit logs. Value must be greater than `druid.coordinator.period.metadataStoreManagementPeriod`. Only applies if `druid.coordinator.kill.audit.on` is set to True.| Yes if `druid.coordinator.kill.audit.on` is set to True| None|
|`druid.coordinator.kill.audit.durationToRetain`| In milliseconds, audit logs to be retained created in last x milliseconds. Only applies if `druid.coordinator.kill.audit.on` is set to True.| Yes if `druid.coordinator.kill.audit.on` is set to True| None|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm seems like this config supports the ISO 8601 notation? Please update the doc to mention what notation is supported.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -256,6 +256,8 @@ These metrics are for the Druid Coordinator and are reset each time the Coordina
|`interval/skipCompact/count`|Total number of intervals of this datasource that are skipped (not eligible for auto compaction) by the auto compaction.|datasource.|Varies.|
|`coordinator/time`|Approximate Coordinator duty runtime in milliseconds. The duty dimension is the string alias of the Duty that is being run.|duty.|Varies.|
|`coordinator/global/time`|Approximate runtime of a full coordination cycle in milliseconds. The `dutyGroup` dimension indicates what type of coordination this run was. i.e. Historical Management vs Indexing|`dutyGroup`|Varies.|
|`metadata/kill/audit/count`|Total number of audit logs deleted from metadata store audit table.| |Varies.|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this metric supposed to be used?

BTW, do we also need a metric for the current size (maybe in rows) of the audit logs? This should be done in a separate PR if needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metric can give the user an idea of how many audit logs are being automatically deleted. Basically how many audit logs do they have within the duration configured. This can help them get a sense if they want to increase or decrease the durationToRetain. I think we should also have a metric for the current size (in rows) of the audit logs. Both of these metrics can be used together. You will be able to compare the current size (in rows) vs the kill size per run (in rows) and see if you want to increase or decrease durationToRetain.

I will add a metric for the current size (in rows) of the audit logs in a separate PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metric can give the user an idea of how many audit logs are being automatically deleted. Basically how many audit logs do they have within the duration configured. This can help them get a sense if they want to increase or decrease the durationToRetain.

Thanks for the explanation. Could you please explain this in the doc as well? Also please mention that this metric is emitted only when druid.coordinator.kill.audit.on is set to true.

@@ -743,13 +756,26 @@ private void stopBeingLeader()
// CompactSegmentsDuty should be the last duty as it can take a long time to complete
duties.addAll(makeCompactSegmentsDuty());

log.debug(
log.info(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is change intentional? If so, can you describe why it is better to be info? I think this was changed to debug to reduce the amount of coordinator logs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed back to debug.

.addAll(metadataStoreManagementDuties)
.build();

log.info(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, does this log need to be info? Coordinator prints a lot of logs and it's better to reduce them if possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed back to debug.

);
this.retainDuration = config.getCoordinatorAuditKillDurationToRetain().getMillis();
Preconditions.checkArgument(this.retainDuration >= 0, "coordinator audit kill retainDuration must be >= 0");
log.info(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, maybe debug?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to debug.

public DruidCoordinatorRuntimeParams run(DruidCoordinatorRuntimeParams params)
{
if ((lastKillTime + period) < System.currentTimeMillis()) {
log.info("Running KillAuditLog duty");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logging doesn't say much except that this duty is running. Maybe better to print how many logs are deleted (auditRemoved) in this run?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. Changed


|Property|Description|Required?|Default|
|--------|-----------|---------|-------|
|`druid.coordinator.period.metadataStoreManagementPeriod`|How often to run metadata management tasks. |No |PT3600S (1 hour)|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: PT1H seems clearer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -746,10 +746,10 @@ These Coordinator static configurations can be defined in the `coordinator/runti

|Property|Description|Required?|Default|
|--------|-----------|---------|-------|
|`druid.coordinator.period.metadataStoreManagementPeriod`|How often to run metadata management tasks. |No |PT3600S (1 hour)|
|`druid.coordinator.period.metadataStoreManagementPeriod`|How often to run metadata management tasks in [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601) duration format. |No | PT1H|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc CI fails because of PT1H. One way to avoid the spell check for such variables is wrapping them with backticks (`).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -256,6 +256,8 @@ These metrics are for the Druid Coordinator and are reset each time the Coordina
|`interval/skipCompact/count`|Total number of intervals of this datasource that are skipped (not eligible for auto compaction) by the auto compaction.|datasource.|Varies.|
|`coordinator/time`|Approximate Coordinator duty runtime in milliseconds. The duty dimension is the string alias of the Duty that is being run.|duty.|Varies.|
|`coordinator/global/time`|Approximate runtime of a full coordination cycle in milliseconds. The `dutyGroup` dimension indicates what type of coordination this run was. i.e. Historical Management vs Indexing|`dutyGroup`|Varies.|
|`metadata/kill/audit/count`|Total number of audit logs deleted from metadata store audit table.| |Varies.|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metric can give the user an idea of how many audit logs are being automatically deleted. Basically how many audit logs do they have within the duration configured. This can help them get a sense if they want to increase or decrease the durationToRetain.

Thanks for the explanation. Could you please explain this in the doc as well? Also please mention that this metric is emitted only when druid.coordinator.kill.audit.on is set to true.

Copy link
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

Comment on lines 751 to 752
|`druid.coordinator.kill.audit.period`| How often to do automatic deletion of audit logs in [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601) duration format. Value must be greater than `druid.coordinator.period.metadataStoreManagementPeriod`. Only applies if `druid.coordinator.kill.audit.on` is set to True.| Yes if `druid.coordinator.kill.audit.on` is set to True| None|
|`druid.coordinator.kill.audit.durationToRetain`| Duration of audit logs to be retained from created time in [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601) duration format. Only applies if `druid.coordinator.kill.audit.on` is set to True.| Yes if `druid.coordinator.kill.audit.on` is set to True| None|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like both of these fields are required. Can you describe how a user should think about setting both of these fields so they work together. Is there a reason we don't allow users to just set one or the other?

Will this affect upgrades if someone has druid.coordinator.kill.audit.period set but druid.coordinator.kill.audit.durationToRetain unset?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both of these fields are very different. druid.coordinator.kill.audit.period controls the frequency (how often) the duty runs, while druid.coordinator.kill.audit.durationToRetain controls how many audit logs the user wants to keep.
druid.coordinator.kill.audit.period is actually not required. Let me fix the docs

public abstract Duration getCoordinatorAuditKillPeriod();

@Config("druid.coordinator.kill.audit.durationToRetain")
@Default("PT-1s")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These defaults don't seem to match the documented defaults above

Copy link
Contributor Author

@maytasm maytasm Apr 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the docs.
druid.coordinator.kill.audit.period is not required. And the default value is P1D
druid.coordinator.kill.audit.durationToRetain is required. The default value which is -1 which causes the precondition check to fail. User should set it to a positive value. Instead of putting in the docs that the default value is -1 and that Coordinator will fail if value is negative, for simplicity, I just documented that default value is None and user must set this value.

Copy link
Contributor

@suneet-s suneet-s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The settings and documentation make sense to me.

The defaults seem reasonable, and this doesn't appear to introduce a breaking change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants