-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add feature to automatically remove audit logs based on retention period #11084
Conversation
docs/configuration/index.md
Outdated
|`druid.coordinator.period.metadataStoreManagementPeriod`|How often to run metadata management tasks. |No |PT3600S (1 hour)| | ||
|`druid.coordinator.kill.audit.on`| Boolean value for whether to enable automatic deletion of audit logs. If set to true, Coordinator will periodically remove audit logs from the audit table entries in metadata storage.| No | False| | ||
|`druid.coordinator.kill.audit.period`| How often to do automatic deletion of audit logs. Value must be greater than `druid.coordinator.period.metadataStoreManagementPeriod`. Only applies if `druid.coordinator.kill.audit.on` is set to True.| Yes if `druid.coordinator.kill.audit.on` is set to True| None| | ||
|`druid.coordinator.kill.audit.durationToRetain`| In milliseconds, audit logs to be retained created in last x milliseconds. Only applies if `druid.coordinator.kill.audit.on` is set to True.| Yes if `druid.coordinator.kill.audit.on` is set to True| None| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm seems like this config supports the ISO 8601 notation? Please update the doc to mention what notation is supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
docs/operations/metrics.md
Outdated
@@ -256,6 +256,8 @@ These metrics are for the Druid Coordinator and are reset each time the Coordina | |||
|`interval/skipCompact/count`|Total number of intervals of this datasource that are skipped (not eligible for auto compaction) by the auto compaction.|datasource.|Varies.| | |||
|`coordinator/time`|Approximate Coordinator duty runtime in milliseconds. The duty dimension is the string alias of the Duty that is being run.|duty.|Varies.| | |||
|`coordinator/global/time`|Approximate runtime of a full coordination cycle in milliseconds. The `dutyGroup` dimension indicates what type of coordination this run was. i.e. Historical Management vs Indexing|`dutyGroup`|Varies.| | |||
|`metadata/kill/audit/count`|Total number of audit logs deleted from metadata store audit table.| |Varies.| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this metric supposed to be used?
BTW, do we also need a metric for the current size (maybe in rows) of the audit logs? This should be done in a separate PR if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metric can give the user an idea of how many audit logs are being automatically deleted. Basically how many audit logs do they have within the duration configured. This can help them get a sense if they want to increase or decrease the durationToRetain. I think we should also have a metric for the current size (in rows) of the audit logs. Both of these metrics can be used together. You will be able to compare the current size (in rows) vs the kill size per run (in rows) and see if you want to increase or decrease durationToRetain.
I will add a metric for the current size (in rows) of the audit logs in a separate PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metric can give the user an idea of how many audit logs are being automatically deleted. Basically how many audit logs do they have within the duration configured. This can help them get a sense if they want to increase or decrease the durationToRetain.
Thanks for the explanation. Could you please explain this in the doc as well? Also please mention that this metric is emitted only when druid.coordinator.kill.audit.on
is set to true.
@@ -743,13 +756,26 @@ private void stopBeingLeader() | |||
// CompactSegmentsDuty should be the last duty as it can take a long time to complete | |||
duties.addAll(makeCompactSegmentsDuty()); | |||
|
|||
log.debug( | |||
log.info( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is change intentional? If so, can you describe why it is better to be info? I think this was changed to debug to reduce the amount of coordinator logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed back to debug.
.addAll(metadataStoreManagementDuties) | ||
.build(); | ||
|
||
log.info( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, does this log need to be info? Coordinator prints a lot of logs and it's better to reduce them if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed back to debug.
); | ||
this.retainDuration = config.getCoordinatorAuditKillDurationToRetain().getMillis(); | ||
Preconditions.checkArgument(this.retainDuration >= 0, "coordinator audit kill retainDuration must be >= 0"); | ||
log.info( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, maybe debug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to debug.
public DruidCoordinatorRuntimeParams run(DruidCoordinatorRuntimeParams params) | ||
{ | ||
if ((lastKillTime + period) < System.currentTimeMillis()) { | ||
log.info("Running KillAuditLog duty"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logging doesn't say much except that this duty is running. Maybe better to print how many logs are deleted (auditRemoved
) in this run?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. Changed
docs/configuration/index.md
Outdated
|
||
|Property|Description|Required?|Default| | ||
|--------|-----------|---------|-------| | ||
|`druid.coordinator.period.metadataStoreManagementPeriod`|How often to run metadata management tasks. |No |PT3600S (1 hour)| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: PT1H
seems clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
docs/configuration/index.md
Outdated
@@ -746,10 +746,10 @@ These Coordinator static configurations can be defined in the `coordinator/runti | |||
|
|||
|Property|Description|Required?|Default| | |||
|--------|-----------|---------|-------| | |||
|`druid.coordinator.period.metadataStoreManagementPeriod`|How often to run metadata management tasks. |No |PT3600S (1 hour)| | |||
|`druid.coordinator.period.metadataStoreManagementPeriod`|How often to run metadata management tasks in [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601) duration format. |No | PT1H| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The doc CI fails because of PT1H
. One way to avoid the spell check for such variables is wrapping them with backticks (`
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
docs/operations/metrics.md
Outdated
@@ -256,6 +256,8 @@ These metrics are for the Druid Coordinator and are reset each time the Coordina | |||
|`interval/skipCompact/count`|Total number of intervals of this datasource that are skipped (not eligible for auto compaction) by the auto compaction.|datasource.|Varies.| | |||
|`coordinator/time`|Approximate Coordinator duty runtime in milliseconds. The duty dimension is the string alias of the Duty that is being run.|duty.|Varies.| | |||
|`coordinator/global/time`|Approximate runtime of a full coordination cycle in milliseconds. The `dutyGroup` dimension indicates what type of coordination this run was. i.e. Historical Management vs Indexing|`dutyGroup`|Varies.| | |||
|`metadata/kill/audit/count`|Total number of audit logs deleted from metadata store audit table.| |Varies.| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metric can give the user an idea of how many audit logs are being automatically deleted. Basically how many audit logs do they have within the duration configured. This can help them get a sense if they want to increase or decrease the durationToRetain.
Thanks for the explanation. Could you please explain this in the doc as well? Also please mention that this metric is emitted only when druid.coordinator.kill.audit.on
is set to true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
docs/configuration/index.md
Outdated
|`druid.coordinator.kill.audit.period`| How often to do automatic deletion of audit logs in [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601) duration format. Value must be greater than `druid.coordinator.period.metadataStoreManagementPeriod`. Only applies if `druid.coordinator.kill.audit.on` is set to True.| Yes if `druid.coordinator.kill.audit.on` is set to True| None| | ||
|`druid.coordinator.kill.audit.durationToRetain`| Duration of audit logs to be retained from created time in [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601) duration format. Only applies if `druid.coordinator.kill.audit.on` is set to True.| Yes if `druid.coordinator.kill.audit.on` is set to True| None| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like both of these fields are required. Can you describe how a user should think about setting both of these fields so they work together. Is there a reason we don't allow users to just set one or the other?
Will this affect upgrades if someone has druid.coordinator.kill.audit.period
set but druid.coordinator.kill.audit.durationToRetain
unset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both of these fields are very different. druid.coordinator.kill.audit.period
controls the frequency (how often) the duty runs, while druid.coordinator.kill.audit.durationToRetain
controls how many audit logs the user wants to keep.
druid.coordinator.kill.audit.period
is actually not required. Let me fix the docs
public abstract Duration getCoordinatorAuditKillPeriod(); | ||
|
||
@Config("druid.coordinator.kill.audit.durationToRetain") | ||
@Default("PT-1s") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These defaults don't seem to match the documented defaults above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed the docs.
druid.coordinator.kill.audit.period
is not required. And the default value is P1D
druid.coordinator.kill.audit.durationToRetain
is required. The default value which is -1 which causes the precondition check to fail. User should set it to a positive value. Instead of putting in the docs that the default value is -1 and that Coordinator will fail if value is negative, for simplicity, I just documented that default value is None and user must set this value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The settings and documentation make sense to me.
The defaults seem reasonable, and this doesn't appear to introduce a breaking change.
Add feature to automatically remove audit logs based on retention period
Description
We currently already have a tasklog auto cleanup based on duration (time to retained) introduced in #3677. This PR adds a similar auto cleanup based on duration (time to retained) but for the audit table (using the coordinator duty instead of OverlordHelper).
This is useful when Druid user has a high churn of task / datasource in a short amount of time causing the metadata store size to grow uncontrollably.
This PR has: