Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add feature to automatically remove datasource metadata based on retention period #11227

Merged
merged 13 commits into from
May 11, 2021

Conversation

maytasm
Copy link
Contributor

@maytasm maytasm commented May 10, 2021

Add feature to automatically remove datasource metadata based on retention period

Description

We currently already have tasklog auto cleanup (#3677) and audit logs auto cleanup (#11084). This PR adds a similar auto cleanup based on duration (time to retained) but for the datasource metadata table to auto clean up datasource that is no longer active -- meaning that the datasource does not have active supervisor running (Note: datasource metadata only exists for datasource created from supervisor).

This is useful when Druid user has a high churn of task / datasource in a short amount of time causing the metadata store size to grow uncontrollably.

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

)
);
for (String datasourceMetadataInDb : datasources) {
if (!excludeDatasources.contains(datasourceMetadataInDb)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possible NPE - excludeDatasources is marked as nullable in the function definition

Copy link
Contributor Author

@maytasm maytasm May 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

excludeDatasources should be @NotNull. Fixed

.mapTo(String.class)
.list()
);
return connector.getDBI().withHandle(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if an exception is thrown while trying to delete the datasources? withHandle will throw a CallbackFailedException - is this handled somewhere else in the code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added try catch block

handle -> {
final PreparedBatch batch = handle.prepareBatch(
StringUtils.format(
"DELETE FROM %1$s WHERE dataSource = :dataSource AND created_date < '%2$s'",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you choose to build the delete statements one at a time instead of doing a batch delete?

I think we could encapsulate the excludeDatasources logic in a where clause of this delete statement instead.

Something like DELETE FROM datasources where created_date < "date" and datasource not in ("excludeDataSources")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to prevent the IN clause being too large

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also moved the filtering of the datasources to outside the handle block.

Comment on lines 81 to 83
if ((lastKillTime + period) < System.currentTimeMillis()) {
lastKillTime = System.currentTimeMillis();
long timestamp = System.currentTimeMillis() - retainDuration;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use consistent timestamp for all calculations

Suggested change
if ((lastKillTime + period) < System.currentTimeMillis()) {
lastKillTime = System.currentTimeMillis();
long timestamp = System.currentTimeMillis() - retainDuration;
long currentTimeMillis = System.currentTimeMillis();
if ((lastKillTime + period) < currentTimeMillis) {
lastKillTime = currentTimeMillis;
long timestamp = currentTimeMillis - retainDuration;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional question - I notice this pattern in a few other co-ordinator duties.

Are there any additional safeguards we need for a very large retainDuration? What happens if timestamp is calculated to be negative?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added safeguard so that we never get calculated timestamp to be negative

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I intentionally didnt specific in the docs that retainDuration have to be less than current timestamp (although we do check against this condition to protect ourself from unexpected behavior), since this is a very unlikely scenario and I don't want to make the docs to be unnecessary verbose.

Set<String> allDatasourceWithActiveSupervisor = allSupervisor.values()
.stream()
// Terminated supervisor will have it's latest supervisorSpec as NoopSupervisorSpec
// (NoopSupervisorSpec is used as a tombstone marker)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is very similar to SQLMetadataSupervisorManager#removeTerminatedSupervisorsOlderThan

Should that logic be moved out of the metadata store layer and pulled into the KillSupervisors class instead?

Should this logic be shared so that other callers can easily find the "active" supervisors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some convenience methods in SQLMetadataSupervisorManager to getLatestTerminatedOnly and getLatestActiveOnly supervisors

Copy link
Contributor

@suneet-s suneet-s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after CI

[ERROR] Errors: 
[ERROR]   KillDatasourceMetadataTest.unnecessary Mockito stubbings » UnnecessaryStubbing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants