Skip to content

Clean up stale entries from upgradeSegments table#15637

Merged
AmatyaAvadhanula merged 8 commits intoapache:masterfrom
AmatyaAvadhanula:clean_upgrade_segments_table
Jan 17, 2024
Merged

Clean up stale entries from upgradeSegments table#15637
AmatyaAvadhanula merged 8 commits intoapache:masterfrom
AmatyaAvadhanula:clean_upgrade_segments_table

Conversation

@AmatyaAvadhanula
Copy link
Contributor

@AmatyaAvadhanula AmatyaAvadhanula commented Jan 8, 2024

The upgradeSegments table contains entries of pending segments allocated by tasks with APPEND locks when there was a REPLACE lock held over the interval.

This PR aims to clean the entries from the metadata store after the task with the REPLACE lock completes by invoking a clean up method before unlocking its locks.
Since this is performed on the TaskLockbox directly, rolling upgrades wouldn't be affected and the clean up would happen automatically once the Overlord is upgraded. Also, the task lockbox is aware of the locks held by the task and this can help us avoid unnecessary calls to the metadata store to clean stale entries.

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@kfaraz kfaraz self-requested a review January 8, 2024 12:14
Copy link
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a task action to perform metadata cleanup makes perfect sense. I have left some feedback that should be addressed before this is merged.

There a couple more concerns:

  • Rolling Upgrade: A task running on an upgraded version would fire the new action but an old Overlord would not be able to recognize it.
  • Should the new action be fired only for supervisor type tasks? Atleast in the impl of the task action, we should do some filtering and proceed to DELETE entries only for relevant supervisor tasks with REPLACE context. Otherwise, we would just be firing redundant DELETE statements on the metadata.

* @return Map from append Segment ID to REPLACE lock version
*/
private Map<String, String> getAppendSegmentsCommittedDuringTask(
@VisibleForTesting
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't do this. Try to find some other cleaner way to test this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

import com.fasterxml.jackson.core.type.TypeReference;
import org.apache.druid.indexing.common.task.Task;

public class CleanMetadataAction implements TaskAction<Void>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a javadoc here.
I would prefer using some kind of response object rather than void. This response object could potentially contain the information of whether the cleanup was successful or not, how many rows of which table were deleted, etc. (probably not needed in this PR, but let's return something meaningful anyway.)

Also, rename:

Suggested change
public class CleanMetadataAction implements TaskAction<Void>
public class CleanupMetadataAction implements TaskAction<Void>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


// call it 3 times, once to update location in setup, then one for status and location in cleanup
Mockito.verify(taskActionClient, times(3)).submit(any());
// call it 3 times, once to update location in setup, then one for status and location in cleanup,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// call it 3 times, once to update location in setup, then one for status and location in cleanup,
// call it 4 times, once to update location in setup, then one for status and location in cleanup,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@AmatyaAvadhanula
Copy link
Contributor Author

@kfaraz, thank you for the feedback.

Rolling Upgrade: A task running on an upgraded version would fire the new action but an old Overlord would not be able to recognize it.

Apologies, I had missed this.
Do you have a suggestion to make the TaskAction work during a rolling upgrade?

An alternative could be to invoke the cleanup from TaskLockbox directly when the task is removed from the set of active tasks instead of using a new task action.

@kfaraz
Copy link
Contributor

kfaraz commented Jan 9, 2024

An alternative could be to invoke the cleanup from TaskLockbox directly when the task is removed from the set of active tasks instead of using a new task action.

Is there any cleanup which is currently performed by the TaskLockbox?

@AmatyaAvadhanula
Copy link
Contributor Author

AmatyaAvadhanula commented Jan 9, 2024

Is there any cleanup which is currently performed by the TaskLockbox?

Yes, the TaskLockbox is responsible for clearing the lock entries from the metadata store

The TaskQueue could also call a task action while removing a task from its set of active tasks. This happens on the Overlord itself so there wouldn't be an issue with rolling upgrades.

@kfaraz
Copy link
Contributor

kfaraz commented Jan 11, 2024

@AmatyaAvadhanula , let's keep it simple. Since the new task action is responsible only for cleanup of upgrade segments (something which doesn't happen in the current version anyway), let's just put the submission of the task action in a try catch. If it succeeds, cool. If the OL cannot identify the task action and throws an exception, we just log the exception saying that the OL is probably on an old version. This would ensure that the task itself doesn't fail if the OL is on an old version.

Also, please make sure you test out this scenario.

@AmatyaAvadhanula AmatyaAvadhanula marked this pull request as draft January 17, 2024 05:33
@AmatyaAvadhanula AmatyaAvadhanula marked this pull request as ready for review January 17, 2024 10:01
try {
try {
log.info("Removing task[%s] from activeTasks", task.getId());
if (findLocksForTask(task).stream().anyMatch(lock -> lock.getType() == TaskLockType.REPLACE)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use findReplaceLocksForTask method instead? But it would return only non-revoked locks.

Copy link
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments, otherwise LGTM

DataSegment retrieveSegmentForId(String id, boolean includeUnused);

/**
* Clean entries in upgrade segments table after the corresponding replace task has ended
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Clean entries in upgrade segments table after the corresponding replace task has ended
* Delete entries from upgrade segments table after the corresponding replace task has ended

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@kfaraz kfaraz added this to the Druid 29.0.0 milestone Jan 17, 2024
@AmatyaAvadhanula
Copy link
Contributor Author

@kfaraz , Thank you for the review!
Merging since failures are unrelated

@AmatyaAvadhanula AmatyaAvadhanula merged commit a26defd into apache:master Jan 17, 2024
@kfaraz kfaraz deleted the clean_upgrade_segments_table branch January 18, 2024 05:00
AmatyaAvadhanula added a commit to AmatyaAvadhanula/druid that referenced this pull request Jan 29, 2024
* Clean up stale entries from upgradeSegments table
LakshSingla pushed a commit that referenced this pull request Jan 30, 2024
* Clean up stale entries from upgradeSegments table
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants