Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix task bootstrapping & simplify segment load/drop flows #16475

Merged
merged 48 commits into from
Jun 4, 2024

Conversation

abhishekrb19
Copy link
Contributor

@abhishekrb19 abhishekrb19 commented May 20, 2024

Problems

  1. The segment bootstrapping logic relies on the presence of configured storage locations. That is, only servers that have storage location configured will act like a "segment data server". Tasks on the other hand don't have this static configuration because tasks can dynamically run on any slot. The CliPeon injects a list of dynamic StorageLocations. This List<StorageLocation> is not currently present in SegmentLoadDropHandler, which does the bootstrapping of segments as part of its start() lifecycle. Instead, this class directly uses config.getLocations(), which isn't set by the TaskToolboxFactory when it instantiates SegmentCacheManager.

  2. One of the overloaded constructors marked @VisibleForTesting in SegmentLocalCacheManager is also being used by tasks via SegmentCacheManagerFactory. This "test constructor" also does not initialize the thread pool executors, so segments won't be loaded into the page cache even if loader config has the appropriate settings.

Core changes

The core part of the fix in SegmentLoadDropHandler is to rely on the source of truth for the caching layer: SegmentCacheManager. Currently, the various interactions of the classes are complex and have abstraction leaks in its implementation:

  • SegmentManager -> SegmentCacheLoader -> SegmentCacheManager
  • SegmentLoadDropHandler -> SegmentManager -> SegmentCacheLoader -> SegmentCacheManager
  • SegmentLoadDropHandler -> SegmentCacheManager

The SegmentLoadDropHandler directly handles the segment info file creation and deletion. These low-level file operations should really be owned and handled by the SegmentCacheManager and the classes that interact with should just operate at the granularity of Segment objects.

With this patch, the updated interactions are as follows:

  • SegmentManager -> SegmentCacheManager
  • SegmentLoadDropHandler -> SegmentManager -> SegmentCacheManager

The addition and deletion of segment info files are handled by the SegmentLocalCacheManager class itself. The SegmentLoadDropHandler gets the set of cached segments from the SegmentManager. To that effect, the dependencies in SegmentLoadDropHandler have been cleaned up, and this class doesn't fiddle with the segment files directly.

In order to simplify this, we did the following:

  1. Removed the SegmentLoader interface and its concrete implementation SegmentLocalCacheLoader. The implementation was acting as a simple factory of Segment objects from the underlying segment files, which has now been moved to SegmentLocalCacheManager, still providing convenience for any test overrides for the factory.

  2. Added the following methods to SegmentCacheManager for callers to interact with:

    • boolean canHandleSegments(): A high-level method that informs callers whether the caching layer can handle segments or not.
    • List getCachedSegments()
    • void storeInfoFile(DataSegment segment)
    • void removeInfoFile(DataSegment segment)
    • ReferenceCountingSegment getSegment(DataSegment segment)
    • ReferenceCountingSegment getBootstrapSegment(DataSegment segment, SegmentLazyLoadFailCallback loadFailed)
    • void loadSegmentIntoPageCache(DataSegment segment)
    • void loadSegmentIntoPageCacheOnBootstrap(DataSegment segment)
  • SegmentManager is now responsible for calling storeInfoFile(segment) once the segment bookkeeping stuff is successful as part of the load segment flow.
  • The bootstrap executor was initialized by SegmentLoadDropHandler, which is now moved to the SegmentLocalCacheManager., along with the initialization of the segment download executor.
  • Methods in SegmentManager and SegmentLocalCacheManager have bene broken down into loadSegment() and loadBootstrapSegment() variants, because segment lazy loading and the fail callback are only applicable for the bootstrap segment flow. This cleanup ensures that callers don't pass unnecessary null arguments anymore and makes the intent clearer.
  • Remove some VisibleForTesting hooks, including the problematic "test constructor" in SegmentLocalCacheManager noted in the problems section. Where possible, the test code uses the same initialization logic as the core code.

Test changes

  • Added new tests in SegmentLocalCacheManagerTest and SegmentManagerTest.
  • Moved the existing tests from SegmentLoaderTest to SegmentLocalCacheManagerTest.
  • Added TestSegmentUtils for commonly used test utilities that has test implementation support for loading DataSegment, Segment factory, etc.

Future work

  1. We can further simplify and cleanup these interactions by moving the functionality of SegmentLoadDropHandler to SegmentManager such that the announcements moves to the manager layer itself. That'll entail a slightly larger refactor.
  2. Cleanup the test code in SegmentLocalCacheManagerTest

Both of these require moderate to large refactoring changes, which can be tackled separately.

Release note

Fixed an issue in task bootstrapping that prevented tasks from accepting any segment assignments, including broadcast segments.

This PR has:

  • been self-reviewed.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • been tested in a test Druid cluster.

@github-actions github-actions bot added Area - Batch Ingestion Area - Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels May 20, 2024
- The load drop handler code talks to the local cache manager via
SegmentManager.
@kfaraz
Copy link
Contributor

kfaraz commented May 21, 2024

@abhishekrb19 , thanks for the detailed description.

IIUC, this PR is basically doing the following:

  • Inject the correct value of storage locations into SegmentLoadDropHandler
  • Refactor and cleanup classes

Is there any other bug fix or feature included in this PR too?
Just wanted to confirm before I start the actual review.

@abhishekrb19
Copy link
Contributor Author

abhishekrb19 commented May 21, 2024

IIUC, this PR is basically doing the following:

  • Inject the correct value of storage locations into SegmentLoadDropHandler
  • Refactor and cleanup classes

That's about right, except that we're not injecting storage locations into SegmentLoadDropHandler to fix the bug. Instead, SegmentLoadDropHandler relies on SegmentManager and SegmentCacheManager to provide the needed information. With this patch, SegmentCacheManager essentially serves as the source of truth for low-level segment info file operations, and this class already has the list of storage locations injected correctly.

Is there any other bug fix or feature included in this PR too? Just wanted to confirm before I start the actual review.

There's no other bug fix or feature in this PR. I do have another related change that touches some of this code, but given the size of this PR, I will add it separately to make reviewing easier. 🙂

@kfaraz
Copy link
Contributor

kfaraz commented May 21, 2024

Makes sense. Thanks for the clarification, @abhishekrb19 !

The intellij-inspect tool doesn't seem to correctly inspect
lambda usages. See ScheduledExecutors.
@abhishekrb19 abhishekrb19 requested a review from jon-wei May 30, 2024 21:37
@abhishekrb19 abhishekrb19 changed the title Fix task bootstrapping. Fix task bootstrapping & simplify segment load/drop flows Jun 3, 2024
@abhishekrb19 abhishekrb19 merged commit b9ba286 into apache:master Jun 4, 2024
87 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area - Batch Ingestion Area - Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 Area - Querying Release Notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants