Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make SegmentLoader extensible and customizable #11398

Merged
merged 17 commits into from
Jul 22, 2021

Conversation

abhishekagarwal87
Copy link
Contributor

@abhishekagarwal87 abhishekagarwal87 commented Jun 30, 2021

Description

This PR refactors the code related to segment loading specifically SegmentLoader and SegmentLoaderLocalCacheManager. SegmentLoader is marked UnstableAPI which means, it can be extended outside core druid in custom extensions. Here is a summary of changes

  • SegmentLoader returns an instance of ReferenceCountingSegment instead of Segment. Earlier, SegmentManager was wrapping Segment objects inside ReferenceCountingSegment. That is now moved to SegmentLoader. With this, a custom implementation can track the references of segments. It also allows them to create custom ReferenceCountingSegment implementations. For this reason, the constructor visibility in ReferenceCountingSegment is changed from private to protected.
  • SegmentCacheManager has two additional methods called - reserve(DataSegment) and release(DataSegment). These methods let the caller reserve or release space without calling SegmentLoader#getSegment. We already had similar methods in StorageLocation and now they are available in SegmentCacheManager too which wraps multiple locations.
  • Refactoring to simplify the code in SegmentCacheManager wherever possible. There is no change in the functionality.

Key changed/added classes in this PR
  • SegmentLoader
  • SegmentLoaderLocalCacheManager
  • StorageLocation

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@lgtm-com
Copy link

lgtm-com bot commented Jun 30, 2021

This pull request fixes 1 alert when merging 0af368e into 8037a54 - view on LGTM.com

fixed alerts:

  • 1 for Uncontrolled data used in path expression

@abhishekagarwal87 abhishekagarwal87 marked this pull request as draft June 30, 2021 12:12
@abhishekagarwal87 abhishekagarwal87 marked this pull request as ready for review July 1, 2021 06:00
@abhishekagarwal87 abhishekagarwal87 changed the title [Draft] Segment loader changes [Test] Segment loader changes Jul 5, 2021
@lgtm-com
Copy link

lgtm-com bot commented Jul 6, 2021

This pull request fixes 1 alert when merging 92cbce4 into 497f2a1 - view on LGTM.com

fixed alerts:

  • 1 for Uncontrolled data used in path expression

@abhishekagarwal87 abhishekagarwal87 changed the title [Test] Segment loader changes Make SegmentLoader extensible and customizable Jul 6, 2021
@lgtm-com
Copy link

lgtm-com bot commented Jul 6, 2021

This pull request introduces 2 alerts and fixes 2 when merging 0e8d0bf into 497f2a1 - view on LGTM.com

new alerts:

  • 2 for Uncontrolled data used in path expression

fixed alerts:

  • 2 for Uncontrolled data used in path expression

@lgtm-com
Copy link

lgtm-com bot commented Jul 6, 2021

This pull request introduces 2 alerts and fixes 2 when merging 0408a52 into 17efa6f - view on LGTM.com

new alerts:

  • 2 for Uncontrolled data used in path expression

fixed alerts:

  • 2 for Uncontrolled data used in path expression

@lgtm-com
Copy link

lgtm-com bot commented Jul 7, 2021

This pull request introduces 2 alerts and fixes 2 when merging 2e6fd24 into d5e8d4d - view on LGTM.com

new alerts:

  • 2 for Uncontrolled data used in path expression

fixed alerts:

  • 2 for Uncontrolled data used in path expression

{
final ReferenceCountingLock lock = createOrGetLock(segment);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the lock acquisition has been removed from here since getSegmentFiles already does it.

* @return Segment object wrapped inside {@link ReferenceCountingSegment}.
* @throws SegmentLoadingException
*/
ReferenceCountingSegment getSegment(DataSegment segment, boolean lazy, SegmentLazyLoadFailCallback loadFailed) throws SegmentLoadingException;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the SegmentLoader guaranteed to return the same ReferenceCountingSegment instance across multiple calls of getSegment? Should it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can do either and the caller is not supposed to depend on that behavior. From the caller's perspective, it is going to get a segment object wrapped inside ReferenceCountingSegment. Implementations can have optimizations to save on repeated expensive work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be documented in the javadoc.

* @param segment - Segment to release the location for.
* @return - True if any location was reserved and released, false otherwise.
*/
boolean release(DataSegment segment);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What calls release()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the same code that calls reserve can call release. Idea is that if reserve is being called explicitly then, same should be done for release. In case of failures, SegmentLoader itself should not release the location and leave it to the caller instead.

Copy link
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @abhishekagarwal87, I left some comments mostly about the interface change.

abhishekagarwal87 added a commit that referenced this pull request Jul 20, 2021
This PR splits current SegmentLoader into SegmentLoader and SegmentCacheManager.

SegmentLoader - this class is responsible for building the segment object but does not expose any methods for downloading, cache space management, etc. Default implementation delegates the download operations to SegmentCacheManager and only contains the logic for building segments once downloaded. . This class will be used in SegmentManager to construct Segment objects.

SegmentCacheManager - this class manages the segment cache on the local disk. It fetches the segment files to the local disk, can clean up the cache, and in the future, support reserve and release on cache space. [See https://github.com/Make SegmentLoader extensible and customizable #11398]. This class will be used in ingestion tasks such as compaction, re-indexing where segment files need to be downloaded locally.
@lgtm-com
Copy link

lgtm-com bot commented Jul 21, 2021

This pull request introduces 2 alerts and fixes 2 when merging dd80d1cd3ebd5a81220f5e2ddac722e7c9226223 into 0453e46 - view on LGTM.com

new alerts:

  • 2 for Uncontrolled data used in path expression

fixed alerts:

  • 2 for Uncontrolled data used in path expression

@lgtm-com
Copy link

lgtm-com bot commented Jul 21, 2021

This pull request introduces 2 alerts and fixes 2 when merging b1c874b into 6ce3b6c - view on LGTM.com

new alerts:

  • 2 for Uncontrolled data used in path expression

fixed alerts:

  • 2 for Uncontrolled data used in path expression

* @param segment - Segment to release the location for.
* @return - True if any location was reserved and released, false otherwise.
*/
boolean release(DataSegment segment);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify the contract between this method and getSegmentFiles? For example, what should happen when release is called if reserve was not called but getSegmentFiles was called?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. if reserve is not called, getSegmentFiles will reserve the space. I will document this.

* {@link StorageLocation} since we don't want callers to operate on {@code StorageLocation} directly outside {@code SegmentLoader}.
* {@link SegmentLoader} operates on the {@code StorageLocation} objects in a thread-safe manner.
*/
boolean reserve(DataSegment segment);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should isSegmentCached still return false after reserve is called? Would be worth to document it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. will document it.

* @return Segment object wrapped inside {@link ReferenceCountingSegment}.
* @throws SegmentLoadingException
*/
ReferenceCountingSegment getSegment(DataSegment segment, boolean lazy, SegmentLazyLoadFailCallback loadFailed) throws SegmentLoadingException;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be documented in the javadoc.

cleanupCacheFiles(loc.getPath(), storageDir);
}
boolean success = loadInLocationWithStartMarkerQuietly(loc, segment, storageDir, true);
if (success) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that loc.release should be called when success is false? Please add some test to verify this behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

loadInLocationWithStartMarkerQuietly(loc, segment, storageDir, true);
this method will release the location since true is passed as the value of releaseLocation flag.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah got it. Thanks.

Comment on lines 121 to 131
public synchronized boolean isReserved(String segmentDir)
{
final File segmentFile = new File(path, segmentDir);
return files.contains(segmentFile);
}

public File segmentDirectoryAsFile(String segmentDir)
{
return new File(path, segmentDir);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The LGTM error in https://lgtm.com/projects/g/apache/druid/rev/pr-1ff2ba29372f1d2b44941bb55f75b5830f808401 seems like a false alarm. Perhaps we should suppress it for this change.

Copy link
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @abhishekagarwal87.

@lgtm-com
Copy link

lgtm-com bot commented Jul 22, 2021

This pull request introduces 2 alerts and fixes 2 when merging a549256 into 167c452 - view on LGTM.com

new alerts:

  • 2 for Uncontrolled data used in path expression

fixed alerts:

  • 2 for Uncontrolled data used in path expression

@lgtm-com
Copy link

lgtm-com bot commented Jul 22, 2021

This pull request introduces 1 alert and fixes 2 when merging 89cd7e5 into 167c452 - view on LGTM.com

new alerts:

  • 1 for Uncontrolled data used in path expression

fixed alerts:

  • 2 for Uncontrolled data used in path expression

@abhishekagarwal87
Copy link
Contributor Author

LGTM error has been suppressed but it will take effect after the PR gets merged.

Copy link
Contributor

@maytasm maytasm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@abhishekagarwal87 abhishekagarwal87 merged commit ce1faa5 into apache:master Jul 22, 2021
@clintropolis clintropolis added this to the 0.22.0 milestone Aug 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants