Skip to content

Optionally load segment index files into page cache on bootstrap and new segment download#11309

Closed
wjhypo wants to merge 7 commits intoapache:masterfrom
wjhypo:feature-loadSegmentsIntoPageCache
Closed

Optionally load segment index files into page cache on bootstrap and new segment download#11309
wjhypo wants to merge 7 commits intoapache:masterfrom
wjhypo:feature-loadSegmentsIntoPageCache

Conversation

@wjhypo
Copy link
Contributor

@wjhypo wjhypo commented May 26, 2021

Fixes #7374

Background

The background is covered in the above issue but here's some more: when a query comes in, the historical process retrieves data from segment index file through mmap. From the historical process's perspective, it operates on a byte array of the segment index file by providing some offset and length. The operating system is responsible for making sure the offset referenced in the byte array is the RAM, if it's not, it will trigger a major page fault to load the page or a serious of pages containing the data into RAM, which is a costly disk operation; on the other hand, if the offset referenced in the byte array is already in RAM, it may reside in the disk cache (or page cache which is a synonym) part of the RAM, then a minor page fault is triggered to link the page in the page cache to the page of the virtual memory of Druid, which is a much light operation and doesn't involve disk read. Because of the way OS manages the page cache, if a segment file is traversed by the read system call by any process (even non Druid processes), OS will load it into page cache part of the RAM if there's still space left so that later read system call to the segment file can just trigger a minor page fault. For latency use cases, we want minimal disk reads during query time, so we can leverage this mechanism to read segment files when they are first downloaded into local disk of historical hosts as a way to populate page cache beforehand so that we hope even the first query to a segment is fast when the first query actually comes.

Description

  • Choice of algorithms

This PR avoids the more complicated design from #7374 to address the use case of a cluster setup with the total segment size on local disk smaller than the page cache usable in the RAM by adopting a simpler design to read the index files of all segments for all data sources in a host blindly into null output stream to force OS to load the segment files into page cache during historical process bootstrap and on new segment download after bootstrap. Other setups with a smaller RAM to disk ratio are left to undefined behavior (won't crash but just not the best performance as the intention of the PR).

When the total segment size on local disk is larger than the page cache usable in the RAM, a lot of complications will happen and many of them are out of application's control so this PR tries to avoid this use case:

  1. Page cache management is at the discretion of operating system so the OS may evict any existing segments already in page cache based on whatever policy it holds (LRU, LFU, etc.). From the application perspective, we can only assume any segment may be evicted when the page cache is full.
  2. If we make it configurable to load only certain data sources into page cache, let's say the host is hosting data source X and Y and only X is configured to load into page cache, unless data source Y is completely not queried, we can't prevent Y from polluting the page cache used by X because when Y is queried, the portion of the bytes needed to complete the query in the segment index file will be mmaped and has to be loaded into page cache by OS and some segments of X in the page cache will have to be evicted first if the page cache is full. A better design to address different data source's need in my opinion is to configure different server pools for X and Y where for a latency sensitive data source X, put it in a cluster of 1 to 1 mapping of available page cache RAM to segment size and enable this feature; for a non latency sensitive data source Y, put it in another cluster of historical nodes of smaller page cache RAM to disk ratio and disable this feature so on demand page in and page out during mmap is performed. Although the user can still choose to enable this feature for Y but it just won't be as effective.

This PR trades availability off performance, so the whole loading segments into page cache is done in a thread pool asynchronously to not delay historical process startup and to not impact data readiness for query ETA when available for download.

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

segment.getId()
);
loadSegment(segment, callback, config.isLazyLoadOnStart());
loadSegment(segment, callback, config.isLazyLoadOnStart(), loadSegmentsIntoPageCacheOnBootstrapExec);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since loading segment is synchronous where as loading into page cache is async then it might happen that historical has announced itself but loadSegmentsIntoPageCacheOnBootstrapExec is still copying segments to null stream which can take time depending on IO throughput of the disk. I thought the point of the PR is to warm up the page cache before historical announces itself ready and hence no warm up delays and performance issues after reboot. Am I missing something here ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Parag, thanks for the comment! To clarify, the point of the PR is to provide best effort precache of segments but not 100% guaranteed precache of segments. The reason is we don't want to sacrifice availability.

Image we have a data source with one replica configured and the host serving it dies, then we want the missing segments to be available ASAP on other replacement host. This is the case of loading segments into a complete new host with its page cache empty of the segments to load. Copying segments to null stream can take some time and if you have a lot segments, it can easily take more than 10 mins to complete the full read into page cache which is too long period of unavailability. In production, we tried the synchronous loading before announcing the segment, and it was indeed too slow. Another case is if we just restart the Druid historical process on the same host, then the process of reading segments into null stream is pretty fast as the segments are already cached in page cache by OS but it will still take some time compared with announcing segments directly after download.

That said, the strategy of announce segment immediately after download and asynchronously read into null stream afterwards still have value: without the change, since OS will only mmap the portion of segments a query needs, even after days of serving a segment in a historical, a different query hitting a different portion of segment will still need to trigger disk reads. With the change, we can ensure after the first 10 mins or so after downloading the segments into a historical (depending on the number of segments), all subsequent queries won't hit disk.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand your point I am just worried that when a historical is warming up the query performance at that time will be worse than performance after usual boot up as its already doing disk reads for warming up and doing more disk reads for serving query.

Can we introduce a flag to enable synchronous warm up which when enabled historical only announces itself after everything is read once. When warm up is enabled default behaviour will be async as in this PR but can be changed to sync warm up when boot up time is not a concern. How does this sound to you ? Thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sg, thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wjhypo any updates on when can you make the changes ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gentle reminder

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pjain1 Hi Parag, sorry about the delay. Sorry I still didn't get time to make the change and verify the sync approach in a test cluster. Not sure if any user is in need of the sync approach. We only verified the async approach in production. At this point, it would be better if someone in need of the sync approach can help make the change and dedicate some time to verify the change. It currently doesn't align with our priority at work so I won't be able to make the change this year.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wjhypo no issues, I think we can get it in as in current condition. Can work on the sync behaviour later.

Few questions - Do you run this on your prod env ? Hows the performance after starting of the historical process and before warm up is complete as process will be doing double reads for warming up as well as serving query ? Do you see more of timeouts during this time ? Is it even usable at that time ? Thanks

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wjhypo any thoughts on this ?

@pjain1
Copy link
Member

pjain1 commented Oct 29, 2021

@wjhypo can you fix the conflicts and review the comments, thanks

@wjhypo
Copy link
Contributor Author

wjhypo commented Oct 30, 2021

@wjhypo can you fix the conflicts and review the comments, thanks

Addressed! Thanks!

@lgtm-com
Copy link

lgtm-com bot commented Oct 30, 2021

This pull request introduces 1 alert when merging 3e53b21 into 33d9d9b - view on LGTM.com

new alerts:

  • 1 for Uncontrolled data used in path expression

this.strategy = strategy;
log.info("Using storage location strategy: [%s]", this.strategy.getClass().getSimpleName());

if (this.config.getNumThreadsToLoadSegmentsIntoPageCacheOnDownload() != 0) {
Copy link
Member

@pjain1 pjain1 Jan 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Should add validation for negative values

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without validation historical start up will fail with IllegalArgumentException, users will have to figure out the cause for it which is also fine.

// Start a temporary thread pool to load segments into page cache during bootstrap
ExecutorService loadingExecutor = null;
ExecutorService loadSegmentsIntoPageCacheOnBootstrapExec =
config.getNumThreadsToLoadSegmentsIntoPageCacheOnBootstrap() != 0 ?
Copy link
Member

@pjain1 pjain1 Jan 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Should add validation for negative values

Copy link
Member

@nishantmonu51 nishantmonu51 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I think this PR is in mergeable state, since the default behavior is to not load segments into page cache, it won't affect any existing users.

The other minor nit-picks can be resolved in an addendum PR.

@lgtm-com
Copy link

lgtm-com bot commented Mar 16, 2022

This pull request introduces 1 alert when merging c8ebd7e into 69f928f - view on LGTM.com

new alerts:

  • 1 for Uncontrolled data used in path expression

@lgtm-com
Copy link

lgtm-com bot commented Mar 22, 2022

This pull request introduces 1 alert when merging cfbce44 into 0867ca7 - view on LGTM.com

new alerts:

  • 1 for Uncontrolled data used in path expression


execToUse.submit(
() -> {
final ReferenceCountingLock lock = createOrGetLock(segment);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wjhypo can you add test for this runnable, travis is failing because of code coverage issue. Once travis passes will merge this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have fixed some of the docs related spelling check failures

@lgtm-com
Copy link

lgtm-com bot commented Mar 22, 2022

This pull request introduces 1 alert when merging 36ddd8b into d7308e9 - view on LGTM.com

new alerts:

  • 1 for Uncontrolled data used in path expression

pjain1 added a commit that referenced this pull request Apr 11, 2022
* Optionally load segment index files into page cache on bootstrap and new segment download

* Fix unit test failure

* Fix test case

* fix spelling

* fix spelling

* fix test and test coverage issues

Co-authored-by: Jian Wang <wjhypo@gmail.com>
@pjain1
Copy link
Member

pjain1 commented Apr 11, 2022

done in #12402

@pjain1 pjain1 closed this Apr 11, 2022
TSFenwick pushed a commit to TSFenwick/druid that referenced this pull request Apr 11, 2022
* Optionally load segment index files into page cache on bootstrap and new segment download

* Fix unit test failure

* Fix test case

* fix spelling

* fix spelling

* fix test and test coverage issues

Co-authored-by: Jian Wang <wjhypo@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Segment Loading on historical node startup

4 participants