Optionally load segment index files into page cache on bootstrap and new segment download by wjhypo · Pull Request #11309 · apache/druid

wjhypo · 2021-05-26T20:34:57Z

Background

The background is covered in the above issue but here's some more: when a query comes in, the historical process retrieves data from segment index file through mmap. From the historical process's perspective, it operates on a byte array of the segment index file by providing some offset and length. The operating system is responsible for making sure the offset referenced in the byte array is the RAM, if it's not, it will trigger a major page fault to load the page or a serious of pages containing the data into RAM, which is a costly disk operation; on the other hand, if the offset referenced in the byte array is already in RAM, it may reside in the disk cache (or page cache which is a synonym) part of the RAM, then a minor page fault is triggered to link the page in the page cache to the page of the virtual memory of Druid, which is a much light operation and doesn't involve disk read. Because of the way OS manages the page cache, if a segment file is traversed by the read system call by any process (even non Druid processes), OS will load it into page cache part of the RAM if there's still space left so that later read system call to the segment file can just trigger a minor page fault. For latency use cases, we want minimal disk reads during query time, so we can leverage this mechanism to read segment files when they are first downloaded into local disk of historical hosts as a way to populate page cache beforehand so that we hope even the first query to a segment is fast when the first query actually comes.

Description

Choice of algorithms

This PR avoids the more complicated design from #7374 to address the use case of a cluster setup with the total segment size on local disk smaller than the page cache usable in the RAM by adopting a simpler design to read the index files of all segments for all data sources in a host blindly into null output stream to force OS to load the segment files into page cache during historical process bootstrap and on new segment download after bootstrap. Other setups with a smaller RAM to disk ratio are left to undefined behavior (won't crash but just not the best performance as the intention of the PR).

When the total segment size on local disk is larger than the page cache usable in the RAM, a lot of complications will happen and many of them are out of application's control so this PR tries to avoid this use case:

Page cache management is at the discretion of operating system so the OS may evict any existing segments already in page cache based on whatever policy it holds (LRU, LFU, etc.). From the application perspective, we can only assume any segment may be evicted when the page cache is full.
If we make it configurable to load only certain data sources into page cache, let's say the host is hosting data source X and Y and only X is configured to load into page cache, unless data source Y is completely not queried, we can't prevent Y from polluting the page cache used by X because when Y is queried, the portion of the bytes needed to complete the query in the segment index file will be mmaped and has to be loaded into page cache by OS and some segments of X in the page cache will have to be evicted first if the page cache is full. A better design to address different data source's need in my opinion is to configure different server pools for X and Y where for a latency sensitive data source X, put it in a cluster of 1 to 1 mapping of available page cache RAM to segment size and enable this feature; for a non latency sensitive data source Y, put it in another cluster of historical nodes of smaller page cache RAM to disk ratio and disable this feature so on demand page in and page out during mmap is performed. Although the user can still choose to enable this feature for Y but it just won't be as effective.

This PR trades availability off performance, so the whole loading segments into page cache is done in a thread pool asynchronously to not delay historical process startup and to not impact data readiness for query ETA when available for download.

This PR has:

been self-reviewed.
- using the concurrency checklist (Remove this item if the PR doesn't have any relation to concurrency.)
added documentation for new or modified features or behaviors.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
added integration tests.
been tested in a test Druid cluster.

…new segment download

pjain1 · 2021-10-26T06:46:09Z

server/src/main/java/org/apache/druid/server/coordination/SegmentLoadDropHandler.java

                    segment.getId()
                );
-                loadSegment(segment, callback, config.isLazyLoadOnStart());
+                loadSegment(segment, callback, config.isLazyLoadOnStart(), loadSegmentsIntoPageCacheOnBootstrapExec);


Since loading segment is synchronous where as loading into page cache is async then it might happen that historical has announced itself but loadSegmentsIntoPageCacheOnBootstrapExec is still copying segments to null stream which can take time depending on IO throughput of the disk. I thought the point of the PR is to warm up the page cache before historical announces itself ready and hence no warm up delays and performance issues after reboot. Am I missing something here ?

Hi Parag, thanks for the comment! To clarify, the point of the PR is to provide best effort precache of segments but not 100% guaranteed precache of segments. The reason is we don't want to sacrifice availability.

Image we have a data source with one replica configured and the host serving it dies, then we want the missing segments to be available ASAP on other replacement host. This is the case of loading segments into a complete new host with its page cache empty of the segments to load. Copying segments to null stream can take some time and if you have a lot segments, it can easily take more than 10 mins to complete the full read into page cache which is too long period of unavailability. In production, we tried the synchronous loading before announcing the segment, and it was indeed too slow. Another case is if we just restart the Druid historical process on the same host, then the process of reading segments into null stream is pretty fast as the segments are already cached in page cache by OS but it will still take some time compared with announcing segments directly after download.

That said, the strategy of announce segment immediately after download and asynchronously read into null stream afterwards still have value: without the change, since OS will only mmap the portion of segments a query needs, even after days of serving a segment in a historical, a different query hitting a different portion of segment will still need to trigger disk reads. With the change, we can ensure after the first 10 mins or so after downloading the segments into a historical (depending on the number of segments), all subsequent queries won't hit disk.

I understand your point I am just worried that when a historical is warming up the query performance at that time will be worse than performance after usual boot up as its already doing disk reads for warming up and doing more disk reads for serving query.

Can we introduce a flag to enable synchronous warm up which when enabled historical only announces itself after everything is read once. When warm up is enabled default behaviour will be async as in this PR but can be changed to sync warm up when boot up time is not a concern. How does this sound to you ? Thanks

Sg, thanks!

@wjhypo any updates on when can you make the changes ?

gentle reminder

@pjain1 Hi Parag, sorry about the delay. Sorry I still didn't get time to make the change and verify the sync approach in a test cluster. Not sure if any user is in need of the sync approach. We only verified the async approach in production. At this point, it would be better if someone in need of the sync approach can help make the change and dedicate some time to verify the change. It currently doesn't align with our priority at work so I won't be able to make the change this year.

@wjhypo no issues, I think we can get it in as in current condition. Can work on the sync behaviour later.

Few questions - Do you run this on your prod env ? Hows the performance after starting of the historical process and before warm up is complete as process will be doing double reads for warming up as well as serving query ? Do you see more of timeouts during this time ? Is it even usable at that time ? Thanks

@wjhypo any thoughts on this ?

pjain1 · 2021-10-29T12:21:55Z

@wjhypo can you fix the conflicts and review the comments, thanks

wjhypo · 2021-10-30T01:50:19Z

@wjhypo can you fix the conflicts and review the comments, thanks

Addressed! Thanks!

lgtm-com · 2021-10-30T02:44:13Z

This pull request introduces 1 alert when merging 3e53b21 into 33d9d9b - view on LGTM.com

new alerts:

1 for Uncontrolled data used in path expression

pjain1 · 2022-01-13T11:20:52Z

server/src/main/java/org/apache/druid/segment/loading/SegmentLocalCacheManager.java

    this.strategy = strategy;
    log.info("Using storage location strategy: [%s]", this.strategy.getClass().getSimpleName());
+
+    if (this.config.getNumThreadsToLoadSegmentsIntoPageCacheOnDownload() != 0) {


nit: Should add validation for negative values

Without validation historical start up will fail with IllegalArgumentException, users will have to figure out the cause for it which is also fine.

pjain1 · 2022-01-13T11:29:22Z

server/src/main/java/org/apache/druid/server/coordination/SegmentLoadDropHandler.java

+    // Start a temporary thread pool to load segments into page cache during bootstrap
    ExecutorService loadingExecutor = null;
+    ExecutorService loadSegmentsIntoPageCacheOnBootstrapExec =
+        config.getNumThreadsToLoadSegmentsIntoPageCacheOnBootstrap() != 0 ?


nit: Should add validation for negative values

nishantmonu51

LGTM, I think this PR is in mergeable state, since the default behavior is to not load segments into page cache, it won't affect any existing users.

The other minor nit-picks can be resolved in an addendum PR.

lgtm-com · 2022-03-16T14:57:25Z

This pull request introduces 1 alert when merging c8ebd7e into 69f928f - view on LGTM.com

new alerts:

1 for Uncontrolled data used in path expression

lgtm-com · 2022-03-22T15:58:48Z

This pull request introduces 1 alert when merging cfbce44 into 0867ca7 - view on LGTM.com

new alerts:

1 for Uncontrolled data used in path expression

pjain1 · 2022-03-22T19:37:07Z

server/src/main/java/org/apache/druid/segment/loading/SegmentLocalCacheManager.java

+
+    execToUse.submit(
+        () -> {
+          final ReferenceCountingLock lock = createOrGetLock(segment);


@wjhypo can you add test for this runnable, travis is failing because of code coverage issue. Once travis passes will merge this PR.

I have fixed some of the docs related spelling check failures

lgtm-com · 2022-03-22T20:47:36Z

This pull request introduces 1 alert when merging 36ddd8b into d7308e9 - view on LGTM.com

new alerts:

1 for Uncontrolled data used in path expression

* Optionally load segment index files into page cache on bootstrap and new segment download * Fix unit test failure * Fix test case * fix spelling * fix spelling * fix test and test coverage issues Co-authored-by: Jian Wang <wjhypo@gmail.com>

pjain1 · 2022-04-11T15:35:56Z

done in #12402

* Optionally load segment index files into page cache on bootstrap and new segment download * Fix unit test failure * Fix test case * fix spelling * fix spelling * fix test and test coverage issues Co-authored-by: Jian Wang <wjhypo@gmail.com>

wjhypo added 2 commits May 26, 2021 12:44

Optionally load segment index files into page cache on bootstrap and …

04e5428

…new segment download

Fix unit test failure

6d673e2

asdf2014 added the Area - Cache label Jun 3, 2021

pjain1 reviewed Oct 26, 2021

View reviewed changes

wjhypo added 2 commits October 29, 2021 17:45

Resolve conflict

aea09e4

Fix test case

3e53b21

pjain1 reviewed Jan 13, 2022

View reviewed changes

nishantmonu51 approved these changes Feb 11, 2022

View reviewed changes

Merge branch 'master' into feature-loadSegmentsIntoPageCache

c8ebd7e

fix spelling

cfbce44

fix spelling

36ddd8b

pjain1 reviewed Mar 22, 2022

View reviewed changes

pjain1 mentioned this pull request Apr 5, 2022

Copy of #11309 with fixes #12402

Merged

pjain1 closed this Apr 11, 2022

Conversation

wjhypo commented May 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

Description

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pjain1 commented Oct 29, 2021

Uh oh!

wjhypo commented Oct 30, 2021

Uh oh!

lgtm-com bot commented Oct 30, 2021

Uh oh!

pjain1 Jan 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pjain1 Jan 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nishantmonu51 left a comment

Choose a reason for hiding this comment

Uh oh!

lgtm-com bot commented Mar 16, 2022

Uh oh!

lgtm-com bot commented Mar 22, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lgtm-com bot commented Mar 22, 2022

Uh oh!

pjain1 commented Apr 11, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wjhypo commented May 26, 2021 •

edited

Loading

pjain1 Jan 13, 2022 •

edited

Loading

pjain1 Jan 13, 2022 •

edited

Loading