Optionally load segment index files into page cache on bootstrap and new segment download#11309
Optionally load segment index files into page cache on bootstrap and new segment download#11309wjhypo wants to merge 7 commits intoapache:masterfrom
Conversation
| segment.getId() | ||
| ); | ||
| loadSegment(segment, callback, config.isLazyLoadOnStart()); | ||
| loadSegment(segment, callback, config.isLazyLoadOnStart(), loadSegmentsIntoPageCacheOnBootstrapExec); |
There was a problem hiding this comment.
Since loading segment is synchronous where as loading into page cache is async then it might happen that historical has announced itself but loadSegmentsIntoPageCacheOnBootstrapExec is still copying segments to null stream which can take time depending on IO throughput of the disk. I thought the point of the PR is to warm up the page cache before historical announces itself ready and hence no warm up delays and performance issues after reboot. Am I missing something here ?
There was a problem hiding this comment.
Hi Parag, thanks for the comment! To clarify, the point of the PR is to provide best effort precache of segments but not 100% guaranteed precache of segments. The reason is we don't want to sacrifice availability.
Image we have a data source with one replica configured and the host serving it dies, then we want the missing segments to be available ASAP on other replacement host. This is the case of loading segments into a complete new host with its page cache empty of the segments to load. Copying segments to null stream can take some time and if you have a lot segments, it can easily take more than 10 mins to complete the full read into page cache which is too long period of unavailability. In production, we tried the synchronous loading before announcing the segment, and it was indeed too slow. Another case is if we just restart the Druid historical process on the same host, then the process of reading segments into null stream is pretty fast as the segments are already cached in page cache by OS but it will still take some time compared with announcing segments directly after download.
That said, the strategy of announce segment immediately after download and asynchronously read into null stream afterwards still have value: without the change, since OS will only mmap the portion of segments a query needs, even after days of serving a segment in a historical, a different query hitting a different portion of segment will still need to trigger disk reads. With the change, we can ensure after the first 10 mins or so after downloading the segments into a historical (depending on the number of segments), all subsequent queries won't hit disk.
There was a problem hiding this comment.
I understand your point I am just worried that when a historical is warming up the query performance at that time will be worse than performance after usual boot up as its already doing disk reads for warming up and doing more disk reads for serving query.
Can we introduce a flag to enable synchronous warm up which when enabled historical only announces itself after everything is read once. When warm up is enabled default behaviour will be async as in this PR but can be changed to sync warm up when boot up time is not a concern. How does this sound to you ? Thanks
There was a problem hiding this comment.
@wjhypo any updates on when can you make the changes ?
There was a problem hiding this comment.
@pjain1 Hi Parag, sorry about the delay. Sorry I still didn't get time to make the change and verify the sync approach in a test cluster. Not sure if any user is in need of the sync approach. We only verified the async approach in production. At this point, it would be better if someone in need of the sync approach can help make the change and dedicate some time to verify the change. It currently doesn't align with our priority at work so I won't be able to make the change this year.
There was a problem hiding this comment.
@wjhypo no issues, I think we can get it in as in current condition. Can work on the sync behaviour later.
Few questions - Do you run this on your prod env ? Hows the performance after starting of the historical process and before warm up is complete as process will be doing double reads for warming up as well as serving query ? Do you see more of timeouts during this time ? Is it even usable at that time ? Thanks
|
@wjhypo can you fix the conflicts and review the comments, thanks |
Addressed! Thanks! |
|
This pull request introduces 1 alert when merging 3e53b21 into 33d9d9b - view on LGTM.com new alerts:
|
| this.strategy = strategy; | ||
| log.info("Using storage location strategy: [%s]", this.strategy.getClass().getSimpleName()); | ||
|
|
||
| if (this.config.getNumThreadsToLoadSegmentsIntoPageCacheOnDownload() != 0) { |
There was a problem hiding this comment.
nit: Should add validation for negative values
There was a problem hiding this comment.
Without validation historical start up will fail with IllegalArgumentException, users will have to figure out the cause for it which is also fine.
| // Start a temporary thread pool to load segments into page cache during bootstrap | ||
| ExecutorService loadingExecutor = null; | ||
| ExecutorService loadSegmentsIntoPageCacheOnBootstrapExec = | ||
| config.getNumThreadsToLoadSegmentsIntoPageCacheOnBootstrap() != 0 ? |
There was a problem hiding this comment.
nit: Should add validation for negative values
nishantmonu51
left a comment
There was a problem hiding this comment.
LGTM, I think this PR is in mergeable state, since the default behavior is to not load segments into page cache, it won't affect any existing users.
The other minor nit-picks can be resolved in an addendum PR.
|
This pull request introduces 1 alert when merging c8ebd7e into 69f928f - view on LGTM.com new alerts:
|
|
This pull request introduces 1 alert when merging cfbce44 into 0867ca7 - view on LGTM.com new alerts:
|
|
|
||
| execToUse.submit( | ||
| () -> { | ||
| final ReferenceCountingLock lock = createOrGetLock(segment); |
There was a problem hiding this comment.
@wjhypo can you add test for this runnable, travis is failing because of code coverage issue. Once travis passes will merge this PR.
There was a problem hiding this comment.
I have fixed some of the docs related spelling check failures
|
This pull request introduces 1 alert when merging 36ddd8b into d7308e9 - view on LGTM.com new alerts:
|
|
done in #12402 |
* Optionally load segment index files into page cache on bootstrap and new segment download * Fix unit test failure * Fix test case * fix spelling * fix spelling * fix test and test coverage issues Co-authored-by: Jian Wang <wjhypo@gmail.com>
Fixes #7374
Background
The background is covered in the above issue but here's some more: when a query comes in, the historical process retrieves data from segment index file through mmap. From the historical process's perspective, it operates on a byte array of the segment index file by providing some offset and length. The operating system is responsible for making sure the offset referenced in the byte array is the RAM, if it's not, it will trigger a major page fault to load the page or a serious of pages containing the data into RAM, which is a costly disk operation; on the other hand, if the offset referenced in the byte array is already in RAM, it may reside in the disk cache (or page cache which is a synonym) part of the RAM, then a minor page fault is triggered to link the page in the page cache to the page of the virtual memory of Druid, which is a much light operation and doesn't involve disk read. Because of the way OS manages the page cache, if a segment file is traversed by the read system call by any process (even non Druid processes), OS will load it into page cache part of the RAM if there's still space left so that later read system call to the segment file can just trigger a minor page fault. For latency use cases, we want minimal disk reads during query time, so we can leverage this mechanism to read segment files when they are first downloaded into local disk of historical hosts as a way to populate page cache beforehand so that we hope even the first query to a segment is fast when the first query actually comes.
Description
This PR avoids the more complicated design from #7374 to address the use case of a cluster setup with the total segment size on local disk smaller than the page cache usable in the RAM by adopting a simpler design to read the index files of all segments for all data sources in a host blindly into null output stream to force OS to load the segment files into page cache during historical process bootstrap and on new segment download after bootstrap. Other setups with a smaller RAM to disk ratio are left to undefined behavior (won't crash but just not the best performance as the intention of the PR).
When the total segment size on local disk is larger than the page cache usable in the RAM, a lot of complications will happen and many of them are out of application's control so this PR tries to avoid this use case:
This PR trades availability off performance, so the whole loading segments into page cache is done in a thread pool asynchronously to not delay historical process startup and to not impact data readiness for query ETA when available for download.
This PR has: