-
Notifications
You must be signed in to change notification settings - Fork 1.8k
POC for DefaultListFilesCache #18855
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Implements a POC version of a default ListFilesCache - Refactors the existing ListFilesCache to mirror the MetadataCache by defining a new trait instead of a fixed type wrapping a trait - Bounds the size of the cache based on number of entries - Expires entries in the cache after a default timeout duration
| // TODO: config | ||
| 512 * 1024, | ||
| Duration::new(600, 0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This POC doesn't implement any of the user configuration. This seems like a good opportunity to divide the work on this effort! We could get the base DefaultListFilesCache approved for merge without user configuration, and leave it disabled, and user configuration could be added by anyone who wants to contribute.
| pub(super) const DEFAULT_LIST_FILES_CACHE_LIMIT: usize = 128 * 1024; // ~130k objects | ||
| pub(super) const DEFAULT_LIST_FILES_CACHE_TTL: Duration = Duration::new(600, 0); // 10min |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The actual default values here probably need to be discussed. These seemed relatively sane to me, but any input here on what values these should have to best accommodate a variety of workflows would be useful feedback.
|
|
||
| pub struct DefaultListFilesCacheState { | ||
| lru_queue: LruQueue<Path, (Arc<Vec<ObjectMeta>>, Instant)>, | ||
| capacity: usize, // TODO: do "bytes" matter here, or should we stick with "entries"? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't feel like limiting this cache by "bytes" really made sense because the data stored in the cache is generally very uniform in size, perhaps aside from the path. I felt that it was probably small enough that simply limiting it by the number of entries should suffice, and "entries" seems like it would be easier for users to configure.
| // TODO: driveby-cleanup | ||
| /// The cache accessor, users usually working on this interface while manipulating caches. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed this doc comment could be edited for additional clarity, so I figured while we were in this area of code we could improve this!
| /// See [`crate::runtime_env::RuntimeEnv`] for more details | ||
| pub type ListFilesCache = | ||
| Arc<dyn CacheAccessor<Path, Arc<Vec<ObjectMeta>>, Extra = ObjectMeta>>; | ||
| pub trait ListFilesCache: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think one of the things I would like to do in the Cache Manager caches is to segregate the cache eviction policy. Personally I think the user should be given an option on what is the eviction behaviour they want. wdyt @alamb @BlakeOrth ? I can work on getting some draft out this weekend on it.
| table_files_statistics_cache: Default::default(), | ||
| list_files_cache: Default::default(), | ||
| list_files_cache_limit: DEFAULT_LIST_FILES_CACHE_LIMIT, | ||
| list_files_cache_ttl: DEFAULT_LIST_FILES_CACHE_TTL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some usecases don't need a TTL, we should provide a way to keep that disable as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well thinking more, I understand why you have kept it.. I feel it diverges from the metadata cache and could confuse the end users somewhat
Which issue does this PR close?
POC for:
ListFilesCacheimplementation for theListingTable#18827Rationale for this change
This is a POC for initial feedback and is not intended for merge at this time.
What changes are included in this PR?
Are these changes tested?
This code is functional and some tests clearly show a reduction in Object Store requests! However, existing tests are broken around
INSERTcommands, which is a key point of discussion that needs to be covered.Are there any user-facing changes?
Yes, this work will likely break the existing
ListFilesCachepublic API.Additional Context
This PR is a basic functional implementation which heavily mirrors the existing
MetadataCacheand its semantics. One very key omission here that needs to be discussed is howINSERTstatements are handled. On the surface it seems like there are two options:INSERTstatementINSERTstatement and add them to the cacheThe first option here seems much easier, but the 2nd option seems more ideal since a user is likely to issue a query against newly inserted data. Any input here, or other strategies I haven't thought of to handle inserts, would be great!
I will also leave some inline comments around some
TODOitems that I think should be discussed.cc @alamb @alchemist51