-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read from blob cache first when MultiGetBlob() #10225
Conversation
size_t num_blobs = user_keys.size(); | ||
assert(num_blobs > 0); | ||
assert(num_blobs <= MultiGetContext::MAX_BATCH_SIZE); | ||
assert(num_blobs == offsets.size()); | ||
assert(num_blobs == value_sizes.size()); | ||
assert(num_blobs == statuses.size()); | ||
assert(num_blobs == blobs.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could consider passing a single autovector
of struct
s instead of a separate one for each of these
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// A read Blob request structure for use in MultiGetBlob
struct ReadRequest {
// User key to lookup the paired blob
const Slice* user_key;
// File offset in bytes
uint64_t offset;
// Length to read in bytes
size_t len;
// Output parameter set by MultiGetBlob() to point to the data buffer, and
// the number of valid bytes
PinnableSlice* result;
// Status of read
Status* status;
};
std::array<ReadRequest, num_blobs> request_buf;
autovector<ReadRequest*> requests;
Sounds like we can't do this in VersionSet
's MultiGetBlob() since num_blobs
is not statically available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, std::array
requires a size that's available at compile time. You could use an std::vector
or, even better, an autovector
though.
P.S. Not sure about the need for request_buf
. Any reason autovector<ReadRequest> requests
would not work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
autovector<ReadRequest> requests
looks good. We need to copy one more time when calling file_reader's MultiGetBlob
(I believe we also need to change its MultiGetBlob
). But, the cost of the copy is accepted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change will also need to modify Version
, we can do it in the same new PR. Added: #10156 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, the name BlobReadRequest
already be used by @riversand963. Let me know if you have a better name for it.
using BlobReadRequest =
std::pair<BlobIndex, std::reference_wrapper<const KeyContext>>;
using BlobReadRequests = std::vector<BlobReadRequest>;
void MultiGetBlob(const ReadOptions& read_options, MultiGetRange& range,
std::unordered_map<uint64_t, BlobReadRequests>& blob_rqs);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to rename the types if the proposal is better
@@ -152,6 +153,140 @@ Status BlobSource::GetBlob(const ReadOptions& read_options, | |||
return s; | |||
} | |||
|
|||
void BlobSource::MultiGetBlob( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if it would make sense to push more of the logic into BlobSource
, like the logic that sorts blobs by offset for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense to me. Since I also need to modify 'VersionSet`'s MultiGetBlob and add new end-to-end test cases, we can put them in a new PR.
Added to my to-do list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added: #10156 (comment)
void BlobSource::MultiGetBlob( | ||
const ReadOptions& read_options, | ||
const autovector<std::reference_wrapper<const Slice>>& user_keys, | ||
uint64_t file_number, uint64_t file_size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume we will support MultiGetBlob()
from multiple files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think MultiGetBlob() is to get multiple blobs from same file. For multiple files, we have an upper level function for that in VersionSet:
Line 1931 in a16e2ff
const uint64_t blob_file_number = elem.first; |
Version::MultiGetBlob will call BlobSource::MultiGetBlob multiple times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's actually a very good question, shall we put Version::MultiGetBlob's logic into a new BlobSource::MultiGetBlob?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since on the Version
layer, we already support reading multiple blobs from multiple files in MultiGetBlob()
, I think we should continue to support this.
How do you think of the following
Version::MultiGetBlob(...) // multiple files multiple blobs
-> Version::MultiGetBlobFromOneFile()
-> BlobSource::MultiGetBlob() // one file, multiple blobs
or
Version::MultiGetBlob(...) // multiple files multiple blobs
-> BlobSource::MultiGetBlob() // multiple files multiple blobs
-> BlobSource::MultiGetBlobFromOneFile() // one file, multiple blobs
Seems the latter is better since, by definition, BlobSource
also has information about multiple blob files, thus we can push the logic into this layer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for this. I think this is another piece that could be pushed down to BlobSource
. (Well, most of it I guess: it would make sense to keep e.g. the blob file number validity check in Version
.)
EDIT: I think @riversand963 's second suggestion would be the way to go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All right. Let's make a PR for cleanup version and add new api MultiGetBlobFromOneFile
. I don't want to do another huge PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added: #10156 (comment)
@gangliao has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks! Let's make those changes we discussed in the next PR
s = GetBlobFromCache(key, &blob_entry); | ||
if (s.ok() && blob_entry.GetValue()) { | ||
assert(statuses[i]); | ||
assert(blob_entry.GetValue()->size() == value_sizes[i]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for the record, we'll probably have to adjust/remove this assertion (as discussed, it doesn't hold for compressed blobs)
Summary: Update HISTORY.md for blob cache. Implementation can be found from Github issue #10156 (or Github PRs #10155, #10178, #10225, #10198, and #10272). Pull Request resolved: #10328 Reviewed By: riversand963 Differential Revision: D37732514 Pulled By: gangliao fbshipit-source-id: 4c942a41c07914bfc8db56a0d3cf4d3e53d5963f
Summary:
There is currently no caching mechanism for blobs, which is not ideal especially when the database resides on remote storage (where we cannot rely on the OS page cache). As part of this task, we would like to make it possible for the application to configure a blob cache.
Test Plan:
Add test cases for MultiGetBlob
Tasks:
In this task, we added the new API MultiGetBlob() for BlobSource.
This PR is a part of #10156