New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-6141: [C++] Enable memory-mapping a file region #5101
Conversation
52ac371
to
8069940
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Remove the const for all
off_t
type.
cpp/src/arrow/io/file.h
Outdated
@@ -183,10 +183,12 @@ class ARROW_EXPORT MemoryMappedFile : public ReadWriteFileInterface { | |||
|
|||
/// Create new file with indicated size, return in read/write mode | |||
static Status Create(const std::string& path, int64_t size, | |||
std::shared_ptr<MemoryMappedFile>* out); | |||
std::shared_ptr<MemoryMappedFile>* out, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Store the offset in a property for debugging/reference purpose.
cpp/src/arrow/io/file.cc
Outdated
@@ -412,7 +412,7 @@ class MemoryMappedFile::MemoryMap : public MutableBuffer { | |||
|
|||
bool closed() const { return !file_->is_open(); } | |||
|
|||
Status Open(const std::string& path, FileMode::type mode) { | |||
Status Open(const std::string& path, FileMode::type mode, const off_t offset) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd add a check that the offset is a multiple of PAGE_SIZE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add this logic. Thanks!
cpp/src/arrow/io/test-common.cc
Outdated
std::shared_ptr<MemoryMappedFile>* mmap) { | ||
RETURN_NOT_OK(MemoryMappedFile::Create(path, size, mmap)); | ||
std::shared_ptr<MemoryMappedFile>* mmap, | ||
const off_t offset) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need to run clang-format
const off_t offset) { | |
const off_t offset) { |
cpp/src/arrow/io/test-common.h
Outdated
@@ -47,7 +47,8 @@ class ARROW_EXPORT MemoryMapFixture { | |||
void CreateFile(const std::string& path, int64_t size); | |||
|
|||
Status InitMemoryMap(int64_t size, const std::string& path, | |||
std::shared_ptr<MemoryMappedFile>* mmap); | |||
std::shared_ptr<MemoryMappedFile>* mmap, | |||
const off_t offset = 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code style quibble: adding an optional parameter after the output parameter, which should be last. Could you add this as an overload (for example Status InitMemoryMap(int64_t offset, int64_t size, const std::string& path, std::shared_ptr<MemoryMappedFile>* mmap)
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, adding a overload seems more clean. Will add this in the next push. Thanks!
cpp/src/arrow/io/file.h
Outdated
|
||
static Status Open(const std::string& path, FileMode::type mode, | ||
std::shared_ptr<MemoryMappedFile>* out); | ||
std::shared_ptr<MemoryMappedFile>* out, | ||
const off_t offset = 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to use int64_t
for the off_t
is a platform-specific type.
I also agree with Ben about adding an overload Open(path, mode, offset, &mmap)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @wesm we may need to add another param length
here, with this user could memory map a portion of the file just like the semantics of mmap()
.
- MemoryMappedFile::Open(path, mode, &mmap)
- MemoryMappedFile::Open(path, mode, length, offset, &mmap)
In the 2nd case, since only part of the file is mmaped, I think we may need to make MemoryMappedFile::GetSize()
return length
instead of the size_
of the corresponding file.
Does this looks like the right approach to you?
cpp/src/arrow/io/file.h
Outdated
@@ -183,10 +183,12 @@ class ARROW_EXPORT MemoryMappedFile : public ReadWriteFileInterface { | |||
|
|||
/// Create new file with indicated size, return in read/write mode | |||
static Status Create(const std::string& path, int64_t size, | |||
std::shared_ptr<MemoryMappedFile>* out); | |||
std::shared_ptr<MemoryMappedFile>* out, | |||
const off_t offset = 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding offset
to this function is a little bit odd. Is there an immediate use case for this (I would think that mapping a portion of a pre-existing file would be the main thing)
282cb95
to
f676167
Compare
f57d6c4
to
2e15766
Compare
@zhouyuan Is there a reason this PR allows passing an offset and not a length? Intuitively, if you want to map a file region, you should be able to pass both. |
@pitrou thanks for the look, yes I was trying to add the missing the |
81f9246
to
4dc5d13
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update. Still a couple comments.
cpp/src/arrow/io/file.cc
Outdated
map_mode_, file_->fd(), 0); | ||
|
||
size_t mmap_length = static_cast<size_t>(initial_size); | ||
if (length > 0 && length < initial_size) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If length
has an invalid value (e.g. greater than the file size), we should return an error instead of silently creating a smaller map, IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
cpp/src/arrow/io/file.cc
Outdated
@@ -484,6 +491,8 @@ class MemoryMappedFile::MemoryMap : public MutableBuffer { | |||
|
|||
int64_t size() const { return size_; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it still useful to expose this? Perhaps size()
should simply return map_len_
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pitrou thanks for the careful review, I'm uncertain on this - is it possible some workload only memory map a file region, but still want to check the whole file size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know. But GetSize
should return the amount of data that's readable through the file region, not the entire size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, will make GetSize()
return the memory map data length and change the corresponding tests.
cpp/src/arrow/io/file.h
Outdated
static Status Open(const std::string& path, FileMode::type mode, | ||
std::shared_ptr<MemoryMappedFile>* out); | ||
|
||
// mmap() with a region of file, the offset must be a multiple of the page size | ||
static Status Open(const std::string& path, FileMode::type mode, const int64_t length, | ||
const int64_t offset, std::shared_ptr<MemoryMappedFile>* out); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would favor (offset, length)
rather than (length, offset)
in the signature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
ASSERT_RAISES(IOError, result->Resize(4096)); | ||
|
||
// Write beyond memory mapped length | ||
ASSERT_RAISES(Invalid, result->WriteAt(4096, buffer.data(), buffer_size)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should you check GetSize()
and Seek()
as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests on GetSize(), Seek(), Tell() added.
Thanks for the changes @zhouyuan . The |
This patch adds an Open() method for MemoryMappedFile() with length and offset params. In this way user can memory map a file region just like mmap(). The new API is: * MemoryMappedFile::Open(path, mode, length, offset, &mmap) The original API is still available. Calling the original API will memory map the whole file: * MemoryMappedFile::Open(path, mode, &mmap) A new field map_len_ is added in MemoryMappedFile::MemoryMap to track the real memory map length. Also MemoryMappedFile::Read()/ReadAt()/Write()/WriteAt() are changed to check the memory map length if it's a region based memory map. Note the MemoryMappedFile::Resize() is not supported if it's a region based memory map. Signey-off-by: Yuan Zhou <yuan.zhou@intel.com> Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>
Codecov Report
@@ Coverage Diff @@
## master #5101 +/- ##
==========================================
+ Coverage 87.64% 89.21% +1.56%
==========================================
Files 1030 747 -283
Lines 148327 107494 -40833
Branches 1437 0 -1437
==========================================
- Hits 129995 95896 -34099
+ Misses 17970 11598 -6372
+ Partials 362 0 -362
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, than you @zhouyuan !
This patch adds an Open() method for MemoryMappedFile() with
length and offset params. In this way user can memory map a file
region just like mmap(). The new API is:
The original API is still available. Calling the original API
will memory map the whole file:
A new field map_len_ is added in MemoryMappedFile::MemoryMap to
track the real memory map length.
Also MemoryMappedFile::Read()/ReadAt()/Write()/WriteAt() are changed
to check the memory map length if it's a region based memory map.
Note the MemoryMappedFile::Resize() is not supported if it's a
region based memory map.
Signed-off-by: Yuan Zhou yuan.zhou@intel.com