Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-6141: [C++] Enable memory-mapping a file region #5101

Closed
wants to merge 1 commit into from

Conversation

zhouyuan
Copy link
Contributor

@zhouyuan zhouyuan commented Aug 16, 2019

This patch adds an Open() method for MemoryMappedFile() with
length and offset params. In this way user can memory map a file
region just like mmap(). The new API is:

  • MemoryMappedFile::Open(path, mode, length, offset, &mmap)

The original API is still available. Calling the original API
will memory map the whole file:

  • MemoryMappedFile::Open(path, mode, &mmap)

A new field map_len_ is added in MemoryMappedFile::MemoryMap to
track the real memory map length.

Also MemoryMappedFile::Read()/ReadAt()/Write()/WriteAt() are changed
to check the memory map length if it's a region based memory map.

Note the MemoryMappedFile::Resize() is not supported if it's a
region based memory map.

Signed-off-by: Yuan Zhou yuan.zhou@intel.com

@zhouyuan zhouyuan force-pushed the wip_6141 branch 2 times, most recently from 52ac371 to 8069940 Compare August 16, 2019 08:16
@fsaintjacques fsaintjacques changed the title RROW-6141: [C++] Enable memory-mapping a file region ARROW-6141: [C++] Enable memory-mapping a file region Aug 16, 2019
Copy link
Contributor

@fsaintjacques fsaintjacques left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Remove the const for all off_t type.

@@ -183,10 +183,12 @@ class ARROW_EXPORT MemoryMappedFile : public ReadWriteFileInterface {

/// Create new file with indicated size, return in read/write mode
static Status Create(const std::string& path, int64_t size,
std::shared_ptr<MemoryMappedFile>* out);
std::shared_ptr<MemoryMappedFile>* out,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Store the offset in a property for debugging/reference purpose.

@@ -412,7 +412,7 @@ class MemoryMappedFile::MemoryMap : public MutableBuffer {

bool closed() const { return !file_->is_open(); }

Status Open(const std::string& path, FileMode::type mode) {
Status Open(const std::string& path, FileMode::type mode, const off_t offset) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add a check that the offset is a multiple of PAGE_SIZE.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add this logic. Thanks!

std::shared_ptr<MemoryMappedFile>* mmap) {
RETURN_NOT_OK(MemoryMappedFile::Create(path, size, mmap));
std::shared_ptr<MemoryMappedFile>* mmap,
const off_t offset) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to run clang-format

Suggested change
const off_t offset) {
const off_t offset) {

@@ -47,7 +47,8 @@ class ARROW_EXPORT MemoryMapFixture {
void CreateFile(const std::string& path, int64_t size);

Status InitMemoryMap(int64_t size, const std::string& path,
std::shared_ptr<MemoryMappedFile>* mmap);
std::shared_ptr<MemoryMappedFile>* mmap,
const off_t offset = 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code style quibble: adding an optional parameter after the output parameter, which should be last. Could you add this as an overload (for example Status InitMemoryMap(int64_t offset, int64_t size, const std::string& path, std::shared_ptr<MemoryMappedFile>* mmap))?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, adding a overload seems more clean. Will add this in the next push. Thanks!


static Status Open(const std::string& path, FileMode::type mode,
std::shared_ptr<MemoryMappedFile>* out);
std::shared_ptr<MemoryMappedFile>* out,
const off_t offset = 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to use int64_t for the off_t is a platform-specific type.

I also agree with Ben about adding an overload Open(path, mode, offset, &mmap)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @wesm we may need to add another param length here, with this user could memory map a portion of the file just like the semantics of mmap().

  • MemoryMappedFile::Open(path, mode, &mmap)
  • MemoryMappedFile::Open(path, mode, length, offset, &mmap)

In the 2nd case, since only part of the file is mmaped, I think we may need to make MemoryMappedFile::GetSize() return length instead of the size_ of the corresponding file.
Does this looks like the right approach to you?

@@ -183,10 +183,12 @@ class ARROW_EXPORT MemoryMappedFile : public ReadWriteFileInterface {

/// Create new file with indicated size, return in read/write mode
static Status Create(const std::string& path, int64_t size,
std::shared_ptr<MemoryMappedFile>* out);
std::shared_ptr<MemoryMappedFile>* out,
const off_t offset = 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding offset to this function is a little bit odd. Is there an immediate use case for this (I would think that mapping a portion of a pre-existing file would be the main thing)

@zhouyuan zhouyuan force-pushed the wip_6141 branch 2 times, most recently from 282cb95 to f676167 Compare August 20, 2019 13:52
@zhouyuan zhouyuan force-pushed the wip_6141 branch 4 times, most recently from f57d6c4 to 2e15766 Compare August 26, 2019 06:23
@pitrou
Copy link
Member

pitrou commented Aug 27, 2019

@zhouyuan Is there a reason this PR allows passing an offset and not a length? Intuitively, if you want to map a file region, you should be able to pass both.

@zhouyuan
Copy link
Contributor Author

zhouyuan commented Aug 28, 2019

@zhouyuan Is there a reason this PR allows passing an offset and not a length? Intuitively, if you want to map a file region, you should be able to pass both.

@pitrou thanks for the look, yes I was trying to add the missing the length param. #5101 (comment)
This would allow us to have the same semantics of mmap(). There're also some changes needed on some APIs like Resize() - but I'm not sure if this is a correct approach. I'll push the code out for review first

@zhouyuan zhouyuan force-pushed the wip_6141 branch 2 times, most recently from 81f9246 to 4dc5d13 Compare August 28, 2019 07:45
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. Still a couple comments.

cpp/src/arrow/io/file.cc Outdated Show resolved Hide resolved
map_mode_, file_->fd(), 0);

size_t mmap_length = static_cast<size_t>(initial_size);
if (length > 0 && length < initial_size) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If length has an invalid value (e.g. greater than the file size), we should return an error instead of silently creating a smaller map, IMHO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -484,6 +491,8 @@ class MemoryMappedFile::MemoryMap : public MutableBuffer {

int64_t size() const { return size_; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it still useful to expose this? Perhaps size() should simply return map_len_?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pitrou thanks for the careful review, I'm uncertain on this - is it possible some workload only memory map a file region, but still want to check the whole file size?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know. But GetSize should return the amount of data that's readable through the file region, not the entire size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, will make GetSize() return the memory map data length and change the corresponding tests.

static Status Open(const std::string& path, FileMode::type mode,
std::shared_ptr<MemoryMappedFile>* out);

// mmap() with a region of file, the offset must be a multiple of the page size
static Status Open(const std::string& path, FileMode::type mode, const int64_t length,
const int64_t offset, std::shared_ptr<MemoryMappedFile>* out);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would favor (offset, length) rather than (length, offset) in the signature.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

ASSERT_RAISES(IOError, result->Resize(4096));

// Write beyond memory mapped length
ASSERT_RAISES(Invalid, result->WriteAt(4096, buffer.data(), buffer_size));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should you check GetSize() and Seek() as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests on GetSize(), Seek(), Tell() added.

@pitrou
Copy link
Member

pitrou commented Aug 29, 2019

Thanks for the changes @zhouyuan . The GetSize semantics still need to be changed, then I think it will be good.

This patch adds an Open() method for MemoryMappedFile() with
length and offset params. In this way user can memory map a file
region just like mmap(). The new API is:

* MemoryMappedFile::Open(path, mode, length, offset, &mmap)

The original API is still available. Calling the original API
will memory map the whole file:

* MemoryMappedFile::Open(path, mode, &mmap)

A new field map_len_ is added in MemoryMappedFile::MemoryMap to
track the real memory map length.

Also MemoryMappedFile::Read()/ReadAt()/Write()/WriteAt() are changed
to check the memory map length if it's a region based memory map.

Note the MemoryMappedFile::Resize() is not supported if it's a
region based memory map.

Signey-off-by: Yuan Zhou <yuan.zhou@intel.com>
Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>
@codecov-io
Copy link

Codecov Report

Merging #5101 into master will increase coverage by 1.56%.
The diff coverage is 98.14%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5101      +/-   ##
==========================================
+ Coverage   87.64%   89.21%   +1.56%     
==========================================
  Files        1030      747     -283     
  Lines      148327   107494   -40833     
  Branches     1437        0    -1437     
==========================================
- Hits       129995    95896   -34099     
+ Misses      17970    11598    -6372     
+ Partials      362        0     -362
Impacted Files Coverage Δ
cpp/src/arrow/io/file_test.cc 95.45% <100%> (+0.2%) ⬆️
cpp/src/arrow/io/file.cc 97.68% <96.15%> (+0.12%) ⬆️
cpp/src/arrow/filesystem/s3_internal.h 95.12% <0%> (-4.88%) ⬇️
python/pyarrow/tests/test_parquet.py 96.5% <0%> (-0.13%) ⬇️
js/src/util/fn.ts
go/arrow/memory/memory_avx2_amd64.go
rust/datafusion/src/execution/filter.rs
rust/arrow/src/csv/writer.rs
rust/datafusion/src/bin/main.rs
go/arrow/ipc/file_reader.go
... and 278 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bcf5897...030994b. Read the comment docs.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, than you @zhouyuan !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants