Skip to content

Conversation

@pknowles
Copy link
Contributor

@pknowles pknowles commented Feb 7, 2025

The main point of this change is to avoid msync() when resizing the
file. It turns out remapping is not even needed.

Disclaimer:

If the size of the mapped file changes after the call to mmap() as a
result of some other operation on the mapped file, the effect of
references to portions of the mapped region that correspond to added
or removed portions of the file is unspecified.

On linux this is a SIGBUS, but by contract of the API the user shouldn't
be writing past the resized area anyway.

An alternative could be to reserve with anonymous+private and mmap()
just the grown region like ResizableMappedMemory. However, mmap is quite
slow, even incrementally.

Summary by CodeRabbit

  • Bug Fixes
    • Improved reliability and consistency of file resizing and memory mapping operations on Linux.
  • Tests
    • Simplified and reorganized test cases for mapped files, reducing fixture usage.
    • Added new tests for memory mapping edge cases and resizable file behavior, including tests for empty and cleared files.
    • Added Linux-specific tests for page residency after truncation and decommit operations.
  • Documentation
    • Updated README with details on Linux implementation of resizable files using memory mapping and file truncation without remapping.

@coderabbitai
Copy link

coderabbitai bot commented Feb 7, 2025

Walkthrough

The changes update the Linux implementation of the ResizableMappedFile class to simplify memory mapping logic by mapping the entire file-backed range up front and managing file size separately. Corresponding test cases are refactored to reduce fixture usage, add Linux-specific edge case tests for memory residency and overmapping, and introduce new tests for resizable file behaviors including empty and cleared files. The README was updated to document the Linux-specific truncation approach.

Changes

Cohort / File(s) Change Summary
ResizableMappedFile Linux Implementation
include/decodeless/detail/mappedfile_linux.hpp
Refactored ResizableMappedFile to map the full file-backed range at construction, remove the reserved anonymous mapping, store file size in a new member, and simplify the resize() logic. Removed the map() helper and optional mapping, replaced with a direct MemoryMapRW member. Updated accessors and move assignment operator. Updated comments to match new mapping strategy relying on truncation and SIGBUS for bounds detection.
Mapped File Tests
test/src/mappedfile.cpp
Refactored tests to reduce fixture usage, converting several to standalone TEST macros. Added Linux-specific tests for overmapping and resizing behaviors, including writing to truncated mappings and verifying file contents. Added tests for empty and cleared resizable files. Introduced page residency checks using mincore in new tests for truncation and decommit scenarios. Removed disabled TODO comments and reformatted some existing tests.
Documentation Update
README.md
Added a note under "Notes" describing the Linux implementation detail of resizable_file that maps more memory than the current file size and truncates without remapping, noting this approach is simple, fast, and works in practice despite lack of official man page support.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant ResizableMappedFile
    participant OS/FileSystem

    User->>ResizableMappedFile: construct(path, maxSize)
    ResizableMappedFile->>OS/FileSystem: open file, map full maxSize (MAP_SHARED)
    ResizableMappedFile->>ResizableMappedFile: store current file size in m_size

    User->>ResizableMappedFile: resize(newSize)
    ResizableMappedFile->>OS/FileSystem: truncate file to newSize
    ResizableMappedFile->>ResizableMappedFile: update m_size

    User->>ResizableMappedFile: data(), size(), capacity()
    ResizableMappedFile->>User: return pointer, m_size, mapped size
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~15–20 minutes

Possibly related PRs

  • add .sync() to flush writes to disk #14: Refactors ResizableMappedFile by simplifying mapping and adding sync() methods to flush writes to disk; changes overlap with this PR in the class and method implementations related to memory mapping and synchronization.

Poem

A bunny mapped a file one day,
And swept the old reserved away.
Now one big map, so neat and bright,
With tests that check each edge-case byte.
SIGBUS may hop, but bugs won't stay—
This code is crisp as new-mown hay!
🐇✨

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ec11f56 and 5bbb941.

📒 Files selected for processing (3)
  • README.md (1 hunks)
  • include/decodeless/detail/mappedfile_linux.hpp (2 hunks)
  • test/src/mappedfile.cpp (6 hunks)
🧰 Additional context used
🪛 LanguageTool
README.md

[grammar] ~96-~96: Use correct spacing
Context: ...s more than the file size and truncates without remapping. Simple and very fast, although not expl...

(QB_NEW_EN_OTHER_ERROR_IDS_5)


[grammar] ~97-~97: Use correct spacing
Context: ...lthough not explicitly supported in the man pages. Tests indicate the right thing still ha...

(QB_NEW_EN_OTHER_ERROR_IDS_5)


[grammar] ~98-~98: Use correct spacing
Context: ...s. Tests indicate the right thing still happens. ## Contributing Issues and pull requests a...

(QB_NEW_EN_OTHER_ERROR_IDS_5)

🔇 Additional comments (9)
README.md (1)

96-98: Documentation addition looks good!

The note clearly explains the Linux-specific optimization approach and appropriately mentions that while not explicitly documented in man pages, tests validate the behavior works correctly.

test/src/mappedfile.cpp (7)

69-69: Good refactoring to reduce fixture usage

Converting these tests from TEST_F to TEST is appropriate since they create their own temporary files and don't need the fixture's pre-created file.


157-176: Excellent test for the Linux optimization

This test effectively validates the core optimization by:

  1. Mapping a region larger than the current file size
  2. Truncating the file to the mapped size
  3. Writing to the expanded region
  4. Verifying the write persists

This directly tests the behavior documented in the README about mapping more than the file size.


274-280: Helper function is well-implemented

The getResidency function provides a clean abstraction for checking page residency status using mincore. The error handling is appropriate.


282-308: Comprehensive page residency test

This test thoroughly validates the Linux kernel's handling of page residency during file truncation operations:

  • Confirms pages are non-resident after initial truncation
  • Verifies pages become resident after writing
  • Validates pages become non-resident again after truncating to 0

This provides strong evidence that the optimization approach is sound.


310-353: Well-crafted memory decommit test

The test effectively validates the mprotect + madvise approach for decommitting memory pages. The commented alternative approach and reference to the Go issue shows thorough research into the best method.


557-570: Good edge case coverage

Testing empty file creation ensures the resizable file implementation correctly handles the zero-size case.


572-581: Important test for file clearing behavior

This test ensures that resizing an existing file to zero properly truncates it, which is crucial for the new implementation that doesn't remap on resize.

include/decodeless/detail/mappedfile_linux.hpp (1)

131-138: Good refactoring of sync method

Using the existing address(offset) helper method eliminates duplicate pointer arithmetic and improves code maintainability.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch dev-no-msync

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 977ad2a and 40b4a3c.

📒 Files selected for processing (2)
  • include/decodeless/detail/mappedfile_linux.hpp (1 hunks)
  • test/src/mappedfile.cpp (5 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: build (windows-latest, Release, cl)
🔇 Additional comments (6)
test/src/mappedfile.cpp (3)

157-176: Well-designed test for overmapping behavior

This test effectively validates the optimization where files can be mapped larger than their current size and then grown with ftruncate() without remapping. The test correctly verifies that writes to the grown region persist.


454-467: Good edge case test for empty file creation

This test properly verifies that resizable files can be created with zero initial size, which is an important edge case for the new implementation.


469-479: Effective test for clearing existing file content

The test validates that resizing an existing file to zero properly truncates it, which is important behavior to verify.

include/decodeless/detail/mappedfile_linux.hpp (3)

212-212: Good null-safety for empty files

The data() method correctly returns nullptr for empty files, preventing potential issues with zero-sized mappings.


215-218: Excellent optimization eliminating remap overhead

The simplified resize() implementation that only calls ftruncate() without remapping is the core optimization mentioned in the PR objectives. This approach leverages the fact that the entire range is already mapped, significantly improving performance.


235-237: Correct member ordering for RAII

The member variable ordering ensures proper destruction sequence (m_mapped before m_file), which is critical for RAII correctness. The addition of m_size to track the actual file size separately from the mapped region size is a clean design choice.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
include/decodeless/detail/mappedfile_linux.hpp (1)

223-230: Document SIGBUS behavior in class documentation

The optimization of mapping the entire range upfront is good, but users need to be aware of the SIGBUS behavior when accessing beyond the actual file size.

test/src/mappedfile.cpp (1)

172-172: Add spaces around the minus operator

-        ifile.seekg(overmapSize - sizeof(lastByte));
+        ifile.seekg(overmapSize - sizeof(lastByte));
📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 40b4a3c and ec11f56.

📒 Files selected for processing (3)
  • README.md (1 hunks)
  • include/decodeless/detail/mappedfile_linux.hpp (2 hunks)
  • test/src/mappedfile.cpp (6 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
test/src/mappedfile.cpp (1)
include/decodeless/detail/mappedfile_linux.hpp (10)
  • size (76-79)
  • size (76-76)
  • size (151-160)
  • size (151-151)
  • size (235-238)
  • size (235-235)
  • size (284-315)
  • size (284-284)
  • result (69-74)
  • LastError (30-31)
🪛 LanguageTool
README.md

[grammar] ~96-~96: Use correct spacing
Context: ...s more than the file size and truncates without remapping. Simple and very fast, although not expl...

(QB_NEW_EN_OTHER_ERROR_IDS_5)


[grammar] ~97-~97: Use correct spacing
Context: ...lthough not explicitly supported in the man pages. Tests indicate the right thing still ha...

(QB_NEW_EN_OTHER_ERROR_IDS_5)


[grammar] ~98-~98: Use correct spacing
Context: ...s. Tests indicate the right thing still happens. ## Contributing Issues and pull requests a...

(QB_NEW_EN_OTHER_ERROR_IDS_5)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: build (windows-latest, Release, cl)
🔇 Additional comments (10)
include/decodeless/detail/mappedfile_linux.hpp (3)

131-140: LGTM!

The addition of the size() getter method and the existing sync() implementation are correct.


263-265: Member ordering is critical for correct destruction

The ordering of m_file before m_mapped ensures the file descriptor outlives the mapping during destruction. This is essential for correctness.


235-238: Efficient resize implementation

The simplified resize logic that only truncates the file without remapping is a significant performance optimization that aligns with the PR objectives.

test/src/mappedfile.cpp (6)

69-94: Good refactoring to reduce fixture overhead

Converting these tests from TEST_F to TEST eliminates unnecessary fixture setup/teardown for tests that don't need the pre-created file.

Also applies to: 96-139, 178-198, 210-272


282-301: Comprehensive test for truncation behavior

This test effectively validates that truncating a mapped file releases the physical pages, which is crucial for the new optimization approach.


302-345: Well-structured memory decommit test

The test properly validates memory decommit behavior using mprotect and madvise(MADV_DONTNEED). The comment referencing the Go issue provides valuable context for the chosen approach.


549-562: Good edge case coverage

Testing empty file creation ensures the implementation handles the zero-size case correctly.


564-573: Validates file truncation to zero

This test ensures that resizing an existing file to zero works correctly, which is important for the new truncation-based approach.


575-595: Documentation example validation

Testing the README example ensures the documentation stays accurate and the API works as advertised.

README.md (1)

96-98: Important documentation of Linux-specific behavior

The added note clearly explains the Linux implementation strategy of overmapping and truncating without remapping. This transparency helps users understand the performance characteristics and potential edge cases.

@pknowles
Copy link
Contributor Author

pknowles commented Aug 4, 2025

https://stackoverflow.com/questions/6875771/mmap-what-happens-if-underlying-file-changes-shrinks

The answer here sound like none of this should be working. Maybe it just works on Linux and not necessarily elsewhere.

https://stackoverflow.com/questions/7587625/truncating-memory-mapped-file

This is not necessary. You can mmap more than the actual size of the file, and writing more than a page past the end of the file will result in SIGBUS. If you increase the size with ftruncate before writing, you should have no problem, though.

Seems to be true in practice. I haven't found anything in the man pages about it though.

The main point of this change is to avoid msync() when resizing the
file. It turns out remapping is not even needed.

Disclaimer:

> If the size of the mapped file changes after the call to mmap() as a
> result of some other operation on the mapped file, the effect of
> references to portions of the mapped region that correspond to added
> or removed portions of the file is unspecified.

On linux this is a SIGBUS, but by contract of the API the user shouldn't
be writing past the resized area anyway.

An alternative could be to reserve with anonymous+private and mmap()
just the grown region like ResizableMappedMemory. However, mmap is quite
slow, even incrementally.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants