Skip to content

iox-#2440 Fix race condition that can cause failed chunk management a…#2443

Merged
elBoberido merged 1 commit intoeclipse-iceoryx:mainfrom
gpalmer-latai:iox-2440-chunk-management-race
Apr 1, 2025
Merged

iox-#2440 Fix race condition that can cause failed chunk management a…#2443
elBoberido merged 1 commit intoeclipse-iceoryx:mainfrom
gpalmer-latai:iox-2440-chunk-management-race

Conversation

@gpalmer-latai
Copy link

@gpalmer-latai gpalmer-latai commented Mar 17, 2025

…llocation

Notes for Reviewer

As described in #2440, the race condition exists between the MemoryManager acquiring chunks for payloads before acquiring chunks for management, and the SharedChunk releasing chunks for payloads before releasing chunks for management.

The MemoryManager relies upon the assumption that if a payload allocation succeeded, the management allocation will succeed. This assumption proves to be false however for a brief moment when SharedChunk::releaseChunk releases the payload chunk but has not yet released the management chunk.

The changes in this pull request resolve the issue by switching the order of release such that the management chunk is released before the payload chunk. This properly uploads the invariant that the number of management chunks is always greater than or equal to the number of payload chunks - since for any given payload/management pair the lifetime of the management chunk is contained within the lifetime of the payload chunk.

With regards to testing - there isn't any formal way to prove these changes resolve the issue nor prevent a new issue, other than running the stress tests.

Pre-Review Checklist for the PR Author

  1. Code follows the coding style of CONTRIBUTING.md
  2. Tests follow the best practice for testing
  3. Changelog updated in the unreleased section including API breaking changes
  4. Branch follows the naming format (iox-123-this-is-a-branch)
  5. Commits messages are according to this guideline
  6. Update the PR title
    • Follow the same conventions as for commit messages
    • Link to the relevant issue
  7. Relevant issues are linked
  8. Add sensible notes for the reviewer
  9. All checks have passed (except task-list-completed)
  10. Assign PR to reviewer

Checklist for the PR Reviewer

  • Consider a second reviewer for complex new features or larger refactorings
  • Commits are properly organized and messages are according to the guideline
  • Code according to our coding style and naming conventions
  • Unit tests have been written for new behavior
  • Public API changes are documented via doxygen
  • Copyright owner are updated in the changed files
  • All touched (C/C++) source code files from iceoryx_hoofs have been added to ./clang-tidy-diff-scans.txt
  • PR title describes the changes

Post-review Checklist for the PR Author

  1. All open points are addressed and tracked via issues

References

@gpalmer-latai gpalmer-latai force-pushed the iox-2440-chunk-management-race branch 3 times, most recently from 931c73f to 0d95da3 Compare March 17, 2025 17:14
@codecov
Copy link

codecov bot commented Mar 17, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 78.26%. Comparing base (2fc7a5f) to head (9baed66).
Report is 2 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2443      +/-   ##
==========================================
- Coverage   78.26%   78.26%   -0.01%     
==========================================
  Files         445      445              
  Lines       17091    17093       +2     
  Branches     2373     2373              
==========================================
+ Hits        13376    13377       +1     
  Misses       2835     2835              
- Partials      880      881       +1     
Flag Coverage Δ
unittests 78.06% <100.00%> (-0.01%) ⬇️
unittests_timing 15.31% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...ude/iceoryx_posh/internal/mepoo/memory_manager.hpp 100.00% <ø> (ø)
...clude/iceoryx_posh/internal/mepoo/shared_chunk.hpp 100.00% <ø> (ø)
iceoryx_posh/source/mepoo/memory_manager.cpp 96.77% <100.00%> (+0.22%) ⬆️
iceoryx_posh/source/mepoo/shared_chunk.cpp 96.42% <100.00%> (-0.24%) ⬇️

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@gpalmer-latai
Copy link
Author

gpalmer-latai commented Mar 27, 2025

@elBoberido I noticed there is no protection around the m_chunkManagement pointer in SharedChunk::freeChunk. Would you like me to do anything like add an IOX_ASSERT to check that it is not null? I've documented the precondition to freeChunk that it has not already been freed.

I can also change freeChunk to be a noop if it is called more than once, but perhaps that scenario itself is indicative of a logic error and it would be better to blow up.

I can ALSO qualify that method as && on the SharedChunk therefore only allowing it to be called via std::move(chunk).freeChunk()

@gpalmer-latai gpalmer-latai force-pushed the iox-2440-chunk-management-race branch from 33ee77e to b679df1 Compare March 28, 2025 12:33
@gpalmer-latai
Copy link
Author

@elBoberido ready for re-review

@gpalmer-latai gpalmer-latai force-pushed the iox-2440-chunk-management-race branch from b679df1 to 6e9ef92 Compare March 28, 2025 12:36
@elBoberido
Copy link
Member

elBoberido commented Mar 28, 2025

@gpalmer-latai freeChunk is a private function and only called from decrementReferenceCounter which does the sanity check.

Copy link
Member

@elBoberido elBoberido left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some nitpicks :)

@gpalmer-latai gpalmer-latai force-pushed the iox-2440-chunk-management-race branch from 6e9ef92 to 2604c4f Compare March 31, 2025 13:32
Copy link
Member

@elBoberido elBoberido left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one last question :)


// Here the chunk management must be freed before the chunk itself to maintain
// the invariant that there are always at least as many chunk management chunks available as payload chunks
chunkManagement.m_chunkManagementPool->freeChunk(&chunkManagement);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hehe, I was also thinking about proposing this but then I thought it might be a bad idea since the reference becomes dangling after calling freeChunk and there is now way to indicate it. With the pointer, one could just set it to a nullptr.

Unfortunately, the language does not give us good options.

Not suggesting to change it but would be good to know your thought process on this.

What do you think of adding a comment right below this line, like // NOTE: chunkManagement is dangling from here on?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmm... Maybe we should just rewrite Iceoryx in Rust? 😆

Sure, I'll add a comment just to be extra cautious.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elBoberido updated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, but how shall we call it? riceoryx?

cc @elfenpiff a lost opportunity but it's maybe not too late

…nk management allocation

Signed-off-by: Graham Palmer <gpalmer+github@lat.ai>
@gpalmer-latai gpalmer-latai force-pushed the iox-2440-chunk-management-race branch from 2604c4f to 9baed66 Compare April 1, 2025 13:02
@elBoberido elBoberido merged commit f3ccf20 into eclipse-iceoryx:main Apr 1, 2025
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Race condition in Memory Manager leads to a nullptr contract violation

2 participants

Comments