Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changed to use posix robust mutex #67

Closed
wants to merge 2 commits into from

Conversation

PomLover
Copy link

@PomLover PomLover commented Nov 9, 2018

Changed to use posix robust mutex to allow subsequent mutex lock can grab the lock from previous dead owner to void deadlock. And it also fixed the issue #65.

Hi Reviewers,
We ran into some issues while we are using boost::interprocess::named_recursive_mutex with posix implementation when the previous owner of the mutex has been dead without releasing the mutex, so all others which are waiting for the same mutex will be blocked forever. And that's why I added this fix in posix mutex implementation to avoid this issue.

…grab the lock from previous dead owner to void deadlock. And it also fixed the issue boostorg#65.
@363568233
Copy link

Which boost version will update it?

@PomLover
Copy link
Author

I don't know, it depends on when the maintainer can merge it.

@363568233
Copy link

It only changed recursive_mutex.hpp, there is another mutex.hpp need to change?

@363568233
Copy link

pthread_mutex_consistent(&m_mut); after it
Whether it should be use pthread_mutex_unlock(&m_mut) to unlock it?

@PomLover
Copy link
Author

pthread_mutex_consistent(&m_mut); after it
Whether it should be use pthread_mutex_unlock(&m_mut) to unlock it?

The case is going to solve is when the new request to lock the mutex if the previous owner has been dead, the new owner is expecting to get the mutex lock anyway, if unlock is called, the new owner would have to call lock again to grab the lock which is NOT expected behavior.

@PomLover
Copy link
Author

It only changed recursive_mutex.hpp, there is another mutex.hpp need to change?

Yeah, potentially for semaphore to, but we didn't use them and test them, so I didn't include any of those changes, but only recursive mutex since we are using it and testing it.

@363568233
Copy link

The way to deal owner dead is different from windows mutex, windows mutex will release mutex and throw owner_dead_error exception, Do not posix mutex keep consistent with windows to tell user?

@363568233
Copy link

I see the boost::interprocess::interprocess_mutex on linux is using the mutex.cpp default, the boost::interprocess::interprocess_mutex is used in rbtree_best_fit

@PomLover
Copy link
Author

The way to deal owner dead is different from windows mutex, windows mutex will release mutex and throw owner_dead_error exception, Do not posix mutex keep consistent with windows to tell user?

Yeah, it could be aligned with windows mutex implementation in boost, but I would prefer to granting mutex to new owner instead of leave it unlocked which behavior is different from Windows API WaitforSingleObject.

@PomLover
Copy link
Author

I see the boost::interprocess::interprocess_mutex on linux is using the mutex.cpp default, the boost::interprocess::interprocess_mutex is used in rbtree_best_fit

I didn't use rbtree_best_fit, but definitely the mutex.cpp can be changed to use robust mutex too if needed, but I didn't use it, so I would prefer to having someone else who is using and testing it making the changes.

@PomLover
Copy link
Author

Actually I think the new owner release the mutex from previous dead owner is a bad implementation which break some expectations in mutex use cases.

@PomLover
Copy link
Author

Hi Maintainers and Contributors,
It's been a while since I submitted this PR, I'm wondering if there are any concerns or suggestions with this PR, and is it possible to merge this in at that some point, could some of major contributors here provide some feedback or suggestions for next step? Thank you!

@msmolensky
Copy link

Recently ran into the same problem. Implemented this solution for both non-recursive and recursive mutexes. I think this PR fixes an important problem and should be accepted

@igaztanaga
Copy link
Member

Hi,

I've found some time review old bugs and patches. I think the proposed patch (and, in general, robust mutex implementation in C++) has some problems.

Issue 1, Throwing while locking should left the mutex unlocked: The final implementation (with commit 34fc89a), throws an exception even when the mutex is locked, something that breaks C++ Lock guarantees (std::unique_lock, scoped_lock, etc... asume that an exception when locking guarantees the mutex was no locked). And nearly every C++ programmer will use a Lock to call the mutex operations.

Issue 2, Unconditionally marking the mutex as consistent is wrong as only the application can decide on this: The first implementation (commit aba35f0), marks the mutex as consistent, which is coherent with not throwing an exception, but a mutex should be marked as consistent if the application can guarantee that no data structures have been corrupted while the original owner died (the original owner probably died while modifying shared data protected by the mutex, and that data might be half-modified). According to the POSIX specification:

"The pthread_mutex_consistent() function is only responsible for notifying the implementation that the state protected by the mutex has been recovered and that normal operations with the mutex can be resumed. It is the responsibility of the application to recover the state so it can be reused."

So it's not possible to take a decision inside "posix_recursive_mutex::lock()" or any other internal function because the application has no chance to inspect the data.

Issue 3, Impossible to fix Issue 1 and Issue 2 at the same time with POSIX: And finally, we can't unlock the mutex and thow an exception informing the previous owner died when EOWNERDEAD is returned because unlocking the mutex marks it as unrecoverable forever. So it's not possible to inform the application that the previous owner died without breaking Lock guarantees.

I'm inclined to think that the only sensible implementation when receiving EOWNERDEAD is to mark the mutex as unrecoverable (just unlocking it) and throwing an exception because certainly data might be corrupted. No new callers will be able to lock it again and will receive an exception, but they won't be blocked and can handle the exception and try to recreate the data structure in another shared memory or mapped file.

@PomLover
Copy link
Author

@igaztanaga Thank you very much for your valuable inputs, I think I agree with all your comments above, and I have a few thoughts based on your comments, as you mentioned, application is responsible to decide if the mutex can be set to consistent and reusable, do you think it will be appropriate to provide another interface for application to set mutex to consistent state and continue using it, as I can see there are some needs in this case from different comments above. As in this case, it is going to be harder for application to notify all existing waiters to wake up and start using a new mutex instead, and that could cause a mess in terms of keeping the same order of the waiters after starting using the new mutex, then reusing existing mutex by setting it consistent will be a better fit. As library provider, it will be great to provide more flexibilities for applications to achieve different cases easily. My thought is that in this case, library provider can't assume the applications will not be able to recover from corrupted shared data by using the existing mutex either and mark the mutex as not recoverable. I think I still agree with you that application is responsible to make this decision, but currently there is no interface for application to make this decision either reusing mutex or creating a new one. It seems like creating a new one is the only choice which might not be ideal for some applications and/or even cause other problems.

@igaztanaga
Copy link
Member

I'm not sure which interface would be reasonable. If you directly manage the mutex, then we could have some lock() overload that succeeds but also informs that the owner was dead, this could work for POSIX, I'm not sure about Win32 (Win32 has no notion of "mutex consistency").

The problem is harder to solve when we try to use more advanced clases (such as managed shared memory in Boost.Interprocess), where the mutex is deeply hidden. If locking returns that the owner is dead, user code is many levels away the mutex failure logic, ¿how can the internal logic of Boost.Interprocess call the user code without unlocking the mutex? (because unlocking would permantly declare the mutex as non-recoverable in POSIX). A callback that the user must pass layer by layer until the internal logic of Boost.Interprocess?

The more I think about it, the more I see the issue is really tricky.

@PomLover
Copy link
Author

@igaztanaga Thanks for your new inputs. Here are my additional thoughts. Even though there is no concept of consistency in Win32, but Win32 returns WAIT_ABANDONED indicator for waiter to identify dead owner case, and allow applications to decide what to do either continue using this mutex with the normal behavior or do something else, I think that is what a lot of Windows Applications have been living with. The interface in my mind would be something like "recover", for POSIX, it will need to set mutex consistence, and for WIN32, there is nothing to do in this case, as WIN32 allows to use the same mutex without setting anything as long as the new owner releases it before exiting. I can understand it feels tricky while dealing with different cases in multi-platforms, but that is the benefit of boost library to deal with the hard portion so that applications don't need to worry about the tricky cases in multi-platforms, then a cross-platform applications will behave the same in multi-platforms.

@msmolensky
Copy link

While I agree that in general the library cannot make a decision, in some cases it is possible. In the systems with only one writer, for example, it is safe for the library to declare the mutex is consistent. Can the mutex be policy or configuration based? By default, it can be unrecoverable, but it may be specialized/configured to be consistent.

I still feel that unconditionally preventing an application from accessing the data is too restrictive. While better than deadlocking, it still won't allow the application to function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants