Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock in shared mem mutex when apps are killed #325

Open
marthtz opened this issue Oct 30, 2020 · 5 comments
Open

Deadlock in shared mem mutex when apps are killed #325

marthtz opened this issue Oct 30, 2020 · 5 comments
Assignees
Labels
bug Something isn't working
Milestone

Comments

@marthtz
Copy link
Contributor

marthtz commented Oct 30, 2020

Required information

Operating system:
Ubuntu 18.04 LTS

Compiler version:
GCC 7.4.0

Observed result or behaviour:
Deadlock in shared mem mutex when apps are killed

Expected result or behaviour:
No deadlock.

Conditions where it occurred / Performed steps:
Kill an app while some system is running

Related issues

@marthtz marthtz added the bug Something isn't working label Oct 30, 2020
@marthtz marthtz self-assigned this Oct 30, 2020
marthtz pushed a commit to marthtz/iceoryx that referenced this issue Nov 2, 2020
…case application fails to unlock mutex and terminate

Signed-off-by: Sumedha Maharudrayya Mathad (RBEI/ESU4) <SumedhaMaharudrayya.Mathad@in.bosch.com>
marthtz added a commit to marthtz/iceoryx that referenced this issue Nov 2, 2020
Signed-off-by: Hintz Martin (CC-AD/ESW1) <martin.hintz@de.bosch.com>
marthtz added a commit to marthtz/iceoryx that referenced this issue Nov 2, 2020
Signed-off-by: Hintz Martin (CC-AD/ESW1) <martin.hintz@de.bosch.com>
marthtz added a commit to marthtz/iceoryx that referenced this issue Nov 2, 2020
Signed-off-by: Hintz Martin (CC-AD/ESW1) <martin.hintz@de.bosch.com>
marthtz pushed a commit to marthtz/iceoryx that referenced this issue Nov 2, 2020
…case application fails to unlock mutex and terminate

Signed-off-by: Sumedha Maharudrayya Mathad (RBEI/ESU4) <SumedhaMaharudrayya.Mathad@in.bosch.com>
marthtz added a commit to marthtz/iceoryx that referenced this issue Nov 2, 2020
Signed-off-by: Hintz Martin (CC-AD/ESW1) <martin.hintz@de.bosch.com>
marthtz added a commit to marthtz/iceoryx that referenced this issue Nov 2, 2020
Signed-off-by: Hintz Martin (CC-AD/ESW1) <martin.hintz@de.bosch.com>
marthtz added a commit to marthtz/iceoryx that referenced this issue Nov 2, 2020
Signed-off-by: Hintz Martin (CC-AD/ESW1) <martin.hintz@de.bosch.com>
@marthtz marthtz linked a pull request Nov 2, 2020 that will close this issue
@mossmaurice mossmaurice added this to the v1.x milestone Nov 4, 2020
@elBoberido
Copy link
Member

I've had an idea how this could be solved in an elegant way without a mutex but an atomic and semaphore instead.
Let's assume we have the following class

class DualAccessTransactionTray
{
    private:
    class AccessGuard;

    public:
    AccessGuard acquireRouDiAccess()
    {
        switch (m_accessToken.exchange(AccessToken::ROUDI, std::memory_order_acquire))
        {
        case AccessToken::NONE:
            break;
        case AccessToken::ROUDI:
            errorHandler(kBROKEN_INVARIANT);
            break;
        case AccessToken::APP:
            m_waitingLine.wait();
            break;
        }
        return AccessGuard{*this, AccessToken::ROUDI};
    }
    AccessGuard acquireAppAccess()
    {
        switch (m_accessToken.exchange(AccessToken::APP, std::memory_order_acquire))
        {
        case AccessToken::NONE:
            break;
        case AccessToken::ROUDI:
            m_waitingLine.wait();
            break;
        case AccessToken::APP:
            errorHandler(kBROKEN_INVARIANT);
            break;
        }
        return AccessGuard{*this, AccessToken::APP};
    }
    void cleanupAfterAppWalkedThePlank()
    {
        releaseAccess(AccessToken::APP);
        // just to be sure the memory is synchronized
        AccessToken expected = AccessToken::APP;
        m_accessToken.compare_exchange_strong(expected, AccessToken::NONE, std::memory_order_acq_rel);
    }

    private:
    enum class AccessToken
    {
        NONE,
        ROUDI,
        APP
    };

    void releaseAccess(AccessToken expected)
    {
        auto casSuccessful =
            m_accessToken.compare_exchange_strong(expected, AccessToken::NONE, std::memory_order_release);
        if (!casSuccessful)
        {
            if (expected == AccessToken::NONE)
            {
                errorHandler(kBROKEN_INVARIANT);
            }
            m_waitingLine.post();
        }
    }

    class AccessGuard
    {
        public:
        AccessGuard(DualAccessTransactionTray& transactionTray, AccessToken accessToken)
            : m_transactionTray(transactionTray)
            , m_accessToken(accessToken)
        {
        }
        ~AccessGuard()
        {
            m_transactionTray.releaseAccess(m_accessToken);
        }

        private:
        DualAccessTransactionTray& m_transactionTray;
        AccessToken m_accessToken{AccessToken::NONE};
    };

    std::atomic<AccessToken> m_accessToken{AccessToken::NONE};
    posix::Semaphore m_waitingLine;
};

So basically a futex but just for two threads. If the application terminates abnormally, the semaphore can be cleaned up contrary to the mutex which cannot be unlocked from an other thread than the one which acquired the lock.

This would be an example of a concurrent access from RouDi and the application

                              DualAccessTransactionTray
RouDi (discovery loop)               AccessFlag            App (publisher thread)
                                        NONE
acquireRouDiAccess                      NONE
-> previously NONE                      ROUDI
-> continue                             ROUDI
                                        ROUDI              acquireAppAccess
                                        APP                -> previously ROUDI but expected NONE
                                        APP                -> m_waitingLine.wait
-> add/remove queues                    APP
releaseRouDiAccess                      APP
-> CAS fails                            APP
-> m_waitingLine.post                   APP
                                        APP                -> continue (maybe check if to be destroyed flag is set)
                                        APP                -> push to all queues
                                        APP                releaseAppAccess
                                        NONE               -> CAS succeeds
                                        NONE               -> nothing to be done
                                        NONE
acquireRouDiAccess                      NONE
-> previously NONE                      ROUDI
-> continue                             ROUDI
                                        ROUDI              acquireAppAccess
                                        APP                -> previously ROUDI but expected NONE
                                        APP                -> m_waitingLine.wait
-> add/remove queues                    APP
releaseRouDiAccess                      APP
-> CAS fails                            APP
-> m_waitingLine.post                   APP
                                        APP                -> continue (maybe check if to be destroyed flag is set)
acquireRouDiAccess                      APP
-> previously APP but expected NONE     ROUDI
-> m_waitingLine.wait                   ROUDI
                                        ROUDI
                                        ROUDI              -> push to all queues
                                        ROUDI              releaseAppAccess
                                        ROUDI              -> CAS fails
                                        ROUDI              -> m_waitingLine.post
-> continue                             ROUDI
-> add/remove queues                    ROUDI
releaseRouDiAccess                      ROUDI
-> CAS succeeds                         NONE
-> nothing to be done                   NONE

@budrus @MatthiasKillat @elfenpiff what do you think?

@budrus
Copy link
Contributor

budrus commented Apr 19, 2021

@elBoberido so like a mutex that can be unlocked by someone else if this one is sure that the locking one walked the plank. But we could still have corrupted data structures which we maybe are not allowed to access. So we could detect and continue with the consequence that chunks are lost. Or we have to go for the extended UsedChunkList to have a more robust data structure.
Which is this todo I added some day

@elBoberido
Copy link
Member

elBoberido commented Apr 19, 2021

@budrus yes, the history vector might be corrupted. The simple approach would indeed be to leak chunks and print a warning if RouDi detects that the lock was hold by the application. A more sophisticated solution would be a HistoryRingBuffer with the same mechanism like the UsedChunkList. In my view, extending the UsedChunkList makes no sense since it has a totally different API. I'd rather go for a small HistoryRingBuffer which does one thing but well.

elBoberido added a commit to ApexAI/iceoryx that referenced this issue Apr 20, 2021
Signed-off-by: Mathias Kraus <mathias.kraus@apex.ai>
elBoberido added a commit to ApexAI/iceoryx that referenced this issue May 5, 2021
Signed-off-by: Mathias Kraus <mathias.kraus@apex.ai>
elBoberido added a commit to ApexAI/iceoryx that referenced this issue May 5, 2021
Signed-off-by: Mathias Kraus <mathias.kraus@apex.ai>
@mossmaurice
Copy link
Contributor

@marthtz @sculpordwarf @bishibashiB @shankar-in
What is the status here? Will you re-open the PR to solve this bug?

cc @CarolinaGGB

@shankar-in
Copy link
Contributor

@mossmaurice @marthtz is on vacation. he will be back in 2 weeks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
5 participants