Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bugfix: fix deadlock and lock optimization #1605

Merged
merged 8 commits into from
Sep 20, 2019
Merged

bugfix: fix deadlock and lock optimization #1605

merged 8 commits into from
Sep 20, 2019

Conversation

BeiKeJieDeLiuLangMao
Copy link
Contributor

@BeiKeJieDeLiuLangMao BeiKeJieDeLiuLangMao commented Sep 6, 2019

Ⅰ. Describe what this PR did

Fix deadlock of MemoryLocker & sort locks by PK before acquire.

Ⅱ. Does this pull request fix one issue?

Fixed #1603.

Ⅲ. Why don't you add test cases (unit test/integration test)?

No, i add one, it is io.seata.server.lock.LockManagerTest#deadlockTest.

Ⅳ. Describe how to verify it

Run io.seata.server.lock.LockManagerTest#deadlockTest.

Ⅴ. Special notes for reviews

All description is in #1603.

@codecov-io
Copy link

codecov-io commented Sep 6, 2019

Codecov Report

Merging #1605 into develop will increase coverage by <.01%.
The diff coverage is 90%.

Impacted file tree graph

@@              Coverage Diff              @@
##             develop    #1605      +/-   ##
=============================================
+ Coverage      47.14%   47.15%   +<.01%     
  Complexity      1785     1785              
=============================================
  Files            364      364              
  Lines          13260    13251       -9     
  Branches        1644     1642       -2     
=============================================
- Hits            6252     6248       -4     
+ Misses          6326     6324       -2     
+ Partials         682      679       -3
Impacted Files Coverage Δ Complexity Δ
.../java/io/seata/server/lock/DefaultLockManager.java 47.91% <ø> (ø) 12 <0> (ø) ⬇️
...in/java/io/seata/server/session/BranchSession.java 78.19% <100%> (ø) 41 <0> (ø) ⬇️
...java/io/seata/server/lock/memory/MemoryLocker.java 87.95% <89.65%> (+2.08%) 20 <4> (ø) ⬇️
...o/seata/server/coordinator/DefaultCoordinator.java 50.9% <0%> (+0.9%) 27% <0%> (+1%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 62c9246...2a0930e. Read the comment docs.

Copy link
Contributor

@l81893521 l81893521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Justice-love
Copy link
Contributor

Justice-love commented Sep 9, 2019

good!
range of synchronized block and apply for another lock object in synchronized block caused this bug.
you move the branchSession.unlock() code out of synchronized block can resolved this bug.
But i think limit the range of synchronized block may be more appropriate.

@Justice-love
Copy link
Contributor

            Long lockingTransactionId = null;
            synchronized (bucketLockMap) {
                lockingTransactionId = bucketLockMap.get(pk);
                if (lockingTransactionId == null) {
                    //No existing lock
                    bucketLockMap.put(pk, transactionId);
                }
            }

@BeiKeJieDeLiuLangMao
Copy link
Contributor Author

@Justice-love
In fact, bucketLockMap is a ConcurrentHashMap too, i think this is enough:

Long previousTransactionId = bucketLockMap.putIfAbsent(pk, transactionId);
if (previousTransactionId == null) {
    //No existing lock
    Set<String> keysInHolder = bucketHolder.get(bucketLockMap);
    if (keysInHolder == null) {
        bucketHolder.putIfAbsent(bucketLockMap, new ConcurrentSet<String>());
        keysInHolder = bucketHolder.get(bucketLockMap);
    }
    keysInHolder.add(pk);

} else if (previousTransactionId == transactionId) {
    // Locked by me
    continue;
} else {
    LOGGER.info("Global lock on [" + tableName + ":" + pk + "] is holding by " + previousTransactionId);
    try {
        // Release all acquired locks.
        branchSession.unlock();
    } catch (TransactionException e) {
        throw new FrameworkException(e);
    }
    return false;
}

@Justice-love
Copy link
Contributor

Justice-love commented Sep 9, 2019

@Justice-love
In fact, bucketLockMap is a ConcurrentHashMap too, i think this is enough:

Long previousTransactionId = bucketLockMap.putIfAbsent(pk, transactionId);
if (previousTransactionId == null) {
    //No existing lock
    Set<String> keysInHolder = bucketHolder.get(bucketLockMap);
    if (keysInHolder == null) {
        bucketHolder.putIfAbsent(bucketLockMap, new ConcurrentSet<String>());
        keysInHolder = bucketHolder.get(bucketLockMap);
    }
    keysInHolder.add(pk);

} else if (previousTransactionId == transactionId) {
    // Locked by me
    continue;
} else {
    LOGGER.info("Global lock on [" + tableName + ":" + pk + "] is holding by " + previousTransactionId);
    try {
        // Release all acquired locks.
        branchSession.unlock();
    } catch (TransactionException e) {
        throw new FrameworkException(e);
    }
    return false;
}

i known, there is no problem code like this. but i think appropriate range of synchronized block is the basic rule and also can resolved this problem.

@BeiKeJieDeLiuLangMao
Copy link
Contributor Author

@Justice-love
I understand your opinion, we do should be careful about range of synchronized block. What i wanted to express before is, instead of using synchronized block, we could better use ConcurrentHashMap's thread safe api. That is more powerful in performance than synchronized block.

@ggndnn
Copy link
Contributor

ggndnn commented Sep 9, 2019

@BeiKeJieDeLiuLangMao
Based on the latest code, can we delete the synchronized (bucket) in releaseLock?

@ggndnn
Copy link
Contributor

ggndnn commented Sep 9, 2019

@BeiKeJieDeLiuLangMao If a branch session could unlock itself in parallel, there will be a problem without the synchronized in releaseLock. Do you think it's possible?

@BeiKeJieDeLiuLangMao
Copy link
Contributor Author

@ggndnn
I only delete synchronized (bucket) in acquireLock method, i think we could alse delete that in releaseLock method, after do that, the releaseLock method will like follow:

@Override
public boolean releaseLock(List<RowLock> rowLock) {
    ConcurrentHashMap<ConcurrentHashMap<String, Long>, Set<String>> lockHolder = branchSession.getLockHolder();
    if (lockHolder == null || lockHolder.size() == 0) {
        return true;
    }
    Iterator<Map.Entry<ConcurrentHashMap<String, Long>, Set<String>>> it = lockHolder.entrySet().iterator();
    while (it.hasNext()) {
        Map.Entry<ConcurrentHashMap<String, Long>, Set<String>> entry = it.next();
        ConcurrentHashMap<String, Long> bucket = entry.getKey();
        Set<String> keys = entry.getValue();
        for (String key : keys) {
            bucket.remove(key, branchSession.getTransactionId());
        }
    }
    lockHolder.clear();
    return true;
}

And i am thinking about the problem what your talk about. If concurrently invoke acquireLock and releaseLock, and releaseLock finish first, acquireLock method maybe add new locks to LOCK_MAP. The problem is nobody will remove these new locks .

@ggndnn
Copy link
Contributor

ggndnn commented Sep 9, 2019

@BeiKeJieDeLiuLangMao
Just for discussion, what about to use CAS instead of synchronized, a little complicated. Compare transactionId first, then remove, instead of removing directly. What's your opinion?

private static final long IMPOSSIBLE_TRANSACTION_ID = Long.MIN_VALUE;

private static final ConcurrentHashMap<String/* resourceId */,
    ConcurrentHashMap<String/* tableName */,
        ConcurrentHashMap<Integer/* bucketId */,
            Map<String/* pk */, AtomicLong/* transactionId */>>>>
    LOCK_MAP
    = new ConcurrentHashMap<>();


...

@Override
    public boolean acquireLock(List<RowLock> rowLocks) {
        for (RowLock lock : rowLocks) {
            ...
            AtomicLong lockingTransactionId = bucketLockMap.get(pk);
            if (lockingTransactionId == null) {
                lockingTransactionId = bucketLockMap.putIfAbsent(pk, new AtomicLong(transactionId));
            }
            while (lockingTransactionId != null && lockingTransactionId.get() == IMPOSSIBLE_TRANSACTION_ID) {
                lockingTransactionId = bucketLockMap.putIfAbsent(pk, new AtomicLong(transactionId));
            }
            if (lockingTransactionId == null) {
                //No existing lock
                Set<String> keysInHolder = bucketHolder.get(bucketLockMap);
                if (keysInHolder == null) {
                    bucketHolder.putIfAbsent(bucketLockMap, new ConcurrentSet<>());
                    keysInHolder = bucketHolder.get(bucketLockMap);
                }
                keysInHolder.add(pk);
            } else if (lockingTransactionId.longValue() != transactionId) {
                LOGGER.info("Global lock on [" + tableName + ":" + pk + "] is holding by " + lockingTransactionId);
                try {
                    // Release all acquired locks.
                    branchSession.unlock();
                } catch (TransactionException e) {
                    throw new FrameworkException(e);
                }
                return false;
            }
        }
        return true;
    }

    @Override
    public boolean releaseLock(List<RowLock> rowLock) {
        Iterator<Map.Entry<Map<String, AtomicLong>, Set<String>>> it = lockHolder.entrySet().iterator();
        while (it.hasNext()) {
            ...
            for (String key : keys) {
                AtomicLong v = bucket.get(key);
                if (v == null) {
                    continue;
                }
                if(v.compareAndSet(branchSession.getTransactionId(), IMPOSSIBLE_TRANSACTION_ID)) {
                    bucket.remove(key);
                }
            }
        }
        ...
    }

@BeiKeJieDeLiuLangMao
Copy link
Contributor Author

@ggndnn
In fact, ConcurrentHashMap's api provides CAS feature,you could take a look document of java.util.Map#putIfAbsent java.util.Map#remove(java.lang.Object, java.lang.Object)

If the specified key is not already associated with a value (or is mapped
* to {@code null}) associates it with the given value and returns
* {@code null}, else returns the current value.

Removes the entry for the specified key only if it is currently
* mapped to the specified value.

So, use java.util.Map#putIfAbsent and java.util.Map#remove(java.lang.Object, java.lang.Object) is enough.

@ggndnn
Copy link
Contributor

ggndnn commented Sep 9, 2019

@BeiKeJieDeLiuLangMao
Yes, same logic, but ConcurrentHashMap's implementation used synchronized... Never mind :), that's not the point here, I didn't notice your code above, it was more concise, my paste was just a logic presentation. Come back to this PR, I think it's enough, releaseLock success means key lock is released, acquireLock success means lock is acquired. What's your opinion?

@BeiKeJieDeLiuLangMao
Copy link
Contributor Author

@ggndnn
What i worry about is, is that possible branchSession's lock and unlock methods both invoked in same time.

I reviewed the code about lock and unlock.
There are two point, which will use branchSession::lock:

  1. SessionHolder::reload: This process wil finish before open rpcServer server, we don't need to care about it because i think it's safe.
  2. DefaultCore::branchRegister: In this method, branchSession::lock will been invoked with GlobalSession's lock, i think it's safe too.

Now let's pay attention to unlock process:

  1. DefaultCore#branchRegister: if add branch failed, it will remove branchSession's lock, this invoked in globalSession's lock, so it's safe.
  2. MemoryLocker#acquireLock: lock conflict, release locks just acqurired, because all branchSession's lock invoked in globalSession's lock, so it's safe.
  3. DefaultCore#doGlobalCommit & doGlobalRollback: In invoke stack, only changing global session status process use GlobalSession's lock, but it could promise the whole globalSession safe.

Now let's consider follow situation:

  1. TM register a globalSession in TC
  2. TM invoke a RPC to RM1, but timeout, so TM wanna do global rollback.
  3. But TM's RPC request received by RM1 actually, and RM1 try to register branch transaction in TC.
  4. Unfortunately, RM1‘s register branch request and TM's global rollback request arrive TC at same time.
  5. Branch register process begin firstly, in this time global session's status is GlobalStatus.Begin, but before add branchSession's locks, global rollback process begin to work.
  6. Because global rollback process use global session's lock when change session status, if any branch register process not finish, it coun't get the global session's lock, after branch register process finish, it gets the lock and changes global session's status. After this moment, others branch register process could naver begin, because when hold the global session's lock to check session's status, it will find session is end, then it will fastfail.

Finally, i think the whole process is safe.

I will commit the code what i show you before. And as you said, ConcurrentHashMap's api uses synchronized block, but only happe when hash conflict, bucketMap's hash conflict, very low possibility i think.

@BeiKeJieDeLiuLangMao BeiKeJieDeLiuLangMao changed the title bugfix: fix deadlock of MemoryLocker; future: sort locks by PK before acquire bugfix: fix deadlock of MemoryLocker; future: sort locks by PK before acquire; feature: lock optimization Sep 9, 2019
@ggndnn
Copy link
Contributor

ggndnn commented Sep 9, 2019

@BeiKeJieDeLiuLangMao
I understand your worries, maybe, it's not going to happen in my understanding of this project. Your analysis is good and detailed. It's very happy discussing here with you.

By the way. I said synchronized, i was just a joke, as I said, never mind. ConcurrentMap is good enough, though we used synchronized, jvm may use spinlock first. And hash conflict is not the condition whether synchronized block was used, I think your meaning was something like Only when hash conflict there will be lock competition.

Your code looks good to me now. Maybe someone else needs further review.

Copy link
Contributor Author

@BeiKeJieDeLiuLangMao BeiKeJieDeLiuLangMao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on single responsibility principle, first || second is enough.
I think LockManagerTest#acquireLock_failed already do what you want to do.

Copy link
Contributor

@ggndnn ggndnn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now. Thanks @BeiKeJieDeLiuLangMao.

@BeiKeJieDeLiuLangMao BeiKeJieDeLiuLangMao mentioned this pull request Sep 10, 2019
1 task
@xingfudeshi xingfudeshi changed the title bugfix: fix deadlock of MemoryLocker; future: sort locks by PK before acquire; feature: lock optimization bugfix: fix deadlock and lock optimization Sep 11, 2019
@xingfudeshi
Copy link
Member

The Unit testing is passed.

@xingfudeshi
Copy link
Member

@BeiKeJieDeLiuLangMao You need to associate your github account.
git config --global user.email xxx@xxxx.com
The email should be same as your github account.

@BeiKeJieDeLiuLangMao
Copy link
Contributor Author

@xingfudeshi
I finished, what should i do further?

@xingfudeshi
Copy link
Member

@xingfudeshi
I finished, what should i do further?

Please re-push the commit.

Copy link
Member

@xingfudeshi xingfudeshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@slievrly
Copy link
Member

@slievrly slievrly added Do Not Merge Do not merge into develop and removed Do Not Merge Do not merge into develop labels Sep 20, 2019
Copy link
Member

@slievrly slievrly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ConcurrentHashMap -> ConcurrentMap

Signed-off-by: 贝克街的流浪猫 <517375685@qq.com>
Signed-off-by: 贝克街的流浪猫 <517375685@qq.com>
Signed-off-by: 贝克街的流浪猫 <517375685@qq.com>
Copy link
Member

@slievrly slievrly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Signed-off-by: 贝克街的流浪猫 <517375685@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MemoryLocker deadlock
9 participants