-
Notifications
You must be signed in to change notification settings - Fork 644
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change the order in which the locks are acquired #7542
Conversation
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## main #7542 +/- ##
==========================================
- Coverage 89.68% 89.61% -0.08%
==========================================
Files 282 282
Lines 60460 60462 +2
Branches 7530 7530
==========================================
- Hits 54226 54183 -43
- Misses 4080 4126 +46
+ Partials 2154 2153 -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that LockReferencedReferenceShardResources() acquires some local locks too via LockShardResource(), we probably wanted to acquire the global locks on all the replicated tables first, hence the ordering on main branch.
To me, this feels more like a problem that's caused by not sorting the shardlist beforehand. There are four codepaths using SerializeNonCommutativeWrites() and two of them first sort the shardlist as in:
// first example
/* lock shards in a consistent order to prevent deadlock */
shardIntervalList = SortList(shardIntervalList, CompareShardIntervalsById);
SerializeNonCommutativeWrites(shardIntervalList, lockMode);
..
..
// second example:
/*
* Acquire the locks in a sorted way to avoid deadlocks due to lock
* ordering across concurrent sessions.
*/
anchorShardIntervalList =
SortList(anchorShardIntervalList, CompareShardIntervalsById);
The other two doesn't that. So considering those other two code-paths, instead of the changes proposed in this PR, I'd suggest:
-
Sorting the shardlist in multi_copy.c:2047 (SerializeNonCommutativeWrites), or better to sort it before L2041 (LockShardListMetadata).
(Should we also change the order of LockShardListMetadata() and SerializeNonCommutativeWrites() there because this also seems a bit problematic for COPY commands. Anyway, this is a separate issue.) -
Moving the function call made to SerializeNonCommutativeWrites() in AcquireExecutorShardLocksForExecution() to somewhere before the first call made to AcquireExecutorShardLocksForRelationRowLockList() in the same function. The reason is that AcquireExecutorShardLocksForRelationRowLockList() also calls SerializeNonCommutativeWrites() but it does so just for a single task. So moving the function call made to SerializeNonCommutativeWrites() in AcquireExecutorShardLocksForExecution() to somewhere before the first call made to AcquireExecutorShardLocksForRelationRowLockList() and sorting anchorShardIntervalList beforehand should ensure a consistent ordering. In other words, need to populate anchorShardIntervalList in a separate loop before the loop that starts at distributed_execution_locks.c:201, should sort it and then need to call SerializeNonCommutativeWrites() for anchorShardIntervalList.
Also, thank you for showing what was happening on main in a separate commit --that makes it so much easier to follow. |
@onurctirtir For this bug, the code path does not include multi_copy.db (can be considered as a separate issue, possibly with test cases?). I do not think the changes from your second point would solve this problem. First, the shard lists passed to AcquireExecutorShardLocksForRelationRowLockList and SerializeNonCommutativeWrites are different. Second, as I wrote in the description, what causes the deadlock is acquiring locks for the target shards and reference shards in different order (each list is sorted). This PR simply unifies this. |
@@ -728,7 +725,10 @@ SerializeNonCommutativeWrites(List *shardIntervalList, LOCKMODE lockMode) | |||
LockReferencedReferenceShardResources(firstShardInterval->shardId, lockMode); | |||
} | |||
} | |||
|
|||
if (anyTableReplicated && ClusterHasKnownMetadataWorkers() && !IsFirstWorkerNode()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note that with this change we are diverging the behavior of Postgres by first locking the referenced tables/shards when there is a foreign key. Postgres acquires the locks on main table first, then cascades into the referenced tables.
So could this change might introduce new classes of concurrency issues where there are concurrent modifications to the ReferencedTables
(e.g., distributed table) while there are modifications to the main table (e.g., reference table)?
Also this code doesn't look like what the PR description tells, it seems doing the opposite.
It feels like the code should be much more explicit about what we are doing, for example, the following code block is how I think this logic should look like -- though probably the code could be nicer:
bool acquiredLocally = false
// simplest case, no query from any node
// first acquire the locks on the modified table
// then cascade into the referenced tables
if (!ClusterHasKnownMetadataWorkers())
LockShardListResources(shardIntervalList, lockMode);
for replicatedShard in replicatedShards
LockReferencedReferenceShardResourcesLocally(replicatedTable->firstShardInterval->shardId, lockMode);
acquiredLocally = true
else
// moderate case, we are the first worker node
// acquire the locks locally, same as the simplest case
if IsFirstWorkerNode() && anyTableReplicated
LockShardListResources(shardIntervalList, lockMode);
for replicatedShard in replicatedShards
LockReferencedReferenceShardResourcesLocally(replicatedTable->firstShardInterval->shardId, lockMode);
acquiredLocally = true
// ok, not we are on a worker that is not first worker node or we are on the coordinator
else
LockShardListResourcesOnFirstWorker(lockMode, replicatedShardList);
for replicatedShard in replicatedShards
LockReferencedReferenceShardResourcesOnTheFirstNode(..)
// ok, above we serialized all the modifications on the relevant tables across nodes
// by acquiring the locks on the first worker node
// so no two concurrent modifications across multiple nodes could interleave any modifications
// so, now lets always acquire the locks locally as there are several other concurrency scenarios
// where we expect shard resource locks are acquired locally
if !acquiredLocally
LockShardListResources(shardIntervalList, lockMode);
for replicatedShard in replicatedShards
LockReferencedReferenceShardResourcesLocally(replicatedTable->firstShardInterval->shardId, lockMode);
note that I think it is even OK to drop all code relevant to !ClusterHasKnownMetadataWorkers()
given that with Citus 11.0 we expect all clusters to have metadata synced. This is like a safe-guard in case old clusters upgraded to CItus 11.0 could not sync the metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the lead regarding the lock order.
I still think the PR description is right, please let me know if I miss anything. Prior to the changes, "LockShardListResources(shardIntervalList, lockMode);
" was the last statement, which acquires the locks for the modified table locally. It is also the last lock acquired when the request is initiated from the first worker node. In particular, the PR changes the order in which the locks are acquired on the first worker node with respect to the node that received the request.
The first version acquired locks on the reference tables and then the modified table.
As you suggested, I changed the order. With the last commit, independent of the node that received the request, the locks are acquired for the modified table and then the reference tables on the first node.
I noticed your suggestion to acquire remote locks before local ones. In the current version, locks are acquired for the modified table on the first worker, followed by local locks. Similarly, locks for reference tables are obtained on the first worker and then locally. I haven’t incorporated that suggestion yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm quite sure the code in this PR is correct.
I do agree with Onder that the flow in this function is somewhat hard to grasp. But I don't think that's the fault of this PR. I feel like the main reason for that is that LockReferencedReferenceShardResources internally does both first-worker and local locking. So that function call is hiding some of the symmetry between the two types of locks. In Onder his pseudo code that symmetry is more pronounced and thus the logic feels easier to follow.
Given we likely want to backport this change though, I don't think we should do such a refactor in this PR.
|
||
/* | ||
* Acquire locks on the modified table. | ||
* If the table is replicated, the locks are first acquired on the first worker node then locally. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* If the table is replicated, the locks are first acquired on the first worker node then locally. | |
* If the table is replicated, the locks are first acquired on the first worker node then locally. | |
* But if we're already on the first worker, we the acquiring on the first worker node and locally are the same operation. So we only acquire locally in that case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
integrated with the last commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you wrap the comment line a bit, so that it's not ~200 characters long? (I'm surprised the citus_indent/uncrustify didn't complain about that)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It actually complained about that line. I fixed it with the last commit.
Co-authored-by: Jelte Fennema-Nio <jelte.fennema@microsoft.com>
This PR changes the order in which the locks are acquired (for the target and reference tables), when a modify request is initiated from a worker node that is not the "FirstWorkerNode". To prevent concurrent writes, locks are acquired on the first worker node for the replicated tables. When the update statement originates from the first worker node, it acquires the lock on the reference table(s) first, followed by the target table(s). However, if the update statement is initiated in another worker node, the lock requests are sent to the first worker in a different order. This PR unifies the modification order on the first worker node. With the third commit, independent of the node that received the request, the locks are acquired for the modified table and then the reference tables on the first node. The first commit shows a sample output for the test prior to the fix. Fixes #7477 --------- Co-authored-by: Jelte Fennema-Nio <jelte.fennema@microsoft.com>
This PR changes the order in which the locks are acquired (for the target and reference tables), when a modify request is initiated from a worker node that is not the "FirstWorkerNode". To prevent concurrent writes, locks are acquired on the first worker node for the replicated tables. When the update statement originates from the first worker node, it acquires the lock on the reference table(s) first, followed by the target table(s). However, if the update statement is initiated in another worker node, the lock requests are sent to the first worker in a different order. This PR unifies the modification order on the first worker node. With the third commit, independent of the node that received the request, the locks are acquired for the modified table and then the reference tables on the first node. The first commit shows a sample output for the test prior to the fix. Fixes #7477 --------- Co-authored-by: Jelte Fennema-Nio <jelte.fennema@microsoft.com> (cherry picked from commit 8afa2d0)
This PR changes the order in which the locks are acquired (for the target and reference tables), when a modify request is initiated from a worker node that is not the "FirstWorkerNode". To prevent concurrent writes, locks are acquired on the first worker node for the replicated tables. When the update statement originates from the first worker node, it acquires the lock on the reference table(s) first, followed by the target table(s). However, if the update statement is initiated in another worker node, the lock requests are sent to the first worker in a different order. This PR unifies the modification order on the first worker node. With the third commit, independent of the node that received the request, the locks are acquired for the modified table and then the reference tables on the first node. The first commit shows a sample output for the test prior to the fix. Fixes #7477 --------- Co-authored-by: Jelte Fennema-Nio <jelte.fennema@microsoft.com> (cherry picked from commit 8afa2d0)
This PR changes the order in which the locks are acquired (for the target and reference tables), when a modify request is initiated from a worker node that is not the "FirstWorkerNode". To prevent concurrent writes, locks are acquired on the first worker node for the replicated tables. When the update statement originates from the first worker node, it acquires the lock on the reference table(s) first, followed by the target table(s). However, if the update statement is initiated in another worker node, the lock requests are sent to the first worker in a different order. This PR unifies the modification order on the first worker node. With the third commit, independent of the node that received the request, the locks are acquired for the modified table and then the reference tables on the first node. The first commit shows a sample output for the test prior to the fix. Fixes #7477 --------- Co-authored-by: Jelte Fennema-Nio <jelte.fennema@microsoft.com> (cherry picked from commit 8afa2d0)
This PR changes the order in which the locks are acquired (for the target and reference tables), when a modify request is initiated from a worker node that is not the "FirstWorkerNode". To prevent concurrent writes, locks are acquired on the first worker node for the replicated tables. When the update statement originates from the first worker node, it acquires the lock on the reference table(s) first, followed by the target table(s). However, if the update statement is initiated in another worker node, the lock requests are sent to the first worker in a different order. This PR unifies the modification order on the first worker node. With the third commit, independent of the node that received the request, the locks are acquired for the modified table and then the reference tables on the first node. The first commit shows a sample output for the test prior to the fix. Fixes #7477 --------- Co-authored-by: Jelte Fennema-Nio <jelte.fennema@microsoft.com> (cherry picked from commit 8afa2d0)
DESCRIPTION: PR changes the order in which the locks are acquired for the target and reference tables, when a modify request is initiated from a worker node that is not the "FirstWorkerNode".
To prevent concurrent writes, locks are acquired on the first worker node for the replicated tables. When the update statement originates from the first worker node, it acquires the lock on the reference table(s) first, followed by the target table(s). However, if the update statement is initiated in another worker node, the lock requests are sent to the first worker in a different order. This PR unifies the modification order on the first worker node. With the third commit, independent of the node that received the request, the locks are acquired for the modified table and then the reference tables on the first node.
The first commit shows a sample output for the test prior to the fix.
Fixes #7477