Change the order in which the locks are acquired #7542

eaydingol · 2024-02-27T09:46:20Z

DESCRIPTION: PR changes the order in which the locks are acquired for the target and reference tables, when a modify request is initiated from a worker node that is not the "FirstWorkerNode".

To prevent concurrent writes, locks are acquired on the first worker node for the replicated tables. When the update statement originates from the first worker node, it acquires the lock on the reference table(s) first, followed by the target table(s). However, if the update statement is initiated in another worker node, the lock requests are sent to the first worker in a different order. This PR unifies the modification order on the first worker node. With the third commit, independent of the node that received the request, the locks are acquired for the modified table and then the reference tables on the first node.

The first commit shows a sample output for the test prior to the fix.

Fixes #7477

codecov · 2024-02-27T09:55:16Z

Codecov Report

Merging #7542 (043a981) into main (12f5643) will decrease coverage by 0.08%.
The diff coverage is 100.00%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #7542      +/-   ##
==========================================
- Coverage   89.68%   89.61%   -0.08%     
==========================================
  Files         282      282              
  Lines       60460    60462       +2     
  Branches     7530     7530              
==========================================
- Hits        54226    54183      -43     
- Misses       4080     4126      +46     
+ Partials     2154     2153       -1

onurctirtir

Given that LockReferencedReferenceShardResources() acquires some local locks too via LockShardResource(), we probably wanted to acquire the global locks on all the replicated tables first, hence the ordering on main branch.

To me, this feels more like a problem that's caused by not sorting the shardlist beforehand. There are four codepaths using SerializeNonCommutativeWrites() and two of them first sort the shardlist as in:

// first example
	/* lock shards in a consistent order to prevent deadlock */
	shardIntervalList = SortList(shardIntervalList, CompareShardIntervalsById);
	SerializeNonCommutativeWrites(shardIntervalList, lockMode);
..
..
// second example:
	/*
	 * Acquire the locks in a sorted way to avoid deadlocks due to lock
	 * ordering across concurrent sessions.
	 */
	anchorShardIntervalList =
		SortList(anchorShardIntervalList, CompareShardIntervalsById);

The other two doesn't that. So considering those other two code-paths, instead of the changes proposed in this PR, I'd suggest:

Sorting the shardlist in multi_copy.c:2047 (SerializeNonCommutativeWrites), or better to sort it before L2041 (LockShardListMetadata).

(Should we also change the order of LockShardListMetadata() and SerializeNonCommutativeWrites() there because this also seems a bit problematic for COPY commands. Anyway, this is a separate issue.)
Moving the function call made to SerializeNonCommutativeWrites() in AcquireExecutorShardLocksForExecution() to somewhere before the first call made to AcquireExecutorShardLocksForRelationRowLockList() in the same function. The reason is that AcquireExecutorShardLocksForRelationRowLockList() also calls SerializeNonCommutativeWrites() but it does so just for a single task. So moving the function call made to SerializeNonCommutativeWrites() in AcquireExecutorShardLocksForExecution() to somewhere before the first call made to AcquireExecutorShardLocksForRelationRowLockList() and sorting anchorShardIntervalList beforehand should ensure a consistent ordering. In other words, need to populate anchorShardIntervalList in a separate loop before the loop that starts at distributed_execution_locks.c:201, should sort it and then need to call SerializeNonCommutativeWrites() for anchorShardIntervalList.

onurctirtir · 2024-02-27T11:37:22Z

Also, thank you for showing what was happening on main in a separate commit --that makes it so much easier to follow.

eaydingol · 2024-02-27T12:53:06Z

@onurctirtir
It is a good idea to track the calls to the SerializeNonCommutativeWrites as the calls to the first worker are performed in it (which can cause the distributed deadlock).

For this bug, the code path does not include multi_copy.db (can be considered as a separate issue, possibly with test cases?).

I do not think the changes from your second point would solve this problem. First, the shard lists passed to AcquireExecutorShardLocksForRelationRowLockList and SerializeNonCommutativeWrites are different. Second, as I wrote in the description, what causes the deadlock is acquiring locks for the target shards and reference shards in different order (each list is sorted). This PR simply unifies this.

onderkalaci · 2024-03-04T06:26:15Z

src/backend/distributed/utils/resource_lock.c

@@ -728,7 +725,10 @@ SerializeNonCommutativeWrites(List *shardIntervalList, LOCKMODE lockMode)
 			LockReferencedReferenceShardResources(firstShardInterval->shardId, lockMode);
 		}
 	}
-
+	if (anyTableReplicated && ClusterHasKnownMetadataWorkers() && !IsFirstWorkerNode())


note that with this change we are diverging the behavior of Postgres by first locking the referenced tables/shards when there is a foreign key. Postgres acquires the locks on main table first, then cascades into the referenced tables.

So could this change might introduce new classes of concurrency issues where there are concurrent modifications to the ReferencedTables (e.g., distributed table) while there are modifications to the main table (e.g., reference table)?

Also this code doesn't look like what the PR description tells, it seems doing the opposite.

It feels like the code should be much more explicit about what we are doing, for example, the following code block is how I think this logic should look like -- though probably the code could be nicer:

bool acquiredLocally = false // simplest case, no query from any node // first acquire the locks on the modified table // then cascade into the referenced tables if (!ClusterHasKnownMetadataWorkers()) LockShardListResources(shardIntervalList, lockMode); for replicatedShard in replicatedShards LockReferencedReferenceShardResourcesLocally(replicatedTable->firstShardInterval->shardId, lockMode); acquiredLocally = true else // moderate case, we are the first worker node // acquire the locks locally, same as the simplest case if IsFirstWorkerNode() && anyTableReplicated LockShardListResources(shardIntervalList, lockMode); for replicatedShard in replicatedShards LockReferencedReferenceShardResourcesLocally(replicatedTable->firstShardInterval->shardId, lockMode); acquiredLocally = true // ok, not we are on a worker that is not first worker node or we are on the coordinator else LockShardListResourcesOnFirstWorker(lockMode, replicatedShardList); for replicatedShard in replicatedShards LockReferencedReferenceShardResourcesOnTheFirstNode(..) // ok, above we serialized all the modifications on the relevant tables across nodes // by acquiring the locks on the first worker node // so no two concurrent modifications across multiple nodes could interleave any modifications // so, now lets always acquire the locks locally as there are several other concurrency scenarios // where we expect shard resource locks are acquired locally if !acquiredLocally LockShardListResources(shardIntervalList, lockMode); for replicatedShard in replicatedShards LockReferencedReferenceShardResourcesLocally(replicatedTable->firstShardInterval->shardId, lockMode);

note that I think it is even OK to drop all code relevant to !ClusterHasKnownMetadataWorkers() given that with Citus 11.0 we expect all clusters to have metadata synced. This is like a safe-guard in case old clusters upgraded to CItus 11.0 could not sync the metadata.

Thanks for the lead regarding the lock order.

I still think the PR description is right, please let me know if I miss anything. Prior to the changes, "LockShardListResources(shardIntervalList, lockMode);" was the last statement, which acquires the locks for the modified table locally. It is also the last lock acquired when the request is initiated from the first worker node. In particular, the PR changes the order in which the locks are acquired on the first worker node with respect to the node that received the request.
The first version acquired locks on the reference tables and then the modified table.

As you suggested, I changed the order. With the last commit, independent of the node that received the request, the locks are acquired for the modified table and then the reference tables on the first node.

I noticed your suggestion to acquire remote locks before local ones. In the current version, locks are acquired for the modified table on the first worker, followed by local locks. Similarly, locks for reference tables are obtained on the first worker and then locally. I haven’t incorporated that suggestion yet.

I'm quite sure the code in this PR is correct.

I do agree with Onder that the flow in this function is somewhat hard to grasp. But I don't think that's the fault of this PR. I feel like the main reason for that is that LockReferencedReferenceShardResources internally does both first-worker and local locking. So that function call is hiding some of the symmetry between the two types of locks. In Onder his pseudo code that symmetry is more pronounced and thus the logic feels easier to follow.

Given we likely want to backport this change though, I don't think we should do such a refactor in this PR.

src/backend/distributed/utils/resource_lock.c

JelteF · 2024-03-08T14:26:00Z

src/backend/distributed/utils/resource_lock.c

+
+	/*
+	 * Acquire locks on the modified table.
+	 * If the table is replicated, the locks are first acquired on the first worker node then locally.


Suggested change

* If the table is replicated, the locks are first acquired on the first worker node then locally.

* If the table is replicated, the locks are first acquired on the first worker node then locally.

* But if we're already on the first worker, we the acquiring on the first worker node and locally are the same operation. So we only acquire locally in that case.

integrated with the last commit.

Can you wrap the comment line a bit, so that it's not ~200 characters long? (I'm surprised the citus_indent/uncrustify didn't complain about that)

It actually complained about that line. I fixed it with the last commit.

Co-authored-by: Jelte Fennema-Nio <jelte.fennema@microsoft.com>

This PR changes the order in which the locks are acquired (for the target and reference tables), when a modify request is initiated from a worker node that is not the "FirstWorkerNode". To prevent concurrent writes, locks are acquired on the first worker node for the replicated tables. When the update statement originates from the first worker node, it acquires the lock on the reference table(s) first, followed by the target table(s). However, if the update statement is initiated in another worker node, the lock requests are sent to the first worker in a different order. This PR unifies the modification order on the first worker node. With the third commit, independent of the node that received the request, the locks are acquired for the modified table and then the reference tables on the first node. The first commit shows a sample output for the test prior to the fix. Fixes #7477 --------- Co-authored-by: Jelte Fennema-Nio <jelte.fennema@microsoft.com>

This PR changes the order in which the locks are acquired (for the target and reference tables), when a modify request is initiated from a worker node that is not the "FirstWorkerNode". To prevent concurrent writes, locks are acquired on the first worker node for the replicated tables. When the update statement originates from the first worker node, it acquires the lock on the reference table(s) first, followed by the target table(s). However, if the update statement is initiated in another worker node, the lock requests are sent to the first worker in a different order. This PR unifies the modification order on the first worker node. With the third commit, independent of the node that received the request, the locks are acquired for the modified table and then the reference tables on the first node. The first commit shows a sample output for the test prior to the fix. Fixes #7477 --------- Co-authored-by: Jelte Fennema-Nio <jelte.fennema@microsoft.com> (cherry picked from commit 8afa2d0)

eaydingol added 2 commits February 27, 2024 09:20

A test with non-deterministic results illustrating the bug.

803557d

Change the order in which the locks are acquired.

880553e

eaydingol requested a review from JelteF February 27, 2024 10:13

onurctirtir self-requested a review February 27, 2024 11:06

onurctirtir reviewed Feb 27, 2024

View reviewed changes

onderkalaci reviewed Mar 4, 2024

View reviewed changes

First acquire locks on the modified tables, then the reference ones

c3bad5a

JelteF reviewed Mar 8, 2024

View reviewed changes

src/backend/distributed/utils/resource_lock.c Outdated Show resolved Hide resolved

JelteF reviewed Mar 8, 2024

View reviewed changes

JelteF approved these changes Mar 8, 2024

View reviewed changes

eaydingol and others added 3 commits March 8, 2024 18:10

Update comment

b28b5a9

Co-authored-by: Jelte Fennema-Nio <jelte.fennema@microsoft.com>

update comment

99d73fd

Merge branch 'main' into issue7477

303261d

eaydingol enabled auto-merge (squash) March 8, 2024 15:26

JelteF disabled auto-merge March 8, 2024 15:27

style

043a981

eaydingol merged commit 8afa2d0 into main Mar 10, 2024
157 checks passed

eaydingol deleted the issue7477 branch March 10, 2024 07:20

eaydingol mentioned this pull request Mar 10, 2024

Refactor serializing non-commutative write #7556

Open

onurctirtir mentioned this pull request Apr 18, 2024

Adds changelog for 12.1.3 #7587

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change the order in which the locks are acquired #7542

Change the order in which the locks are acquired #7542

eaydingol commented Feb 27, 2024 •

edited

codecov bot commented Feb 27, 2024 •

edited

onurctirtir left a comment •

edited

onurctirtir commented Feb 27, 2024

eaydingol commented Feb 27, 2024

onderkalaci Mar 4, 2024 •

edited

eaydingol Mar 4, 2024 •

edited

JelteF Mar 8, 2024 •

edited

JelteF Mar 8, 2024 •

edited

eaydingol Mar 8, 2024

JelteF Mar 8, 2024

eaydingol Mar 10, 2024

	* If the table is replicated, the locks are first acquired on the first worker node then locally.
	* If the table is replicated, the locks are first acquired on the first worker node then locally.
	* But if we're already on the first worker, we the acquiring on the first worker node and locally are the same operation. So we only acquire locally in that case.

Change the order in which the locks are acquired #7542

Change the order in which the locks are acquired #7542

Conversation

eaydingol commented Feb 27, 2024 • edited

codecov bot commented Feb 27, 2024 • edited

Codecov Report

onurctirtir left a comment • edited

Choose a reason for hiding this comment

onurctirtir commented Feb 27, 2024

eaydingol commented Feb 27, 2024

onderkalaci Mar 4, 2024 • edited

Choose a reason for hiding this comment

eaydingol Mar 4, 2024 • edited

Choose a reason for hiding this comment

JelteF Mar 8, 2024 • edited

Choose a reason for hiding this comment

JelteF Mar 8, 2024 • edited

Choose a reason for hiding this comment

eaydingol Mar 8, 2024

Choose a reason for hiding this comment

JelteF Mar 8, 2024

Choose a reason for hiding this comment

eaydingol Mar 10, 2024

Choose a reason for hiding this comment

eaydingol commented Feb 27, 2024 •

edited

codecov bot commented Feb 27, 2024 •

edited

onurctirtir left a comment •

edited

onderkalaci Mar 4, 2024 •

edited

eaydingol Mar 4, 2024 •

edited

JelteF Mar 8, 2024 •

edited

JelteF Mar 8, 2024 •

edited