Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change the order in which the locks are acquired #7542

Merged
merged 7 commits into from
Mar 10, 2024
Merged

Change the order in which the locks are acquired #7542

merged 7 commits into from
Mar 10, 2024

Conversation

eaydingol
Copy link
Contributor

@eaydingol eaydingol commented Feb 27, 2024

DESCRIPTION: PR changes the order in which the locks are acquired for the target and reference tables, when a modify request is initiated from a worker node that is not the "FirstWorkerNode".

To prevent concurrent writes, locks are acquired on the first worker node for the replicated tables. When the update statement originates from the first worker node, it acquires the lock on the reference table(s) first, followed by the target table(s). However, if the update statement is initiated in another worker node, the lock requests are sent to the first worker in a different order. This PR unifies the modification order on the first worker node. With the third commit, independent of the node that received the request, the locks are acquired for the modified table and then the reference tables on the first node.

The first commit shows a sample output for the test prior to the fix.

Fixes #7477

Copy link

codecov bot commented Feb 27, 2024

Codecov Report

Merging #7542 (043a981) into main (12f5643) will decrease coverage by 0.08%.
The diff coverage is 100.00%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7542      +/-   ##
==========================================
- Coverage   89.68%   89.61%   -0.08%     
==========================================
  Files         282      282              
  Lines       60460    60462       +2     
  Branches     7530     7530              
==========================================
- Hits        54226    54183      -43     
- Misses       4080     4126      +46     
+ Partials     2154     2153       -1     

Copy link
Member

@onurctirtir onurctirtir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that LockReferencedReferenceShardResources() acquires some local locks too via LockShardResource(), we probably wanted to acquire the global locks on all the replicated tables first, hence the ordering on main branch.

To me, this feels more like a problem that's caused by not sorting the shardlist beforehand. There are four codepaths using SerializeNonCommutativeWrites() and two of them first sort the shardlist as in:

// first example
	/* lock shards in a consistent order to prevent deadlock */
	shardIntervalList = SortList(shardIntervalList, CompareShardIntervalsById);
	SerializeNonCommutativeWrites(shardIntervalList, lockMode);
..
..
// second example:
	/*
	 * Acquire the locks in a sorted way to avoid deadlocks due to lock
	 * ordering across concurrent sessions.
	 */
	anchorShardIntervalList =
		SortList(anchorShardIntervalList, CompareShardIntervalsById);

The other two doesn't that. So considering those other two code-paths, instead of the changes proposed in this PR, I'd suggest:

  1. Sorting the shardlist in multi_copy.c:2047 (SerializeNonCommutativeWrites), or better to sort it before L2041 (LockShardListMetadata).

    (Should we also change the order of LockShardListMetadata() and SerializeNonCommutativeWrites() there because this also seems a bit problematic for COPY commands. Anyway, this is a separate issue.)

  2. Moving the function call made to SerializeNonCommutativeWrites() in AcquireExecutorShardLocksForExecution() to somewhere before the first call made to AcquireExecutorShardLocksForRelationRowLockList() in the same function. The reason is that AcquireExecutorShardLocksForRelationRowLockList() also calls SerializeNonCommutativeWrites() but it does so just for a single task. So moving the function call made to SerializeNonCommutativeWrites() in AcquireExecutorShardLocksForExecution() to somewhere before the first call made to AcquireExecutorShardLocksForRelationRowLockList() and sorting anchorShardIntervalList beforehand should ensure a consistent ordering. In other words, need to populate anchorShardIntervalList in a separate loop before the loop that starts at distributed_execution_locks.c:201, should sort it and then need to call SerializeNonCommutativeWrites() for anchorShardIntervalList.

@onurctirtir
Copy link
Member

Also, thank you for showing what was happening on main in a separate commit --that makes it so much easier to follow.

@eaydingol
Copy link
Contributor Author

@onurctirtir
It is a good idea to track the calls to the SerializeNonCommutativeWrites as the calls to the first worker are performed in it (which can cause the distributed deadlock).

For this bug, the code path does not include multi_copy.db (can be considered as a separate issue, possibly with test cases?).

I do not think the changes from your second point would solve this problem. First, the shard lists passed to AcquireExecutorShardLocksForRelationRowLockList and SerializeNonCommutativeWrites are different. Second, as I wrote in the description, what causes the deadlock is acquiring locks for the target shards and reference shards in different order (each list is sorted). This PR simply unifies this.

@@ -728,7 +725,10 @@ SerializeNonCommutativeWrites(List *shardIntervalList, LOCKMODE lockMode)
LockReferencedReferenceShardResources(firstShardInterval->shardId, lockMode);
}
}

if (anyTableReplicated && ClusterHasKnownMetadataWorkers() && !IsFirstWorkerNode())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that with this change we are diverging the behavior of Postgres by first locking the referenced tables/shards when there is a foreign key. Postgres acquires the locks on main table first, then cascades into the referenced tables.

So could this change might introduce new classes of concurrency issues where there are concurrent modifications to the ReferencedTables (e.g., distributed table) while there are modifications to the main table (e.g., reference table)?

Also this code doesn't look like what the PR description tells, it seems doing the opposite.

It feels like the code should be much more explicit about what we are doing, for example, the following code block is how I think this logic should look like -- though probably the code could be nicer:

bool acquiredLocally = false

// simplest case, no query from any node
// first acquire the locks on the modified table
// then cascade into the referenced tables
if (!ClusterHasKnownMetadataWorkers())
	LockShardListResources(shardIntervalList, lockMode);
	for replicatedShard in replicatedShards
		LockReferencedReferenceShardResourcesLocally(replicatedTable->firstShardInterval->shardId, lockMode);

	acquiredLocally = true
else
	
	// moderate case, we are the first worker node
	// acquire the locks locally, same as the simplest case
	if IsFirstWorkerNode() && anyTableReplicated
		LockShardListResources(shardIntervalList, lockMode);
		for replicatedShard in replicatedShards
			LockReferencedReferenceShardResourcesLocally(replicatedTable->firstShardInterval->shardId, lockMode);
		acquiredLocally = true

	// ok, not we are on a worker that is not first worker node or we are on the coordinator
	else
		LockShardListResourcesOnFirstWorker(lockMode, replicatedShardList);
		for replicatedShard in replicatedShards
			LockReferencedReferenceShardResourcesOnTheFirstNode(..)

// ok, above we serialized all the modifications on the relevant tables across nodes
// by acquiring the locks on the first worker node
// so no two concurrent modifications across multiple nodes could interleave any modifications

// so, now lets always acquire the locks locally as there are several other concurrency scenarios
// where we expect shard resource locks are acquired locally
if !acquiredLocally 
	LockShardListResources(shardIntervalList, lockMode);
	for replicatedShard in replicatedShards
		LockReferencedReferenceShardResourcesLocally(replicatedTable->firstShardInterval->shardId, lockMode);

note that I think it is even OK to drop all code relevant to !ClusterHasKnownMetadataWorkers() given that with Citus 11.0 we expect all clusters to have metadata synced. This is like a safe-guard in case old clusters upgraded to CItus 11.0 could not sync the metadata.

Copy link
Contributor Author

@eaydingol eaydingol Mar 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the lead regarding the lock order.

I still think the PR description is right, please let me know if I miss anything. Prior to the changes, "LockShardListResources(shardIntervalList, lockMode);" was the last statement, which acquires the locks for the modified table locally. It is also the last lock acquired when the request is initiated from the first worker node. In particular, the PR changes the order in which the locks are acquired on the first worker node with respect to the node that received the request.
The first version acquired locks on the reference tables and then the modified table.

As you suggested, I changed the order. With the last commit, independent of the node that received the request, the locks are acquired for the modified table and then the reference tables on the first node.

I noticed your suggestion to acquire remote locks before local ones. In the current version, locks are acquired for the modified table on the first worker, followed by local locks. Similarly, locks for reference tables are obtained on the first worker and then locally. I haven’t incorporated that suggestion yet.

Copy link
Contributor

@JelteF JelteF Mar 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm quite sure the code in this PR is correct.

I do agree with Onder that the flow in this function is somewhat hard to grasp. But I don't think that's the fault of this PR. I feel like the main reason for that is that LockReferencedReferenceShardResources internally does both first-worker and local locking. So that function call is hiding some of the symmetry between the two types of locks. In Onder his pseudo code that symmetry is more pronounced and thus the logic feels easier to follow.

Given we likely want to backport this change though, I don't think we should do such a refactor in this PR.


/*
* Acquire locks on the modified table.
* If the table is replicated, the locks are first acquired on the first worker node then locally.
Copy link
Contributor

@JelteF JelteF Mar 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* If the table is replicated, the locks are first acquired on the first worker node then locally.
* If the table is replicated, the locks are first acquired on the first worker node then locally.
* But if we're already on the first worker, we the acquiring on the first worker node and locally are the same operation. So we only acquire locally in that case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

integrated with the last commit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you wrap the comment line a bit, so that it's not ~200 characters long? (I'm surprised the citus_indent/uncrustify didn't complain about that)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It actually complained about that line. I fixed it with the last commit.

eaydingol and others added 3 commits March 8, 2024 18:10
Co-authored-by: Jelte Fennema-Nio <jelte.fennema@microsoft.com>
@eaydingol eaydingol enabled auto-merge (squash) March 8, 2024 15:26
@JelteF JelteF disabled auto-merge March 8, 2024 15:27
@eaydingol eaydingol merged commit 8afa2d0 into main Mar 10, 2024
157 checks passed
@eaydingol eaydingol deleted the issue7477 branch March 10, 2024 07:20
emelsimsek pushed a commit that referenced this pull request Mar 11, 2024
This PR changes the order in which the locks are acquired (for the
target and reference tables), when a modify request is initiated from a
worker node that is not the "FirstWorkerNode".


To prevent concurrent writes, locks are acquired on the first worker
node for the replicated tables. When the update statement originates
from the first worker node, it acquires the lock on the reference
table(s) first, followed by the target table(s). However, if the update
statement is initiated in another worker node, the lock requests are
sent to the first worker in a different order. This PR unifies the
modification order on the first worker node. With the third commit,
independent of the node that received the request, the locks are
acquired for the modified table and then the reference tables on the
first node.

The first commit shows a sample output for the test prior to the fix. 

Fixes #7477

---------

Co-authored-by: Jelte Fennema-Nio <jelte.fennema@microsoft.com>
JelteF pushed a commit that referenced this pull request Apr 16, 2024
This PR changes the order in which the locks are acquired (for the
target and reference tables), when a modify request is initiated from a
worker node that is not the "FirstWorkerNode".

To prevent concurrent writes, locks are acquired on the first worker
node for the replicated tables. When the update statement originates
from the first worker node, it acquires the lock on the reference
table(s) first, followed by the target table(s). However, if the update
statement is initiated in another worker node, the lock requests are
sent to the first worker in a different order. This PR unifies the
modification order on the first worker node. With the third commit,
independent of the node that received the request, the locks are
acquired for the modified table and then the reference tables on the
first node.

The first commit shows a sample output for the test prior to the fix.

Fixes #7477

---------

Co-authored-by: Jelte Fennema-Nio <jelte.fennema@microsoft.com>
(cherry picked from commit 8afa2d0)
JelteF pushed a commit that referenced this pull request Apr 16, 2024
This PR changes the order in which the locks are acquired (for the
target and reference tables), when a modify request is initiated from a
worker node that is not the "FirstWorkerNode".

To prevent concurrent writes, locks are acquired on the first worker
node for the replicated tables. When the update statement originates
from the first worker node, it acquires the lock on the reference
table(s) first, followed by the target table(s). However, if the update
statement is initiated in another worker node, the lock requests are
sent to the first worker in a different order. This PR unifies the
modification order on the first worker node. With the third commit,
independent of the node that received the request, the locks are
acquired for the modified table and then the reference tables on the
first node.

The first commit shows a sample output for the test prior to the fix.

Fixes #7477

---------

Co-authored-by: Jelte Fennema-Nio <jelte.fennema@microsoft.com>
(cherry picked from commit 8afa2d0)
JelteF pushed a commit that referenced this pull request Apr 16, 2024
This PR changes the order in which the locks are acquired (for the
target and reference tables), when a modify request is initiated from a
worker node that is not the "FirstWorkerNode".

To prevent concurrent writes, locks are acquired on the first worker
node for the replicated tables. When the update statement originates
from the first worker node, it acquires the lock on the reference
table(s) first, followed by the target table(s). However, if the update
statement is initiated in another worker node, the lock requests are
sent to the first worker in a different order. This PR unifies the
modification order on the first worker node. With the third commit,
independent of the node that received the request, the locks are
acquired for the modified table and then the reference tables on the
first node.

The first commit shows a sample output for the test prior to the fix.

Fixes #7477

---------

Co-authored-by: Jelte Fennema-Nio <jelte.fennema@microsoft.com>
(cherry picked from commit 8afa2d0)
JelteF pushed a commit that referenced this pull request Apr 17, 2024
This PR changes the order in which the locks are acquired (for the
target and reference tables), when a modify request is initiated from a
worker node that is not the "FirstWorkerNode".

To prevent concurrent writes, locks are acquired on the first worker
node for the replicated tables. When the update statement originates
from the first worker node, it acquires the lock on the reference
table(s) first, followed by the target table(s). However, if the update
statement is initiated in another worker node, the lock requests are
sent to the first worker in a different order. This PR unifies the
modification order on the first worker node. With the third commit,
independent of the node that received the request, the locks are
acquired for the modified table and then the reference tables on the
first node.

The first commit shows a sample output for the test prior to the fix.

Fixes #7477

---------

Co-authored-by: Jelte Fennema-Nio <jelte.fennema@microsoft.com>
(cherry picked from commit 8afa2d0)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Distributed deadlock after issuing a simple UPDATE statement
4 participants