-
Notifications
You must be signed in to change notification settings - Fork 670
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change the order in which the locks are acquired #7542
Merged
Changes from 2 commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
803557d
A test with non-deterministic results illustrating the bug.
eaydingol 880553e
Change the order in which the locks are acquired.
eaydingol c3bad5a
First acquire locks on the modified tables, then the reference ones
eaydingol b28b5a9
Update comment
eaydingol 99d73fd
update comment
eaydingol 303261d
Merge branch 'main' into issue7477
eaydingol 043a981
style
eaydingol File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,62 @@ | ||
--- Test for updating a table that has a foreign key reference to another reference table. | ||
--- Issue #7477: Distributed deadlock after issuing a simple UPDATE statement | ||
--- https://github.com/citusdata/citus/issues/7477 | ||
CREATE TABLE table1 (id INT PRIMARY KEY); | ||
SELECT create_reference_table('table1'); | ||
create_reference_table | ||
--------------------------------------------------------------------- | ||
|
||
(1 row) | ||
|
||
INSERT INTO table1 VALUES (1); | ||
CREATE TABLE table2 ( | ||
id INT, | ||
info TEXT, | ||
CONSTRAINT table1_id_fk FOREIGN KEY (id) REFERENCES table1 (id) | ||
); | ||
SELECT create_reference_table('table2'); | ||
create_reference_table | ||
--------------------------------------------------------------------- | ||
|
||
(1 row) | ||
|
||
INSERT INTO table2 VALUES (1, 'test'); | ||
--- Runs the update command in parallel on workers. | ||
--- Due to bug #7477, before the fix, the result is non-deterministic | ||
--- and have several rows of the form: | ||
--- localhost | 57638 | f | ERROR: deadlock detected | ||
--- localhost | 57637 | f | ERROR: deadlock detected | ||
--- localhost | 57637 | f | ERROR: canceling the transaction since it was involved in a distributed deadlock | ||
SELECT * FROM master_run_on_worker( | ||
ARRAY['localhost', 'localhost','localhost', 'localhost','localhost', | ||
'localhost','localhost', 'localhost','localhost', 'localhost']::text[], | ||
ARRAY[57638, 57637, 57637, 57638, 57637, 57638, 57637, 57638, 57638, 57637]::int[], | ||
ARRAY['UPDATE table2 SET info = ''test_update'' WHERE id = 1', | ||
'UPDATE table2 SET info = ''test_update'' WHERE id = 1', | ||
'UPDATE table2 SET info = ''test_update'' WHERE id = 1', | ||
'UPDATE table2 SET info = ''test_update'' WHERE id = 1', | ||
'UPDATE table2 SET info = ''test_update'' WHERE id = 1', | ||
'UPDATE table2 SET info = ''test_update'' WHERE id = 1', | ||
'UPDATE table2 SET info = ''test_update'' WHERE id = 1', | ||
'UPDATE table2 SET info = ''test_update'' WHERE id = 1', | ||
'UPDATE table2 SET info = ''test_update'' WHERE id = 1', | ||
'UPDATE table2 SET info = ''test_update'' WHERE id = 1' | ||
]::text[], | ||
true); | ||
node_name | node_port | success | result | ||
--------------------------------------------------------------------- | ||
localhost | 57638 | t | UPDATE 1 | ||
localhost | 57637 | t | UPDATE 1 | ||
localhost | 57637 | t | UPDATE 1 | ||
localhost | 57638 | t | UPDATE 1 | ||
localhost | 57637 | t | UPDATE 1 | ||
localhost | 57638 | t | UPDATE 1 | ||
localhost | 57637 | t | UPDATE 1 | ||
localhost | 57638 | t | UPDATE 1 | ||
localhost | 57638 | t | UPDATE 1 | ||
localhost | 57637 | t | UPDATE 1 | ||
(10 rows) | ||
|
||
--- cleanup | ||
DROP TABLE table2; | ||
DROP TABLE table1; |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,44 @@ | ||
|
||
--- Test for updating a table that has a foreign key reference to another reference table. | ||
--- Issue #7477: Distributed deadlock after issuing a simple UPDATE statement | ||
--- https://github.com/citusdata/citus/issues/7477 | ||
|
||
CREATE TABLE table1 (id INT PRIMARY KEY); | ||
SELECT create_reference_table('table1'); | ||
INSERT INTO table1 VALUES (1); | ||
|
||
CREATE TABLE table2 ( | ||
id INT, | ||
info TEXT, | ||
CONSTRAINT table1_id_fk FOREIGN KEY (id) REFERENCES table1 (id) | ||
); | ||
SELECT create_reference_table('table2'); | ||
INSERT INTO table2 VALUES (1, 'test'); | ||
|
||
--- Runs the update command in parallel on workers. | ||
--- Due to bug #7477, before the fix, the result is non-deterministic | ||
--- and have several rows of the form: | ||
--- localhost | 57638 | f | ERROR: deadlock detected | ||
--- localhost | 57637 | f | ERROR: deadlock detected | ||
--- localhost | 57637 | f | ERROR: canceling the transaction since it was involved in a distributed deadlock | ||
|
||
SELECT * FROM master_run_on_worker( | ||
ARRAY['localhost', 'localhost','localhost', 'localhost','localhost', | ||
'localhost','localhost', 'localhost','localhost', 'localhost']::text[], | ||
ARRAY[57638, 57637, 57637, 57638, 57637, 57638, 57637, 57638, 57638, 57637]::int[], | ||
ARRAY['UPDATE table2 SET info = ''test_update'' WHERE id = 1', | ||
'UPDATE table2 SET info = ''test_update'' WHERE id = 1', | ||
'UPDATE table2 SET info = ''test_update'' WHERE id = 1', | ||
'UPDATE table2 SET info = ''test_update'' WHERE id = 1', | ||
'UPDATE table2 SET info = ''test_update'' WHERE id = 1', | ||
'UPDATE table2 SET info = ''test_update'' WHERE id = 1', | ||
'UPDATE table2 SET info = ''test_update'' WHERE id = 1', | ||
'UPDATE table2 SET info = ''test_update'' WHERE id = 1', | ||
'UPDATE table2 SET info = ''test_update'' WHERE id = 1', | ||
'UPDATE table2 SET info = ''test_update'' WHERE id = 1' | ||
]::text[], | ||
true); | ||
|
||
--- cleanup | ||
DROP TABLE table2; | ||
DROP TABLE table1; |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note that with this change we are diverging the behavior of Postgres by first locking the referenced tables/shards when there is a foreign key. Postgres acquires the locks on main table first, then cascades into the referenced tables.
So could this change might introduce new classes of concurrency issues where there are concurrent modifications to the
ReferencedTables
(e.g., distributed table) while there are modifications to the main table (e.g., reference table)?Also this code doesn't look like what the PR description tells, it seems doing the opposite.
It feels like the code should be much more explicit about what we are doing, for example, the following code block is how I think this logic should look like -- though probably the code could be nicer:
note that I think it is even OK to drop all code relevant to
!ClusterHasKnownMetadataWorkers()
given that with Citus 11.0 we expect all clusters to have metadata synced. This is like a safe-guard in case old clusters upgraded to CItus 11.0 could not sync the metadata.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the lead regarding the lock order.
I still think the PR description is right, please let me know if I miss anything. Prior to the changes, "
LockShardListResources(shardIntervalList, lockMode);
" was the last statement, which acquires the locks for the modified table locally. It is also the last lock acquired when the request is initiated from the first worker node. In particular, the PR changes the order in which the locks are acquired on the first worker node with respect to the node that received the request.The first version acquired locks on the reference tables and then the modified table.
As you suggested, I changed the order. With the last commit, independent of the node that received the request, the locks are acquired for the modified table and then the reference tables on the first node.
I noticed your suggestion to acquire remote locks before local ones. In the current version, locks are acquired for the modified table on the first worker, followed by local locks. Similarly, locks for reference tables are obtained on the first worker and then locally. I haven’t incorporated that suggestion yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm quite sure the code in this PR is correct.
I do agree with Onder that the flow in this function is somewhat hard to grasp. But I don't think that's the fault of this PR. I feel like the main reason for that is that LockReferencedReferenceShardResources internally does both first-worker and local locking. So that function call is hiding some of the symmetry between the two types of locks. In Onder his pseudo code that symmetry is more pronounced and thus the logic feels easier to follow.
Given we likely want to backport this change though, I don't think we should do such a refactor in this PR.