Fix assertion error when rolling back to savepoint #3868

JelteF · 2020-06-04T15:32:50Z

DESCRIPTION: Fixes crash when using rollback to savepoint after cancelation of DML

It was possible to get an assertion error, if a DML command was
cancelled that opened a connection and then "ROLLBACK TO SAVEPOINT" was
used to continue the transaction. The reason for this was that canceling
the transaction might leave the claimedExclusively flag on for (some
of) it's connections.

This caused an assertion failure because CanUseExistingConnection
would return false and a new connection would be opened, and then there
would be two connections doing DML for the same placement. Which is
disallowed. That this situation caused an assertion failure instead of
an error, means that without asserts this could possibly result in some
visibility bugs, similar to the ones described
#3867

The fix simply "unclaims" all connections after "ROLLBACK TO SAVEPOINT"
is done.

This specific issue also highlights some other issues:

Just like Changing user inside transaction causes incorrect results, deadlocks or assert failures #3867, it shows that the CanUseExistingConnection should
be improved to catch these type of errors.
I think we should make these DML and DDL asserts normal errors.
This way we will not return incorrect data in case of bugs, but instead
error out:
https://github.com/citusdata/citus/blob/master/src/backend/distributed/connection/placement_connection.c#L418-L419
This code is complex and it's quite possible we missed some other
edge cases.
"ROLLBACK TO SAVEPOINT" in plain Postgres undos the locks that the
rolled back statements took. We do not undo our hadDML and hadDDL
flags, when rolling back to a savepoint. This could result in some
queries not being allowed on Citus that would actually be fine to
execute. Changing this would require us to keep track for each
savepoint which placement connections had DDL/DML executed at that
point.

codecov · 2020-06-04T15:52:50Z

Codecov Report

Merging #3868 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #3868   +/-   ##
=======================================
  Coverage   91.57%   91.57%           
=======================================
  Files         185      185           
  Lines       36553    36554    +1     
=======================================
+ Hits        33473    33474    +1     
  Misses       3080     3080

codecov · 2020-06-04T15:57:01Z

Codecov Report

Merging #3868 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #3868   +/-   ##
=======================================
  Coverage   91.57%   91.57%           
=======================================
  Files         185      185           
  Lines       36553    36554    +1     
=======================================
+ Hits        33473    33474    +1     
  Misses       3080     3080

JelteF · 2020-06-05T08:06:50Z

src/backend/distributed/transaction/transaction_management.c

@@ -484,8 +484,7 @@ ResetShardPlacementTransactionState(void)


 /*
- * Subtransaction callback - currently only used to remember whether a
- * savepoint has been rolled back, as we don't support that.
+ * Subtransaction callback used to implement distributed ROLLBACK TO SAVEPOINT.


(this comment seemed outdated so I changed it)

src/test/regress/sql/rollback_to_savepoint.sql

src/backend/distributed/transaction/transaction_management.c

codecov · 2020-06-05T08:27:19Z

Codecov Report

Merging #3868 into master will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #3868      +/-   ##
==========================================
+ Coverage   91.57%   91.58%   +0.01%     
==========================================
  Files         185      185              
  Lines       36553    36554       +1     
==========================================
+ Hits        33473    33478       +5     
+ Misses       3080     3076       -4

onderkalaci

Some minor notes

onderkalaci · 2020-06-05T10:43:32Z

src/test/regress/multi_schedule

@@ -90,6 +90,7 @@ test: multi_basic_queries multi_complex_expressions multi_subquery multi_subquer
 test: multi_subquery_complex_reference_clause multi_subquery_window_functions multi_view multi_sql_function multi_prepare_sql
 test: sql_procedure multi_function_in_join row_types materialized_view
 test: multi_subquery_in_where_reference_clause full_join adaptive_executor propagate_set_commands
+test: rollback_to_savepoint


can we run it in parallel with other tests where parallelism is low?

For some weird reason when running the test in parallel I get distributed deadlocks, instead of timeouts: https://app.circleci.com/pipelines/github/citusdata/citus/9575/workflows/c434721e-d812-45ef-b830-1b86ee3f5d78/jobs/136125/steps

So I'll keep the test as is.

onderkalaci · 2020-06-05T10:52:26Z

src/test/regress/expected/rollback_to_savepoint.out

+-- This timeout is chosen such that the INSERT with
+-- generate_series(1, 100000000) is cancelled at the right time to trigger the
+-- bug
+SET statement_timeout = '2s';


Hmm, what if in 2 seconds Citus cannot start COPY command yet (e.g., still collecting the results of SELECT)? Would we still trigger the bug?

Can we maybe convert this to a failure tests where we fail once COPY starts?

I wasn't able to reproduce the original issue with our failure testing suite. So I'm keeping it as is.

src/backend/distributed/transaction/remote_transaction.c

metdos · 2020-06-12T09:24:16Z

This PR seems to approved, @JelteF any reason to not merge it?

JelteF · 2020-06-12T09:28:00Z

@metdos I still want to address Onders feeback on tests and comments.

codecov · 2020-06-26T08:09:35Z

Codecov Report

Merging #3868 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #3868   +/-   ##
=======================================
  Coverage   91.57%   91.57%           
=======================================
  Files         185      185           
  Lines       36553    36554    +1     
=======================================
+ Hits        33473    33475    +2     
+ Misses       3080     3079    -1

codecov · 2020-06-26T14:45:11Z

Codecov Report

Merging #3868 into master will increase coverage by 0.58%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #3868      +/-   ##
==========================================
+ Coverage   90.99%   91.57%   +0.58%     
==========================================
  Files         187      185       -2     
  Lines       37356    36554     -802     
==========================================
- Hits        33992    33475     -517     
+ Misses       3364     3079     -285

Fixes #3622 It was possible to get an assertion error, if a DML command was cancelled that opened a connection and then "ROLLBACK TO SAVEPOINT" was used to continue the transaction. The reason for this was that canceling the transaction might leave the `claimedExclusively` flag on for (some of) it's connections. This caused an assertion failure because `CanUseExistingConnection` would return false and a new connection would be opened, and then there would be two connections doing DML for the same placement. Which is disallowed. That this situation caused an assertion failure instead of an error, means that without asserts this could possibly result in some visibility bugs, similar to the ones described #3867 The fix simply "unclaims" all connections after "ROLLBACK TO SAVEPOINT" is done. This specific issue also highlights some other issues: 1. Just like #3867, it shows that the `CanUseExistingConnection` should be improved to catch these type of errors. 2. I think we should make these DML and DDL asserts normal errors. This way we will not return incorrect data in case of bugs, but instead error out: https://github.com/citusdata/citus/blob/master/src/backend/distributed/connection/placement_connection.c#L418-L419 This code is complex and it's quite possible we missed some other edge cases. 3. "ROLLBACK TO SAVEPOINT" in plain Postgres undos the locks that the rolled back statements took. We do not undo our hadDML and hadDDL flags, when rolling back to a savepoint. This could result in some queries not being allowed on Citus that would actually be fine to execute. Changing this would require us to keep track for each savepoint which placement connections had DDL/DML executed at that point.

This removes a flaky test that I introduced in #3868 after I fixed the issue described in #3622. This test is sometimes fails randomly in CI. The way it fails indicates that there might be some bug: A connection breaks after rolling back to a savepoint. I tried reproducing this issue locally, but I wasn't able to. I don't understand what causes the failure. Things that I tried were: 1. Running the test with: ```sql SET citus.force_max_query_parallelization = true; ``` 2. Running the test with: ```sql SET citus.max_adaptive_executor_pool_size = 1; ``` 3. Running the test in parallel with the same tests that it is run in parallel with in multi_schedule. None of these allowed me to reproduce the issue locally. So I think it's time to give on fixing this test and simply remove the test. The regression that this test protects against seems very unlikely to reappear, since in #3868 I also added a big comment about the need for the newly added `UnclaimConnection` call. So, I think the need for the test is quite small, and removing it will make our CI less flaky. Example of a failing CI run: https://app.circleci.com/pipelines/github/citusdata/citus/26098/workflows/f84741d9-13b1-4ae7-9155-c21ed3466951/jobs/736424

This removes a flaky test that I introduced in #3868 after I fixed the issue described in #3622. This test is sometimes fails randomly in CI. The way it fails indicates that there might be some bug: A connection breaks after rolling back to a savepoint. I tried reproducing this issue locally, but I wasn't able to. I don't understand what causes the failure. Things that I tried were: 1. Running the test with: ```sql SET citus.force_max_query_parallelization = true; ``` 2. Running the test with: ```sql SET citus.max_adaptive_executor_pool_size = 1; ``` 3. Running the test in parallel with the same tests that it is run in parallel with in multi_schedule. None of these allowed me to reproduce the issue locally. So I think it's time to give on fixing this test and simply remove the test. The regression that this test protects against seems very unlikely to reappear, since in #3868 I also added a big comment about the need for the newly added `UnclaimConnection` call. So, I think the need for the test is quite small, and removing it will make our CI less flaky. In case the cause of the bug ever gets found, I tracked the bug in #6189 Example of a failing CI run: https://app.circleci.com/pipelines/github/citusdata/citus/26098/workflows/f84741d9-13b1-4ae7-9155-c21ed3466951/jobs/736424

This removes a flaky test that I introduced in #3868 after I fixed the issue described in #3622. This test is sometimes fails randomly in CI. The way it fails indicates that there might be some bug: A connection breaks after rolling back to a savepoint. I tried reproducing this issue locally, but I wasn't able to. I don't understand what causes the failure. Things that I tried were: 1. Running the test with: ```sql SET citus.force_max_query_parallelization = true; ``` 2. Running the test with: ```sql SET citus.max_adaptive_executor_pool_size = 1; ``` 3. Running the test in parallel with the same tests that it is run in parallel with in multi_schedule. None of these allowed me to reproduce the issue locally. So I think it's time to give on fixing this test and simply remove the test. The regression that this test protects against seems very unlikely to reappear, since in #3868 I also added a big comment about the need for the newly added `UnclaimConnection` call. So, I think the need for the test is quite small, and removing it will make our CI less flaky. In case the cause of the bug ever gets found, I tracked the bug in #6189 Example of a failing CI run: https://app.circleci.com/pipelines/github/citusdata/citus/26098/workflows/f84741d9-13b1-4ae7-9155-c21ed3466951/jobs/736424 For reference the unexpected diff is this (so both warnings and an error): ```diff INSERT INTO t SELECT i FROM generate_series(1, 100) i; +WARNING: connection to the remote node localhost:57638 failed with the following error: +WARNING: +CONTEXT: while executing command on localhost:57638 +ERROR: connection to the remote node localhost:57638 failed with the following error: ROLLBACK; ``` This test is also mentioned as the most failing regression test in #5975

This removes a flaky test that I introduced in #3868 after I fixed the issue described in #3622. This test is sometimes fails randomly in CI. The way it fails indicates that there might be some bug: A connection breaks after rolling back to a savepoint. I tried reproducing this issue locally, but I wasn't able to. I don't understand what causes the failure. Things that I tried were: 1. Running the test with: ```sql SET citus.force_max_query_parallelization = true; ``` 2. Running the test with: ```sql SET citus.max_adaptive_executor_pool_size = 1; ``` 3. Running the test in parallel with the same tests that it is run in parallel with in multi_schedule. None of these allowed me to reproduce the issue locally. So I think it's time to give on fixing this test and simply remove the test. The regression that this test protects against seems very unlikely to reappear, since in #3868 I also added a big comment about the need for the newly added `UnclaimConnection` call. So, I think the need for the test is quite small, and removing it will make our CI less flaky. In case the cause of the bug ever gets found, I tracked the bug in #6189 Example of a failing CI run: https://app.circleci.com/pipelines/github/citusdata/citus/26098/workflows/f84741d9-13b1-4ae7-9155-c21ed3466951/jobs/736424 For reference the unexpected diff is this (so both warnings and an error): ```diff INSERT INTO t SELECT i FROM generate_series(1, 100) i; +WARNING: connection to the remote node localhost:57638 failed with the following error: +WARNING: +CONTEXT: while executing command on localhost:57638 +ERROR: connection to the remote node localhost:57638 failed with the following error: ROLLBACK; ``` This test is also mentioned as the most failing regression test in #5975 (cherry picked from commit d16b458)

This removes a flaky test that I introduced in #3868 after I fixed the issue described in #3622. This test is sometimes fails randomly in CI. The way it fails indicates that there might be some bug: A connection breaks after rolling back to a savepoint. I tried reproducing this issue locally, but I wasn't able to. I don't understand what causes the failure. Things that I tried were: 1. Running the test with: ```sql SET citus.force_max_query_parallelization = true; ``` 2. Running the test with: ```sql SET citus.max_adaptive_executor_pool_size = 1; ``` 3. Running the test in parallel with the same tests that it is run in parallel with in multi_schedule. None of these allowed me to reproduce the issue locally. So I think it's time to give on fixing this test and simply remove the test. The regression that this test protects against seems very unlikely to reappear, since in #3868 I also added a big comment about the need for the newly added `UnclaimConnection` call. So, I think the need for the test is quite small, and removing it will make our CI less flaky. In case the cause of the bug ever gets found, I tracked the bug in #6189 Example of a failing CI run: https://app.circleci.com/pipelines/github/citusdata/citus/26098/workflows/f84741d9-13b1-4ae7-9155-c21ed3466951/jobs/736424 For reference the unexpected diff is this (so both warnings and an error): ```diff INSERT INTO t SELECT i FROM generate_series(1, 100) i; +WARNING: connection to the remote node localhost:57638 failed with the following error: +WARNING: +CONTEXT: while executing command on localhost:57638 +ERROR: connection to the remote node localhost:57638 failed with the following error: ROLLBACK; ``` This test is also mentioned as the most failing regression test in #5975

JelteF requested review from pykello and onderkalaci June 4, 2020 15:33

JelteF added backport bug labels Jun 4, 2020

JelteF force-pushed the fix-rollback-to-savepoint branch from 76bf623 to 6da52a8 Compare June 4, 2020 15:51

JelteF force-pushed the fix-rollback-to-savepoint branch from 6da52a8 to d9480f3 Compare June 4, 2020 15:54

JelteF force-pushed the fix-rollback-to-savepoint branch from d9480f3 to 7493eb7 Compare June 4, 2020 16:18

pykello approved these changes Jun 5, 2020

View reviewed changes

JelteF commented Jun 5, 2020

View reviewed changes

SaitTalhaNisanci reviewed Jun 5, 2020

View reviewed changes

src/test/regress/sql/rollback_to_savepoint.sql Show resolved Hide resolved

src/backend/distributed/transaction/transaction_management.c Outdated Show resolved Hide resolved

JelteF force-pushed the fix-rollback-to-savepoint branch from 7493eb7 to f35c47a Compare June 5, 2020 08:26

SaitTalhaNisanci approved these changes Jun 5, 2020

View reviewed changes

onderkalaci approved these changes Jun 5, 2020

View reviewed changes

JelteF force-pushed the fix-rollback-to-savepoint branch from 6e9fa72 to 9db4f68 Compare June 26, 2020 08:00

JelteF force-pushed the fix-rollback-to-savepoint branch from 9db4f68 to f5ce254 Compare June 26, 2020 14:44

JelteF force-pushed the fix-rollback-to-savepoint branch 2 times, most recently from 52727f6 to 7302d7d Compare June 29, 2020 10:10

JelteF mentioned this pull request Jun 29, 2020

Cancelation is attributed to distributed deadlock instead of statement timeout #3954

Closed

JelteF force-pushed the fix-rollback-to-savepoint branch 2 times, most recently from c0314e3 to a7cd0a2 Compare June 30, 2020 08:51

JelteF added 3 commits June 30, 2020 10:51

Add comment

d372df9

Reorder tests so they pass

b746ed2

JelteF force-pushed the fix-rollback-to-savepoint branch from a7cd0a2 to b746ed2 Compare June 30, 2020 08:51

JelteF merged commit 02fa942 into master Jun 30, 2020

JelteF deleted the fix-rollback-to-savepoint branch June 30, 2020 09:31

JelteF mentioned this pull request Aug 18, 2022

Remove the flaky rollback_to_savepoint test #6190

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix assertion error when rolling back to savepoint #3868

Fix assertion error when rolling back to savepoint #3868

JelteF commented Jun 4, 2020 •

edited

Loading

codecov bot commented Jun 4, 2020

codecov bot commented Jun 4, 2020

JelteF Jun 5, 2020

codecov bot commented Jun 5, 2020

onderkalaci left a comment

onderkalaci Jun 5, 2020

JelteF Jun 26, 2020

onderkalaci Jun 5, 2020

JelteF Jun 26, 2020

metdos commented Jun 12, 2020

JelteF commented Jun 12, 2020

codecov bot commented Jun 26, 2020

codecov bot commented Jun 26, 2020

Fix assertion error when rolling back to savepoint #3868

Fix assertion error when rolling back to savepoint #3868

Conversation

JelteF commented Jun 4, 2020 • edited Loading

codecov bot commented Jun 4, 2020

Codecov Report

codecov bot commented Jun 4, 2020

Codecov Report

JelteF Jun 5, 2020

Choose a reason for hiding this comment

codecov bot commented Jun 5, 2020

Codecov Report

onderkalaci left a comment

Choose a reason for hiding this comment

onderkalaci Jun 5, 2020

Choose a reason for hiding this comment

JelteF Jun 26, 2020

Choose a reason for hiding this comment

onderkalaci Jun 5, 2020

Choose a reason for hiding this comment

JelteF Jun 26, 2020

Choose a reason for hiding this comment

metdos commented Jun 12, 2020

JelteF commented Jun 12, 2020

codecov bot commented Jun 26, 2020

Codecov Report

codecov bot commented Jun 26, 2020

Codecov Report

JelteF commented Jun 4, 2020 •

edited

Loading