-
Notifications
You must be signed in to change notification settings - Fork 653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix assertion error when rolling back to savepoint #3868
Conversation
76bf623
to
6da52a8
Compare
Codecov Report
@@ Coverage Diff @@
## master #3868 +/- ##
=======================================
Coverage 91.57% 91.57%
=======================================
Files 185 185
Lines 36553 36554 +1
=======================================
+ Hits 33473 33474 +1
Misses 3080 3080 |
6da52a8
to
d9480f3
Compare
Codecov Report
@@ Coverage Diff @@
## master #3868 +/- ##
=======================================
Coverage 91.57% 91.57%
=======================================
Files 185 185
Lines 36553 36554 +1
=======================================
+ Hits 33473 33474 +1
Misses 3080 3080 |
d9480f3
to
7493eb7
Compare
@@ -484,8 +484,7 @@ ResetShardPlacementTransactionState(void) | |||
|
|||
|
|||
/* | |||
* Subtransaction callback - currently only used to remember whether a | |||
* savepoint has been rolled back, as we don't support that. | |||
* Subtransaction callback used to implement distributed ROLLBACK TO SAVEPOINT. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(this comment seemed outdated so I changed it)
7493eb7
to
f35c47a
Compare
Codecov Report
@@ Coverage Diff @@
## master #3868 +/- ##
==========================================
+ Coverage 91.57% 91.58% +0.01%
==========================================
Files 185 185
Lines 36553 36554 +1
==========================================
+ Hits 33473 33478 +5
+ Misses 3080 3076 -4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor notes
src/test/regress/multi_schedule
Outdated
@@ -90,6 +90,7 @@ test: multi_basic_queries multi_complex_expressions multi_subquery multi_subquer | |||
test: multi_subquery_complex_reference_clause multi_subquery_window_functions multi_view multi_sql_function multi_prepare_sql | |||
test: sql_procedure multi_function_in_join row_types materialized_view | |||
test: multi_subquery_in_where_reference_clause full_join adaptive_executor propagate_set_commands | |||
test: rollback_to_savepoint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we run it in parallel with other tests where parallelism is low?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some weird reason when running the test in parallel I get distributed deadlocks, instead of timeouts: https://app.circleci.com/pipelines/github/citusdata/citus/9575/workflows/c434721e-d812-45ef-b830-1b86ee3f5d78/jobs/136125/steps
So I'll keep the test as is.
-- This timeout is chosen such that the INSERT with | ||
-- generate_series(1, 100000000) is cancelled at the right time to trigger the | ||
-- bug | ||
SET statement_timeout = '2s'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, what if in 2 seconds Citus cannot start COPY command yet (e.g., still collecting the results of SELECT)? Would we still trigger the bug?
Can we maybe convert this to a failure tests where we fail once COPY starts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't able to reproduce the original issue with our failure testing suite. So I'm keeping it as is.
This PR seems to approved, @JelteF any reason to not merge it? |
@metdos I still want to address Onders feeback on tests and comments. |
6e9fa72
to
9db4f68
Compare
Codecov Report
@@ Coverage Diff @@
## master #3868 +/- ##
=======================================
Coverage 91.57% 91.57%
=======================================
Files 185 185
Lines 36553 36554 +1
=======================================
+ Hits 33473 33475 +2
+ Misses 3080 3079 -1 |
9db4f68
to
f5ce254
Compare
Codecov Report
@@ Coverage Diff @@
## master #3868 +/- ##
==========================================
+ Coverage 90.99% 91.57% +0.58%
==========================================
Files 187 185 -2
Lines 37356 36554 -802
==========================================
- Hits 33992 33475 -517
+ Misses 3364 3079 -285 |
52727f6
to
7302d7d
Compare
c0314e3
to
a7cd0a2
Compare
Fixes #3622 It was possible to get an assertion error, if a DML command was cancelled that opened a connection and then "ROLLBACK TO SAVEPOINT" was used to continue the transaction. The reason for this was that canceling the transaction might leave the `claimedExclusively` flag on for (some of) it's connections. This caused an assertion failure because `CanUseExistingConnection` would return false and a new connection would be opened, and then there would be two connections doing DML for the same placement. Which is disallowed. That this situation caused an assertion failure instead of an error, means that without asserts this could possibly result in some visibility bugs, similar to the ones described #3867 The fix simply "unclaims" all connections after "ROLLBACK TO SAVEPOINT" is done. This specific issue also highlights some other issues: 1. Just like #3867, it shows that the `CanUseExistingConnection` should be improved to catch these type of errors. 2. I think we should make these DML and DDL asserts normal errors. This way we will not return incorrect data in case of bugs, but instead error out: https://github.com/citusdata/citus/blob/master/src/backend/distributed/connection/placement_connection.c#L418-L419 This code is complex and it's quite possible we missed some other edge cases. 3. "ROLLBACK TO SAVEPOINT" in plain Postgres undos the locks that the rolled back statements took. We do not undo our hadDML and hadDDL flags, when rolling back to a savepoint. This could result in some queries not being allowed on Citus that would actually be fine to execute. Changing this would require us to keep track for each savepoint which placement connections had DDL/DML executed at that point.
a7cd0a2
to
b746ed2
Compare
This removes a flaky test that I introduced in #3868 after I fixed the issue described in #3622. This test is sometimes fails randomly in CI. The way it fails indicates that there might be some bug: A connection breaks after rolling back to a savepoint. I tried reproducing this issue locally, but I wasn't able to. I don't understand what causes the failure. Things that I tried were: 1. Running the test with: ```sql SET citus.force_max_query_parallelization = true; ``` 2. Running the test with: ```sql SET citus.max_adaptive_executor_pool_size = 1; ``` 3. Running the test in parallel with the same tests that it is run in parallel with in multi_schedule. None of these allowed me to reproduce the issue locally. So I think it's time to give on fixing this test and simply remove the test. The regression that this test protects against seems very unlikely to reappear, since in #3868 I also added a big comment about the need for the newly added `UnclaimConnection` call. So, I think the need for the test is quite small, and removing it will make our CI less flaky. Example of a failing CI run: https://app.circleci.com/pipelines/github/citusdata/citus/26098/workflows/f84741d9-13b1-4ae7-9155-c21ed3466951/jobs/736424
This removes a flaky test that I introduced in #3868 after I fixed the issue described in #3622. This test is sometimes fails randomly in CI. The way it fails indicates that there might be some bug: A connection breaks after rolling back to a savepoint. I tried reproducing this issue locally, but I wasn't able to. I don't understand what causes the failure. Things that I tried were: 1. Running the test with: ```sql SET citus.force_max_query_parallelization = true; ``` 2. Running the test with: ```sql SET citus.max_adaptive_executor_pool_size = 1; ``` 3. Running the test in parallel with the same tests that it is run in parallel with in multi_schedule. None of these allowed me to reproduce the issue locally. So I think it's time to give on fixing this test and simply remove the test. The regression that this test protects against seems very unlikely to reappear, since in #3868 I also added a big comment about the need for the newly added `UnclaimConnection` call. So, I think the need for the test is quite small, and removing it will make our CI less flaky. In case the cause of the bug ever gets found, I tracked the bug in #6189 Example of a failing CI run: https://app.circleci.com/pipelines/github/citusdata/citus/26098/workflows/f84741d9-13b1-4ae7-9155-c21ed3466951/jobs/736424
This removes a flaky test that I introduced in #3868 after I fixed the issue described in #3622. This test is sometimes fails randomly in CI. The way it fails indicates that there might be some bug: A connection breaks after rolling back to a savepoint. I tried reproducing this issue locally, but I wasn't able to. I don't understand what causes the failure. Things that I tried were: 1. Running the test with: ```sql SET citus.force_max_query_parallelization = true; ``` 2. Running the test with: ```sql SET citus.max_adaptive_executor_pool_size = 1; ``` 3. Running the test in parallel with the same tests that it is run in parallel with in multi_schedule. None of these allowed me to reproduce the issue locally. So I think it's time to give on fixing this test and simply remove the test. The regression that this test protects against seems very unlikely to reappear, since in #3868 I also added a big comment about the need for the newly added `UnclaimConnection` call. So, I think the need for the test is quite small, and removing it will make our CI less flaky. In case the cause of the bug ever gets found, I tracked the bug in #6189 Example of a failing CI run: https://app.circleci.com/pipelines/github/citusdata/citus/26098/workflows/f84741d9-13b1-4ae7-9155-c21ed3466951/jobs/736424 For reference the unexpected diff is this (so both warnings and an error): ```diff INSERT INTO t SELECT i FROM generate_series(1, 100) i; +WARNING: connection to the remote node localhost:57638 failed with the following error: +WARNING: +CONTEXT: while executing command on localhost:57638 +ERROR: connection to the remote node localhost:57638 failed with the following error: ROLLBACK; ``` This test is also mentioned as the most failing regression test in #5975
This removes a flaky test that I introduced in #3868 after I fixed the issue described in #3622. This test is sometimes fails randomly in CI. The way it fails indicates that there might be some bug: A connection breaks after rolling back to a savepoint. I tried reproducing this issue locally, but I wasn't able to. I don't understand what causes the failure. Things that I tried were: 1. Running the test with: ```sql SET citus.force_max_query_parallelization = true; ``` 2. Running the test with: ```sql SET citus.max_adaptive_executor_pool_size = 1; ``` 3. Running the test in parallel with the same tests that it is run in parallel with in multi_schedule. None of these allowed me to reproduce the issue locally. So I think it's time to give on fixing this test and simply remove the test. The regression that this test protects against seems very unlikely to reappear, since in #3868 I also added a big comment about the need for the newly added `UnclaimConnection` call. So, I think the need for the test is quite small, and removing it will make our CI less flaky. In case the cause of the bug ever gets found, I tracked the bug in #6189 Example of a failing CI run: https://app.circleci.com/pipelines/github/citusdata/citus/26098/workflows/f84741d9-13b1-4ae7-9155-c21ed3466951/jobs/736424 For reference the unexpected diff is this (so both warnings and an error): ```diff INSERT INTO t SELECT i FROM generate_series(1, 100) i; +WARNING: connection to the remote node localhost:57638 failed with the following error: +WARNING: +CONTEXT: while executing command on localhost:57638 +ERROR: connection to the remote node localhost:57638 failed with the following error: ROLLBACK; ``` This test is also mentioned as the most failing regression test in #5975 (cherry picked from commit d16b458)
This removes a flaky test that I introduced in #3868 after I fixed the issue described in #3622. This test is sometimes fails randomly in CI. The way it fails indicates that there might be some bug: A connection breaks after rolling back to a savepoint. I tried reproducing this issue locally, but I wasn't able to. I don't understand what causes the failure. Things that I tried were: 1. Running the test with: ```sql SET citus.force_max_query_parallelization = true; ``` 2. Running the test with: ```sql SET citus.max_adaptive_executor_pool_size = 1; ``` 3. Running the test in parallel with the same tests that it is run in parallel with in multi_schedule. None of these allowed me to reproduce the issue locally. So I think it's time to give on fixing this test and simply remove the test. The regression that this test protects against seems very unlikely to reappear, since in #3868 I also added a big comment about the need for the newly added `UnclaimConnection` call. So, I think the need for the test is quite small, and removing it will make our CI less flaky. In case the cause of the bug ever gets found, I tracked the bug in #6189 Example of a failing CI run: https://app.circleci.com/pipelines/github/citusdata/citus/26098/workflows/f84741d9-13b1-4ae7-9155-c21ed3466951/jobs/736424 For reference the unexpected diff is this (so both warnings and an error): ```diff INSERT INTO t SELECT i FROM generate_series(1, 100) i; +WARNING: connection to the remote node localhost:57638 failed with the following error: +WARNING: +CONTEXT: while executing command on localhost:57638 +ERROR: connection to the remote node localhost:57638 failed with the following error: ROLLBACK; ``` This test is also mentioned as the most failing regression test in #5975 (cherry picked from commit d16b458)
This removes a flaky test that I introduced in #3868 after I fixed the issue described in #3622. This test is sometimes fails randomly in CI. The way it fails indicates that there might be some bug: A connection breaks after rolling back to a savepoint. I tried reproducing this issue locally, but I wasn't able to. I don't understand what causes the failure. Things that I tried were: 1. Running the test with: ```sql SET citus.force_max_query_parallelization = true; ``` 2. Running the test with: ```sql SET citus.max_adaptive_executor_pool_size = 1; ``` 3. Running the test in parallel with the same tests that it is run in parallel with in multi_schedule. None of these allowed me to reproduce the issue locally. So I think it's time to give on fixing this test and simply remove the test. The regression that this test protects against seems very unlikely to reappear, since in #3868 I also added a big comment about the need for the newly added `UnclaimConnection` call. So, I think the need for the test is quite small, and removing it will make our CI less flaky. In case the cause of the bug ever gets found, I tracked the bug in #6189 Example of a failing CI run: https://app.circleci.com/pipelines/github/citusdata/citus/26098/workflows/f84741d9-13b1-4ae7-9155-c21ed3466951/jobs/736424 For reference the unexpected diff is this (so both warnings and an error): ```diff INSERT INTO t SELECT i FROM generate_series(1, 100) i; +WARNING: connection to the remote node localhost:57638 failed with the following error: +WARNING: +CONTEXT: while executing command on localhost:57638 +ERROR: connection to the remote node localhost:57638 failed with the following error: ROLLBACK; ``` This test is also mentioned as the most failing regression test in #5975 (cherry picked from commit d16b458)
This removes a flaky test that I introduced in #3868 after I fixed the issue described in #3622. This test is sometimes fails randomly in CI. The way it fails indicates that there might be some bug: A connection breaks after rolling back to a savepoint. I tried reproducing this issue locally, but I wasn't able to. I don't understand what causes the failure. Things that I tried were: 1. Running the test with: ```sql SET citus.force_max_query_parallelization = true; ``` 2. Running the test with: ```sql SET citus.max_adaptive_executor_pool_size = 1; ``` 3. Running the test in parallel with the same tests that it is run in parallel with in multi_schedule. None of these allowed me to reproduce the issue locally. So I think it's time to give on fixing this test and simply remove the test. The regression that this test protects against seems very unlikely to reappear, since in #3868 I also added a big comment about the need for the newly added `UnclaimConnection` call. So, I think the need for the test is quite small, and removing it will make our CI less flaky. In case the cause of the bug ever gets found, I tracked the bug in #6189 Example of a failing CI run: https://app.circleci.com/pipelines/github/citusdata/citus/26098/workflows/f84741d9-13b1-4ae7-9155-c21ed3466951/jobs/736424 For reference the unexpected diff is this (so both warnings and an error): ```diff INSERT INTO t SELECT i FROM generate_series(1, 100) i; +WARNING: connection to the remote node localhost:57638 failed with the following error: +WARNING: +CONTEXT: while executing command on localhost:57638 +ERROR: connection to the remote node localhost:57638 failed with the following error: ROLLBACK; ``` This test is also mentioned as the most failing regression test in #5975
DESCRIPTION: Fixes crash when using rollback to savepoint after cancelation of DML
Fixes #3622
It was possible to get an assertion error, if a DML command was
cancelled that opened a connection and then "ROLLBACK TO SAVEPOINT" was
used to continue the transaction. The reason for this was that canceling
the transaction might leave the
claimedExclusively
flag on for (someof) it's connections.
This caused an assertion failure because
CanUseExistingConnection
would return false and a new connection would be opened, and then there
would be two connections doing DML for the same placement. Which is
disallowed. That this situation caused an assertion failure instead of
an error, means that without asserts this could possibly result in some
visibility bugs, similar to the ones described
#3867
The fix simply "unclaims" all connections after "ROLLBACK TO SAVEPOINT"
is done.
This specific issue also highlights some other issues:
CanUseExistingConnection
shouldbe improved to catch these type of errors.
This way we will not return incorrect data in case of bugs, but instead
error out:
https://github.com/citusdata/citus/blob/master/src/backend/distributed/connection/placement_connection.c#L418-L419
This code is complex and it's quite possible we missed some other
edge cases.
rolled back statements took. We do not undo our hadDML and hadDDL
flags, when rolling back to a savepoint. This could result in some
queries not being allowed on Citus that would actually be fine to
execute. Changing this would require us to keep track for each
savepoint which placement connections had DDL/DML executed at that
point.