-
Notifications
You must be signed in to change notification settings - Fork 652
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
crash after rollback to savepoint #3622
Comments
This seems to be fixed by #3403 |
I tried that branch, and bug is still there. Note that the case I mentioned doesn't use the adaptive executor. |
JelteF
added a commit
that referenced
this issue
Jun 4, 2020
Fixes #3622 It was possible to get an assertion error, if a DML command was cancelled that opened a connection and then "ROLLBACK TO SAVEPOINT" was used to continue the transaction. The reason for this was that canceling the transaction might leave the `claimedExclusively` flag on for (some of) it's connections. This caused an assertion failure because `CanUseExistingConnection` would return false and a new connection would be opened, and then there would be two connections doing DML for the same placement. Which is disallowed. That this situation caused an assertion failure instead of an error, means that without asserts this could possibly result in some visibility bugs, similar to the ones described #3867 The fix simply "unclaims" all connections after "ROLLBACK TO SAVEPOINT" is done. This specific issue also highlights some other issues: 1. Just like #3867, it shows that the `CanUseExistingConnection` should be improved to catch these type of errors. 2. I think we should make these DML and DDL asserts normal errors. This way we will not return incorrect data in case of bugs, but instead error out: https://github.com/citusdata/citus/blob/master/src/backend/distributed/connection/placement_connection.c#L418-L419 This code is complex and it's quite possible we missed some other edge cases. 3. "ROLLBACK TO SAVEPOINT" in plain Postgres undos the locks that the rolled back statements took. We do not undo our hadDML and hadDDL flags, when rolling back to a savepoint. This could result in some queries not being allowed on Citus that would actually be fine to execute. Changing this would require us to keep track for each savepoint which placement connections had DDL/DML executed at that point.
JelteF
added a commit
that referenced
this issue
Jun 4, 2020
Fixes #3622 It was possible to get an assertion error, if a DML command was cancelled that opened a connection and then "ROLLBACK TO SAVEPOINT" was used to continue the transaction. The reason for this was that canceling the transaction might leave the `claimedExclusively` flag on for (some of) it's connections. This caused an assertion failure because `CanUseExistingConnection` would return false and a new connection would be opened, and then there would be two connections doing DML for the same placement. Which is disallowed. That this situation caused an assertion failure instead of an error, means that without asserts this could possibly result in some visibility bugs, similar to the ones described #3867 The fix simply "unclaims" all connections after "ROLLBACK TO SAVEPOINT" is done. This specific issue also highlights some other issues: 1. Just like #3867, it shows that the `CanUseExistingConnection` should be improved to catch these type of errors. 2. I think we should make these DML and DDL asserts normal errors. This way we will not return incorrect data in case of bugs, but instead error out: https://github.com/citusdata/citus/blob/master/src/backend/distributed/connection/placement_connection.c#L418-L419 This code is complex and it's quite possible we missed some other edge cases. 3. "ROLLBACK TO SAVEPOINT" in plain Postgres undos the locks that the rolled back statements took. We do not undo our hadDML and hadDDL flags, when rolling back to a savepoint. This could result in some queries not being allowed on Citus that would actually be fine to execute. Changing this would require us to keep track for each savepoint which placement connections had DDL/DML executed at that point.
JelteF
added a commit
that referenced
this issue
Jun 4, 2020
Fixes #3622 It was possible to get an assertion error, if a DML command was cancelled that opened a connection and then "ROLLBACK TO SAVEPOINT" was used to continue the transaction. The reason for this was that canceling the transaction might leave the `claimedExclusively` flag on for (some of) it's connections. This caused an assertion failure because `CanUseExistingConnection` would return false and a new connection would be opened, and then there would be two connections doing DML for the same placement. Which is disallowed. That this situation caused an assertion failure instead of an error, means that without asserts this could possibly result in some visibility bugs, similar to the ones described #3867 The fix simply "unclaims" all connections after "ROLLBACK TO SAVEPOINT" is done. This specific issue also highlights some other issues: 1. Just like #3867, it shows that the `CanUseExistingConnection` should be improved to catch these type of errors. 2. I think we should make these DML and DDL asserts normal errors. This way we will not return incorrect data in case of bugs, but instead error out: https://github.com/citusdata/citus/blob/master/src/backend/distributed/connection/placement_connection.c#L418-L419 This code is complex and it's quite possible we missed some other edge cases. 3. "ROLLBACK TO SAVEPOINT" in plain Postgres undos the locks that the rolled back statements took. We do not undo our hadDML and hadDDL flags, when rolling back to a savepoint. This could result in some queries not being allowed on Citus that would actually be fine to execute. Changing this would require us to keep track for each savepoint which placement connections had DDL/DML executed at that point.
JelteF
added a commit
that referenced
this issue
Jun 4, 2020
Fixes #3622 It was possible to get an assertion error, if a DML command was cancelled that opened a connection and then "ROLLBACK TO SAVEPOINT" was used to continue the transaction. The reason for this was that canceling the transaction might leave the `claimedExclusively` flag on for (some of) it's connections. This caused an assertion failure because `CanUseExistingConnection` would return false and a new connection would be opened, and then there would be two connections doing DML for the same placement. Which is disallowed. That this situation caused an assertion failure instead of an error, means that without asserts this could possibly result in some visibility bugs, similar to the ones described #3867 The fix simply "unclaims" all connections after "ROLLBACK TO SAVEPOINT" is done. This specific issue also highlights some other issues: 1. Just like #3867, it shows that the `CanUseExistingConnection` should be improved to catch these type of errors. 2. I think we should make these DML and DDL asserts normal errors. This way we will not return incorrect data in case of bugs, but instead error out: https://github.com/citusdata/citus/blob/master/src/backend/distributed/connection/placement_connection.c#L418-L419 This code is complex and it's quite possible we missed some other edge cases. 3. "ROLLBACK TO SAVEPOINT" in plain Postgres undos the locks that the rolled back statements took. We do not undo our hadDML and hadDDL flags, when rolling back to a savepoint. This could result in some queries not being allowed on Citus that would actually be fine to execute. Changing this would require us to keep track for each savepoint which placement connections had DDL/DML executed at that point.
JelteF
added a commit
that referenced
this issue
Jun 5, 2020
Fixes #3622 It was possible to get an assertion error, if a DML command was cancelled that opened a connection and then "ROLLBACK TO SAVEPOINT" was used to continue the transaction. The reason for this was that canceling the transaction might leave the `claimedExclusively` flag on for (some of) it's connections. This caused an assertion failure because `CanUseExistingConnection` would return false and a new connection would be opened, and then there would be two connections doing DML for the same placement. Which is disallowed. That this situation caused an assertion failure instead of an error, means that without asserts this could possibly result in some visibility bugs, similar to the ones described #3867 The fix simply "unclaims" all connections after "ROLLBACK TO SAVEPOINT" is done. This specific issue also highlights some other issues: 1. Just like #3867, it shows that the `CanUseExistingConnection` should be improved to catch these type of errors. 2. I think we should make these DML and DDL asserts normal errors. This way we will not return incorrect data in case of bugs, but instead error out: https://github.com/citusdata/citus/blob/master/src/backend/distributed/connection/placement_connection.c#L418-L419 This code is complex and it's quite possible we missed some other edge cases. 3. "ROLLBACK TO SAVEPOINT" in plain Postgres undos the locks that the rolled back statements took. We do not undo our hadDML and hadDDL flags, when rolling back to a savepoint. This could result in some queries not being allowed on Citus that would actually be fine to execute. Changing this would require us to keep track for each savepoint which placement connections had DDL/DML executed at that point.
JelteF
added a commit
that referenced
this issue
Jun 26, 2020
Fixes #3622 It was possible to get an assertion error, if a DML command was cancelled that opened a connection and then "ROLLBACK TO SAVEPOINT" was used to continue the transaction. The reason for this was that canceling the transaction might leave the `claimedExclusively` flag on for (some of) it's connections. This caused an assertion failure because `CanUseExistingConnection` would return false and a new connection would be opened, and then there would be two connections doing DML for the same placement. Which is disallowed. That this situation caused an assertion failure instead of an error, means that without asserts this could possibly result in some visibility bugs, similar to the ones described #3867 The fix simply "unclaims" all connections after "ROLLBACK TO SAVEPOINT" is done. This specific issue also highlights some other issues: 1. Just like #3867, it shows that the `CanUseExistingConnection` should be improved to catch these type of errors. 2. I think we should make these DML and DDL asserts normal errors. This way we will not return incorrect data in case of bugs, but instead error out: https://github.com/citusdata/citus/blob/master/src/backend/distributed/connection/placement_connection.c#L418-L419 This code is complex and it's quite possible we missed some other edge cases. 3. "ROLLBACK TO SAVEPOINT" in plain Postgres undos the locks that the rolled back statements took. We do not undo our hadDML and hadDDL flags, when rolling back to a savepoint. This could result in some queries not being allowed on Citus that would actually be fine to execute. Changing this would require us to keep track for each savepoint which placement connections had DDL/DML executed at that point.
JelteF
added a commit
that referenced
this issue
Jun 26, 2020
Fixes #3622 It was possible to get an assertion error, if a DML command was cancelled that opened a connection and then "ROLLBACK TO SAVEPOINT" was used to continue the transaction. The reason for this was that canceling the transaction might leave the `claimedExclusively` flag on for (some of) it's connections. This caused an assertion failure because `CanUseExistingConnection` would return false and a new connection would be opened, and then there would be two connections doing DML for the same placement. Which is disallowed. That this situation caused an assertion failure instead of an error, means that without asserts this could possibly result in some visibility bugs, similar to the ones described #3867 The fix simply "unclaims" all connections after "ROLLBACK TO SAVEPOINT" is done. This specific issue also highlights some other issues: 1. Just like #3867, it shows that the `CanUseExistingConnection` should be improved to catch these type of errors. 2. I think we should make these DML and DDL asserts normal errors. This way we will not return incorrect data in case of bugs, but instead error out: https://github.com/citusdata/citus/blob/master/src/backend/distributed/connection/placement_connection.c#L418-L419 This code is complex and it's quite possible we missed some other edge cases. 3. "ROLLBACK TO SAVEPOINT" in plain Postgres undos the locks that the rolled back statements took. We do not undo our hadDML and hadDDL flags, when rolling back to a savepoint. This could result in some queries not being allowed on Citus that would actually be fine to execute. Changing this would require us to keep track for each savepoint which placement connections had DDL/DML executed at that point.
JelteF
added a commit
that referenced
this issue
Jun 30, 2020
Fixes #3622 It was possible to get an assertion error, if a DML command was cancelled that opened a connection and then "ROLLBACK TO SAVEPOINT" was used to continue the transaction. The reason for this was that canceling the transaction might leave the `claimedExclusively` flag on for (some of) it's connections. This caused an assertion failure because `CanUseExistingConnection` would return false and a new connection would be opened, and then there would be two connections doing DML for the same placement. Which is disallowed. That this situation caused an assertion failure instead of an error, means that without asserts this could possibly result in some visibility bugs, similar to the ones described #3867 The fix simply "unclaims" all connections after "ROLLBACK TO SAVEPOINT" is done. This specific issue also highlights some other issues: 1. Just like #3867, it shows that the `CanUseExistingConnection` should be improved to catch these type of errors. 2. I think we should make these DML and DDL asserts normal errors. This way we will not return incorrect data in case of bugs, but instead error out: https://github.com/citusdata/citus/blob/master/src/backend/distributed/connection/placement_connection.c#L418-L419 This code is complex and it's quite possible we missed some other edge cases. 3. "ROLLBACK TO SAVEPOINT" in plain Postgres undos the locks that the rolled back statements took. We do not undo our hadDML and hadDDL flags, when rolling back to a savepoint. This could result in some queries not being allowed on Citus that would actually be fine to execute. Changing this would require us to keep track for each savepoint which placement connections had DDL/DML executed at that point.
JelteF
added a commit
that referenced
this issue
Aug 18, 2022
This removes a flaky test that I introduced in #3868 after I fixed the issue described in #3622. This test is sometimes fails randomly in CI. The way it fails indicates that there might be some bug: A connection breaks after rolling back to a savepoint. I tried reproducing this issue locally, but I wasn't able to. I don't understand what causes the failure. Things that I tried were: 1. Running the test with: ```sql SET citus.force_max_query_parallelization = true; ``` 2. Running the test with: ```sql SET citus.max_adaptive_executor_pool_size = 1; ``` 3. Running the test in parallel with the same tests that it is run in parallel with in multi_schedule. None of these allowed me to reproduce the issue locally. So I think it's time to give on fixing this test and simply remove the test. The regression that this test protects against seems very unlikely to reappear, since in #3868 I also added a big comment about the need for the newly added `UnclaimConnection` call. So, I think the need for the test is quite small, and removing it will make our CI less flaky. Example of a failing CI run: https://app.circleci.com/pipelines/github/citusdata/citus/26098/workflows/f84741d9-13b1-4ae7-9155-c21ed3466951/jobs/736424
JelteF
added a commit
that referenced
this issue
Aug 18, 2022
This removes a flaky test that I introduced in #3868 after I fixed the issue described in #3622. This test is sometimes fails randomly in CI. The way it fails indicates that there might be some bug: A connection breaks after rolling back to a savepoint. I tried reproducing this issue locally, but I wasn't able to. I don't understand what causes the failure. Things that I tried were: 1. Running the test with: ```sql SET citus.force_max_query_parallelization = true; ``` 2. Running the test with: ```sql SET citus.max_adaptive_executor_pool_size = 1; ``` 3. Running the test in parallel with the same tests that it is run in parallel with in multi_schedule. None of these allowed me to reproduce the issue locally. So I think it's time to give on fixing this test and simply remove the test. The regression that this test protects against seems very unlikely to reappear, since in #3868 I also added a big comment about the need for the newly added `UnclaimConnection` call. So, I think the need for the test is quite small, and removing it will make our CI less flaky. In case the cause of the bug ever gets found, I tracked the bug in #6189 Example of a failing CI run: https://app.circleci.com/pipelines/github/citusdata/citus/26098/workflows/f84741d9-13b1-4ae7-9155-c21ed3466951/jobs/736424
JelteF
added a commit
that referenced
this issue
Aug 18, 2022
This removes a flaky test that I introduced in #3868 after I fixed the issue described in #3622. This test is sometimes fails randomly in CI. The way it fails indicates that there might be some bug: A connection breaks after rolling back to a savepoint. I tried reproducing this issue locally, but I wasn't able to. I don't understand what causes the failure. Things that I tried were: 1. Running the test with: ```sql SET citus.force_max_query_parallelization = true; ``` 2. Running the test with: ```sql SET citus.max_adaptive_executor_pool_size = 1; ``` 3. Running the test in parallel with the same tests that it is run in parallel with in multi_schedule. None of these allowed me to reproduce the issue locally. So I think it's time to give on fixing this test and simply remove the test. The regression that this test protects against seems very unlikely to reappear, since in #3868 I also added a big comment about the need for the newly added `UnclaimConnection` call. So, I think the need for the test is quite small, and removing it will make our CI less flaky. In case the cause of the bug ever gets found, I tracked the bug in #6189 Example of a failing CI run: https://app.circleci.com/pipelines/github/citusdata/citus/26098/workflows/f84741d9-13b1-4ae7-9155-c21ed3466951/jobs/736424 For reference the unexpected diff is this (so both warnings and an error): ```diff INSERT INTO t SELECT i FROM generate_series(1, 100) i; +WARNING: connection to the remote node localhost:57638 failed with the following error: +WARNING: +CONTEXT: while executing command on localhost:57638 +ERROR: connection to the remote node localhost:57638 failed with the following error: ROLLBACK; ``` This test is also mentioned as the most failing regression test in #5975
JelteF
added a commit
that referenced
this issue
Sep 7, 2022
This removes a flaky test that I introduced in #3868 after I fixed the issue described in #3622. This test is sometimes fails randomly in CI. The way it fails indicates that there might be some bug: A connection breaks after rolling back to a savepoint. I tried reproducing this issue locally, but I wasn't able to. I don't understand what causes the failure. Things that I tried were: 1. Running the test with: ```sql SET citus.force_max_query_parallelization = true; ``` 2. Running the test with: ```sql SET citus.max_adaptive_executor_pool_size = 1; ``` 3. Running the test in parallel with the same tests that it is run in parallel with in multi_schedule. None of these allowed me to reproduce the issue locally. So I think it's time to give on fixing this test and simply remove the test. The regression that this test protects against seems very unlikely to reappear, since in #3868 I also added a big comment about the need for the newly added `UnclaimConnection` call. So, I think the need for the test is quite small, and removing it will make our CI less flaky. In case the cause of the bug ever gets found, I tracked the bug in #6189 Example of a failing CI run: https://app.circleci.com/pipelines/github/citusdata/citus/26098/workflows/f84741d9-13b1-4ae7-9155-c21ed3466951/jobs/736424 For reference the unexpected diff is this (so both warnings and an error): ```diff INSERT INTO t SELECT i FROM generate_series(1, 100) i; +WARNING: connection to the remote node localhost:57638 failed with the following error: +WARNING: +CONTEXT: while executing command on localhost:57638 +ERROR: connection to the remote node localhost:57638 failed with the following error: ROLLBACK; ``` This test is also mentioned as the most failing regression test in #5975 (cherry picked from commit d16b458)
JelteF
added a commit
that referenced
this issue
Sep 7, 2022
This removes a flaky test that I introduced in #3868 after I fixed the issue described in #3622. This test is sometimes fails randomly in CI. The way it fails indicates that there might be some bug: A connection breaks after rolling back to a savepoint. I tried reproducing this issue locally, but I wasn't able to. I don't understand what causes the failure. Things that I tried were: 1. Running the test with: ```sql SET citus.force_max_query_parallelization = true; ``` 2. Running the test with: ```sql SET citus.max_adaptive_executor_pool_size = 1; ``` 3. Running the test in parallel with the same tests that it is run in parallel with in multi_schedule. None of these allowed me to reproduce the issue locally. So I think it's time to give on fixing this test and simply remove the test. The regression that this test protects against seems very unlikely to reappear, since in #3868 I also added a big comment about the need for the newly added `UnclaimConnection` call. So, I think the need for the test is quite small, and removing it will make our CI less flaky. In case the cause of the bug ever gets found, I tracked the bug in #6189 Example of a failing CI run: https://app.circleci.com/pipelines/github/citusdata/citus/26098/workflows/f84741d9-13b1-4ae7-9155-c21ed3466951/jobs/736424 For reference the unexpected diff is this (so both warnings and an error): ```diff INSERT INTO t SELECT i FROM generate_series(1, 100) i; +WARNING: connection to the remote node localhost:57638 failed with the following error: +WARNING: +CONTEXT: while executing command on localhost:57638 +ERROR: connection to the remote node localhost:57638 failed with the following error: ROLLBACK; ``` This test is also mentioned as the most failing regression test in #5975 (cherry picked from commit d16b458)
JelteF
added a commit
that referenced
this issue
Sep 7, 2022
This removes a flaky test that I introduced in #3868 after I fixed the issue described in #3622. This test is sometimes fails randomly in CI. The way it fails indicates that there might be some bug: A connection breaks after rolling back to a savepoint. I tried reproducing this issue locally, but I wasn't able to. I don't understand what causes the failure. Things that I tried were: 1. Running the test with: ```sql SET citus.force_max_query_parallelization = true; ``` 2. Running the test with: ```sql SET citus.max_adaptive_executor_pool_size = 1; ``` 3. Running the test in parallel with the same tests that it is run in parallel with in multi_schedule. None of these allowed me to reproduce the issue locally. So I think it's time to give on fixing this test and simply remove the test. The regression that this test protects against seems very unlikely to reappear, since in #3868 I also added a big comment about the need for the newly added `UnclaimConnection` call. So, I think the need for the test is quite small, and removing it will make our CI less flaky. In case the cause of the bug ever gets found, I tracked the bug in #6189 Example of a failing CI run: https://app.circleci.com/pipelines/github/citusdata/citus/26098/workflows/f84741d9-13b1-4ae7-9155-c21ed3466951/jobs/736424 For reference the unexpected diff is this (so both warnings and an error): ```diff INSERT INTO t SELECT i FROM generate_series(1, 100) i; +WARNING: connection to the remote node localhost:57638 failed with the following error: +WARNING: +CONTEXT: while executing command on localhost:57638 +ERROR: connection to the remote node localhost:57638 failed with the following error: ROLLBACK; ``` This test is also mentioned as the most failing regression test in #5975 (cherry picked from commit d16b458)
yxu2162
pushed a commit
that referenced
this issue
Sep 15, 2022
This removes a flaky test that I introduced in #3868 after I fixed the issue described in #3622. This test is sometimes fails randomly in CI. The way it fails indicates that there might be some bug: A connection breaks after rolling back to a savepoint. I tried reproducing this issue locally, but I wasn't able to. I don't understand what causes the failure. Things that I tried were: 1. Running the test with: ```sql SET citus.force_max_query_parallelization = true; ``` 2. Running the test with: ```sql SET citus.max_adaptive_executor_pool_size = 1; ``` 3. Running the test in parallel with the same tests that it is run in parallel with in multi_schedule. None of these allowed me to reproduce the issue locally. So I think it's time to give on fixing this test and simply remove the test. The regression that this test protects against seems very unlikely to reappear, since in #3868 I also added a big comment about the need for the newly added `UnclaimConnection` call. So, I think the need for the test is quite small, and removing it will make our CI less flaky. In case the cause of the bug ever gets found, I tracked the bug in #6189 Example of a failing CI run: https://app.circleci.com/pipelines/github/citusdata/citus/26098/workflows/f84741d9-13b1-4ae7-9155-c21ed3466951/jobs/736424 For reference the unexpected diff is this (so both warnings and an error): ```diff INSERT INTO t SELECT i FROM generate_series(1, 100) i; +WARNING: connection to the remote node localhost:57638 failed with the following error: +WARNING: +CONTEXT: while executing command on localhost:57638 +ERROR: connection to the remote node localhost:57638 failed with the following error: ROLLBACK; ``` This test is also mentioned as the most failing regression test in #5975
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
On master, I did this:
backtrace was:
The text was updated successfully, but these errors were encountered: