Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix assertion error when rolling back to savepoint #3868

Merged
merged 3 commits into from
Jun 30, 2020

Conversation

JelteF
Copy link
Contributor

@JelteF JelteF commented Jun 4, 2020

DESCRIPTION: Fixes crash when using rollback to savepoint after cancelation of DML

Fixes #3622

It was possible to get an assertion error, if a DML command was
cancelled that opened a connection and then "ROLLBACK TO SAVEPOINT" was
used to continue the transaction. The reason for this was that canceling
the transaction might leave the claimedExclusively flag on for (some
of) it's connections.

This caused an assertion failure because CanUseExistingConnection
would return false and a new connection would be opened, and then there
would be two connections doing DML for the same placement. Which is
disallowed. That this situation caused an assertion failure instead of
an error, means that without asserts this could possibly result in some
visibility bugs, similar to the ones described
#3867

The fix simply "unclaims" all connections after "ROLLBACK TO SAVEPOINT"
is done.

This specific issue also highlights some other issues:

  1. Just like Changing user inside transaction causes incorrect results, deadlocks or assert failures #3867, it shows that the CanUseExistingConnection should
    be improved to catch these type of errors.
  2. I think we should make these DML and DDL asserts normal errors.
    This way we will not return incorrect data in case of bugs, but instead
    error out:
    https://github.com/citusdata/citus/blob/master/src/backend/distributed/connection/placement_connection.c#L418-L419
    This code is complex and it's quite possible we missed some other
    edge cases.
  3. "ROLLBACK TO SAVEPOINT" in plain Postgres undos the locks that the
    rolled back statements took. We do not undo our hadDML and hadDDL
    flags, when rolling back to a savepoint. This could result in some
    queries not being allowed on Citus that would actually be fine to
    execute. Changing this would require us to keep track for each
    savepoint which placement connections had DDL/DML executed at that
    point.

@codecov
Copy link

codecov bot commented Jun 4, 2020

Codecov Report

Merging #3868 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #3868   +/-   ##
=======================================
  Coverage   91.57%   91.57%           
=======================================
  Files         185      185           
  Lines       36553    36554    +1     
=======================================
+ Hits        33473    33474    +1     
  Misses       3080     3080           

@JelteF JelteF force-pushed the fix-rollback-to-savepoint branch from 6da52a8 to d9480f3 Compare June 4, 2020 15:54
@codecov
Copy link

codecov bot commented Jun 4, 2020

Codecov Report

Merging #3868 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #3868   +/-   ##
=======================================
  Coverage   91.57%   91.57%           
=======================================
  Files         185      185           
  Lines       36553    36554    +1     
=======================================
+ Hits        33473    33474    +1     
  Misses       3080     3080           

@JelteF JelteF force-pushed the fix-rollback-to-savepoint branch from d9480f3 to 7493eb7 Compare June 4, 2020 16:18
@@ -484,8 +484,7 @@ ResetShardPlacementTransactionState(void)


/*
* Subtransaction callback - currently only used to remember whether a
* savepoint has been rolled back, as we don't support that.
* Subtransaction callback used to implement distributed ROLLBACK TO SAVEPOINT.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(this comment seemed outdated so I changed it)

@JelteF JelteF force-pushed the fix-rollback-to-savepoint branch from 7493eb7 to f35c47a Compare June 5, 2020 08:26
@codecov
Copy link

codecov bot commented Jun 5, 2020

Codecov Report

Merging #3868 into master will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #3868      +/-   ##
==========================================
+ Coverage   91.57%   91.58%   +0.01%     
==========================================
  Files         185      185              
  Lines       36553    36554       +1     
==========================================
+ Hits        33473    33478       +5     
+ Misses       3080     3076       -4     

Copy link
Member

@onderkalaci onderkalaci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor notes

@@ -90,6 +90,7 @@ test: multi_basic_queries multi_complex_expressions multi_subquery multi_subquer
test: multi_subquery_complex_reference_clause multi_subquery_window_functions multi_view multi_sql_function multi_prepare_sql
test: sql_procedure multi_function_in_join row_types materialized_view
test: multi_subquery_in_where_reference_clause full_join adaptive_executor propagate_set_commands
test: rollback_to_savepoint
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we run it in parallel with other tests where parallelism is low?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some weird reason when running the test in parallel I get distributed deadlocks, instead of timeouts: https://app.circleci.com/pipelines/github/citusdata/citus/9575/workflows/c434721e-d812-45ef-b830-1b86ee3f5d78/jobs/136125/steps

So I'll keep the test as is.

-- This timeout is chosen such that the INSERT with
-- generate_series(1, 100000000) is cancelled at the right time to trigger the
-- bug
SET statement_timeout = '2s';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, what if in 2 seconds Citus cannot start COPY command yet (e.g., still collecting the results of SELECT)? Would we still trigger the bug?

Can we maybe convert this to a failure tests where we fail once COPY starts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't able to reproduce the original issue with our failure testing suite. So I'm keeping it as is.

@metdos
Copy link
Contributor

metdos commented Jun 12, 2020

This PR seems to approved, @JelteF any reason to not merge it?

@JelteF
Copy link
Contributor Author

JelteF commented Jun 12, 2020

@metdos I still want to address Onders feeback on tests and comments.

@JelteF JelteF force-pushed the fix-rollback-to-savepoint branch from 6e9fa72 to 9db4f68 Compare June 26, 2020 08:00
@codecov
Copy link

codecov bot commented Jun 26, 2020

Codecov Report

Merging #3868 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #3868   +/-   ##
=======================================
  Coverage   91.57%   91.57%           
=======================================
  Files         185      185           
  Lines       36553    36554    +1     
=======================================
+ Hits        33473    33475    +2     
+ Misses       3080     3079    -1     

@JelteF JelteF force-pushed the fix-rollback-to-savepoint branch from 9db4f68 to f5ce254 Compare June 26, 2020 14:44
@codecov
Copy link

codecov bot commented Jun 26, 2020

Codecov Report

Merging #3868 into master will increase coverage by 0.58%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #3868      +/-   ##
==========================================
+ Coverage   90.99%   91.57%   +0.58%     
==========================================
  Files         187      185       -2     
  Lines       37356    36554     -802     
==========================================
- Hits        33992    33475     -517     
+ Misses       3364     3079     -285     

@JelteF JelteF force-pushed the fix-rollback-to-savepoint branch 2 times, most recently from 52727f6 to 7302d7d Compare June 29, 2020 10:10
@JelteF JelteF force-pushed the fix-rollback-to-savepoint branch 2 times, most recently from c0314e3 to a7cd0a2 Compare June 30, 2020 08:51
Fixes #3622

It was possible to get an assertion error, if a DML command was
cancelled that opened a connection and then "ROLLBACK TO SAVEPOINT" was
used to continue the transaction. The reason for this was that canceling
the transaction might leave the `claimedExclusively` flag on for (some
of) it's connections.

This caused an assertion failure because `CanUseExistingConnection`
would return false and a new connection would be opened, and then there
would be two connections doing DML for the same placement. Which is
disallowed. That this situation caused an assertion failure instead of
an error, means that without asserts this could possibly result in some
visibility bugs, similar to the ones described
#3867

The fix simply "unclaims" all connections after "ROLLBACK TO SAVEPOINT"
is done.

This specific issue also highlights some other issues:
1. Just like #3867, it shows that the `CanUseExistingConnection` should
   be improved to catch these type of errors.
2. I think we should make these DML and DDL asserts normal errors.
   This way we will not return incorrect data in case of bugs, but instead
   error out:
   https://github.com/citusdata/citus/blob/master/src/backend/distributed/connection/placement_connection.c#L418-L419
   This code is complex and it's quite possible we missed some other
   edge cases.
3. "ROLLBACK TO SAVEPOINT" in plain Postgres undos the locks that the
   rolled back statements took. We do not undo our hadDML and hadDDL
   flags, when rolling back to a savepoint. This could result in some
   queries not being allowed on Citus that would actually be fine to
   execute. Changing this would require us to keep track for each
   savepoint which placement connections had DDL/DML executed at that
   point.
@JelteF JelteF force-pushed the fix-rollback-to-savepoint branch from a7cd0a2 to b746ed2 Compare June 30, 2020 08:51
@JelteF JelteF merged commit 02fa942 into master Jun 30, 2020
@JelteF JelteF deleted the fix-rollback-to-savepoint branch June 30, 2020 09:31
JelteF added a commit that referenced this pull request Aug 18, 2022
This removes a flaky test that I introduced in #3868 after I fixed the
issue described in #3622. This test is sometimes fails randomly in CI.
The way it fails indicates that there might be some bug: A connection
breaks after rolling back to a savepoint.

I tried reproducing this issue locally, but I wasn't able to. I don't
understand what causes the failure.

Things that I tried were:

1. Running the test with:
   ```sql
   SET citus.force_max_query_parallelization = true;
   ```
2. Running the test with:
   ```sql
   SET citus.max_adaptive_executor_pool_size = 1;
   ```
3. Running the test in parallel with the same tests that it is run in
   parallel with in multi_schedule.

None of these allowed me to reproduce the issue locally.

So I think it's time to give on fixing this test and simply remove the
test. The regression that this test protects against seems very unlikely
to reappear, since in #3868 I also added a big comment about the need
for the newly added `UnclaimConnection` call. So, I think the need for
the test is quite small, and removing it will make our CI less flaky.

Example of a failing CI run:
https://app.circleci.com/pipelines/github/citusdata/citus/26098/workflows/f84741d9-13b1-4ae7-9155-c21ed3466951/jobs/736424
JelteF added a commit that referenced this pull request Aug 18, 2022
This removes a flaky test that I introduced in #3868 after I fixed the
issue described in #3622. This test is sometimes fails randomly in CI.
The way it fails indicates that there might be some bug: A connection
breaks after rolling back to a savepoint.

I tried reproducing this issue locally, but I wasn't able to. I don't
understand what causes the failure.

Things that I tried were:

1. Running the test with:
   ```sql
   SET citus.force_max_query_parallelization = true;
   ```
2. Running the test with:
   ```sql
   SET citus.max_adaptive_executor_pool_size = 1;
   ```
3. Running the test in parallel with the same tests that it is run in
   parallel with in multi_schedule.

None of these allowed me to reproduce the issue locally.

So I think it's time to give on fixing this test and simply remove the
test. The regression that this test protects against seems very unlikely
to reappear, since in #3868 I also added a big comment about the need
for the newly added `UnclaimConnection` call. So, I think the need for
the test is quite small, and removing it will make our CI less flaky.

In case the cause of the bug ever gets found, I tracked the bug in #6189

Example of a failing CI run:
https://app.circleci.com/pipelines/github/citusdata/citus/26098/workflows/f84741d9-13b1-4ae7-9155-c21ed3466951/jobs/736424
JelteF added a commit that referenced this pull request Aug 18, 2022
This removes a flaky test that I introduced in #3868 after I fixed the
issue described in #3622. This test is sometimes fails randomly in CI.
The way it fails indicates that there might be some bug: A connection
breaks after rolling back to a savepoint.

I tried reproducing this issue locally, but I wasn't able to. I don't
understand what causes the failure.

Things that I tried were:

1. Running the test with:
   ```sql
   SET citus.force_max_query_parallelization = true;
   ```
2. Running the test with:
   ```sql
   SET citus.max_adaptive_executor_pool_size = 1;
   ```
3. Running the test in parallel with the same tests that it is run in
   parallel with in multi_schedule.

None of these allowed me to reproduce the issue locally.

So I think it's time to give on fixing this test and simply remove the
test. The regression that this test protects against seems very unlikely
to reappear, since in #3868 I also added a big comment about the need
for the newly added `UnclaimConnection` call. So, I think the need for
the test is quite small, and removing it will make our CI less flaky.

In case the cause of the bug ever gets found, I tracked the bug in #6189

Example of a failing CI run:
https://app.circleci.com/pipelines/github/citusdata/citus/26098/workflows/f84741d9-13b1-4ae7-9155-c21ed3466951/jobs/736424

For reference the unexpected diff is this (so both warnings and an error):
```diff
 INSERT INTO t SELECT i FROM generate_series(1, 100) i;
+WARNING:  connection to the remote node localhost:57638 failed with the following error: 
+WARNING:  
+CONTEXT:  while executing command on localhost:57638
+ERROR:  connection to the remote node localhost:57638 failed with the following error: 
 ROLLBACK;
```

This test is also mentioned as the most failing regression test in #5975
JelteF added a commit that referenced this pull request Sep 7, 2022
This removes a flaky test that I introduced in #3868 after I fixed the
issue described in #3622. This test is sometimes fails randomly in CI.
The way it fails indicates that there might be some bug: A connection
breaks after rolling back to a savepoint.

I tried reproducing this issue locally, but I wasn't able to. I don't
understand what causes the failure.

Things that I tried were:

1. Running the test with:
   ```sql
   SET citus.force_max_query_parallelization = true;
   ```
2. Running the test with:
   ```sql
   SET citus.max_adaptive_executor_pool_size = 1;
   ```
3. Running the test in parallel with the same tests that it is run in
   parallel with in multi_schedule.

None of these allowed me to reproduce the issue locally.

So I think it's time to give on fixing this test and simply remove the
test. The regression that this test protects against seems very unlikely
to reappear, since in #3868 I also added a big comment about the need
for the newly added `UnclaimConnection` call. So, I think the need for
the test is quite small, and removing it will make our CI less flaky.

In case the cause of the bug ever gets found, I tracked the bug in #6189

Example of a failing CI run:
https://app.circleci.com/pipelines/github/citusdata/citus/26098/workflows/f84741d9-13b1-4ae7-9155-c21ed3466951/jobs/736424

For reference the unexpected diff is this (so both warnings and an error):
```diff
 INSERT INTO t SELECT i FROM generate_series(1, 100) i;
+WARNING:  connection to the remote node localhost:57638 failed with the following error:
+WARNING:
+CONTEXT:  while executing command on localhost:57638
+ERROR:  connection to the remote node localhost:57638 failed with the following error:
 ROLLBACK;
```

This test is also mentioned as the most failing regression test in #5975

(cherry picked from commit d16b458)
JelteF added a commit that referenced this pull request Sep 7, 2022
This removes a flaky test that I introduced in #3868 after I fixed the
issue described in #3622. This test is sometimes fails randomly in CI.
The way it fails indicates that there might be some bug: A connection
breaks after rolling back to a savepoint.

I tried reproducing this issue locally, but I wasn't able to. I don't
understand what causes the failure.

Things that I tried were:

1. Running the test with:
   ```sql
   SET citus.force_max_query_parallelization = true;
   ```
2. Running the test with:
   ```sql
   SET citus.max_adaptive_executor_pool_size = 1;
   ```
3. Running the test in parallel with the same tests that it is run in
   parallel with in multi_schedule.

None of these allowed me to reproduce the issue locally.

So I think it's time to give on fixing this test and simply remove the
test. The regression that this test protects against seems very unlikely
to reappear, since in #3868 I also added a big comment about the need
for the newly added `UnclaimConnection` call. So, I think the need for
the test is quite small, and removing it will make our CI less flaky.

In case the cause of the bug ever gets found, I tracked the bug in #6189

Example of a failing CI run:
https://app.circleci.com/pipelines/github/citusdata/citus/26098/workflows/f84741d9-13b1-4ae7-9155-c21ed3466951/jobs/736424

For reference the unexpected diff is this (so both warnings and an error):
```diff
 INSERT INTO t SELECT i FROM generate_series(1, 100) i;
+WARNING:  connection to the remote node localhost:57638 failed with the following error:
+WARNING:
+CONTEXT:  while executing command on localhost:57638
+ERROR:  connection to the remote node localhost:57638 failed with the following error:
 ROLLBACK;
```

This test is also mentioned as the most failing regression test in #5975

(cherry picked from commit d16b458)
JelteF added a commit that referenced this pull request Sep 7, 2022
This removes a flaky test that I introduced in #3868 after I fixed the
issue described in #3622. This test is sometimes fails randomly in CI.
The way it fails indicates that there might be some bug: A connection
breaks after rolling back to a savepoint.

I tried reproducing this issue locally, but I wasn't able to. I don't
understand what causes the failure.

Things that I tried were:

1. Running the test with:
   ```sql
   SET citus.force_max_query_parallelization = true;
   ```
2. Running the test with:
   ```sql
   SET citus.max_adaptive_executor_pool_size = 1;
   ```
3. Running the test in parallel with the same tests that it is run in
   parallel with in multi_schedule.

None of these allowed me to reproduce the issue locally.

So I think it's time to give on fixing this test and simply remove the
test. The regression that this test protects against seems very unlikely
to reappear, since in #3868 I also added a big comment about the need
for the newly added `UnclaimConnection` call. So, I think the need for
the test is quite small, and removing it will make our CI less flaky.

In case the cause of the bug ever gets found, I tracked the bug in #6189

Example of a failing CI run:
https://app.circleci.com/pipelines/github/citusdata/citus/26098/workflows/f84741d9-13b1-4ae7-9155-c21ed3466951/jobs/736424

For reference the unexpected diff is this (so both warnings and an error):
```diff
 INSERT INTO t SELECT i FROM generate_series(1, 100) i;
+WARNING:  connection to the remote node localhost:57638 failed with the following error:
+WARNING:
+CONTEXT:  while executing command on localhost:57638
+ERROR:  connection to the remote node localhost:57638 failed with the following error:
 ROLLBACK;
```

This test is also mentioned as the most failing regression test in #5975

(cherry picked from commit d16b458)
yxu2162 pushed a commit that referenced this pull request Sep 15, 2022
This removes a flaky test that I introduced in #3868 after I fixed the
issue described in #3622. This test is sometimes fails randomly in CI.
The way it fails indicates that there might be some bug: A connection
breaks after rolling back to a savepoint.

I tried reproducing this issue locally, but I wasn't able to. I don't
understand what causes the failure.

Things that I tried were:

1. Running the test with:
   ```sql
   SET citus.force_max_query_parallelization = true;
   ```
2. Running the test with:
   ```sql
   SET citus.max_adaptive_executor_pool_size = 1;
   ```
3. Running the test in parallel with the same tests that it is run in
   parallel with in multi_schedule.

None of these allowed me to reproduce the issue locally.

So I think it's time to give on fixing this test and simply remove the
test. The regression that this test protects against seems very unlikely
to reappear, since in #3868 I also added a big comment about the need
for the newly added `UnclaimConnection` call. So, I think the need for
the test is quite small, and removing it will make our CI less flaky.

In case the cause of the bug ever gets found, I tracked the bug in #6189

Example of a failing CI run:
https://app.circleci.com/pipelines/github/citusdata/citus/26098/workflows/f84741d9-13b1-4ae7-9155-c21ed3466951/jobs/736424

For reference the unexpected diff is this (so both warnings and an error):
```diff
 INSERT INTO t SELECT i FROM generate_series(1, 100) i;
+WARNING:  connection to the remote node localhost:57638 failed with the following error: 
+WARNING:  
+CONTEXT:  while executing command on localhost:57638
+ERROR:  connection to the remote node localhost:57638 failed with the following error: 
 ROLLBACK;
```

This test is also mentioned as the most failing regression test in #5975
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

crash after rollback to savepoint
5 participants