crash after rollback to savepoint #3622

pykello · 2020-03-17T23:50:11Z

On master, I did this:

create table t(a int);
select create_distributed_table('t', 'a');

begin;
insert into t values (4);
savepoint s1;
insert into t select i from generate_series(1, 10000000) i;
(wait for 2 seconds and then cancel)
rollback to savepoint s1;
insert into t select i from generate_series(1, 10000000) i;
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
Time: 1199.724 ms (00:01.200)

backtrace was:

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007f30599c7801 in __GI_abort () at abort.c:79
#2  0x00005641de0258b3 in ExceptionalCondition (
    conditionName=conditionName@entry=0x7f30569aed30 "!(!placementConnection->hadDML)", 
    errorType=errorType@entry=0x7f30569a7c50 "FailedAssertion", 
    fileName=fileName@entry=0x7f30569aebc8 "connection/placement_connection.c", lineNumber=lineNumber@entry=405) at assert.c:54
#3  0x00007f3056932439 in AssignPlacementListToConnection (placementAccessList=placementAccessList@entry=0x5641e0bdf028, 
    connection=connection@entry=0x5641e0d6bc28) at connection/placement_connection.c:405
#4  0x00007f30569324f6 in StartPlacementListConnection (flags=12, placementAccessList=0x5641e0bdf028, 
    userName=userName@entry=0x5641e04e3c68 "hadi") at connection/placement_connection.c:322
#5  0x00007f30569325e4 in StartPlacementConnection (flags=<optimized out>, placement=placement@entry=0x5641e0b5d9c8, 
    userName=userName@entry=0x5641e04e3c68 "hadi") at connection/placement_connection.c:257
#6  0x00007f3056932606 in GetPlacementConnection (flags=<optimized out>, placement=placement@entry=0x5641e0b5d9c8, 
    userName=userName@entry=0x5641e04e3c68 "hadi") at connection/placement_connection.c:210
#7  0x00007f30569270a0 in CopyGetPlacementConnection (stopOnFailure=false, placement=0x5641e0b5d9c8)
    at commands/multi_copy.c:3228
#8  InitializeCopyShardState (stopOnFailure=false, shardId=<optimized out>, connectionStateHash=0x5641e07c8618, 
    shardState=0x5641e0794fd8) at commands/multi_copy.c:3138
#9  GetShardState (found=0x7ffdf14c9bf2, stopOnFailure=false, connectionStateHash=0x5641e07c8618, 
    shardStateHash=<optimized out>, shardId=<optimized out>) at commands/multi_copy.c:3096
#10 CitusSendTupleToPlacements (copyDest=0x5641e0bdecf8, slot=0x5641e052b3e0) at commands/multi_copy.c:2228
#11 CitusCopyDestReceiverReceive (slot=0x5641e052b3e0, dest=0x5641e0bdecf8) at commands/multi_copy.c:2175
#12 0x00005641ddd88248 in ExecutePlan (execute_once=<optimized out>, dest=0x5641e0bdecf8, direction=-572907836, numberTuples=0, 
    sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x5641e052a070, 
    estate=0x5641e0529e18) at execMain.c:1677
#13 standard_ExecutorRun (queryDesc=queryDesc@entry=0x5641e0388878, direction=direction@entry=ForwardScanDirection, 
    count=count@entry=0, execute_once=execute_once@entry=true) at execMain.c:364
#14 0x00007f305694e7eb in CitusExecutorRun (queryDesc=0x5641e0388878, direction=ForwardScanDirection, count=0, 
    execute_once=<optimized out>) at executor/multi_executor.c:220
#15 0x00005641ddefdd05 in PortalRunSelect (portal=portal@entry=0x5641e0440890, forward=forward@entry=true, count=0, 
    count@entry=9223372036854775807, dest=dest@entry=0x5641e0bdecf8) at pquery.c:929
#16 0x00005641ddeff496 in PortalRun (portal=portal@entry=0x5641e0440890, count=count@entry=9223372036854775807, 
    isTopLevel=isTopLevel@entry=false, run_once=run_once@entry=true, dest=dest@entry=0x5641e0bdecf8, 
    altdest=altdest@entry=0x5641e0bdecf8, completionTag=0x0) at pquery.c:770
#17 0x00007f305694f2e4 in ExecutePlanIntoDestReceiver (queryPlan=0x5641e0bdeb10, params=0x0, dest=0x5641e0bdecf8)
    at executor/multi_executor.c:687
#18 0x00007f305694b806 in ExecutePlanIntoRelation (executorState=0x5641e0513ef8, selectPlan=0x5641e0bdeb10, 
    insertTargetList=0x5641e04e21c0, targetRelationId=16814) at executor/insert_select_executor.c:633
#19 CoordinatorInsertSelectExecScanInternal (node=0x5641e0514180) at executor/insert_select_executor.c:361
#20 CoordinatorInsertSelectExecScan (node=0x5641e0514180) at executor/insert_select_executor.c:99
#21 0x00005641ddd8821a in ExecProcNode (node=0x5641e0514180) at ../../../src/include/executor/executor.h:239
#22 ExecutePlan (execute_once=<optimized out>, dest=0x5641e052e9d0, direction=NoMovementScanDirection, numberTuples=0, 
    sendTuples=<optimized out>, operation=CMD_INSERT, use_parallel_mode=<optimized out>, planstate=0x5641e0514180, 
    estate=0x5641e0513ef8) at execMain.c:1646

The text was updated successfully, but these errors were encountered:

serprex · 2020-03-18T09:30:22Z

This seems to be fixed by #3403

pykello · 2020-03-18T16:50:18Z

This seems to be fixed by #3403

I tried that branch, and bug is still there. Note that the case I mentioned doesn't use the adaptive executor.

Fixes #3622 It was possible to get an assertion error, if a DML command was cancelled that opened a connection and then "ROLLBACK TO SAVEPOINT" was used to continue the transaction. The reason for this was that canceling the transaction might leave the `claimedExclusively` flag on for (some of) it's connections. This caused an assertion failure because `CanUseExistingConnection` would return false and a new connection would be opened, and then there would be two connections doing DML for the same placement. Which is disallowed. That this situation caused an assertion failure instead of an error, means that without asserts this could possibly result in some visibility bugs, similar to the ones described #3867 The fix simply "unclaims" all connections after "ROLLBACK TO SAVEPOINT" is done. This specific issue also highlights some other issues: 1. Just like #3867, it shows that the `CanUseExistingConnection` should be improved to catch these type of errors. 2. I think we should make these DML and DDL asserts normal errors. This way we will not return incorrect data in case of bugs, but instead error out: https://github.com/citusdata/citus/blob/master/src/backend/distributed/connection/placement_connection.c#L418-L419 This code is complex and it's quite possible we missed some other edge cases. 3. "ROLLBACK TO SAVEPOINT" in plain Postgres undos the locks that the rolled back statements took. We do not undo our hadDML and hadDDL flags, when rolling back to a savepoint. This could result in some queries not being allowed on Citus that would actually be fine to execute. Changing this would require us to keep track for each savepoint which placement connections had DDL/DML executed at that point.

This removes a flaky test that I introduced in #3868 after I fixed the issue described in #3622. This test is sometimes fails randomly in CI. The way it fails indicates that there might be some bug: A connection breaks after rolling back to a savepoint. I tried reproducing this issue locally, but I wasn't able to. I don't understand what causes the failure. Things that I tried were: 1. Running the test with: ```sql SET citus.force_max_query_parallelization = true; ``` 2. Running the test with: ```sql SET citus.max_adaptive_executor_pool_size = 1; ``` 3. Running the test in parallel with the same tests that it is run in parallel with in multi_schedule. None of these allowed me to reproduce the issue locally. So I think it's time to give on fixing this test and simply remove the test. The regression that this test protects against seems very unlikely to reappear, since in #3868 I also added a big comment about the need for the newly added `UnclaimConnection` call. So, I think the need for the test is quite small, and removing it will make our CI less flaky. Example of a failing CI run: https://app.circleci.com/pipelines/github/citusdata/citus/26098/workflows/f84741d9-13b1-4ae7-9155-c21ed3466951/jobs/736424

This removes a flaky test that I introduced in #3868 after I fixed the issue described in #3622. This test is sometimes fails randomly in CI. The way it fails indicates that there might be some bug: A connection breaks after rolling back to a savepoint. I tried reproducing this issue locally, but I wasn't able to. I don't understand what causes the failure. Things that I tried were: 1. Running the test with: ```sql SET citus.force_max_query_parallelization = true; ``` 2. Running the test with: ```sql SET citus.max_adaptive_executor_pool_size = 1; ``` 3. Running the test in parallel with the same tests that it is run in parallel with in multi_schedule. None of these allowed me to reproduce the issue locally. So I think it's time to give on fixing this test and simply remove the test. The regression that this test protects against seems very unlikely to reappear, since in #3868 I also added a big comment about the need for the newly added `UnclaimConnection` call. So, I think the need for the test is quite small, and removing it will make our CI less flaky. In case the cause of the bug ever gets found, I tracked the bug in #6189 Example of a failing CI run: https://app.circleci.com/pipelines/github/citusdata/citus/26098/workflows/f84741d9-13b1-4ae7-9155-c21ed3466951/jobs/736424

This removes a flaky test that I introduced in #3868 after I fixed the issue described in #3622. This test is sometimes fails randomly in CI. The way it fails indicates that there might be some bug: A connection breaks after rolling back to a savepoint. I tried reproducing this issue locally, but I wasn't able to. I don't understand what causes the failure. Things that I tried were: 1. Running the test with: ```sql SET citus.force_max_query_parallelization = true; ``` 2. Running the test with: ```sql SET citus.max_adaptive_executor_pool_size = 1; ``` 3. Running the test in parallel with the same tests that it is run in parallel with in multi_schedule. None of these allowed me to reproduce the issue locally. So I think it's time to give on fixing this test and simply remove the test. The regression that this test protects against seems very unlikely to reappear, since in #3868 I also added a big comment about the need for the newly added `UnclaimConnection` call. So, I think the need for the test is quite small, and removing it will make our CI less flaky. In case the cause of the bug ever gets found, I tracked the bug in #6189 Example of a failing CI run: https://app.circleci.com/pipelines/github/citusdata/citus/26098/workflows/f84741d9-13b1-4ae7-9155-c21ed3466951/jobs/736424 For reference the unexpected diff is this (so both warnings and an error): ```diff INSERT INTO t SELECT i FROM generate_series(1, 100) i; +WARNING: connection to the remote node localhost:57638 failed with the following error: +WARNING: +CONTEXT: while executing command on localhost:57638 +ERROR: connection to the remote node localhost:57638 failed with the following error: ROLLBACK; ``` This test is also mentioned as the most failing regression test in #5975

This removes a flaky test that I introduced in #3868 after I fixed the issue described in #3622. This test is sometimes fails randomly in CI. The way it fails indicates that there might be some bug: A connection breaks after rolling back to a savepoint. I tried reproducing this issue locally, but I wasn't able to. I don't understand what causes the failure. Things that I tried were: 1. Running the test with: ```sql SET citus.force_max_query_parallelization = true; ``` 2. Running the test with: ```sql SET citus.max_adaptive_executor_pool_size = 1; ``` 3. Running the test in parallel with the same tests that it is run in parallel with in multi_schedule. None of these allowed me to reproduce the issue locally. So I think it's time to give on fixing this test and simply remove the test. The regression that this test protects against seems very unlikely to reappear, since in #3868 I also added a big comment about the need for the newly added `UnclaimConnection` call. So, I think the need for the test is quite small, and removing it will make our CI less flaky. In case the cause of the bug ever gets found, I tracked the bug in #6189 Example of a failing CI run: https://app.circleci.com/pipelines/github/citusdata/citus/26098/workflows/f84741d9-13b1-4ae7-9155-c21ed3466951/jobs/736424 For reference the unexpected diff is this (so both warnings and an error): ```diff INSERT INTO t SELECT i FROM generate_series(1, 100) i; +WARNING: connection to the remote node localhost:57638 failed with the following error: +WARNING: +CONTEXT: while executing command on localhost:57638 +ERROR: connection to the remote node localhost:57638 failed with the following error: ROLLBACK; ``` This test is also mentioned as the most failing regression test in #5975 (cherry picked from commit d16b458)

This removes a flaky test that I introduced in #3868 after I fixed the issue described in #3622. This test is sometimes fails randomly in CI. The way it fails indicates that there might be some bug: A connection breaks after rolling back to a savepoint. I tried reproducing this issue locally, but I wasn't able to. I don't understand what causes the failure. Things that I tried were: 1. Running the test with: ```sql SET citus.force_max_query_parallelization = true; ``` 2. Running the test with: ```sql SET citus.max_adaptive_executor_pool_size = 1; ``` 3. Running the test in parallel with the same tests that it is run in parallel with in multi_schedule. None of these allowed me to reproduce the issue locally. So I think it's time to give on fixing this test and simply remove the test. The regression that this test protects against seems very unlikely to reappear, since in #3868 I also added a big comment about the need for the newly added `UnclaimConnection` call. So, I think the need for the test is quite small, and removing it will make our CI less flaky. In case the cause of the bug ever gets found, I tracked the bug in #6189 Example of a failing CI run: https://app.circleci.com/pipelines/github/citusdata/citus/26098/workflows/f84741d9-13b1-4ae7-9155-c21ed3466951/jobs/736424 For reference the unexpected diff is this (so both warnings and an error): ```diff INSERT INTO t SELECT i FROM generate_series(1, 100) i; +WARNING: connection to the remote node localhost:57638 failed with the following error: +WARNING: +CONTEXT: while executing command on localhost:57638 +ERROR: connection to the remote node localhost:57638 failed with the following error: ROLLBACK; ``` This test is also mentioned as the most failing regression test in #5975

pykello self-assigned this Mar 17, 2020

serprex added the bug label Mar 26, 2020

marcocitus added the backport label Mar 31, 2020

JelteF mentioned this issue Jun 4, 2020

Changing user inside transaction causes incorrect results, deadlocks or assert failures #3867

Closed

JelteF mentioned this issue Jun 4, 2020

Fix assertion error when rolling back to savepoint #3868

Merged

JelteF closed this as completed in #3868 Jun 30, 2020

JelteF mentioned this issue Aug 18, 2022

Remove the flaky rollback_to_savepoint test #6190

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

crash after rollback to savepoint #3622

crash after rollback to savepoint #3622

pykello commented Mar 17, 2020

serprex commented Mar 18, 2020

pykello commented Mar 18, 2020

crash after rollback to savepoint #3622

crash after rollback to savepoint #3622

Comments

pykello commented Mar 17, 2020

serprex commented Mar 18, 2020

pykello commented Mar 18, 2020