Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crash after rollback to savepoint #3622

Closed
pykello opened this issue Mar 17, 2020 · 2 comments · Fixed by #3868
Closed

crash after rollback to savepoint #3622

pykello opened this issue Mar 17, 2020 · 2 comments · Fixed by #3868
Assignees

Comments

@pykello
Copy link
Contributor

pykello commented Mar 17, 2020

On master, I did this:

create table t(a int);
select create_distributed_table('t', 'a');

begin;
insert into t values (4);
savepoint s1;
insert into t select i from generate_series(1, 10000000) i;
(wait for 2 seconds and then cancel)
rollback to savepoint s1;
insert into t select i from generate_series(1, 10000000) i;
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
Time: 1199.724 ms (00:01.200)

backtrace was:

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007f30599c7801 in __GI_abort () at abort.c:79
#2  0x00005641de0258b3 in ExceptionalCondition (
    conditionName=conditionName@entry=0x7f30569aed30 "!(!placementConnection->hadDML)", 
    errorType=errorType@entry=0x7f30569a7c50 "FailedAssertion", 
    fileName=fileName@entry=0x7f30569aebc8 "connection/placement_connection.c", lineNumber=lineNumber@entry=405) at assert.c:54
#3  0x00007f3056932439 in AssignPlacementListToConnection (placementAccessList=placementAccessList@entry=0x5641e0bdf028, 
    connection=connection@entry=0x5641e0d6bc28) at connection/placement_connection.c:405
#4  0x00007f30569324f6 in StartPlacementListConnection (flags=12, placementAccessList=0x5641e0bdf028, 
    userName=userName@entry=0x5641e04e3c68 "hadi") at connection/placement_connection.c:322
#5  0x00007f30569325e4 in StartPlacementConnection (flags=<optimized out>, placement=placement@entry=0x5641e0b5d9c8, 
    userName=userName@entry=0x5641e04e3c68 "hadi") at connection/placement_connection.c:257
#6  0x00007f3056932606 in GetPlacementConnection (flags=<optimized out>, placement=placement@entry=0x5641e0b5d9c8, 
    userName=userName@entry=0x5641e04e3c68 "hadi") at connection/placement_connection.c:210
#7  0x00007f30569270a0 in CopyGetPlacementConnection (stopOnFailure=false, placement=0x5641e0b5d9c8)
    at commands/multi_copy.c:3228
#8  InitializeCopyShardState (stopOnFailure=false, shardId=<optimized out>, connectionStateHash=0x5641e07c8618, 
    shardState=0x5641e0794fd8) at commands/multi_copy.c:3138
#9  GetShardState (found=0x7ffdf14c9bf2, stopOnFailure=false, connectionStateHash=0x5641e07c8618, 
    shardStateHash=<optimized out>, shardId=<optimized out>) at commands/multi_copy.c:3096
#10 CitusSendTupleToPlacements (copyDest=0x5641e0bdecf8, slot=0x5641e052b3e0) at commands/multi_copy.c:2228
#11 CitusCopyDestReceiverReceive (slot=0x5641e052b3e0, dest=0x5641e0bdecf8) at commands/multi_copy.c:2175
#12 0x00005641ddd88248 in ExecutePlan (execute_once=<optimized out>, dest=0x5641e0bdecf8, direction=-572907836, numberTuples=0, 
    sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x5641e052a070, 
    estate=0x5641e0529e18) at execMain.c:1677
#13 standard_ExecutorRun (queryDesc=queryDesc@entry=0x5641e0388878, direction=direction@entry=ForwardScanDirection, 
    count=count@entry=0, execute_once=execute_once@entry=true) at execMain.c:364
#14 0x00007f305694e7eb in CitusExecutorRun (queryDesc=0x5641e0388878, direction=ForwardScanDirection, count=0, 
    execute_once=<optimized out>) at executor/multi_executor.c:220
#15 0x00005641ddefdd05 in PortalRunSelect (portal=portal@entry=0x5641e0440890, forward=forward@entry=true, count=0, 
    count@entry=9223372036854775807, dest=dest@entry=0x5641e0bdecf8) at pquery.c:929
#16 0x00005641ddeff496 in PortalRun (portal=portal@entry=0x5641e0440890, count=count@entry=9223372036854775807, 
    isTopLevel=isTopLevel@entry=false, run_once=run_once@entry=true, dest=dest@entry=0x5641e0bdecf8, 
    altdest=altdest@entry=0x5641e0bdecf8, completionTag=0x0) at pquery.c:770
#17 0x00007f305694f2e4 in ExecutePlanIntoDestReceiver (queryPlan=0x5641e0bdeb10, params=0x0, dest=0x5641e0bdecf8)
    at executor/multi_executor.c:687
#18 0x00007f305694b806 in ExecutePlanIntoRelation (executorState=0x5641e0513ef8, selectPlan=0x5641e0bdeb10, 
    insertTargetList=0x5641e04e21c0, targetRelationId=16814) at executor/insert_select_executor.c:633
#19 CoordinatorInsertSelectExecScanInternal (node=0x5641e0514180) at executor/insert_select_executor.c:361
#20 CoordinatorInsertSelectExecScan (node=0x5641e0514180) at executor/insert_select_executor.c:99
#21 0x00005641ddd8821a in ExecProcNode (node=0x5641e0514180) at ../../../src/include/executor/executor.h:239
#22 ExecutePlan (execute_once=<optimized out>, dest=0x5641e052e9d0, direction=NoMovementScanDirection, numberTuples=0, 
    sendTuples=<optimized out>, operation=CMD_INSERT, use_parallel_mode=<optimized out>, planstate=0x5641e0514180, 
    estate=0x5641e0513ef8) at execMain.c:1646
@pykello pykello self-assigned this Mar 17, 2020
@serprex
Copy link
Collaborator

serprex commented Mar 18, 2020

This seems to be fixed by #3403

@pykello
Copy link
Contributor Author

pykello commented Mar 18, 2020

This seems to be fixed by #3403

I tried that branch, and bug is still there. Note that the case I mentioned doesn't use the adaptive executor.

@serprex serprex added the bug label Mar 26, 2020
JelteF added a commit that referenced this issue Jun 4, 2020
Fixes #3622

It was possible to get an assertion error, if a DML command was
cancelled that opened a connection and then "ROLLBACK TO SAVEPOINT" was
used to continue the transaction. The reason for this was that canceling
the transaction might leave the `claimedExclusively` flag on for (some
of) it's connections.

This caused an assertion failure because `CanUseExistingConnection`
would return false and a new connection would be opened, and then there
would be two connections doing DML for the same placement. Which is
disallowed. That this situation caused an assertion failure instead of
an error, means that without asserts this could possibly result in some
visibility bugs, similar to the ones described
#3867

The fix simply "unclaims" all connections after "ROLLBACK TO SAVEPOINT"
is done.

This specific issue also highlights some other issues:
1. Just like #3867, it shows that the `CanUseExistingConnection` should
   be improved to catch these type of errors.
2. I think we should make these DML and DDL asserts normal errors.
   This way we will not return incorrect data in case of bugs, but instead
   error out:
   https://github.com/citusdata/citus/blob/master/src/backend/distributed/connection/placement_connection.c#L418-L419
   This code is complex and it's quite possible we missed some other
   edge cases.
3. "ROLLBACK TO SAVEPOINT" in plain Postgres undos the locks that the
   rolled back statements took. We do not undo our hadDML and hadDDL
   flags, when rolling back to a savepoint. This could result in some
   queries not being allowed on Citus that would actually be fine to
   execute. Changing this would require us to keep track for each
   savepoint which placement connections had DDL/DML executed at that
   point.
JelteF added a commit that referenced this issue Jun 4, 2020
Fixes #3622

It was possible to get an assertion error, if a DML command was
cancelled that opened a connection and then "ROLLBACK TO SAVEPOINT" was
used to continue the transaction. The reason for this was that canceling
the transaction might leave the `claimedExclusively` flag on for (some
of) it's connections.

This caused an assertion failure because `CanUseExistingConnection`
would return false and a new connection would be opened, and then there
would be two connections doing DML for the same placement. Which is
disallowed. That this situation caused an assertion failure instead of
an error, means that without asserts this could possibly result in some
visibility bugs, similar to the ones described
#3867

The fix simply "unclaims" all connections after "ROLLBACK TO SAVEPOINT"
is done.

This specific issue also highlights some other issues:
1. Just like #3867, it shows that the `CanUseExistingConnection` should
   be improved to catch these type of errors.
2. I think we should make these DML and DDL asserts normal errors.
   This way we will not return incorrect data in case of bugs, but instead
   error out:
   https://github.com/citusdata/citus/blob/master/src/backend/distributed/connection/placement_connection.c#L418-L419
   This code is complex and it's quite possible we missed some other
   edge cases.
3. "ROLLBACK TO SAVEPOINT" in plain Postgres undos the locks that the
   rolled back statements took. We do not undo our hadDML and hadDDL
   flags, when rolling back to a savepoint. This could result in some
   queries not being allowed on Citus that would actually be fine to
   execute. Changing this would require us to keep track for each
   savepoint which placement connections had DDL/DML executed at that
   point.
JelteF added a commit that referenced this issue Jun 4, 2020
Fixes #3622

It was possible to get an assertion error, if a DML command was
cancelled that opened a connection and then "ROLLBACK TO SAVEPOINT" was
used to continue the transaction. The reason for this was that canceling
the transaction might leave the `claimedExclusively` flag on for (some
of) it's connections.

This caused an assertion failure because `CanUseExistingConnection`
would return false and a new connection would be opened, and then there
would be two connections doing DML for the same placement. Which is
disallowed. That this situation caused an assertion failure instead of
an error, means that without asserts this could possibly result in some
visibility bugs, similar to the ones described
#3867

The fix simply "unclaims" all connections after "ROLLBACK TO SAVEPOINT"
is done.

This specific issue also highlights some other issues:
1. Just like #3867, it shows that the `CanUseExistingConnection` should
   be improved to catch these type of errors.
2. I think we should make these DML and DDL asserts normal errors.
   This way we will not return incorrect data in case of bugs, but instead
   error out:
   https://github.com/citusdata/citus/blob/master/src/backend/distributed/connection/placement_connection.c#L418-L419
   This code is complex and it's quite possible we missed some other
   edge cases.
3. "ROLLBACK TO SAVEPOINT" in plain Postgres undos the locks that the
   rolled back statements took. We do not undo our hadDML and hadDDL
   flags, when rolling back to a savepoint. This could result in some
   queries not being allowed on Citus that would actually be fine to
   execute. Changing this would require us to keep track for each
   savepoint which placement connections had DDL/DML executed at that
   point.
JelteF added a commit that referenced this issue Jun 4, 2020
Fixes #3622

It was possible to get an assertion error, if a DML command was
cancelled that opened a connection and then "ROLLBACK TO SAVEPOINT" was
used to continue the transaction. The reason for this was that canceling
the transaction might leave the `claimedExclusively` flag on for (some
of) it's connections.

This caused an assertion failure because `CanUseExistingConnection`
would return false and a new connection would be opened, and then there
would be two connections doing DML for the same placement. Which is
disallowed. That this situation caused an assertion failure instead of
an error, means that without asserts this could possibly result in some
visibility bugs, similar to the ones described
#3867

The fix simply "unclaims" all connections after "ROLLBACK TO SAVEPOINT"
is done.

This specific issue also highlights some other issues:
1. Just like #3867, it shows that the `CanUseExistingConnection` should
   be improved to catch these type of errors.
2. I think we should make these DML and DDL asserts normal errors.
   This way we will not return incorrect data in case of bugs, but instead
   error out:
   https://github.com/citusdata/citus/blob/master/src/backend/distributed/connection/placement_connection.c#L418-L419
   This code is complex and it's quite possible we missed some other
   edge cases.
3. "ROLLBACK TO SAVEPOINT" in plain Postgres undos the locks that the
   rolled back statements took. We do not undo our hadDML and hadDDL
   flags, when rolling back to a savepoint. This could result in some
   queries not being allowed on Citus that would actually be fine to
   execute. Changing this would require us to keep track for each
   savepoint which placement connections had DDL/DML executed at that
   point.
JelteF added a commit that referenced this issue Jun 5, 2020
Fixes #3622

It was possible to get an assertion error, if a DML command was
cancelled that opened a connection and then "ROLLBACK TO SAVEPOINT" was
used to continue the transaction. The reason for this was that canceling
the transaction might leave the `claimedExclusively` flag on for (some
of) it's connections.

This caused an assertion failure because `CanUseExistingConnection`
would return false and a new connection would be opened, and then there
would be two connections doing DML for the same placement. Which is
disallowed. That this situation caused an assertion failure instead of
an error, means that without asserts this could possibly result in some
visibility bugs, similar to the ones described
#3867

The fix simply "unclaims" all connections after "ROLLBACK TO SAVEPOINT"
is done.

This specific issue also highlights some other issues:
1. Just like #3867, it shows that the `CanUseExistingConnection` should
   be improved to catch these type of errors.
2. I think we should make these DML and DDL asserts normal errors.
   This way we will not return incorrect data in case of bugs, but instead
   error out:
   https://github.com/citusdata/citus/blob/master/src/backend/distributed/connection/placement_connection.c#L418-L419
   This code is complex and it's quite possible we missed some other
   edge cases.
3. "ROLLBACK TO SAVEPOINT" in plain Postgres undos the locks that the
   rolled back statements took. We do not undo our hadDML and hadDDL
   flags, when rolling back to a savepoint. This could result in some
   queries not being allowed on Citus that would actually be fine to
   execute. Changing this would require us to keep track for each
   savepoint which placement connections had DDL/DML executed at that
   point.
JelteF added a commit that referenced this issue Jun 26, 2020
Fixes #3622

It was possible to get an assertion error, if a DML command was
cancelled that opened a connection and then "ROLLBACK TO SAVEPOINT" was
used to continue the transaction. The reason for this was that canceling
the transaction might leave the `claimedExclusively` flag on for (some
of) it's connections.

This caused an assertion failure because `CanUseExistingConnection`
would return false and a new connection would be opened, and then there
would be two connections doing DML for the same placement. Which is
disallowed. That this situation caused an assertion failure instead of
an error, means that without asserts this could possibly result in some
visibility bugs, similar to the ones described
#3867

The fix simply "unclaims" all connections after "ROLLBACK TO SAVEPOINT"
is done.

This specific issue also highlights some other issues:
1. Just like #3867, it shows that the `CanUseExistingConnection` should
   be improved to catch these type of errors.
2. I think we should make these DML and DDL asserts normal errors.
   This way we will not return incorrect data in case of bugs, but instead
   error out:
   https://github.com/citusdata/citus/blob/master/src/backend/distributed/connection/placement_connection.c#L418-L419
   This code is complex and it's quite possible we missed some other
   edge cases.
3. "ROLLBACK TO SAVEPOINT" in plain Postgres undos the locks that the
   rolled back statements took. We do not undo our hadDML and hadDDL
   flags, when rolling back to a savepoint. This could result in some
   queries not being allowed on Citus that would actually be fine to
   execute. Changing this would require us to keep track for each
   savepoint which placement connections had DDL/DML executed at that
   point.
JelteF added a commit that referenced this issue Jun 26, 2020
Fixes #3622

It was possible to get an assertion error, if a DML command was
cancelled that opened a connection and then "ROLLBACK TO SAVEPOINT" was
used to continue the transaction. The reason for this was that canceling
the transaction might leave the `claimedExclusively` flag on for (some
of) it's connections.

This caused an assertion failure because `CanUseExistingConnection`
would return false and a new connection would be opened, and then there
would be two connections doing DML for the same placement. Which is
disallowed. That this situation caused an assertion failure instead of
an error, means that without asserts this could possibly result in some
visibility bugs, similar to the ones described
#3867

The fix simply "unclaims" all connections after "ROLLBACK TO SAVEPOINT"
is done.

This specific issue also highlights some other issues:
1. Just like #3867, it shows that the `CanUseExistingConnection` should
   be improved to catch these type of errors.
2. I think we should make these DML and DDL asserts normal errors.
   This way we will not return incorrect data in case of bugs, but instead
   error out:
   https://github.com/citusdata/citus/blob/master/src/backend/distributed/connection/placement_connection.c#L418-L419
   This code is complex and it's quite possible we missed some other
   edge cases.
3. "ROLLBACK TO SAVEPOINT" in plain Postgres undos the locks that the
   rolled back statements took. We do not undo our hadDML and hadDDL
   flags, when rolling back to a savepoint. This could result in some
   queries not being allowed on Citus that would actually be fine to
   execute. Changing this would require us to keep track for each
   savepoint which placement connections had DDL/DML executed at that
   point.
JelteF added a commit that referenced this issue Jun 30, 2020
Fixes #3622

It was possible to get an assertion error, if a DML command was
cancelled that opened a connection and then "ROLLBACK TO SAVEPOINT" was
used to continue the transaction. The reason for this was that canceling
the transaction might leave the `claimedExclusively` flag on for (some
of) it's connections.

This caused an assertion failure because `CanUseExistingConnection`
would return false and a new connection would be opened, and then there
would be two connections doing DML for the same placement. Which is
disallowed. That this situation caused an assertion failure instead of
an error, means that without asserts this could possibly result in some
visibility bugs, similar to the ones described
#3867

The fix simply "unclaims" all connections after "ROLLBACK TO SAVEPOINT"
is done.

This specific issue also highlights some other issues:
1. Just like #3867, it shows that the `CanUseExistingConnection` should
   be improved to catch these type of errors.
2. I think we should make these DML and DDL asserts normal errors.
   This way we will not return incorrect data in case of bugs, but instead
   error out:
   https://github.com/citusdata/citus/blob/master/src/backend/distributed/connection/placement_connection.c#L418-L419
   This code is complex and it's quite possible we missed some other
   edge cases.
3. "ROLLBACK TO SAVEPOINT" in plain Postgres undos the locks that the
   rolled back statements took. We do not undo our hadDML and hadDDL
   flags, when rolling back to a savepoint. This could result in some
   queries not being allowed on Citus that would actually be fine to
   execute. Changing this would require us to keep track for each
   savepoint which placement connections had DDL/DML executed at that
   point.
JelteF added a commit that referenced this issue Aug 18, 2022
This removes a flaky test that I introduced in #3868 after I fixed the
issue described in #3622. This test is sometimes fails randomly in CI.
The way it fails indicates that there might be some bug: A connection
breaks after rolling back to a savepoint.

I tried reproducing this issue locally, but I wasn't able to. I don't
understand what causes the failure.

Things that I tried were:

1. Running the test with:
   ```sql
   SET citus.force_max_query_parallelization = true;
   ```
2. Running the test with:
   ```sql
   SET citus.max_adaptive_executor_pool_size = 1;
   ```
3. Running the test in parallel with the same tests that it is run in
   parallel with in multi_schedule.

None of these allowed me to reproduce the issue locally.

So I think it's time to give on fixing this test and simply remove the
test. The regression that this test protects against seems very unlikely
to reappear, since in #3868 I also added a big comment about the need
for the newly added `UnclaimConnection` call. So, I think the need for
the test is quite small, and removing it will make our CI less flaky.

Example of a failing CI run:
https://app.circleci.com/pipelines/github/citusdata/citus/26098/workflows/f84741d9-13b1-4ae7-9155-c21ed3466951/jobs/736424
JelteF added a commit that referenced this issue Aug 18, 2022
This removes a flaky test that I introduced in #3868 after I fixed the
issue described in #3622. This test is sometimes fails randomly in CI.
The way it fails indicates that there might be some bug: A connection
breaks after rolling back to a savepoint.

I tried reproducing this issue locally, but I wasn't able to. I don't
understand what causes the failure.

Things that I tried were:

1. Running the test with:
   ```sql
   SET citus.force_max_query_parallelization = true;
   ```
2. Running the test with:
   ```sql
   SET citus.max_adaptive_executor_pool_size = 1;
   ```
3. Running the test in parallel with the same tests that it is run in
   parallel with in multi_schedule.

None of these allowed me to reproduce the issue locally.

So I think it's time to give on fixing this test and simply remove the
test. The regression that this test protects against seems very unlikely
to reappear, since in #3868 I also added a big comment about the need
for the newly added `UnclaimConnection` call. So, I think the need for
the test is quite small, and removing it will make our CI less flaky.

In case the cause of the bug ever gets found, I tracked the bug in #6189

Example of a failing CI run:
https://app.circleci.com/pipelines/github/citusdata/citus/26098/workflows/f84741d9-13b1-4ae7-9155-c21ed3466951/jobs/736424
JelteF added a commit that referenced this issue Aug 18, 2022
This removes a flaky test that I introduced in #3868 after I fixed the
issue described in #3622. This test is sometimes fails randomly in CI.
The way it fails indicates that there might be some bug: A connection
breaks after rolling back to a savepoint.

I tried reproducing this issue locally, but I wasn't able to. I don't
understand what causes the failure.

Things that I tried were:

1. Running the test with:
   ```sql
   SET citus.force_max_query_parallelization = true;
   ```
2. Running the test with:
   ```sql
   SET citus.max_adaptive_executor_pool_size = 1;
   ```
3. Running the test in parallel with the same tests that it is run in
   parallel with in multi_schedule.

None of these allowed me to reproduce the issue locally.

So I think it's time to give on fixing this test and simply remove the
test. The regression that this test protects against seems very unlikely
to reappear, since in #3868 I also added a big comment about the need
for the newly added `UnclaimConnection` call. So, I think the need for
the test is quite small, and removing it will make our CI less flaky.

In case the cause of the bug ever gets found, I tracked the bug in #6189

Example of a failing CI run:
https://app.circleci.com/pipelines/github/citusdata/citus/26098/workflows/f84741d9-13b1-4ae7-9155-c21ed3466951/jobs/736424

For reference the unexpected diff is this (so both warnings and an error):
```diff
 INSERT INTO t SELECT i FROM generate_series(1, 100) i;
+WARNING:  connection to the remote node localhost:57638 failed with the following error: 
+WARNING:  
+CONTEXT:  while executing command on localhost:57638
+ERROR:  connection to the remote node localhost:57638 failed with the following error: 
 ROLLBACK;
```

This test is also mentioned as the most failing regression test in #5975
JelteF added a commit that referenced this issue Sep 7, 2022
This removes a flaky test that I introduced in #3868 after I fixed the
issue described in #3622. This test is sometimes fails randomly in CI.
The way it fails indicates that there might be some bug: A connection
breaks after rolling back to a savepoint.

I tried reproducing this issue locally, but I wasn't able to. I don't
understand what causes the failure.

Things that I tried were:

1. Running the test with:
   ```sql
   SET citus.force_max_query_parallelization = true;
   ```
2. Running the test with:
   ```sql
   SET citus.max_adaptive_executor_pool_size = 1;
   ```
3. Running the test in parallel with the same tests that it is run in
   parallel with in multi_schedule.

None of these allowed me to reproduce the issue locally.

So I think it's time to give on fixing this test and simply remove the
test. The regression that this test protects against seems very unlikely
to reappear, since in #3868 I also added a big comment about the need
for the newly added `UnclaimConnection` call. So, I think the need for
the test is quite small, and removing it will make our CI less flaky.

In case the cause of the bug ever gets found, I tracked the bug in #6189

Example of a failing CI run:
https://app.circleci.com/pipelines/github/citusdata/citus/26098/workflows/f84741d9-13b1-4ae7-9155-c21ed3466951/jobs/736424

For reference the unexpected diff is this (so both warnings and an error):
```diff
 INSERT INTO t SELECT i FROM generate_series(1, 100) i;
+WARNING:  connection to the remote node localhost:57638 failed with the following error:
+WARNING:
+CONTEXT:  while executing command on localhost:57638
+ERROR:  connection to the remote node localhost:57638 failed with the following error:
 ROLLBACK;
```

This test is also mentioned as the most failing regression test in #5975

(cherry picked from commit d16b458)
JelteF added a commit that referenced this issue Sep 7, 2022
This removes a flaky test that I introduced in #3868 after I fixed the
issue described in #3622. This test is sometimes fails randomly in CI.
The way it fails indicates that there might be some bug: A connection
breaks after rolling back to a savepoint.

I tried reproducing this issue locally, but I wasn't able to. I don't
understand what causes the failure.

Things that I tried were:

1. Running the test with:
   ```sql
   SET citus.force_max_query_parallelization = true;
   ```
2. Running the test with:
   ```sql
   SET citus.max_adaptive_executor_pool_size = 1;
   ```
3. Running the test in parallel with the same tests that it is run in
   parallel with in multi_schedule.

None of these allowed me to reproduce the issue locally.

So I think it's time to give on fixing this test and simply remove the
test. The regression that this test protects against seems very unlikely
to reappear, since in #3868 I also added a big comment about the need
for the newly added `UnclaimConnection` call. So, I think the need for
the test is quite small, and removing it will make our CI less flaky.

In case the cause of the bug ever gets found, I tracked the bug in #6189

Example of a failing CI run:
https://app.circleci.com/pipelines/github/citusdata/citus/26098/workflows/f84741d9-13b1-4ae7-9155-c21ed3466951/jobs/736424

For reference the unexpected diff is this (so both warnings and an error):
```diff
 INSERT INTO t SELECT i FROM generate_series(1, 100) i;
+WARNING:  connection to the remote node localhost:57638 failed with the following error:
+WARNING:
+CONTEXT:  while executing command on localhost:57638
+ERROR:  connection to the remote node localhost:57638 failed with the following error:
 ROLLBACK;
```

This test is also mentioned as the most failing regression test in #5975

(cherry picked from commit d16b458)
JelteF added a commit that referenced this issue Sep 7, 2022
This removes a flaky test that I introduced in #3868 after I fixed the
issue described in #3622. This test is sometimes fails randomly in CI.
The way it fails indicates that there might be some bug: A connection
breaks after rolling back to a savepoint.

I tried reproducing this issue locally, but I wasn't able to. I don't
understand what causes the failure.

Things that I tried were:

1. Running the test with:
   ```sql
   SET citus.force_max_query_parallelization = true;
   ```
2. Running the test with:
   ```sql
   SET citus.max_adaptive_executor_pool_size = 1;
   ```
3. Running the test in parallel with the same tests that it is run in
   parallel with in multi_schedule.

None of these allowed me to reproduce the issue locally.

So I think it's time to give on fixing this test and simply remove the
test. The regression that this test protects against seems very unlikely
to reappear, since in #3868 I also added a big comment about the need
for the newly added `UnclaimConnection` call. So, I think the need for
the test is quite small, and removing it will make our CI less flaky.

In case the cause of the bug ever gets found, I tracked the bug in #6189

Example of a failing CI run:
https://app.circleci.com/pipelines/github/citusdata/citus/26098/workflows/f84741d9-13b1-4ae7-9155-c21ed3466951/jobs/736424

For reference the unexpected diff is this (so both warnings and an error):
```diff
 INSERT INTO t SELECT i FROM generate_series(1, 100) i;
+WARNING:  connection to the remote node localhost:57638 failed with the following error:
+WARNING:
+CONTEXT:  while executing command on localhost:57638
+ERROR:  connection to the remote node localhost:57638 failed with the following error:
 ROLLBACK;
```

This test is also mentioned as the most failing regression test in #5975

(cherry picked from commit d16b458)
yxu2162 pushed a commit that referenced this issue Sep 15, 2022
This removes a flaky test that I introduced in #3868 after I fixed the
issue described in #3622. This test is sometimes fails randomly in CI.
The way it fails indicates that there might be some bug: A connection
breaks after rolling back to a savepoint.

I tried reproducing this issue locally, but I wasn't able to. I don't
understand what causes the failure.

Things that I tried were:

1. Running the test with:
   ```sql
   SET citus.force_max_query_parallelization = true;
   ```
2. Running the test with:
   ```sql
   SET citus.max_adaptive_executor_pool_size = 1;
   ```
3. Running the test in parallel with the same tests that it is run in
   parallel with in multi_schedule.

None of these allowed me to reproduce the issue locally.

So I think it's time to give on fixing this test and simply remove the
test. The regression that this test protects against seems very unlikely
to reappear, since in #3868 I also added a big comment about the need
for the newly added `UnclaimConnection` call. So, I think the need for
the test is quite small, and removing it will make our CI less flaky.

In case the cause of the bug ever gets found, I tracked the bug in #6189

Example of a failing CI run:
https://app.circleci.com/pipelines/github/citusdata/citus/26098/workflows/f84741d9-13b1-4ae7-9155-c21ed3466951/jobs/736424

For reference the unexpected diff is this (so both warnings and an error):
```diff
 INSERT INTO t SELECT i FROM generate_series(1, 100) i;
+WARNING:  connection to the remote node localhost:57638 failed with the following error: 
+WARNING:  
+CONTEXT:  while executing command on localhost:57638
+ERROR:  connection to the remote node localhost:57638 failed with the following error: 
 ROLLBACK;
```

This test is also mentioned as the most failing regression test in #5975
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants