Skip to content

GPDB 5.4 Branch Merge#1

Merged
anermakov merged 4092 commits intoarenadata:masterfrom
greenplum-db:master
Jan 22, 2018
Merged

GPDB 5.4 Branch Merge#1
anermakov merged 4092 commits intoarenadata:masterfrom
greenplum-db:master

Conversation

@anermakov
Copy link
Member

No description provided.

Ashwin Agrawal and others added 30 commits January 12, 2018 17:11
On QD start, FTS process can take sometime to start and initialize the
`ftsProbeInfo->fts_status`. So, check against the same only if its
initialized and avoid incorrectly reporting segment as down and
failing queries.
This reverts commit 1a27788edfd2163cbfcb1dc30c5eb870eac77181. As should be
passing now since FTS using uninitialized variable `fts_status` causing the
failure has been fixed now.
- Special lock functions for RelationExtension
- Special function ReadBuffer_Resync()
- States related to CT and RESYNC
- Xlog during PageInit() for vacuum lazy (upstream doesn't have it so removed it)
I removed the filespaces arg earlier, but failed to update the format
string accordingly.
Many times the `commit_blocking` test fails, its not clear if the
problem happened during the test, or cluster was not in right state
before we started. Hence adding expected state for the test at start
which would help to easily know if this test failed or is just
side-effect for previous test failure.
Remove test code for check_persistent_tables function, which was removed
earlier. Fix a few more initFromString() lines.
After removing -s option from gpsegstart, this unit test started failing
with curious error:

Stderr:
Usage: gpunit [--help] [options]

gpunit: error: no such option: -s

======================================================================
ERROR: Test Suite Name|programs.test.unit.test_cluster_clsrecoversegment|Test Case Name|test_output_segments_in_change_tracking_disabled_should_print_failed_segments|Test Details|
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/heikki/git-sandbox-gpdb/master/gpMgmt/bin/gppylib/programs/test/unit/test_cluster_clsrecoversegment.py", line 16, in setUp
    (options, _) = raw_options.parse_args()
  File "/usr/lib/python2.7/optparse.py", line 1402, in parse_args
    self.error(str(err))
  File "/usr/lib/python2.7/optparse.py", line 1584, in error
    self.exit(2, "%s: error: %s\n" % (self.get_prog_name(), msg))
  File "/usr/lib/python2.7/optparse.py", line 1574, in exit
    sys.exit(status)
SystemExit: 2

It turns out that the unit test was passing the command-line options from
the gpunit invocation, to the options parser of the gpsegstart command. It
was accidental that it ever worked, because gpsegstart also happened to have
a -s option, like gpunit did. Now that the option was removed, the gpsegstart
options parser started to fail.

To fix, explicitly pass "no command-line arguments" to the gpsegstart
options parser, so that it doesn't pick up the args from the gpunit
invocation.
These tests are very tightly coupled with filerep and persistent
tables implemenation.

We do need to have tests in replacement but these would be coded not
in TINC and mostly much less scenarios as in general walreplication
mirrors active for full ICW validates most of crashrecovery
logic. What's missing mostly is real crash scenarios in middle of
active workload like creating/dropping objects and all.
Since persistent tables were removed, these were not used for anything.
It just doesn't mean "filespace" anymore, just plain "datadir".
- use the new fact that datadirs are in the gp_segment_configuration
- fix a few things with the gpperfmon behave tests (mostly for macOS)
   --> the change to mgmt_utils.py is to do the config file manipulation
   natively in python
   --> the change to the gp_bash_functions.sh is to use ASCII '
   characters so that python string comparison is happier

Author: C.J. Jameson <cjameson@pivotal.io>
This reverts commit f38f024d7d314ae8ac100d862307a228b212c70e.
…ica_check."

This reverts commit 3214d80fabf7c208e40b2ec4f7ec69c97a9e7dd6.
The walreceiver test establishes a walrep connection to the master's
walsender. With the recent change to start standby master by default
on gpdemo cluster, the mock walreceiver will be unable to create a
connection to the walsender since max_wal_senders is hardcoded to
1. Moving this to run only on mirrorless cluster makes more sense
since the assumption is to have an available walrep connection.
There are more in other tests, but I suspect those other tests are obsolete
in whole, so I'm just patching this one that caused a failure in the
concourse pipeline.
darthunix pushed a commit that referenced this pull request Apr 10, 2020
This reverts commit b3e4249.

This PR comprises of the following commit
1. Commit 01c9755 - This is from an existing approved PR
https://github.com/greenplum-db/gporca/pull/428 which was merged but later
reverted due to regression caused due to cardinality under-estimation.
2. Commit 01c9755 - This commit fixes the regression caused due to the previous
commit and adds tests for it.

So for this PR, the 2nd commit is primarily for review.

Repro:
```
create table vt1(a varchar, b varchar);
insert into vt1 select 'a' || i, 'b' || i from generate_series(1, 10000000) i;
analyze vt1;
gnw_edw_dev=# explain with cte as (select * from vt1) select * from cte c1, cte c2, cte c3 where c1.a::text = c2.a::text and c2.a::text = c3.a::text;
                                                    QUERY PLAN
------------------------------------------------------------------------------------------------------------------
 Gather Motion 3:1  (slice1; segments: 3)  (cost=0.00..6194.47 rows=10000 width=30)
   ->  Sequence  (cost=0.00..6193.35 rows=3334 width=30)
         ->  Shared Scan (share slice:id 1:0)  (cost=0.00..569.67 rows=3333334 width=1)
               ->  Materialize  (cost=0.00..569.67 rows=3333334 width=1)
                     ->  Table Scan on vt1  (cost=0.00..504.33 rows=3333334 width=10)
         ->  Hash Join  (cost=0.00..5623.58 rows=3334 width=30) ==> #2
               Hash Cond: share0_ref4.a::text = share0_ref2.a::text AND share0_ref3.a::text = share0_ref2.a::text
               ->  Hash Join  (cost=0.00..2945.63 rows=3333334 width=20)
                     Hash Cond: share0_ref4.a::text = share0_ref3.a::text #1
                     ->  Shared Scan (share slice:id 1:0)  (cost=0.00..511.33 rows=3333334 width=10)
                     ->  Hash  (cost=511.33..511.33 rows=3333334 width=10)
                           ->  Shared Scan (share slice:id 1:0)  (cost=0.00..511.33 rows=3333334 width=10)
               ->  Hash  (cost=511.33..511.33 rows=3333334 width=10)
                     ->  Shared Scan (share slice:id 1:0)  (cost=0.00..511.33 rows=3333334 width=10)
 Optimizer status: PQO version 3.29.0
```

Check the cardinality:
```
 Hash Join  (cost=0.00..5623.58 rows=3334 width=30)
               Hash Cond: share0_ref4.a::text = share0_ref2.a::text AND share0_ref3.a::text = share0_ref2.a::text
```
Since all the tuple values are distinct the cardinality for the above join must
be `3333334` (per segment) but since after the commit 01c9755 implied
predicates were generated, and the existing code didn't remove binary coercible
casted comparison prior to cardinality estimation, it was down to `3334`, which
can result in broadcast motion placed on top of it if this node is a child of
another join etc significantly reducing the performance.

After the fix: => The cardinality is 3333334 for the join `share0_ref4.a::text
= share0_ref2.a::text AND share0_ref3.a::text = share0_ref2.a::text`
```
explain with cte as (select * from vt1) select * from cte c1, cte c2, cte c3 where c1.a::text = c2.a::text and c2.a::text = c3.a::text;
                                                    QUERY PLAN
------------------------------------------------------------------------------------------------------------------
 Gather Motion 3:1  (slice1; segments: 3)  (cost=0.00..7760.90 rows=10000000 width=30)
   ->  Sequence  (cost=0.00..6642.90 rows=3333334 width=30)
         ->  Shared Scan (share slice:id 1:0)  (cost=0.00..569.67 rows=3333334 width=1)
               ->  Materialize  (cost=0.00..569.67 rows=3333334 width=1)
                     ->  Table Scan on vt1  (cost=0.00..504.33 rows=3333334 width=10)
         ->  Hash Join  (cost=0.00..5973.23 rows=3333334 width=30) #1
               Hash Cond: share0_ref4.a::text = share0_ref2.a::text AND share0_ref3.a::text = share0_ref2.a::text
               ->  Hash Join  (cost=0.00..2945.63 rows=3333334 width=20)
                     Hash Cond: share0_ref4.a::text = share0_ref3.a::text
                     ->  Shared Scan (share slice:id 1:0)  (cost=0.00..511.33 rows=3333334 width=10)
                     ->  Hash  (cost=511.33..511.33 rows=3333334 width=10)
                           ->  Shared Scan (share slice:id 1:0)  (cost=0.00..511.33 rows=3333334 width=10)
               ->  Hash  (cost=511.33..511.33 rows=3333334 width=10)
                     ->  Shared Scan (share slice:id 1:0)  (cost=0.00..511.33 rows=3333334 width=10)
 Optimizer status: PQO version 3.29.0
```

Bump ORCA version 3.42.0
deart2k pushed a commit that referenced this pull request Apr 10, 2020
The syslogger will open the gpperfmon log alert file when
gpperfmon_log_alert_level is not NONE, however even if it fails to open
the file it still writes to it, which causes a crash like this:

    #0  fwrite () from /lib64/libc.so.6
    #1  write_binary_to_file (fh=NULL, ...) at syslogger.c:1885
    #2  write_syslogger_file_binary (...) at syslogger.c:1917
    #3  syslogger_append_current_timestamp (...) at syslogger.c:1103
    #4  syslogger_log_chunk_list (...) at syslogger.c:1571
    #5  syslogger_handle_chunk (...) at syslogger.c:1766
    #6  SysLoggerMain (argv=0x0, argc=0) at syslogger.c:576
    #7  SysLogger_Start () at syslogger.c:850
    #8  do_reaper () at postmaster.c:4984
    #9  ServerLoop () at postmaster.c:2417
    #10 PostmasterMain (...) at postmaster.c:1528
    #11 main (argc=15, argv=0x1dc1680) at main.c:206

To fix the issue we won't mark alert_log_level_opened as true until the
file is really opened successfully.

Reviewed-by: Paul Guo <pguo@pivotal.io>
(cherry picked from commit e2f1414)
hughcapet pushed a commit that referenced this pull request Aug 4, 2020
…id potential pg_rewind hang.

During testing, I encountered an incremental gprecoverseg hang issue.
Incremental gprecoverseg is based on pg_rewind.  pg_rewind launches a single
mode postgres process and quits after crash recovery if the postgres instance
was not cleanly shut down - this is used to ensure that the postgres is in a
consistent state before doing incremental recovery. I found that the single
mode postgres hangs with the below stack.

\#1  0x00000000008cf2d6 in PGSemaphoreLock (sema=0x7f238274a4b0, interruptOK=1 '\001') at pg_sema.c:422
\#2  0x00000000009614ed in ProcSleep (locallock=0x2c783c0, lockMethodTable=0xddb140 <default_lockmethod>) at proc.c:1347
\#3  0x000000000095a0c1 in WaitOnLock (locallock=0x2c783c0, owner=0x2cbf950) at lock.c:1853
\#4  0x0000000000958e3a in LockAcquireExtended (locktag=0x7ffde826aa60, lockmode=3, sessionLock=0 '\000', dontWait=0 '\000', reportMemoryError=1 '\001', locallockp=0x0) at lock.c:1155
\#5  0x0000000000957e64 in LockAcquire (locktag=0x7ffde826aa60, lockmode=3, sessionLock=0 '\000', dontWait=0 '\000') at lock.c:700
\#6  0x000000000095728c in LockSharedObject (classid=1262, objid=1, objsubid=0, lockmode=3) at lmgr.c:939
\#7  0x0000000000b0152b in InitPostgres (in_dbname=0x2c769f0 "template1", dboid=0, username=0x2c59340 "gpadmin", out_dbname=0x0) at postinit.c:1019
\#8  0x000000000097b970 in PostgresMain (argc=5, argv=0x2c51990, dbname=0x2c769f0 "template1", username=0x2c59340 "gpadmin") at postgres.c:4820
\#9  0x00000000007dc432 in main (argc=5, argv=0x2c51990) at main.c:241

It tries to hold the lock for template1 on pg_database with lockmode 3 but
it conflicts with the lock with lockmode 5 which was held by a recovered dtx
transaction in startup RecoverPreparedTransactions(). Typically the dtx
transaction comes from "create database" (by default the template database is
template1).

Fixing this by using the postgres database for single mode postgres execution.
The postgres database is commonly used in many background worker backends like
dtx recovery, gdd and ftsprobe. With this change, we do not need to worry
about "create database" with template postgres, etc since they won't succeed,
thus avoid the lock conflict.

We may be able to fix this in InitPostgres() by bypassing the locking code in
single mode but the current fix seems to be safer.  Note InitPostgres()
locks/unlocks some other catalog tables also but almost all of them are using
lock mode 1 (except mode 3 pg_resqueuecapability per debugging output).  It
seems that it is not usual in real scenario to have a dtx transaction that
locks catalog with mode 8 which conflicts with mode 1.  If we encounter this
later we need to think out a better (might not be trivial) solution for this.
For now let's fix the issue we encountered at first.

Note in this patch the code fixes in buildMirrorSegments.py and twophase.c are
not related to this patch. They do not seem to be strict bugs but we'd better
fix them to avoid potential issues in the future.

Reviewed-by: Ashwin Agrawal <aashwin@vmware.com>
Reviewed-by: Asim R P <pasim@vmware.com>
deart2k pushed a commit that referenced this pull request Aug 10, 2020
…id potential pg_rewind hang.

During testing, I encountered an incremental gprecoverseg hang issue.
Incremental gprecoverseg is based on pg_rewind.  pg_rewind launches a single
mode postgres process and quits after crash recovery if the postgres instance
was not cleanly shut down - this is used to ensure that the postgres is in a
consistent state before doing incremental recovery. I found that the single
mode postgres hangs with the below stack.

\#1  0x00000000008cf2d6 in PGSemaphoreLock (sema=0x7f238274a4b0, interruptOK=1 '\001') at pg_sema.c:422
\#2  0x00000000009614ed in ProcSleep (locallock=0x2c783c0, lockMethodTable=0xddb140 <default_lockmethod>) at proc.c:1347
\#3  0x000000000095a0c1 in WaitOnLock (locallock=0x2c783c0, owner=0x2cbf950) at lock.c:1853
\#4  0x0000000000958e3a in LockAcquireExtended (locktag=0x7ffde826aa60, lockmode=3, sessionLock=0 '\000', dontWait=0 '\000', reportMemoryError=1 '\001', locallockp=0x0) at lock.c:1155
\#5  0x0000000000957e64 in LockAcquire (locktag=0x7ffde826aa60, lockmode=3, sessionLock=0 '\000', dontWait=0 '\000') at lock.c:700
\#6  0x000000000095728c in LockSharedObject (classid=1262, objid=1, objsubid=0, lockmode=3) at lmgr.c:939
\#7  0x0000000000b0152b in InitPostgres (in_dbname=0x2c769f0 "template1", dboid=0, username=0x2c59340 "gpadmin", out_dbname=0x0) at postinit.c:1019
\#8  0x000000000097b970 in PostgresMain (argc=5, argv=0x2c51990, dbname=0x2c769f0 "template1", username=0x2c59340 "gpadmin") at postgres.c:4820
\#9  0x00000000007dc432 in main (argc=5, argv=0x2c51990) at main.c:241

It tries to hold the lock for template1 on pg_database with lockmode 3 but
it conflicts with the lock with lockmode 5 which was held by a recovered dtx
transaction in startup RecoverPreparedTransactions(). Typically the dtx
transaction comes from "create database" (by default the template database is
template1).

Fixing this by using the postgres database for single mode postgres execution.
The postgres database is commonly used in many background worker backends like
dtx recovery, gdd and ftsprobe. With this change, we do not need to worry
about "create database" with template postgres, etc since they won't succeed,
thus avoid the lock conflict.

We may be able to fix this in InitPostgres() by bypassing the locking code in
single mode but the current fix seems to be safer.  Note InitPostgres()
locks/unlocks some other catalog tables also but almost all of them are using
lock mode 1 (except mode 3 pg_resqueuecapability per debugging output).  It
seems that it is not usual in real scenario to have a dtx transaction that
locks catalog with mode 8 which conflicts with mode 1.  If we encounter this
later we need to think out a better (might not be trivial) solution for this.
For now let's fix the issue we encountered at first.

Note in this patch the code fixes in buildMirrorSegments.py and twophase.c are
not related to this patch. They do not seem to be strict bugs but we'd better
fix them to avoid potential issues in the future.

Reviewed-by: Ashwin Agrawal <aashwin@vmware.com>
Reviewed-by: Asim R P <pasim@vmware.com>
(cherry picked from commit 288908f)
leskin-in pushed a commit that referenced this pull request Sep 29, 2020
Overview
--------

Manage the allocation of AO segfiles in QE nodes, instead of the master.
That simplifies a lot of things. We no longer need to send the decision
on which segfile to use from QD to QE, we no longer need to keep up-to-date
tuple counts in the master, and so forth.

This gets rid of the in-memory hash table to track AO segment status. To
lock a segfile for insertion or other operations, we now rely on tuple
locks on the AO segment table instead.

VACUUM code has been refactored. I tried to make the code structure closer
to upstream, to reduce merge conflicts as we merge with upstream. The way
the VACUUM phases are dispatched, and the phases needed, are different in
the new regime. Some phases could use local transactions, but others have
to still use distributed transactions, but for simplicity we use
distributed transactions for all of the phases.

Now that QE nodes choose the insertion target segfile independently, we
could use the same algorithm for choosing the target segfile in utility
mode, and wouldn't need to reserve segfile #0 for special operations
anymore. But to keep things familiar, and to avoid any more churn in the
regression tests, this keeps the historical behavior and still reserves
segfile #0 for utility mode and CREATE TABLE AS.

Previously, SERIALIZABLE (and REPATABLE READ) transactions were treated
differently from transactions in READ COMMITTED mode. If there was an
old serializable transaction running, VACUUM refrained from recycling
old AO segfiles after compaction, because the serializable transaction
might lock the table later and try to read the rows that were compacted
away. But that was not completely waterproof; a READ COMMITTED transaction
that's still holding onto an old snapshot is no different from a
serializable transaction. There's a gap between acquiring a snapshot and
locking a table, so even if a READ COMMITTED backend is not holding a lock
on a table right now, it might open it later. To fix that, remove the
distinction between serializable and other transactions. Any open
transaction will now prevent AO segfiles in awaiting-drop state from being
removed. This commit adds an isolation2 test case called
'vacuum_self_function' to test one such case; it used to fail before this
commit.

That is a user-visible change in behavior that's worth calling out
explicitly. If there are concurrent transactions running, even in
READ COMMITTED mode, and even if they don't access the table at all,
VACUUM will no longer be able to recycle compacted segfiles. Next VACUUM
will clean them up, after the old concurrent transactions. Meanwhile,
the awaiting-drop segfiles will use up disk space, and increase the risk
that an insertion fails because all segfiles are in use.

This work was originally submitted github PR #790; you might see references
to that in the discussions.

Code change details
-------------------

This gets rid of the in-memory hash table to track AO segment status. To
lock a segfile for insertion or other operations, we now rely on tuple
locks on the AO segment table instead. When an AO segment is chosen as
insertion target, its AO segment table row is locked for the transaction
(by heap_lock_tuple, i.e. by stamping its XMAX). Later, when the insertion
finishes, the tuple is updated with the new EOF (no change there). When
choosing a new insertion target segfile, we scan the pg_aoseg table instead
of the hash table, and use the xmin/xmax to determine which ones are safe
to use. The process of choosing a target segment itself is not
concurrency-safe, so the relation extension lock is held to protect that.
See new "Locking and snapshots" section in
src/backend/access/appendonly/README.md for details.

tupcount and modcount are not kept up-to-date in the master. Remove all the
code needed to send the totals back from QE to QD at end of a command,
including the GPDB-specific 'o' protocol FE/BE message.

Remove the machinery to track which PROCs are in a serializable transaction.
It was not completely waterproof anyway, and there were some existing bugs
in that area. A READ COMMITTED transaction that's still holding onto an old
snapshot is no different from a serializable transaction. This commit adds
an isolation2 test case called 'vacuum_self_function' to test one such case;
it used to fail before this commit.

Move responsibility of creating AO segfile to callers of
appendonly_insert_init(). The caller must have locked, and created if
necessary, the segfile by calling ChooseSegnoForWrite() or
LockSegnoForWrite(). Previously, the pg_aoseg was not locked in rewriting
ALTER TABLE and other such commands that always used segfile #0. That was
OK, because they held an AccessExclusiveLock on the table, but let's be
consistent.

When selecting the insertion target for VACUUM compaction, prefer creating
a new segfile over using an existing non-empty segfile. Existing empty
segfiles are still choice #1, but let's avoid writing to non-empty
existing files, because otherwise we cannot vacuum those segfiles in the
same VACUUM cycle.

Warn, if compaction fails because no insertion segfile could be chosen.
This used to be an ERROR in the old code, but I think a WARNING is more
appropriate.

In targetid_get_partition(), don't create duplicate ResultRelInfo entries.
It's possible that the "result relation" is a partition of a partitioned
table, rather than the root. Don't create a new ResultRelInfo hash entry
for it in that case, use the parent ResultRelInfo. This was harmless
before, but now the logic of when we need to increment the modcount was
getting confused. Per the 'qp_dropped_cols' regression test.

Test change details
-------------------

src/test/regress/greenplum_schedule

    Run uao[cs]_catalog_tables tests separately.

    Because we are now more lazy with dropping AWAITING_DROP segments, the
    issue mentioned in the comment with serializable transactions now
    applies to all transactions.

uao_compaction/threshold
uaocs_compaction/threshold

    Adjust uao[cs]_compaction/threshold tests case to make it pass.

    In the first few changed outputs, I'm not entirely sure why one segment
    chooses to insert the new tuples to segfile #2 while others pick
    segfile #1, but there's also nothing wrong with that. The point of this
    PR is that QE nodes can make the decision independently.

    In the last test with gp_toolkit.__gp_aovisimap_compaction_info(),
    all the test data is inserted to one QE node, so segfile 2 isn't
    allocated on others.

isolation2/vacuum_self_function

    Add test case for the non-SERIALIZABLE case discussed earlier.

    See also discussion at
    https://groups.google.com/a/greenplum.org/d/msg/gpdb-dev/49MW3tBoH8s/W_bdyxZMAgAJ

    This test errored out before this commit:

    ERROR:  read beyond eof in table "ao" file "base/16384/16387.1", read position 0 (small offset 0), actual read length 0 (large read length 192) (cdbbufferedread.c:211)

    Works with this PR, but we don't drop old AO segments as aggressively
    in vacuum anymore. Partly because we're now more careful and correct.
    But also because we don't detect the case that the VACUUM is the oldest
    transaction in the system, and nothing else is running; we miss the
    opportunity to drop compacted AO segments immediately in that case. The
    next VACUUM should drop them though.

    Furthermore, this test is written so that if we tried to perform AO
    compaction phase in a local rather than distributed transaction, it
    would fail. (I know because I tried to do it that way at first.) This
    is "Problem 3" that I mentioned on that gpdb-dev thread.

isolation2/fsync_ao

  Adjust "hit counts" in fsync_ao test.

  The number of times the fsyncs happen have changed. To be honest, I'm not
  sure why, nor what the principled way to get the correct value would be,
  but as long as the counts are > 0, this looks like success to me.

isolation2/gdd/planner_insert_while_vacuum_drop

  Remove GDD test that's not relevant anymore.

  The test was testing a deadlock between QD and QE, involving AO vacuum.
  We no longer take an AccessExclusiveLock in the QD, so the test case
  doesn't work. I'm not sure what an equivalent deadlock might be with the
  new AO vacuum, so just remove the test.

isolation2/vacuum_drop_phase_ao
isolation2/uao/insert_should_not_use_awaiting_drop

    These tests don't make sense in the QE-managed regime anymore. I
    couldn't figure out what these should look like with this commit, so
    remove.

isolation2/uao/insert_policy

    Update test case to cover a scenario where two transactions try to
    create the initial segment for relation. We had a bug in that scenario
    during development, and it wasn't covered by any existing tests.

isolation2/uao/select_while_vacuum_serializable2

    The behavior is different now, a serializable transaction doesn't
    prevent compaction phase from happening, only the recycling of the
    segfile. Adjust expected output accordingly.

isolation2/uao/compaction_utility

    VACUUM in utility mode can compact now.

Co-authored-by: Abhijit Subramanya <asubramanya@pivotal.io>
Co-authored-by: Ashwin Agrawal <aagrawal@pivotal.io>
Co-authored-by: Asim R P <apraveen@pivotal.io>
Co-authored-by: Xin Zhang <xzhang@pivotal.io>
Reviewed-by: Georgios Kokolatos <gkokolatos@pivotal.io>
Reviewed-by: Taylor Vesely <tvesely@pivotal.io>

Discussion: https://groups.google.com/a/greenplum.org/d/msg/gpdb-dev/49MW3tBoH8s/W_bdyxZMAgAJ
Discussion: https://groups.google.com/a/greenplum.org/d/msg/gpdb-dev/lUwtNxaT3f0/Cg39CP8uAwAJ
maksm90 pushed a commit that referenced this pull request Oct 18, 2020
…tions.

Commit 3d956d9 added support for update row movement in postgres_fdw.
This patch fixes the following issues introduced by that commit:

* When a remote partition chosen to insert routed rows into was also an
  UPDATE subplan target rel that would be updated later, the UPDATE that
  used a direct modification plan modified those routed rows incorrectly
  because those routed rows were visible to the later UPDATE command.
  The right fix for this would be to have some way in postgres_fdw in
  which the later UPDATE command ignores those routed rows, but it seems
  hard to do so with the current infrastructure.  For now throw an error
  in that case.

* When a remote partition chosen to insert routed rows into was also an
  UPDATE subplan target rel, fmstate created for the UPDATE that used a
  non-direct modification plan was mistakenly overridden by another
  fmstate created for inserting those routed rows into the partition.
  This caused 1) server crash when the partition would be updated later,
  and 2) resource leak when the partition had been already updated.  To
  avoid that, adjust the treatment of the fmstate for the inserting.  As
  for #1, since we would also have the incorrectness issue as mentioned
  above, error out in that case as well.

Update the docs to mention that postgres_fdw currently does not handle
the case where a remote partition chosen to insert a routed row into is
also an UPDATE subplan target rel that will be updated later.

Author: Amit Langote and Etsuro Fujita
Reviewed-by: Amit Langote
Backpatch-through: 11 where row movement in postgres_fdw was added
Discussion: https://postgr.es/m/21e7eaa4-0d4d-20c2-a1f7-c7e96f4ce440@lab.ntt.co.jp
darthunix pushed a commit that referenced this pull request Aug 24, 2021
…CREATE/ALTER resouce group.

In some scenarios, the AccessExclusiveLock for table pg_resgroupcapability may cause database setup/recovery pending. Below is why we need change the AccessExclusiveLock to ExclusiveLock. 

This lock on table pg_resgroupcapability is used to concurrent update this table when run "Create/Alter resource group" statement. There is a CPU limit, after modify one resource group, it has to check if the whole CPU usage of all resource groups doesn't exceed 100%.

Before this fix, AccessExclusiveLock is used. Suppose one user is running "Alter resource group" statement, QD will dispatch this statement to all QEs, so it is a two phase commit(2PC) transaction. When QD dispatched "Alter resource group" statement and QE acquire the AccessExclusiveLock for table pg_resgroupcapability. Until the 2PC distributed transaction committed, QE can release the AccessExclusiveLock for this table.

In the second phase, QD will call function doNotifyingCommitPrepared to broadcast "commit prepared" command to all QEs, QE has already finish prepared, this transation is a prepared transaction. Suppose at this point, there is a primary segment down and a mirror will be promoted to primary.

The mirror got the "promoted" message from coordinator, and will recover based on xlog from primary, in order to recover the prepared transaction, it will read the prepared transaction log entry and acquire AccessExclusiveLock for table pg_resgroupcapability. The callstack is:
#0 lock_twophase_recover (xid=, info=, recdata=, len=) at lock.c:4697
#1 ProcessRecords (callbacks=, xid=2933, bufptr=0x1d575a8 "") at twophase.c:1757
#2 RecoverPreparedTransactions () at twophase.c:2214
#3 StartupXLOG () at xlog.c:8013
#4 StartupProcessMain () at startup.c:231
#5 AuxiliaryProcessMain (argc=argc@entry=2, argv=argv@entry=0x7fff84b94a70) at bootstrap.c:459
#6 StartChildProcess (type=StartupProcess) at postmaster.c:5917
#7 PostmasterMain (argc=argc@entry=7, argv=argv@entry=0x1d555b0) at postmaster.c:1581
#8 main (argc=7, argv=0x1d555b0) at main.c:240

After that, the database instance will start up, all related initialization functions will be called. However, there is a function named "InitResGroups", it will acquire AccessShareLock for table pg_resgroupcapability and do some initialization stuff. The callstack is:
#6 WaitOnLock (locallock=locallock@entry=0x1c7f248, owner=owner@entry=0x1ca0a40) at lock.c:1999
#7 LockAcquireExtended (locktag=locktag@entry=0x7ffd15d18d90, lockmode=lockmode@entry=1, sessionLock=sessionLock@entry=false, dontWait=dontWait@entry=false, reportMemoryError=reportMemoryError@entry=true, locallockp=locallockp@entry=0x7ffd15d18d88) at lock.c:1192
#8 LockRelationOid (relid=6439, lockmode=1) at lmgr.c:126
#9 relation_open (relationId=relationId@entry=6439, lockmode=lockmode@entry=1) at relation.c:56
#10 table_open (relationId=relationId@entry=6439, lockmode=lockmode@entry=1) at table.c:47
#11 InitResGroups () at resgroup.c:581
#12 InitResManager () at resource_manager.c:83
#13 initPostgres (in_dbname=, dboid=dboid@entry=0, username=username@entry=0x1c5b730 "linw", useroid=useroid@entry=0, out_dbname=out_dbname@entry=0x0, override_allow_connections=override_allow_connections@entry=false) at postinit.c:1284
#14 PostgresMain (argc=1, argv=argv@entry=0x1c8af78, dbname=0x1c89e70 "postgres", username=0x1c5b730 "linw") at postgres.c:4812
#15 BackendRun (port=, port=) at postmaster.c:4922
#16 BackendStartup (port=0x1c835d0) at postmaster.c:4607
#17 ServerLoop () at postmaster.c:1963
#18 PostmasterMain (argc=argc@entry=7, argv=argv@entry=0x1c595b0) at postmaster.c:1589
#19 in main (argc=7, argv=0x1c595b0) at main.c:240

The AccessExclusiveLock is not released, and it is not compatible with any other locks, so the startup process will be pending on this lock. So the mirror can't become primary successfully.

Even users run "gprecoverseg" to recover the primary segment. the result is similar. The primary segment will recover from xlog, it will recover prepared transactions and acquire AccessExclusiveLock for table pg_resgroupcapability. Then the startup process is pending on this lock. Unless users change the resource type to "queue", the function InitResGroups will not be called, and won't be blocked, then the primary segment can startup normally.

After this fix, ExclusiveLock is acquired when alter resource group. In above case, the startup process acquires AccessShareLock, ExclusiveLock and AccessShareLock are compatible. The startup process can run successfully. After startup, QE will get RECOVERY_COMMIT_PREPARED command from QD, it will finish the second phase of this distributed transaction and release ExclusiveLock on table pg_resgroupcapability. The callstack is:
#0 lock_twophase_postcommit (xid=, info=, recdata=0x3303458, len=) at lock.c:4758
#1 ProcessRecords (callbacks=, xid=, bufptr=0x3303458 "") at twophase.c:1757
#2 FinishPreparedTransaction (gid=gid@entry=0x323caf5 "25", isCommit=isCommit@entry=true, raiseErrorIfNotFound=raiseErrorIfNotFound@entry=false) at twophase.c:1704
#3 in performDtxProtocolCommitPrepared (gid=gid@entry=0x323caf5 "25", raiseErrorIfNotFound=raiseErrorIfNotFound@entry=false) at cdbtm.c:2107
#4 performDtxProtocolCommand (dtxProtocolCommand=dtxProtocolCommand@entry=DTX_PROTOCOL_COMMAND_RECOVERY_COMMIT_PREPARED, gid=gid@entry=0x323caf5 "25", contextInfo=contextInfo@entry=0x10e1820 ) at cdbtm.c:2279
#5 exec_mpp_dtx_protocol_command (contextInfo=0x10e1820 , gid=0x323caf5 "25", loggingStr=0x323cad8 "Recovery Commit Prepared", dtxProtocolCommand=DTX_PROTOCOL_COMMAND_RECOVERY_COMMIT_PREPARED) at postgres.c:1570
#6 PostgresMain (argc=, argv=argv@entry=0x3268f98, dbname=0x3267e90 "postgres", username=) at postgres.c:5482

The test case of this commit simulates a repro of this bug.
InnerLife0 pushed a commit that referenced this pull request Sep 15, 2021
…ce (#12447)

Recently I built from GreenPlum master branch to run TPC-DS query with 1GB data. For Q47 and Q57, when I turned off GUC `execute_pruned_plan` (on by default), some of worker processes will be hang and the query never returns.

Take Q57 as an example. My cluster configuration is 1 QD + 2 QE. The query looks like:

```sql
with v1 as(
  select
    i_category,i_brand,
    cc_name,d_year,d_moy,
    sum(cs_sales_price) sum_sales,
    avg(sum(cs_sales_price)) over (partition by
      i_category,i_brand,cc_name,d_year)
    avg_monthly_sales,
    rank() over (partition by
      i_category,i_brand,cc_name
    order by
      d_year,d_moy
    ) rn
  from
    item,catalog_sales,date_dim,call_center
  where
    cs_item_sk = i_item_sk and
    cs_sold_date_sk = d_date_sk and
    cc_call_center_sk= cs_call_center_sk and(
      d_year = 1999 or
      ( d_year = 1999-1 and d_moy =12) or
      ( d_year = 1999+1 and d_moy =1)
    )
  group by
    i_category,i_brand,cc_name,d_year,d_moy
),
v2 as(
  select
    v1.i_category,v1.i_brand,v1.cc_name,
    v1.d_year,v1.d_moy,v1.avg_monthly_sales,
    v1.sum_sales,v1_lag.sum_sales psum,
    v1_lead.sum_sales nsum
  from
    v1,v1 v1_lag,v1 v1_lead
  where
    v1.i_category = v1_lag.i_category and
    v1.i_category = v1_lead.i_category and
    v1.i_brand = v1_lag.i_brand and
    v1.i_brand = v1_lead.i_brand and
    v1. cc_name = v1_lag. cc_name and
    v1. cc_name = v1_lead. cc_name and
    v1.rn = v1_lag.rn + 1 and
    v1.rn = v1_lead.rn - 1
)
select *
from v2
where
  d_year = 1999 and
  avg_monthly_sales > 0 and
  case when avg_monthly_sales > 0 then
    abs(sum_sales - avg_monthly_sales) / avg_monthly_sales
    else null end > 0.1
order by
  sum_sales - avg_monthly_sales,3
limit 100;
```

When `execute_pruned_plan` is on by default, the plan looks like:

```
                                                                                                                                                                 QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Result  (cost=0.00..2832.84 rows=1 width=64) (actual time=10792.606..10792.702 rows=100 loops=1)
   ->  Gather Motion 2:1  (slice1; segments: 2)  (cost=0.00..2832.84 rows=1 width=64) (actual time=10792.597..10792.673 rows=100 loops=1)
         Merge Key: ((share0_ref4.sum_sales - share0_ref4.avg_monthly_sales)), share0_ref4.cc_name
         ->  Sort  (cost=0.00..2832.84 rows=1 width=72) (actual time=10791.203..10791.225 rows=50 loops=1)
               Sort Key: ((share0_ref4.sum_sales - share0_ref4.avg_monthly_sales)), share0_ref4.cc_name
               Sort Method:  quicksort  Memory: 152kB
               ->  Sequence  (cost=0.00..2832.84 rows=1 width=72) (actual time=10790.522..10790.559 rows=50 loops=1)
                     ->  Shared Scan (share slice:id 1:0)  (cost=0.00..1539.83 rows=1 width=1) (actual time=10140.895..10145.397 rows=16510 loops=1)
                           ->  WindowAgg  (cost=0.00..1539.83 rows=1 width=56) (actual time=10082.465..10128.750 rows=16510 loops=1)
                                 Partition By: item.i_category, item.i_brand, call_center.cc_name
                                 Order By: date_dim.d_year, date_dim.d_moy
                                 ->  Sort  (cost=0.00..1539.83 rows=1 width=48) (actual time=10082.429..10084.923 rows=16510 loops=1)
                                       Sort Key: item.i_category, item.i_brand, call_center.cc_name, date_dim.d_year, date_dim.d_moy
                                       Sort Method:  quicksort  Memory: 20078kB
                                       ->  Redistribute Motion 2:2  (slice2; segments: 2)  (cost=0.00..1539.83 rows=1 width=48) (actual time=9924.269..9989.657 rows=16510 loops=1)
                                             Hash Key: item.i_category, item.i_brand, call_center.cc_name
                                             ->  WindowAgg  (cost=0.00..1539.83 rows=1 width=48) (actual time=9924.717..9974.500 rows=16633 loops=1)
                                                   Partition By: item.i_category, item.i_brand, call_center.cc_name, date_dim.d_year
                                                   ->  Sort  (cost=0.00..1539.83 rows=1 width=126) (actual time=9924.662..9927.280 rows=16633 loops=1)
                                                         Sort Key: item.i_category, item.i_brand, call_center.cc_name, date_dim.d_year
                                                         Sort Method:  quicksort  Memory: 20076kB
                                                         ->  Redistribute Motion 2:2  (slice3; segments: 2)  (cost=0.00..1539.83 rows=1 width=126) (actual time=9394.220..9856.375 rows=16633 loops=1)
                                                               Hash Key: item.i_category, item.i_brand, call_center.cc_name, date_dim.d_year
                                                               ->  GroupAggregate  (cost=0.00..1539.83 rows=1 width=126) (actual time=9391.783..9833.988 rows=16424 loops=1)
                                                                     Group Key: item.i_category, item.i_brand, call_center.cc_name, date_dim.d_year, date_dim.d_moy
                                                                     ->  Sort  (cost=0.00..1539.83 rows=1 width=124) (actual time=9397.448..9628.606 rows=174584 loops=1)
                                                                           Sort Key: item.i_category, item.i_brand, call_center.cc_name, date_dim.d_year, date_dim.d_moy
                                                                           Sort Method:  external merge  Disk: 134144kB
                                                                           ->  Redistribute Motion 2:2  (slice4; segments: 2)  (cost=0.00..1539.83 rows=1 width=124) (actual time=6107.447..8237.581 rows=174584 loops=1)
                                                                                 Hash Key: item.i_category, item.i_brand, call_center.cc_name, date_dim.d_year, date_dim.d_moy
                                                                                 ->  Hash Join  (cost=0.00..1539.83 rows=1 width=124) (actual time=6112.706..7088.349 rows=178669 loops=1)
                                                                                       Hash Cond: (date_dim.d_date_sk = catalog_sales.cs_sold_date_sk)
                                                                                       ->  Seq Scan on date_dim  (cost=0.00..436.38 rows=204 width=12) (actual time=10.656..17.972 rows=222 loops=1)
                                                                                             Filter: ((d_year = 1999) OR ((d_year = 1998) AND (d_moy = 12)) OR ((d_year = 2000) AND (d_moy = 1)))
                                                                                             Rows Removed by Filter: 36504
                                                                                       ->  Hash  (cost=1103.41..1103.41 rows=1 width=120) (actual time=6100.040..6100.040 rows=1430799 loops=1)
                                                                                             Buckets: 16384 (originally 16384)  Batches: 32 (originally 1)  Memory Usage: 12493kB
                                                                                             ->  Broadcast Motion 2:2  (slice5; segments: 2)  (cost=0.00..1103.41 rows=1 width=120) (actual time=1.802..5410.377 rows=1434428 loops=1)
                                                                                                   ->  Nested Loop  (cost=0.00..1103.40 rows=1 width=120) (actual time=1.632..5127.625 rows=718766 loops=1)
                                                                                                         Join Filter: true
                                                                                                         ->  Redistribute Motion 2:2  (slice6; segments: 2)  (cost=0.00..1097.40 rows=1 width=22) (actual time=1.564..362.958 rows=718766 loops=1)
                                                                                                               Hash Key: catalog_sales.cs_item_sk
                                                                                                               ->  Hash Join  (cost=0.00..1097.40 rows=1 width=22) (actual time=1.112..996.643 rows=717589 loops=1)
                                                                                                                     Hash Cond: (catalog_sales.cs_call_center_sk = call_center.cc_call_center_sk)
                                                                                                                     ->  Seq Scan on catalog_sales  (cost=0.00..509.10 rows=720774 width=18) (actual time=0.144..602.362 rows=721193 loops=1)
                                                                                                                     ->  Hash  (cost=431.00..431.00 rows=1 width=12) (actual time=0.022..0.022 rows=6 loops=1)
                                                                                                                           Buckets: 32768  Batches: 1  Memory Usage: 257kB
                                                                                                                           ->  Broadcast Motion 2:2  (slice7; segments: 2)  (cost=0.00..431.00 rows=1 width=12) (actual time=0.009..0.012 rows=6 loops=1)
                                                                                                                                 ->  Seq Scan on call_center  (cost=0.00..431.00 rows=1 width=12) (actual time=0.032..0.035 rows=4 loops=1)
                                                                                                         ->  Index Scan using item_pkey on item  (cost=0.00..6.00 rows=1 width=102) (actual time=0.000..0.006 rows=1 loops=718766)
                                                                                                               Index Cond: (i_item_sk = catalog_sales.cs_item_sk)
                     ->  Redistribute Motion 1:2  (slice8)  (cost=0.00..1293.01 rows=1 width=72) (actual time=646.614..646.646 rows=50 loops=1)
                           ->  Limit  (cost=0.00..1293.01 rows=1 width=72) (actual time=10787.533..10787.700 rows=100 loops=1)
                                 ->  Gather Motion 2:1  (slice9; segments: 2)  (cost=0.00..1293.01 rows=1 width=72) (actual time=10787.527..10787.654 rows=100 loops=1)
                                       Merge Key: ((share0_ref4.sum_sales - share0_ref4.avg_monthly_sales)), share0_ref4.cc_name
                                       ->  Sort  (cost=0.00..1293.01 rows=1 width=72) (actual time=10789.933..10789.995 rows=357 loops=1)
                                             Sort Key: ((share0_ref4.sum_sales - share0_ref4.avg_monthly_sales)), share0_ref4.cc_name
                                             Sort Method:  quicksort  Memory: 14998kB
                                             ->  Result  (cost=0.00..1293.01 rows=1 width=150) (actual time=10648.280..10774.898 rows=12379 loops=1)
                                                   Filter: ((share0_ref4.d_year = 1999) AND (share0_ref4.avg_monthly_sales > '0'::numeric) AND (CASE WHEN (share0_ref4.avg_monthly_sales > '0'::numeric) THEN (abs((share0_ref4.sum_sales - share0_ref4.avg_monthly_sales)) / share0_ref4.avg_monthly_sales) ELSE NULL::numeric END > 0.1))
                                                   ->  Hash Join  (cost=0.00..1293.01 rows=1 width=150) (actual time=10648.253..10740.262 rows=13582 loops=1)
                                                         Hash Cond: ((share0_ref4.i_category = share0_ref3.i_category) AND (share0_ref4.i_brand = share0_ref3.i_brand) AND ((share0_ref4.cc_name)::text = (share0_ref3.cc_name)::text) AND (share0_ref4.rn = (share0_ref3.rn + 1)) AND (share0_ref4.rn = (share0_ref2.rn - 1)))
                                                         ->  Shared Scan (share slice:id 9:0)  (cost=0.00..431.00 rows=1 width=142) (actual time=0.013..5.570 rows=16510 loops=1)
                                                         ->  Hash  (cost=862.00..862.00 rows=1 width=142) (actual time=10647.380..10647.380 rows=209076 loops=1)
                                                               Buckets: 65536 (originally 32768)  Batches: 2 (originally 1)  Memory Usage: 31389kB
                                                               ->  Hash Join  (cost=0.00..862.00 rows=1 width=142) (actual time=10156.494..10374.421 rows=209076 loops=1)
                                                                     Hash Cond: ((share0_ref3.i_category = share0_ref2.i_category) AND (share0_ref3.i_brand = share0_ref2.i_brand) AND ((share0_ref3.cc_name)::text = (share0_ref2.cc_name)::text))
                                                                     ->  Shared Scan (share slice:id 9:0)  (cost=0.00..431.00 rows=1 width=126) (actual time=0.009..6.887 rows=16510 loops=1)
                                                                     ->  Hash  (cost=431.00..431.00 rows=1 width=126) (actual time=10156.297..10156.298 rows=16178 loops=1)
                                                                           Buckets: 32768  Batches: 1  Memory Usage: 3144kB
                                                                           ->  Shared Scan (share slice:id 9:0)  (cost=0.00..431.00 rows=1 width=126) (actual time=10139.421..10144.473 rows=16510 loops=1)
 Planning Time: 1905.667 ms
   (slice0)    Executor memory: 330K bytes.
   (slice1)    Executor memory: 4750K bytes avg x 2 workers, 4968K bytes max (seg1).  Work_mem: 4861K bytes max.
   (slice2)    Executor memory: 4701K bytes avg x 2 workers, 4952K bytes max (seg0).  Work_mem: 4894K bytes max.
   (slice3)    Executor memory: 12428K bytes avg x 2 workers, 12428K bytes max (seg0).  Work_mem: 12375K bytes max.
 * (slice4)    Executor memory: 14021K bytes avg x 2 workers, 14021K bytes max (seg0).  Work_mem: 12493K bytes max, 221759K bytes wanted.
   (slice5)    Executor memory: 77K bytes avg x 2 workers, 77K bytes max (seg0).
   (slice6)    Executor memory: 323K bytes avg x 2 workers, 323K bytes max (seg0).  Work_mem: 257K bytes max.
   (slice7)    Executor memory: 39K bytes avg x 2 workers, 39K bytes max (seg0).
   (slice8)    Executor memory: 242K bytes (entry db).
 * (slice9)    Executor memory: 35344K bytes avg x 2 workers, 35360K bytes max (seg1).  Work_mem: 31389K bytes max, 37501K bytes wanted.
 Memory used:  128000kB
 Memory wanted:  3328681kB
 Optimizer: Pivotal Optimizer (GPORCA)
 Execution Time: 10856.507 ms
(86 rows)

Time: 12779.991 ms (00:12.780)
```

There is only one share slice in this query, one producer in slice 1, three consumers in slice 9. However, when I turned GUC off, the query never returns, and the process situation looks like:

```
postgres  22285  22255  0 03:03 pts/1    00:00:00 psql -p9221
postgres  22288  20912  3 03:03 ?        00:00:03 postgres:  9221, postgres tpcds [local] con150 cmd16 EXPLAIN
postgres  22294  20939  0 03:03 ?        00:00:00 postgres:  9210, postgres tpcds 172.17.0.50(60732) con150 seg0 cmd17 slice1 MPPEXEC SELECT
postgres  22295  20950  0 03:03 ?        00:00:00 postgres:  9211, postgres tpcds 172.17.0.50(36177) con150 seg1 cmd17 slice1 MPPEXEC SELECT
postgres  22306  20939  5 03:03 ?        00:00:04 postgres:  9210, postgres tpcds 172.17.0.50(60742) con150 seg0 idle
postgres  22307  20950  5 03:03 ?        00:00:04 postgres:  9211, postgres tpcds 172.17.0.50(36187) con150 seg1 idle
postgres  22310  20939 11 03:03 ?        00:00:10 postgres:  9210, postgres tpcds 172.17.0.50(60745) con150 seg0 idle
postgres  22311  20950 12 03:03 ?        00:00:11 postgres:  9211, postgres tpcds 172.17.0.50(36190) con150 seg1 idle
postgres  22314  20939  5 03:03 ?        00:00:04 postgres:  9210, postgres tpcds 172.17.0.50(60748) con150 seg0 idle
postgres  22315  20950  5 03:03 ?        00:00:04 postgres:  9211, postgres tpcds 172.17.0.50(36193) con150 seg1 idle
postgres  22318  20939  1 03:03 ?        00:00:01 postgres:  9210, postgres tpcds 172.17.0.50(60750) con150 seg0 idle
postgres  22319  20950  2 03:03 ?        00:00:01 postgres:  9211, postgres tpcds 172.17.0.50(36195) con150 seg1 idle
postgres  22322  20912  0 03:03 ?        00:00:00 postgres:  9221, postgres tpcds [local] con150 seg-1 idle
postgres  22324  20939  0 03:03 ?        00:00:00 postgres:  9210, postgres tpcds 172.17.0.50(60754) con150 seg0 idle
postgres  22325  20950  0 03:03 ?        00:00:00 postgres:  9211, postgres tpcds 172.17.0.50(36199) con150 seg1 idle
postgres  22348  20939  0 03:05 ?        00:00:00 postgres:  9210, postgres tpcds 172.17.0.50(45936) con150 seg0 idle
postgres  22349  20950  0 03:05 ?        00:00:00 postgres:  9211, postgres tpcds 172.17.0.50(49614) con150 seg1 idle
postgres  22352  20939  4 03:05 ?        00:00:00 postgres:  9210, postgres tpcds 172.17.0.50(45939) con150 seg0 idle
postgres  22353  20950  4 03:05 ?        00:00:00 postgres:  9211, postgres tpcds 172.17.0.50(49617) con150 seg1 idle
```

According to my debugging, the stack of slice 1 processes looks like:

```
#0  0x00007fde606f94f3 in epoll_wait () from /lib64/libc.so.6
#1  0x0000000000d2eec1 in WaitEventSetWaitBlock (set=0x87d8fe0, cur_timeout=-1, occurred_events=0x7ffce695fe00, nevents=1) at latch.c:1081
#2  0x0000000000d2ed9a in WaitEventSetWait (set=0x87d8fe0, timeout=-1, occurred_events=0x7ffce695fe00, nevents=1, wait_event_info=0) at latch.c:1033
#3  0x0000000000d5987d in ConditionVariableSleep (cv=0x7fde540890b0, wait_event_info=0) at condition_variable.c:157
#4  0x0000000000b30a61 in shareinput_writer_waitdone (ref=0x87da950, nconsumers=1) at nodeShareInputScan.c:994
#5  0x0000000000b2fe89 in ExecEndShareInputScan (node=0x88c2ec0) at nodeShareInputScan.c:522
#6  0x0000000000ad63e8 in ExecEndNode (node=0x88c2ec0) at execProcnode.c:888
#7  0x0000000000b3237b in ExecEndSequence (node=0x88c2d80) at nodeSequence.c:132
#8  0x0000000000ad623f in ExecEndNode (node=0x88c2d80) at execProcnode.c:779
#9  0x0000000000b1772e in ExecEndSort (node=0x88c2658) at nodeSort.c:365
```

That is to say, the producer is waiting for consumers to wake it up, while the consumers didn't. According to further debugging, I found a **squelch** is triggered on the *Gather Motion* node upstream of three ShareInputScan consumer nodes. In the squelch logic of ShareInputScan, the consumer will notify producer only if `ndone == nsharers`:

```c
local_state->ndone++;

if (local_state->ndone == local_state->nsharers)
{
    shareinput_reader_notifydone(node->ref, sisc->nconsumers);
    local_state->closed = true;
}
```

While `ndone` will be accumulated one by one consumer, `nsharers` is initialized in ExecInitNode. However, GUC `execute_pruned_plan` affects the root node where the Executor starts to call `ExecInitNode`:

- `execute_pruned_plan` set to true: the initialization will start at the root node of slice 9, `nsharers` will be 3
- `execute_pruned_plan` set to false: the initialization will start at the root node of the whole plan tree, `nsharers` will be 4, then `ndone == nsharers` will never establish, because we only have three consumers, `ndone` will be 3 at most

According to my understanding, the algorithm should work well no matter this GUC is set to true or false. So I add some conditions in the process of initialization of `nsharers`: to accumulate `nsharers` only when initializing consumer nodes of current slice. Then this algorithm should be working fine.
seryozhasmirnov pushed a commit that referenced this pull request Sep 16, 2021
Send results of 'select pg_catalog.gp_acquire_sample_rows' query in
binary mode.
That allows to avoid overflow for max double.

For text mode (default) when analyze for table is performed the master
segment calls gp_acquire_sample_rows() helper function on each
segment. That eventually calls float8out function on segment to
converts float8 number to a string with snprintf:

int	ndig = DBL_DIG + extra_float_digits;

snprintf(ascii, MAXDOUBLEWIDTH + 1, "%.*g", ndig, num);

When ndig is 15 the maximum float8 value 1.7976931348623157e+308 is
rounded to "1.79769313486232e+308" that has no representation.

And on master acquire_sample_rows_dispatcher function
process gp_acquire_sample_rows result and eventually float8in
function is called to convert string to float8 with strtold:
val = strtold(num, &endptr);

This is where overflow for "1.79769313486232e+308" happens but works
fine for "1.7976931348623157e+308".

Transferring in binary mode allows to avoid conversion from double to
string on segments and then back to double on master.

Using CdbDispatchPlan instead of CdbDispatchCommand allows
to receive data in binary mode in MemTuple.

As acquire_sample_rows_dispatcher executes
select pg_catalog.gp_acquire_sample_rows with CdbDispatchPlan
when Test_print_direct_dispatch_info is enabled cdbdisp_dispatchX
prints INFO message  "(slice %d) Dispatch command to %s".
This has to be added to expected tests output.

(gdb) list
1143						primaryGang->type == GANGTYPE_PRIMARY_READER ||
1144						primaryGang->type == GANGTYPE_SINGLETON_READER ||
1145						primaryGang->type == GANGTYPE_ENTRYDB_READER);
1146
1147			if (Test_print_direct_dispatch_info)
1148				elog(INFO, "(slice %d) Dispatch command to %s", slice->sliceIndex,
1149					 		segmentsToContentStr(slice->directDispatch.isDirectDispatch ?
1150												slice->directDispatch.contentIds : slice->segments));
1151
1152			/*
(gdb) bt 5
#0  cdbdisp_dispatchX (queryDesc=0x5601ebd98dc0, planRequiresTxn=1 '\001', cancelOnError=1 '\001') at cdbdisp_query.c:1148
#1  0x00005601e9c6368e in CdbDispatchPlan (queryDesc=0x5601ebd98dc0, planRequiresTxn=1 '\001', cancelOnError=1 '\001') at cdbdisp_query.c:234
#2  0x00005601e97a7740 in standard_ExecutorStart (queryDesc=0x5601ebd98dc0, eflags=16) at execMain.c:721
#3  0x00005601e97a6686 in ExecutorStart (queryDesc=0x5601ebd98dc0, eflags=0) at execMain.c:262
#4  0x00005601e96d14d0 in acquire_sample_rows_dispatcher (onerel=0x7fe0829d6388, inh=0 '\000', elevel=13, rows=0x7fe082975058, targrows=30000, totalrows=0x7ffdbc341458,
    totaldeadrows=0x7ffdbc341460) at analyze.c:2492
(More stack frames follow...)
(gdb)
InnerLife0 pushed a commit that referenced this pull request Sep 28, 2021
dnskvlnk pushed a commit that referenced this pull request Jul 19, 2022
In commit a0684da we made a change to use the fixed TOAST_TUPLE_TARGET
instead of custom toast_tuple_target that's calculated based on the the
blocksize of the AO table for toasting AO table tuples.

This caused an issue that some tuples that are well beyond the
toast_tuple_target are not being toasted (because they're smaller than
TOAST_TUPLE_TARGET). This is ok when the tuples can still fit into the
AO table's blocksize. But if not, an error would occur. E.g.:

  postgres=# create table t (a int, b int[]) WITH (appendonly=true, blocksize=8192);
  CREATE TABLE
  postgres=# insert into t select 1, array_agg(x) from generate_series(1, 2030) x;
  ERROR:  item too long (check #1): length 8160, maxBufferLen 8168
  CONTEXT:  Append-Only table 't'

Therefore, we should revert those changes, including:

1. Still use the toast_tuple_target for toasting 'x' and 'e' columns for AO tables.
There's also a GPDB_12_MERGE_FIXME in the original comment suggesting to use
RelationGetToastTupleTarget for AO table (in order to unify the logic to calculate
maxDataLen for heap and AO tuples, I suppose). But that's not possible currently
because the calculated toast_tuple_target is stored in AppendOnlyInsertDesc and
we cannot get it from the Relation struct. And it seems to me that we won't have
a way to do that easily. Therefore, keep this FIXME comment being removed.

2. Still use the toast_tuple_target for toasting 'm' columns for AO tables. Also restore
the FIXME comment suggesting that we should use a bigger target size for 'm' columns.
This should be something that worth investigating more into.

Commit a0684da also includes a change to use custom toast_tuple_target for heap tables,
which should be a valid change. I think that fixed an oversight when merging the upstream
commit c251336.
red1452 pushed a commit that referenced this pull request Nov 21, 2022
We used to rename index-backed constraints in the EXCHANGE PARTITION
command in 6X. Now we don't. We've decided to keep that behavior
in 7X after looking into the opposing arguments:

Argument #1. It might cause problem during upgrade.
- Firstly, we won't be using legacy syntax in the dump scripts so we
just need to worry about the existing tables produced by EXCHANGE
PARTITION. I.e. whether or not they can be restored correctly.
- For upgrading from 6X->7X, since those tables already have matched
constraint and index names with the table names, we should be OK.
- For upgrading 7X->onward, since we implement EXCHANGE PARTIITON
simply as a combination of upstream-syntax commands (see
AtExecGPExchangePartition()), pg_upgrade should be able to handle
them. We've verified that manually and the automated test should
cover that too. In fact, this gives another point that we shouldn't
do our own hacky things as part of EXCHANGE PARTITION which might
confuse pg_upgrade.

Argument #2. It might surprise the users and their existing workloads.
- The indexed constraint names are all implicitly generated and
shouldn't be directly used in most cases.
- They are not the only thing that will appear changed. E.g. the
normal indexes (e.g. B-tree ones) will be too. So the decision to
change one should be made with changing all of them.

More details see: https://docs.google.com/document/d/1enJdKYxkk5WFRV1WoqIgLgRxCGxOqI2nglJVE_Wglec
RekGRpth pushed a commit that referenced this pull request Jan 23, 2023
Due to how pg_size_pretty(bigint) was implemented, it's possible that when
given a negative number of bytes that the returning value would not match
the equivalent positive return value when given the equivalent positive
number of bytes.  This was due to two separate issues.

1. The function used bit shifting to convert the number of bytes into
larger units.  The rounding performed by bit shifting is not the same as
dividing.  For example -3 >> 1 = -2, but -3 / 2 = -1.  These two
operations are only equivalent with positive numbers.

2. The half_rounded() macro rounded towards positive infinity.  This meant
that negative numbers rounded towards zero and positive numbers rounded
away from zero.

Here we fix #1 by dividing the values instead of bit shifting.  We fix #2
by adjusting the half_rounded macro always to round away from zero.

Additionally, adjust the pg_size_pretty(numeric) function to be more
explicit that it's using division rather than bit shifting.  A casual
observer might have believed bit shifting was used due to a static
function being named numeric_shift_right.  However, that function was
calculating the divisor from the number of bits and performed division.
Here we make that more clear.  This change is just cosmetic and does not
affect the return value of the numeric version of the function.

Here we also add a set of regression tests both versions of
pg_size_pretty() which test the values directly before and after the
function switches to the next unit.

This bug was introduced in 8a1fab3. Prior to that negative values were
always displayed in bytes.

Author: Dean Rasheed, David Rowley
Discussion: https://postgr.es/m/CAEZATCXnNW4HsmZnxhfezR5FuiGgp+mkY4AzcL5eRGO4fuadWg@mail.gmail.com
Backpatch-through: 9.6, where the bug was introduced.
red1452 pushed a commit that referenced this pull request Jul 26, 2023
gpdb_get_master_data_dir should set master_data_dir variable
with non-NULL pointer. However, there is codepath in this function
that leads to NULL result. We need to check this case and finish
gpmon process with error if any trouble.

This case is really reproduced in our production

(gdb) bt
#0  __strlen_avx2 () at ../sysdeps/x86_64/multiarch/strlen-avx2.S:65
#1  0x00007f18fc1de9ce in __GI___strdup (s=0x0) at strdup.c:41
#2  0x000055ff3885813d in getconfig () at gpmmon.c:1679
#3  main (argc=<optimized out>, argv=<optimized out>) at gpmmon.c:1358
(gdb) Quit
BenderArenadata pushed a commit that referenced this pull request Jan 24, 2024
## Problem
An error occurs in python lib when a plpython function is executed.
After our analysis, in the user's cluster, a plpython UDF 
was running with the unstable network, and got a timeout error:
`failed to acquire resources on one or more segments`.
Then a plpython UDF was run in the same session, and the UDF
failed with GC error.

Here is the core dump:
```
2023-11-24 10:15:18.945507 CST,,,p2705198,th2081832064,,,,0,,,seg-1,,,,,"LOG","00000","3rd party error log:
    #0 0x7f7c68b6d55b in frame_dealloc /home/cc/repo/cpython/Objects/frameobject.c:509:5
    #1 0x7f7c68b5109d in gen_send_ex /home/cc/repo/cpython/Objects/genobject.c:108:9
    #2 0x7f7c68af9ddd in PyIter_Next /home/cc/repo/cpython/Objects/abstract.c:3118:14
    #3 0x7f7c78caa5c0 in PLy_exec_function /home/cc/repo/gpdb6/src/pl/plpython/plpy_exec.c:134:11
    #4 0x7f7c78cb5ffb in plpython_call_handler /home/cc/repo/gpdb6/src/pl/plpython/plpy_main.c:387:13
    #5 0x562f5e008bb5 in ExecMakeTableFunctionResult /home/cc/repo/gpdb6/src/backend/executor/execQual.c:2395:13
    #6 0x562f5e0dddec in FunctionNext_guts /home/cc/repo/gpdb6/src/backend/executor/nodeFunctionscan.c:142:5
    #7 0x562f5e0da094 in FunctionNext /home/cc/repo/gpdb6/src/backend/executor/nodeFunctionscan.c:350:11
    #8 0x562f5e03d4b0 in ExecScanFetch /home/cc/repo/gpdb6/src/backend/executor/execScan.c:84:9
    #9 0x562f5e03cd8f in ExecScan /home/cc/repo/gpdb6/src/backend/executor/execScan.c:154:10
    #10 0x562f5e0da072 in ExecFunctionScan /home/cc/repo/gpdb6/src/backend/executor/nodeFunctionscan.c:380:9
    #11 0x562f5e001a1c in ExecProcNode /home/cc/repo/gpdb6/src/backend/executor/execProcnode.c:1071:13
    #12 0x562f5dfe6377 in ExecutePlan /home/cc/repo/gpdb6/src/backend/executor/execMain.c:3202:10
    #13 0x562f5dfe5bf4 in standard_ExecutorRun /home/cc/repo/gpdb6/src/backend/executor/execMain.c:1171:5
    #14 0x562f5dfe4877 in ExecutorRun /home/cc/repo/gpdb6/src/backend/executor/execMain.c:992:4
    #15 0x562f5e857e69 in PortalRunSelect /home/cc/repo/gpdb6/src/backend/tcop/pquery.c:1164:4
    #16 0x562f5e856d3f in PortalRun /home/cc/repo/gpdb6/src/backend/tcop/pquery.c:1005:18
    #17 0x562f5e84607a in exec_simple_query /home/cc/repo/gpdb6/src/backend/tcop/postgres.c:1848:10
```

## Reproduce
We can use a simple procedure to reproduce the above problem:
- set timeout GUC: `gpconfig -c gp_segment_connect_timeout -v 5` and `gpstop -ari`
- prepare function:
```
CREATE EXTENSION plpythonu;
CREATE OR REPLACE FUNCTION test_func() RETURNS SETOF int AS
$$
plpy.execute("select pg_backend_pid()")

for i in range(0, 5):
    yield (i)

$$ LANGUAGE plpythonu;
```
- exit from the current psql session.
- stop the postmaster of segment: `gdb -p "the pid of segment postmaster"`
- enter a psql session.
- call `SELECT test_func();` and get error
```
gpadmin=# select test_func();
ERROR:  function "test_func" error fetching next item from iterator (plpy_elog.c:121)
DETAIL:  Exception: failed to acquire resources on one or more segments
CONTEXT:  Traceback (most recent call last):
PL/Python function "test_func"
```
- quit gdb and make postmaster runnable.
- call  `SELECT test_func();` again and get panic
```
gpadmin=# SELECT test_func();
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!> 
```

## Analysis
- There is an SPI call in test_func(): `plpy.execute()`. 
- Then coordinator will start a subtransaction by PLy_spi_subtransaction_begin();
- Meanwhile, if the segment cannot receive the instruction from the coordinator,
  the subtransaction beginning procedure return fails.
- BUT! The Python processor does not know whether an error happened and
  does not clean its environment.
- Then the next plpython UDF in the same session will fail due to the wrong
  Python environment.

## Solution
- Use try-catch to catch the exception caused by PLy_spi_subtransaction_begin()
- set the python error indicator by PLy_spi_exception_set()

backport from #16856


Co-authored-by: Chen Mulong <chenmulong@gmail.com>
BenderArenadata pushed a commit that referenced this pull request Jan 24, 2024
An error occurs in python lib when a plpython function is executed.
After our analysis, in the user's cluster, a plpython UDF
was running with the unstable network, and got a timeout error:
`failed to acquire resources on one or more segments`.
Then a plpython UDF was run in the same session, and the UDF
failed with GC error.

Here is the core dump:
```
2023-11-24 10:15:18.945507 CST,,,p2705198,th2081832064,,,,0,,,seg-1,,,,,"LOG","00000","3rd party error log:
    #0 0x7f7c68b6d55b in frame_dealloc /home/cc/repo/cpython/Objects/frameobject.c:509:5
    #1 0x7f7c68b5109d in gen_send_ex /home/cc/repo/cpython/Objects/genobject.c:108:9
    #2 0x7f7c68af9ddd in PyIter_Next /home/cc/repo/cpython/Objects/abstract.c:3118:14
    #3 0x7f7c78caa5c0 in PLy_exec_function /home/cc/repo/gpdb6/src/pl/plpython/plpy_exec.c:134:11
    #4 0x7f7c78cb5ffb in plpython_call_handler /home/cc/repo/gpdb6/src/pl/plpython/plpy_main.c:387:13
    #5 0x562f5e008bb5 in ExecMakeTableFunctionResult /home/cc/repo/gpdb6/src/backend/executor/execQual.c:2395:13
    #6 0x562f5e0dddec in FunctionNext_guts /home/cc/repo/gpdb6/src/backend/executor/nodeFunctionscan.c:142:5
    #7 0x562f5e0da094 in FunctionNext /home/cc/repo/gpdb6/src/backend/executor/nodeFunctionscan.c:350:11
    #8 0x562f5e03d4b0 in ExecScanFetch /home/cc/repo/gpdb6/src/backend/executor/execScan.c:84:9
    #9 0x562f5e03cd8f in ExecScan /home/cc/repo/gpdb6/src/backend/executor/execScan.c:154:10
    #10 0x562f5e0da072 in ExecFunctionScan /home/cc/repo/gpdb6/src/backend/executor/nodeFunctionscan.c:380:9
    #11 0x562f5e001a1c in ExecProcNode /home/cc/repo/gpdb6/src/backend/executor/execProcnode.c:1071:13
    #12 0x562f5dfe6377 in ExecutePlan /home/cc/repo/gpdb6/src/backend/executor/execMain.c:3202:10
    #13 0x562f5dfe5bf4 in standard_ExecutorRun /home/cc/repo/gpdb6/src/backend/executor/execMain.c:1171:5
    #14 0x562f5dfe4877 in ExecutorRun /home/cc/repo/gpdb6/src/backend/executor/execMain.c:992:4
    #15 0x562f5e857e69 in PortalRunSelect /home/cc/repo/gpdb6/src/backend/tcop/pquery.c:1164:4
    #16 0x562f5e856d3f in PortalRun /home/cc/repo/gpdb6/src/backend/tcop/pquery.c:1005:18
    #17 0x562f5e84607a in exec_simple_query /home/cc/repo/gpdb6/src/backend/tcop/postgres.c:1848:10
```

We can use a simple procedure to reproduce the above problem:
- set timeout GUC: `gpconfig -c gp_segment_connect_timeout -v 5` and `gpstop -ari`
- prepare function:
```
CREATE EXTENSION plpythonu;
CREATE OR REPLACE FUNCTION test_func() RETURNS SETOF int AS
$$
plpy.execute("select pg_backend_pid()")

for i in range(0, 5):
    yield (i)

$$ LANGUAGE plpythonu;
```
- exit from the current psql session.
- stop the postmaster of segment: `gdb -p "the pid of segment postmaster"`
- enter a psql session.
- call `SELECT test_func();` and get error
```
gpadmin=# select test_func();
ERROR:  function "test_func" error fetching next item from iterator (plpy_elog.c:121)
DETAIL:  Exception: failed to acquire resources on one or more segments
CONTEXT:  Traceback (most recent call last):
PL/Python function "test_func"
```
- quit gdb and make postmaster runnable.
- call  `SELECT test_func();` again and get panic
```
gpadmin=# SELECT test_func();
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!>
```

- There is an SPI call in test_func(): `plpy.execute()`.
- Then coordinator will start a subtransaction by PLy_spi_subtransaction_begin();
- Meanwhile, if the segment cannot receive the instruction from the coordinator,
  the subtransaction beginning procedure return fails.
- BUT! The Python processor does not know whether an error happened and
  does not clean its environment.
- Then the next plpython UDF in the same session will fail due to the wrong
  Python environment.

- Use try-catch to catch the exception caused by PLy_spi_subtransaction_begin()
- set the python error indicator by PLy_spi_exception_set()

backport from #16856

Co-authored-by: Chen Mulong <chenmulong@gmail.com>
(cherry picked from commit 45d6ba8)

Co-authored-by: Zhang Hao <hzhang2@vmware.com>
Stolb27 pushed a commit that referenced this pull request Feb 15, 2024
…586)

My previous commit 8915cd0 caused coredump in some pipeline jobs.
Example stack:
```
Core was generated by `postgres:  7000, ic proxy process
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000000000b46ec3 in pg_atomic_read_u32_impl (ptr=0x7f05a8c51104) at ../../../../src/include/port/atomics/generic.h:48
(gdb) bt
#0  0x0000000000b46ec3 in pg_atomic_read_u32_impl (ptr=0x7f05a8c51104) at ../../../../src/include/port/atomics/generic.h:48
#1  pg_atomic_read_u32 (ptr=0x7f05a8c51104) at ../../../../src/include/port/atomics.h:247
#2  LWLockAttemptLock (mode=LW_EXCLUSIVE, lock=0x7f05a8c51100) at lwlock.c:751
#3  LWLockAcquire (lock=0x7f05a8c51100, mode=mode@entry=LW_EXCLUSIVE) at lwlock.c:1188
#4  0x0000000000b32fff in ShmemInitStruct (name=name@entry=0x130e160 "", size=size@entry=4, foundPtr=foundPtr@entry=0x7ffcf94513bf) at shmem.c:412
#5  0x0000000000d6d18e in ic_proxy_server_main () at ic_proxy_main.c:545
#6  0x0000000000d6c219 in ICProxyMain (main_arg=<optimized out>) at ic_proxy_bgworker.c:36
#7  0x0000000000aa9caa in StartBackgroundWorker () at bgworker.c:955
#8  0x0000000000ab9407 in do_start_bgworker (rw=<optimized out>) at postmaster.c:6450
#9  maybe_start_bgworkers () at postmaster.c:6706
#10 0x0000000000abbc59 in ServerLoop () at postmaster.c:2095
#11 0x0000000000abd777 in PostmasterMain (argc=argc@entry=5, argv=argv@entry=0x36e3650) at postmaster.c:1633
#12 0x00000000006e4764 in main (argc=5, argv=0x36e3650) at main.c:240
(gdb) p *ptr
Cannot access memory at address 0x7f05a8c51104
```

The root cause is I forgot to init SHM structure at CreateSharedMemoryAndSemaphores().
Fix it in this commit.
Stolb27 pushed a commit that referenced this pull request Feb 15, 2024
## Problem
An error occurs in python lib when a plpython function is executed.
After our analysis, in the user's cluster, a plpython UDF 
was running with the unstable network, and got a timeout error:
`failed to acquire resources on one or more segments`.
Then a plpython UDF was run in the same session, and the UDF
failed with GC error.

Here is the core dump:
```
2023-11-24 10:15:18.945507 CST,,,p2705198,th2081832064,,,,0,,,seg-1,,,,,"LOG","00000","3rd party error log:
    #0 0x7f7c68b6d55b in frame_dealloc /home/cc/repo/cpython/Objects/frameobject.c:509:5
    #1 0x7f7c68b5109d in gen_send_ex /home/cc/repo/cpython/Objects/genobject.c:108:9
    #2 0x7f7c68af9ddd in PyIter_Next /home/cc/repo/cpython/Objects/abstract.c:3118:14
    #3 0x7f7c78caa5c0 in PLy_exec_function /home/cc/repo/gpdb6/src/pl/plpython/plpy_exec.c:134:11
    #4 0x7f7c78cb5ffb in plpython_call_handler /home/cc/repo/gpdb6/src/pl/plpython/plpy_main.c:387:13
    #5 0x562f5e008bb5 in ExecMakeTableFunctionResult /home/cc/repo/gpdb6/src/backend/executor/execQual.c:2395:13
    #6 0x562f5e0dddec in FunctionNext_guts /home/cc/repo/gpdb6/src/backend/executor/nodeFunctionscan.c:142:5
    #7 0x562f5e0da094 in FunctionNext /home/cc/repo/gpdb6/src/backend/executor/nodeFunctionscan.c:350:11
    #8 0x562f5e03d4b0 in ExecScanFetch /home/cc/repo/gpdb6/src/backend/executor/execScan.c:84:9
    #9 0x562f5e03cd8f in ExecScan /home/cc/repo/gpdb6/src/backend/executor/execScan.c:154:10
    #10 0x562f5e0da072 in ExecFunctionScan /home/cc/repo/gpdb6/src/backend/executor/nodeFunctionscan.c:380:9
    #11 0x562f5e001a1c in ExecProcNode /home/cc/repo/gpdb6/src/backend/executor/execProcnode.c:1071:13
    #12 0x562f5dfe6377 in ExecutePlan /home/cc/repo/gpdb6/src/backend/executor/execMain.c:3202:10
    #13 0x562f5dfe5bf4 in standard_ExecutorRun /home/cc/repo/gpdb6/src/backend/executor/execMain.c:1171:5
    #14 0x562f5dfe4877 in ExecutorRun /home/cc/repo/gpdb6/src/backend/executor/execMain.c:992:4
    #15 0x562f5e857e69 in PortalRunSelect /home/cc/repo/gpdb6/src/backend/tcop/pquery.c:1164:4
    #16 0x562f5e856d3f in PortalRun /home/cc/repo/gpdb6/src/backend/tcop/pquery.c:1005:18
    #17 0x562f5e84607a in exec_simple_query /home/cc/repo/gpdb6/src/backend/tcop/postgres.c:1848:10
```

## Reproduce
We can use a simple procedure to reproduce the above problem:
- set timeout GUC: `gpconfig -c gp_segment_connect_timeout -v 5` and `gpstop -ari`
- prepare function:
```
CREATE EXTENSION plpythonu;
CREATE OR REPLACE FUNCTION test_func() RETURNS SETOF int AS
$$
plpy.execute("select pg_backend_pid()")

for i in range(0, 5):
    yield (i)

$$ LANGUAGE plpythonu;
```
- exit from the current psql session.
- stop the postmaster of segment: `gdb -p "the pid of segment postmaster"`
- enter a psql session.
- call `SELECT test_func();` and get error
```
gpadmin=# select test_func();
ERROR:  function "test_func" error fetching next item from iterator (plpy_elog.c:121)
DETAIL:  Exception: failed to acquire resources on one or more segments
CONTEXT:  Traceback (most recent call last):
PL/Python function "test_func"
```
- quit gdb and make postmaster runnable.
- call  `SELECT test_func();` again and get panic
```
gpadmin=# SELECT test_func();
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!> 
```

## Analysis
- There is an SPI call in test_func(): `plpy.execute()`. 
- Then coordinator will start a subtransaction by PLy_spi_subtransaction_begin();
- Meanwhile, if the segment cannot receive the instruction from the coordinator,
  the subtransaction beginning procedure return fails.
- BUT! The Python processor does not know whether an error happened and
  does not clean its environment.
- Then the next plpython UDF in the same session will fail due to the wrong
  Python environment.

## Solution
- Use try-catch to catch the exception caused by PLy_spi_subtransaction_begin()
- set the python error indicator by PLy_spi_exception_set()


Co-authored-by: Chen Mulong <chenmulong@gmail.com>
Stolb27 pushed a commit that referenced this pull request Mar 10, 2025
Commit e7cb7ee, which introduced the infrastructure for FDWs and
custom scan providers to replace joins with scans, failed to add support
handling of pseudoconstant quals assigned to replaced joins in
createplan.c, leading to an incorrect plan without a gating Result node
when postgres_fdw replaced a join with such a qual.

To fix, we could add the support by 1) modifying the ForeignPath and
CustomPath structs to store the list of RestrictInfo nodes to apply to
the join, as in JoinPaths, if they represent foreign and custom scans
replacing a join with a scan, and by 2) modifying create_scan_plan() in
createplan.c to use that list in that case, instead of the
baserestrictinfo list, to get pseudoconstant quals assigned to the join;
but #1 would cause an ABI break.  So fix by modifying the infrastructure
to just disallow replacing joins with such quals.

Back-patch to all supported branches.

Reported by Nishant Sharma.  Patch by me, reviewed by Nishant Sharma and
Richard Guo.

Discussion: https://postgr.es/m/CADrsxdbcN1vejBaf8a%2BQhrZY5PXL-04mCd4GDu6qm6FigDZd6Q%40mail.gmail.com
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.