Skip to content

Upstream Sync#7

Closed
acmnu wants to merge 175 commits intoarenadata:adb-5.xfrom
greenplum-db:5X_STABLE
Closed

Upstream Sync#7
acmnu wants to merge 175 commits intoarenadata:adb-5.xfrom
greenplum-db:5X_STABLE

Conversation

@acmnu
Copy link

@acmnu acmnu commented May 10, 2018

There are a lot of commits after 5.8.0. Let them be here.

Xin Zhang and others added 21 commits May 4, 2018 13:37
Sometimes, these are not set. Other times, they can be set incorrectly
by the user, or a malicious actor.

Co-authored-by: Xin Zhang <xzhang@pivotal.io>
Co-authored-by: David Sharp <dsharp@pivotal.io>
(cherry picked from commit f60124ccf00c48c1e5f374d3f863faf3b2c4ac20)
These table are inserted values not distributed right, via setting the
GUC gp_enable_segment_copy_checking off and `COPY FROM ON SEGMENT`.

Alter them to distributed randomly after testing to not break following
tests against the databases left by ICW.

(cherry picked from commit 864d6162cda469ec33be09f32e5ca2bae8debc53)
* Create the sequence which is belonged to a table.
* Correct the sequence next value

We will check the dependent sequences before transfering the table data.
If the table contains the sequeces. gptransfer will pg_dump all the
metadata of the sequences with data. and import it into the
destiantion cluster

Signed-off-by: Ming LI <mli@apache.org>
(cherry picked from commit 43a7a9e909c1086eafda70c7db48f878759ac8d9)
* docs - updates to resgroup "using" page for memory_auditor

* reorg, other suggestions from david

* edits from david

* 2 more edits from david

* too many Greenplum Databases in the sentence
* docs - resgroup memory_auditor cat/view updates for 5x

* clarify default memory auditor
* using page updates for resgroup global shmem for 5x

* some edits requested by david

* new faq - add global shmem case

* user can not request resgroup global shared memory

* incorporate edits requested from simon
* docs - low memory query config for RQs and RGs

* add missing topic ending tag
The other two platforms (centos6 and sles11) are currently being
exported. All three need to be consumed in the gp-integration-testing
pipeline.

Co-authored-by: Jason Vigil <jvigil@pivotal.io>
Co-authored-by: Kris Macoskey <kmacoskey@pivotal.io>
if the gptransfer with --full option, it should ensure the spcified
database is not existed in the destination cluster. it isn't guaranteed,
so we decide to remove this case

Signed-off-by: Adam Lee <ali@pivotal.io>
(cherry picked from commit ec07296e689501fb7cb6613ea2452722cbfee37b)
Signed-off-by: Adam Lee <ali@pivotal.io>
(cherry picked from commit 9ed9da25fda2a261a1ffc83217006e5ecf281480)
ExecBRInsertTriggers() uses the per tuple memory context, which might be
reset and pfree() SEGV.

(cherry picked from commit 3ce5921bb3cdcd912a30b1ddfb19bcdcefb88d82)
because the behivor of pg_dump is different between gpdb 4 and gpdb 5
so the check_sequence function only can work on gpdb 5

(cherry picked from commit 4eb6078f9587f4ddda0646e30579fc55d5ca47f9)
* docs - add pl/container resource mgmt info

* incorporate review comments

* edits requested by david
These jobs no longer belong to the Data Protection team. It is confusing
to have them labeled with DPM.

Authored-by: Karen Huddleston <khuddleston@pivotal.io>
* docs - add a pl/container upgrade procedure

* replace gppkg -r, -i with -u
* docs - clarify settings for low memory queries

* memory_spill_ration range 0-2 for low mem queries
This version of ORCA adds missing Xforms that generate Indexed Nested
Loop Join
With ensure, the destroy step will run even if a previous step fails or
if the job is terminated early.

Co-authored-by: Karen Huddleston <khuddleston@pivotal.io>
Co-authored-by: Chris Hajas <chajas@pivotal.io>
@acmnu acmnu added this to the Next milestone May 10, 2018
jimmyyih and others added 8 commits May 10, 2018 14:28
There were scenarios where adding a new partition to a partition table would
cause a negative or duplicate partition rule order (parruleord) value to show
up in the pg_partition_rule catalog table.

1. Negative parruleord values could show up during parruleord gap closing when
   the new partition is inserted above a parruleord gap.
2. Negative parruleord values could show up when the max number of partitions
   for that level has been reached (32767), and there is an attempt to add a
   new partition that would have been the highest ranked partition in that
   partition's partition range.
3. Duplicate parruleord values could show up when the max number of partitions
   for that level has been reached (32767), and there is an attempt to add a
   new partition that would have been inserted between the partition table's
   sequence of parruleord values.

Co-authored-by: David Kimura <dkimura@pivotal.io>
During pg_upgrade, the pg_type catalog is queried for oid, typname,
and typnamespace which will be used to pre-assign type oids. The issue
that has been exposed is that the typnames in the catalog (created by
ALTER TABLE ADD PARTITION) are different than the typnames generated
during CREATE TABLE (table creation SQL generated by pg_dump).

Here's an example...
From catalog where typname is generated during ALTER TABLE ADD PARTITION:
____test_partitioned_table_never_decrements_parruleord__1_prt_p

From catalog where typname is generated during CREATE TABLE (created by pg_dump):
___test_partitioned_table_never_decrements_parruleord__1_prt_pa

To prevent pg_upgrade test from failing and keeping the 5X_STABLE
branch red, we drop the partition table and add a FIXME comment to
look into why the typnames are generated differently.
This image exists in gpdb6 and is needed for building a docker image on
new releases of gpdb5, therefore it should live on the 5X_STABLE branch.

Co-authored-by: Joao Pereira <jdealmeidapereira@pivotal.io>
Co-authored-by: Victoria Henry <vhenry@pivotal.io>
Co-authored-by: Jason Vigil <jvigil@pivotal.io>
Co-authored-by: Kris Macoskey <kmacoskey@pivotal.io>
In order to be built in a pipeline there needs to be a reference to the
scripts used in the Dockerfile that is relative to the expected
directories made available as volumes in concourse.

Co-authored-by: Joao Pereira <jdealmeidapereira@pivotal.io>
Co-authored-by: Victoria Henry <vhenry@pivotal.io>
Co-authored-by: Jason Vigil <jvigil@pivotal.io>
Co-authored-by: Kris Macoskey <kmacoskey@pivotal.io>
Since protocols do not belong to a namespace, we do not want to dump
them in table or schema filtered backups. They will only be dumped in a
full backup.

Co-authored-by: Karen Huddleston <khuddleston@pivotal.io>
Co-authored-by: Chris Hajas <chajas@pivotal.io>
Huiliang Liu and others added 14 commits July 18, 2018 10:58
#5285)

* docs - update gpbackup API - add segment instance and update backup directory information.

Also update API version to 0.3.0.

This will be ported to 5X_STABLE

* docs - gpbackup API - review updates and fixes for scope information

Also, cleanup edits.

* docs - gpbackup API - more review updates and fixes to scope information.
…ons. (#5050)

We have lately seen a lot of failures in test cases related to
partitioning, with errors like this:

select tablename, partitionlevel, partitiontablename, partitionname, partitionrank, partitionboundary from pg_partitions where tablename = 'mpp3079a';
ERROR:  cache lookup failed for relation 148532 (ruleutils.c:7172)

The culprit is that that the view passes a relation OID to the
pg_get_partition_rule_def() function, and the function tries to perform a
syscache lookup on the relation (in flatten_reloptions()), but the lookup
fails because the relation was dropped concurrently by another transaction.
This race is possible, because the query runs with an MVCC snapshot, but
the syscache lookups use SnapshotNow.

This commit doesn't eliminate the race completely, but at least it makes it
narrower. A more reliable solution would've been to acquire a lock on the
table, but then that might block, which isn't nice either.

Another solution would've been to modify flatten_reloptions() to return
NULL instead of erroring out, if the lookup fails. That approach is taken
on the other lookups, but I'm reluctant to modify flatten_reloptions()
because it's inherited from upstream. Let's see how well this works in
practice first, before we do more drastic measures.
We've seen a lot of failures in the 'sreh' test in the pipeline, like this:

--- 263,269 ----
  FORMAT 'text' (delimiter '|')
  SEGMENT REJECT LIMIT 10000;
  SELECT * FROM sreh_ext;
! ERROR:  connection failed dummy_protocol://DUMMY_LOCATION
  INSERT INTO sreh_target SELECT * FROM sreh_ext;
  NOTICE:  Found 10 data formatting errors (10 or more input rows). Rejected related input data.
  SELECT count(*) FROM sreh_target;

I don't really know, but I'm guessing it could be because it sometimes
takes more than one second for gpfdist to fully start up, if there's a lot
of disk or other activity. Increase the sleep time from 1 to 3 seconds;
we'll see if that helps.

(cherry picked from commit bb8575a9d58939a44218e5f7afa6ce32faf8d3c6)
- Previously, DNS was queried within the `Ping` utility constructor, so a DNS failure would always raised an exception.
- Now the DNS query is in the standard `run()` method, so a failure from DNS will raise optionally, depending on the `validateAfter` parameter.
- `Command` declared as a new style class so that `super(Ping, self).run()` can be called.

Co-authored-by: Larry Hamel <lhamel@pivotal.io>
Co-authored-by: Jemish Patel <jpatel@pivotal.io>
The MyProc inDropTransaction flag was used to make sure concurrent AO vacuum
would not conflict with each other during drop phase. Two concurrent AO vacuum
on same relation was possible back in 4.3 where the different AO vacuum phases
(prepare, compaction, drop, cleanup) would interleave with each other, and
having two AO vacuum drop phases concurrently on the same AO relation was
dangerous. We now hold the ShareUpdateExclusiveLock through the entire AO
vacuum which renders the inDropTransaction flag useless and disallows the
interleaving mechanism.

Co-authored-by: Jimmy Yih <jyih@pivotal.io>
* docs - correct log file locations in best practices

* edit requested by david
1. When doing harvesting, raise the gp_max_csv_line_length to
maximum legal value in session level.
2. For query longer than gp_max_csv_line_length, this workaround
replaces line breaks in query text with space to prevent load
failure. It may lead long query statement changed when load to
history table, but it is still better than fail to load or truncate
the query text.

Co-authored-by: Teng Zhang tezhang@pivotal.io
Co-authored-by: Hao Wang haowang@pivotal.io
- add fast_match option in gpload config file. If both reuse_tables
and fast_match are true, gpload will try fast match external
table(without checking columns). If reuse_tables is false and
fast_match is true, it will print warning message.
* docs - remove duplicate gphdfs/kerberos topic in best practices

* remove unused file
If autovacuum was triggered before ShmemVariableCache->latestCompletedXid is
updated by manually consuming xids then autovacuum may not vacuum template0
with a proper transaction id to compare against. We made the test more reliable
by suspending a new fault injector (auto_vac_worker_before_do_autovacuum) right
before autovacuum worker sets recentXid and starts doing the autovacuum.  This
allows us to guarantee that autovacuum is comparing against a proper xid.

We also removed the loop in the test because vacuum_update_dat_frozen_xid fault
injector ensures the pg_database table has been updated.

Co-authored-by: Jimmy Yih <jyih@pivotal.io>
@acmnu acmnu closed this Jul 24, 2018
aspekt112 pushed a commit that referenced this pull request Oct 31, 2018
Actually we saw this panic on GPDB5. This issue was fixed on PG
in 2014 and gpdb master does not include this fix yet.

Here is the crash stack on GPDB5.
    \#2  <signal handler called>
    \#3  transformStmt (pstate=pstate@entry=0x1464200, parseTree=parseTree@entry=0x0) at analyze.c:269
    \#4  0x00000000005b5852 in parse_analyze (parseTree=parseTree@entry=0x0, sourceText=sourceText@entry=0x207c8c0 "",
        paramTypes=0x0, numParams=0) at analyze.c:166
    \#5  0x000000000082cbc9 in pg_analyze_and_rewrite (parsetree=0x0, query_string=0x207c8c0 "",
        paramTypes=<optimized out>, numParams=<optimized out>) at postgres.c:811
    \#6  0x000000000092a232 in RevalidateCachedPlanWithParams (plansource=plansource@entry=0x207c830,
        useResOwner=useResOwner@entry=0 '\000', boundParams=boundParams@entry=0x0, intoClause=intoClause@entry=0x0)
        at plancache.c:556
    \#7  0x000000000092a462 in RevalidateCachedPlan (plansource=plansource@entry=0x207c830,
        useResOwner=useResOwner@entry=0 '\000') at plancache.c:665
    \#8  0x0000000000828b1a in exec_bind_message (input_message=input_message@entry=0x7fff0c00a300) at postgres.c:2521
    \#9  0x000000000082b935 in PostgresMain (argc=<optimized out>, argv=argv@entry=0x12e1bb0, dbname=<optimized out>,
        username=<optimized out>) at postgres.c:5309

Here is part of the PG commit message.

    commit 6777080
    Author: Tom Lane <tgl@sss.pgh.pa.us>
    Date:   Wed Nov 12 15:58:37 2014 -0500

        Explicitly support the case that a plancache's raw_parse_tree is NULL.

        This only happens if a client issues a Parse message with an empty query
        string, which is a bit odd; but since it is explicitly called out as legal
        by our FE/BE protocol spec, we'd probably better continue to allow it.

        ......
leskin-in pushed a commit that referenced this pull request Dec 14, 2018
It had two indexes:

* pg_statlastop_classid_objid_index on (classid oid_ops, objid oid_ops), and
* pg_statlastop_classid_objid_staactionname_index on (classid oid_ops, objid
  oid_ops, staactionname name_ops)

The first one is completely redundant with the second one. Remove it.

Fixes assertion failure https://github.com/greenplum-db/gpdb/issues/6362.
The assertion was added in PostgreSQL 9.1, commit d2f60a3. The failure
happened on "VACUUM FULL pg_stat_last_operation", if the VACUUM FULL
itself added a new row to the table. The insertion also inserted entries
in the indexes, which tripped the assertion that checks that you don't try
to insert entries into an index that's currently being reindexed, or
pending reindexing:

> (gdb) bt
> #0  0x00007f02f5189783 in __select_nocancel () from /lib64/libc.so.6
> #1  0x0000000000be76ef in pg_usleep (microsec=30000000) at pgsleep.c:53
> #2  0x0000000000ad75aa in elog_debug_linger (edata=0x11bf760 <errordata>) at elog.c:5293
> #3  0x0000000000acdba4 in errfinish (dummy=0) at elog.c:675
> #4  0x0000000000acc3bf in ExceptionalCondition (conditionName=0xc15798 "!(!ReindexIsProcessingIndex(((indexRelation)->rd_id)))", errorType=0xc156ef "FailedAssertion",
>     fileName=0xc156d0 "indexam.c", lineNumber=215) at assert.c:46
> #5  0x00000000004fded5 in index_insert (indexRelation=0x7f02f6b6daa0, values=0x7ffdb43915e0, isnull=0x7ffdb43915c0 "", heap_t_ctid=0x240bd64, heapRelation=0x24efa78,
>     checkUnique=UNIQUE_CHECK_YES) at indexam.c:215
> #6  0x00000000005bda59 in CatalogIndexInsert (indstate=0x240e5d0, heapTuple=0x240bd60) at indexing.c:136
> #7  0x00000000005bdaaa in CatalogUpdateIndexes (heapRel=0x24efa78, heapTuple=0x240bd60) at indexing.c:162
> #8  0x00000000005b2203 in MetaTrackAddUpdInternal (classid=1259, objoid=6053, relowner=10, actionname=0xc51543 "VACUUM", subtype=0xc5153b "REINDEX", rel=0x24efa78,
>     old_tuple=0x0) at heap.c:744
> #9  0x00000000005b229d in MetaTrackAddObject (classid=1259, objoid=6053, relowner=10, actionname=0xc51543 "VACUUM", subtype=0xc5153b "REINDEX") at heap.c:773
> #10 0x00000000005b2553 in MetaTrackUpdObject (classid=1259, objoid=6053, relowner=10, actionname=0xc51543 "VACUUM", subtype=0xc5153b "REINDEX") at heap.c:856
> #11 0x00000000005bd271 in reindex_index (indexId=6053, skip_constraint_checks=1 '\001') at index.c:3741
> #12 0x00000000005bd418 in reindex_relation (relid=6052, flags=2) at index.c:3870
> #13 0x000000000067ba71 in finish_heap_swap (OIDOldHeap=6052, OIDNewHeap=16687, is_system_catalog=1 '\001', swap_toast_by_content=0 '\000', swap_stats=1 '\001',
>     check_constraints=0 '\000', is_internal=1 '\001', frozenXid=821, cutoffMulti=1) at cluster.c:1667
> #14 0x0000000000679ed5 in rebuild_relation (OldHeap=0x7f02f6b7a6f0, indexOid=0, verbose=0 '\000') at cluster.c:648
> #15 0x0000000000679913 in cluster_rel (tableOid=6052, indexOid=0, recheck=0 '\000', verbose=0 '\000', printError=1 '\001') at cluster.c:461
> #16 0x0000000000717580 in vacuum_rel (onerel=0x0, relid=6052, vacstmt=0x2533c38, lmode=8, for_wraparound=0 '\000') at vacuum.c:2315
> #17 0x0000000000714ce7 in vacuumStatement_Relation (vacstmt=0x2533c38, relid=6052, relations=0x24c12f8, bstrategy=0x24c1220, do_toast=1 '\001', for_wraparound=0 '\000',
>     isTopLevel=1 '\001') at vacuum.c:787
> #18 0x0000000000714303 in vacuum (vacstmt=0x2403260, relid=0, do_toast=1 '\001', bstrategy=0x24c1220, for_wraparound=0 '\000', isTopLevel=1 '\001') at vacuum.c:337
> #19 0x0000000000969cd2 in standard_ProcessUtility (parsetree=0x2403260, queryString=0x24027e0 "vacuum full;", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, dest=0x2403648,
>     completionTag=0x7ffdb4392550 "") at utility.c:804
> #20 0x00000000009691be in ProcessUtility (parsetree=0x2403260, queryString=0x24027e0 "vacuum full;", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, dest=0x2403648,
>     completionTag=0x7ffdb4392550 "") at utility.c:373

In this scenario, we had just reindexed one of the indexes of
pg_stat_last_operation, and the metatrack update of that tried to insert a
row into the same table. But the second index in the table was pending
reindexing, which triggered the assertion.

After removing the redundant index, pg_stat_last_operation has only one
index, and that scenario no longer happens. This is a bit fragile fix,
because the problem will reappear as soon as you add a second index on the
table. But we have no plans of doing that, and I believe no harm would be
done in production builds with assertions disabled, anyway. So this will
do for now.

Reviewed-by: Ashwin Agrawal <aagrawal@pivotal.io>
Reviewed-by: Shaoqi Bai <sbai@pivotal.io>
Reviewed-by: Jamie McAtamney <jmcatamney@pivotal.io>
leskin-in pushed a commit that referenced this pull request May 7, 2019
We found that if gp_segment_configuration is locked, then it will
fail by triggering FTS. We got the stack below
	#2  0x0000000000a6bb29 in ExceptionalCondition at assert.c:66
	#3  0x0000000000aac19a in enable_timeout timeout.c:143
	#4  0x0000000000aacb6c in enable_timeout_after timeout.c:473
	#5  0x00000000008e86ef in ProcSleep at proc.c:1300
	#6  0x00000000008deb70 in WaitOnLock at lock.c:1894
	#7  0x00000000008e019e in LockAcquireExtended at lock.c:1205
	#8  0x00000000008dd2d8 in LockRelationOid at lmgr.c:102
	#9  0x000000000051c928 in heap_open at heapam.c:1083
	#10 0x0000000000b7feaf in getCdbComponentInfo at cdbutil.c:173
	#11 0x0000000000b81365 in cdbcomponent_getCdbComponents at cdbutil.c:606
	#12 0x00000000007603e1 in ftsMain at fts.c:351
	#13 0x0000000000760715 in ftsprobe_start at fts.c:121
	#14 0x00000000004cc7b0 in ServerLoop ()
	#15 0x00000000008769bf in PostmasterMain at postmaster.c:1531
	#16 0x000000000079098b in main ()
So it is that FTS hasn't initialized timeout. Any process that
wants to use timeout must call initilization first. This is the
root cause gpexpand job fails on master pipeline in build 71 and 79.
We added this initialization in FTS and GDD.
deart2k pushed a commit that referenced this pull request May 17, 2019
We found that if gp_segment_configuration is locked, then it will
fail by triggering FTS. We got the stack below
	#2  0x0000000000a6bb29 in ExceptionalCondition at assert.c:66
	#3  0x0000000000aac19a in enable_timeout timeout.c:143
	#4  0x0000000000aacb6c in enable_timeout_after timeout.c:473
	#5  0x00000000008e86ef in ProcSleep at proc.c:1300
	#6  0x00000000008deb70 in WaitOnLock at lock.c:1894
	#7  0x00000000008e019e in LockAcquireExtended at lock.c:1205
	#8  0x00000000008dd2d8 in LockRelationOid at lmgr.c:102
	#9  0x000000000051c928 in heap_open at heapam.c:1083
	#10 0x0000000000b7feaf in getCdbComponentInfo at cdbutil.c:173
	#11 0x0000000000b81365 in cdbcomponent_getCdbComponents at cdbutil.c:606
	#12 0x00000000007603e1 in ftsMain at fts.c:351
	#13 0x0000000000760715 in ftsprobe_start at fts.c:121
	#14 0x00000000004cc7b0 in ServerLoop ()
	#15 0x00000000008769bf in PostmasterMain at postmaster.c:1531
	#16 0x000000000079098b in main ()
So it is that FTS hasn't initialized timeout. Any process that
wants to use timeout must call initilization first. This is the
root cause gpexpand job fails on master pipeline in build 71 and 79.
We added this initialization in FTS and GDD.
darthunix pushed a commit that referenced this pull request Apr 7, 2020
The syslogger will open the gpperfmon log alert file when
gpperfmon_log_alert_level is not NONE, however even if it fails to open
the file it still writes to it, which causes a crash like this:

    #0  fwrite () from /lib64/libc.so.6
    #1  write_binary_to_file (fh=NULL, ...) at syslogger.c:1885
    #2  write_syslogger_file_binary (...) at syslogger.c:1917
    #3  syslogger_append_current_timestamp (...) at syslogger.c:1103
    #4  syslogger_log_chunk_list (...) at syslogger.c:1571
    #5  syslogger_handle_chunk (...) at syslogger.c:1766
    #6  SysLoggerMain (argv=0x0, argc=0) at syslogger.c:576
    #7  SysLogger_Start () at syslogger.c:850
    #8  do_reaper () at postmaster.c:4984
    #9  ServerLoop () at postmaster.c:2417
    #10 PostmasterMain (...) at postmaster.c:1528
    #11 main (argc=15, argv=0x1dc1680) at main.c:206

To fix the issue we won't mark alert_log_level_opened as true until the
file is really opened successfully.

Reviewed-by: Paul Guo <pguo@pivotal.io>
deart2k pushed a commit that referenced this pull request Apr 10, 2020
The syslogger will open the gpperfmon log alert file when
gpperfmon_log_alert_level is not NONE, however even if it fails to open
the file it still writes to it, which causes a crash like this:

    #0  fwrite () from /lib64/libc.so.6
    #1  write_binary_to_file (fh=NULL, ...) at syslogger.c:1885
    #2  write_syslogger_file_binary (...) at syslogger.c:1917
    #3  syslogger_append_current_timestamp (...) at syslogger.c:1103
    #4  syslogger_log_chunk_list (...) at syslogger.c:1571
    #5  syslogger_handle_chunk (...) at syslogger.c:1766
    #6  SysLoggerMain (argv=0x0, argc=0) at syslogger.c:576
    #7  SysLogger_Start () at syslogger.c:850
    #8  do_reaper () at postmaster.c:4984
    #9  ServerLoop () at postmaster.c:2417
    #10 PostmasterMain (...) at postmaster.c:1528
    #11 main (argc=15, argv=0x1dc1680) at main.c:206

To fix the issue we won't mark alert_log_level_opened as true until the
file is really opened successfully.

Reviewed-by: Paul Guo <pguo@pivotal.io>
(cherry picked from commit e2f1414)
hughcapet pushed a commit that referenced this pull request Aug 4, 2020
…id potential pg_rewind hang.

During testing, I encountered an incremental gprecoverseg hang issue.
Incremental gprecoverseg is based on pg_rewind.  pg_rewind launches a single
mode postgres process and quits after crash recovery if the postgres instance
was not cleanly shut down - this is used to ensure that the postgres is in a
consistent state before doing incremental recovery. I found that the single
mode postgres hangs with the below stack.

\#1  0x00000000008cf2d6 in PGSemaphoreLock (sema=0x7f238274a4b0, interruptOK=1 '\001') at pg_sema.c:422
\#2  0x00000000009614ed in ProcSleep (locallock=0x2c783c0, lockMethodTable=0xddb140 <default_lockmethod>) at proc.c:1347
\#3  0x000000000095a0c1 in WaitOnLock (locallock=0x2c783c0, owner=0x2cbf950) at lock.c:1853
\#4  0x0000000000958e3a in LockAcquireExtended (locktag=0x7ffde826aa60, lockmode=3, sessionLock=0 '\000', dontWait=0 '\000', reportMemoryError=1 '\001', locallockp=0x0) at lock.c:1155
\#5  0x0000000000957e64 in LockAcquire (locktag=0x7ffde826aa60, lockmode=3, sessionLock=0 '\000', dontWait=0 '\000') at lock.c:700
\#6  0x000000000095728c in LockSharedObject (classid=1262, objid=1, objsubid=0, lockmode=3) at lmgr.c:939
\#7  0x0000000000b0152b in InitPostgres (in_dbname=0x2c769f0 "template1", dboid=0, username=0x2c59340 "gpadmin", out_dbname=0x0) at postinit.c:1019
\#8  0x000000000097b970 in PostgresMain (argc=5, argv=0x2c51990, dbname=0x2c769f0 "template1", username=0x2c59340 "gpadmin") at postgres.c:4820
\#9  0x00000000007dc432 in main (argc=5, argv=0x2c51990) at main.c:241

It tries to hold the lock for template1 on pg_database with lockmode 3 but
it conflicts with the lock with lockmode 5 which was held by a recovered dtx
transaction in startup RecoverPreparedTransactions(). Typically the dtx
transaction comes from "create database" (by default the template database is
template1).

Fixing this by using the postgres database for single mode postgres execution.
The postgres database is commonly used in many background worker backends like
dtx recovery, gdd and ftsprobe. With this change, we do not need to worry
about "create database" with template postgres, etc since they won't succeed,
thus avoid the lock conflict.

We may be able to fix this in InitPostgres() by bypassing the locking code in
single mode but the current fix seems to be safer.  Note InitPostgres()
locks/unlocks some other catalog tables also but almost all of them are using
lock mode 1 (except mode 3 pg_resqueuecapability per debugging output).  It
seems that it is not usual in real scenario to have a dtx transaction that
locks catalog with mode 8 which conflicts with mode 1.  If we encounter this
later we need to think out a better (might not be trivial) solution for this.
For now let's fix the issue we encountered at first.

Note in this patch the code fixes in buildMirrorSegments.py and twophase.c are
not related to this patch. They do not seem to be strict bugs but we'd better
fix them to avoid potential issues in the future.

Reviewed-by: Ashwin Agrawal <aashwin@vmware.com>
Reviewed-by: Asim R P <pasim@vmware.com>
deart2k pushed a commit that referenced this pull request Aug 10, 2020
…id potential pg_rewind hang.

During testing, I encountered an incremental gprecoverseg hang issue.
Incremental gprecoverseg is based on pg_rewind.  pg_rewind launches a single
mode postgres process and quits after crash recovery if the postgres instance
was not cleanly shut down - this is used to ensure that the postgres is in a
consistent state before doing incremental recovery. I found that the single
mode postgres hangs with the below stack.

\#1  0x00000000008cf2d6 in PGSemaphoreLock (sema=0x7f238274a4b0, interruptOK=1 '\001') at pg_sema.c:422
\#2  0x00000000009614ed in ProcSleep (locallock=0x2c783c0, lockMethodTable=0xddb140 <default_lockmethod>) at proc.c:1347
\#3  0x000000000095a0c1 in WaitOnLock (locallock=0x2c783c0, owner=0x2cbf950) at lock.c:1853
\#4  0x0000000000958e3a in LockAcquireExtended (locktag=0x7ffde826aa60, lockmode=3, sessionLock=0 '\000', dontWait=0 '\000', reportMemoryError=1 '\001', locallockp=0x0) at lock.c:1155
\#5  0x0000000000957e64 in LockAcquire (locktag=0x7ffde826aa60, lockmode=3, sessionLock=0 '\000', dontWait=0 '\000') at lock.c:700
\#6  0x000000000095728c in LockSharedObject (classid=1262, objid=1, objsubid=0, lockmode=3) at lmgr.c:939
\#7  0x0000000000b0152b in InitPostgres (in_dbname=0x2c769f0 "template1", dboid=0, username=0x2c59340 "gpadmin", out_dbname=0x0) at postinit.c:1019
\#8  0x000000000097b970 in PostgresMain (argc=5, argv=0x2c51990, dbname=0x2c769f0 "template1", username=0x2c59340 "gpadmin") at postgres.c:4820
\#9  0x00000000007dc432 in main (argc=5, argv=0x2c51990) at main.c:241

It tries to hold the lock for template1 on pg_database with lockmode 3 but
it conflicts with the lock with lockmode 5 which was held by a recovered dtx
transaction in startup RecoverPreparedTransactions(). Typically the dtx
transaction comes from "create database" (by default the template database is
template1).

Fixing this by using the postgres database for single mode postgres execution.
The postgres database is commonly used in many background worker backends like
dtx recovery, gdd and ftsprobe. With this change, we do not need to worry
about "create database" with template postgres, etc since they won't succeed,
thus avoid the lock conflict.

We may be able to fix this in InitPostgres() by bypassing the locking code in
single mode but the current fix seems to be safer.  Note InitPostgres()
locks/unlocks some other catalog tables also but almost all of them are using
lock mode 1 (except mode 3 pg_resqueuecapability per debugging output).  It
seems that it is not usual in real scenario to have a dtx transaction that
locks catalog with mode 8 which conflicts with mode 1.  If we encounter this
later we need to think out a better (might not be trivial) solution for this.
For now let's fix the issue we encountered at first.

Note in this patch the code fixes in buildMirrorSegments.py and twophase.c are
not related to this patch. They do not seem to be strict bugs but we'd better
fix them to avoid potential issues in the future.

Reviewed-by: Ashwin Agrawal <aashwin@vmware.com>
Reviewed-by: Asim R P <pasim@vmware.com>
(cherry picked from commit 288908f)
darthunix pushed a commit that referenced this pull request Aug 24, 2021
…CREATE/ALTER resouce group.

In some scenarios, the AccessExclusiveLock for table pg_resgroupcapability may cause database setup/recovery pending. Below is why we need change the AccessExclusiveLock to ExclusiveLock. 

This lock on table pg_resgroupcapability is used to concurrent update this table when run "Create/Alter resource group" statement. There is a CPU limit, after modify one resource group, it has to check if the whole CPU usage of all resource groups doesn't exceed 100%.

Before this fix, AccessExclusiveLock is used. Suppose one user is running "Alter resource group" statement, QD will dispatch this statement to all QEs, so it is a two phase commit(2PC) transaction. When QD dispatched "Alter resource group" statement and QE acquire the AccessExclusiveLock for table pg_resgroupcapability. Until the 2PC distributed transaction committed, QE can release the AccessExclusiveLock for this table.

In the second phase, QD will call function doNotifyingCommitPrepared to broadcast "commit prepared" command to all QEs, QE has already finish prepared, this transation is a prepared transaction. Suppose at this point, there is a primary segment down and a mirror will be promoted to primary.

The mirror got the "promoted" message from coordinator, and will recover based on xlog from primary, in order to recover the prepared transaction, it will read the prepared transaction log entry and acquire AccessExclusiveLock for table pg_resgroupcapability. The callstack is:
#0 lock_twophase_recover (xid=, info=, recdata=, len=) at lock.c:4697
#1 ProcessRecords (callbacks=, xid=2933, bufptr=0x1d575a8 "") at twophase.c:1757
#2 RecoverPreparedTransactions () at twophase.c:2214
#3 StartupXLOG () at xlog.c:8013
#4 StartupProcessMain () at startup.c:231
#5 AuxiliaryProcessMain (argc=argc@entry=2, argv=argv@entry=0x7fff84b94a70) at bootstrap.c:459
#6 StartChildProcess (type=StartupProcess) at postmaster.c:5917
#7 PostmasterMain (argc=argc@entry=7, argv=argv@entry=0x1d555b0) at postmaster.c:1581
#8 main (argc=7, argv=0x1d555b0) at main.c:240

After that, the database instance will start up, all related initialization functions will be called. However, there is a function named "InitResGroups", it will acquire AccessShareLock for table pg_resgroupcapability and do some initialization stuff. The callstack is:
#6 WaitOnLock (locallock=locallock@entry=0x1c7f248, owner=owner@entry=0x1ca0a40) at lock.c:1999
#7 LockAcquireExtended (locktag=locktag@entry=0x7ffd15d18d90, lockmode=lockmode@entry=1, sessionLock=sessionLock@entry=false, dontWait=dontWait@entry=false, reportMemoryError=reportMemoryError@entry=true, locallockp=locallockp@entry=0x7ffd15d18d88) at lock.c:1192
#8 LockRelationOid (relid=6439, lockmode=1) at lmgr.c:126
#9 relation_open (relationId=relationId@entry=6439, lockmode=lockmode@entry=1) at relation.c:56
#10 table_open (relationId=relationId@entry=6439, lockmode=lockmode@entry=1) at table.c:47
#11 InitResGroups () at resgroup.c:581
#12 InitResManager () at resource_manager.c:83
#13 initPostgres (in_dbname=, dboid=dboid@entry=0, username=username@entry=0x1c5b730 "linw", useroid=useroid@entry=0, out_dbname=out_dbname@entry=0x0, override_allow_connections=override_allow_connections@entry=false) at postinit.c:1284
#14 PostgresMain (argc=1, argv=argv@entry=0x1c8af78, dbname=0x1c89e70 "postgres", username=0x1c5b730 "linw") at postgres.c:4812
#15 BackendRun (port=, port=) at postmaster.c:4922
#16 BackendStartup (port=0x1c835d0) at postmaster.c:4607
#17 ServerLoop () at postmaster.c:1963
#18 PostmasterMain (argc=argc@entry=7, argv=argv@entry=0x1c595b0) at postmaster.c:1589
#19 in main (argc=7, argv=0x1c595b0) at main.c:240

The AccessExclusiveLock is not released, and it is not compatible with any other locks, so the startup process will be pending on this lock. So the mirror can't become primary successfully.

Even users run "gprecoverseg" to recover the primary segment. the result is similar. The primary segment will recover from xlog, it will recover prepared transactions and acquire AccessExclusiveLock for table pg_resgroupcapability. Then the startup process is pending on this lock. Unless users change the resource type to "queue", the function InitResGroups will not be called, and won't be blocked, then the primary segment can startup normally.

After this fix, ExclusiveLock is acquired when alter resource group. In above case, the startup process acquires AccessShareLock, ExclusiveLock and AccessShareLock are compatible. The startup process can run successfully. After startup, QE will get RECOVERY_COMMIT_PREPARED command from QD, it will finish the second phase of this distributed transaction and release ExclusiveLock on table pg_resgroupcapability. The callstack is:
#0 lock_twophase_postcommit (xid=, info=, recdata=0x3303458, len=) at lock.c:4758
#1 ProcessRecords (callbacks=, xid=, bufptr=0x3303458 "") at twophase.c:1757
#2 FinishPreparedTransaction (gid=gid@entry=0x323caf5 "25", isCommit=isCommit@entry=true, raiseErrorIfNotFound=raiseErrorIfNotFound@entry=false) at twophase.c:1704
#3 in performDtxProtocolCommitPrepared (gid=gid@entry=0x323caf5 "25", raiseErrorIfNotFound=raiseErrorIfNotFound@entry=false) at cdbtm.c:2107
#4 performDtxProtocolCommand (dtxProtocolCommand=dtxProtocolCommand@entry=DTX_PROTOCOL_COMMAND_RECOVERY_COMMIT_PREPARED, gid=gid@entry=0x323caf5 "25", contextInfo=contextInfo@entry=0x10e1820 ) at cdbtm.c:2279
#5 exec_mpp_dtx_protocol_command (contextInfo=0x10e1820 , gid=0x323caf5 "25", loggingStr=0x323cad8 "Recovery Commit Prepared", dtxProtocolCommand=DTX_PROTOCOL_COMMAND_RECOVERY_COMMIT_PREPARED) at postgres.c:1570
#6 PostgresMain (argc=, argv=argv@entry=0x3268f98, dbname=0x3267e90 "postgres", username=) at postgres.c:5482

The test case of this commit simulates a repro of this bug.
InnerLife0 pushed a commit that referenced this pull request Sep 15, 2021
…ce (#12447)

Recently I built from GreenPlum master branch to run TPC-DS query with 1GB data. For Q47 and Q57, when I turned off GUC `execute_pruned_plan` (on by default), some of worker processes will be hang and the query never returns.

Take Q57 as an example. My cluster configuration is 1 QD + 2 QE. The query looks like:

```sql
with v1 as(
  select
    i_category,i_brand,
    cc_name,d_year,d_moy,
    sum(cs_sales_price) sum_sales,
    avg(sum(cs_sales_price)) over (partition by
      i_category,i_brand,cc_name,d_year)
    avg_monthly_sales,
    rank() over (partition by
      i_category,i_brand,cc_name
    order by
      d_year,d_moy
    ) rn
  from
    item,catalog_sales,date_dim,call_center
  where
    cs_item_sk = i_item_sk and
    cs_sold_date_sk = d_date_sk and
    cc_call_center_sk= cs_call_center_sk and(
      d_year = 1999 or
      ( d_year = 1999-1 and d_moy =12) or
      ( d_year = 1999+1 and d_moy =1)
    )
  group by
    i_category,i_brand,cc_name,d_year,d_moy
),
v2 as(
  select
    v1.i_category,v1.i_brand,v1.cc_name,
    v1.d_year,v1.d_moy,v1.avg_monthly_sales,
    v1.sum_sales,v1_lag.sum_sales psum,
    v1_lead.sum_sales nsum
  from
    v1,v1 v1_lag,v1 v1_lead
  where
    v1.i_category = v1_lag.i_category and
    v1.i_category = v1_lead.i_category and
    v1.i_brand = v1_lag.i_brand and
    v1.i_brand = v1_lead.i_brand and
    v1. cc_name = v1_lag. cc_name and
    v1. cc_name = v1_lead. cc_name and
    v1.rn = v1_lag.rn + 1 and
    v1.rn = v1_lead.rn - 1
)
select *
from v2
where
  d_year = 1999 and
  avg_monthly_sales > 0 and
  case when avg_monthly_sales > 0 then
    abs(sum_sales - avg_monthly_sales) / avg_monthly_sales
    else null end > 0.1
order by
  sum_sales - avg_monthly_sales,3
limit 100;
```

When `execute_pruned_plan` is on by default, the plan looks like:

```
                                                                                                                                                                 QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Result  (cost=0.00..2832.84 rows=1 width=64) (actual time=10792.606..10792.702 rows=100 loops=1)
   ->  Gather Motion 2:1  (slice1; segments: 2)  (cost=0.00..2832.84 rows=1 width=64) (actual time=10792.597..10792.673 rows=100 loops=1)
         Merge Key: ((share0_ref4.sum_sales - share0_ref4.avg_monthly_sales)), share0_ref4.cc_name
         ->  Sort  (cost=0.00..2832.84 rows=1 width=72) (actual time=10791.203..10791.225 rows=50 loops=1)
               Sort Key: ((share0_ref4.sum_sales - share0_ref4.avg_monthly_sales)), share0_ref4.cc_name
               Sort Method:  quicksort  Memory: 152kB
               ->  Sequence  (cost=0.00..2832.84 rows=1 width=72) (actual time=10790.522..10790.559 rows=50 loops=1)
                     ->  Shared Scan (share slice:id 1:0)  (cost=0.00..1539.83 rows=1 width=1) (actual time=10140.895..10145.397 rows=16510 loops=1)
                           ->  WindowAgg  (cost=0.00..1539.83 rows=1 width=56) (actual time=10082.465..10128.750 rows=16510 loops=1)
                                 Partition By: item.i_category, item.i_brand, call_center.cc_name
                                 Order By: date_dim.d_year, date_dim.d_moy
                                 ->  Sort  (cost=0.00..1539.83 rows=1 width=48) (actual time=10082.429..10084.923 rows=16510 loops=1)
                                       Sort Key: item.i_category, item.i_brand, call_center.cc_name, date_dim.d_year, date_dim.d_moy
                                       Sort Method:  quicksort  Memory: 20078kB
                                       ->  Redistribute Motion 2:2  (slice2; segments: 2)  (cost=0.00..1539.83 rows=1 width=48) (actual time=9924.269..9989.657 rows=16510 loops=1)
                                             Hash Key: item.i_category, item.i_brand, call_center.cc_name
                                             ->  WindowAgg  (cost=0.00..1539.83 rows=1 width=48) (actual time=9924.717..9974.500 rows=16633 loops=1)
                                                   Partition By: item.i_category, item.i_brand, call_center.cc_name, date_dim.d_year
                                                   ->  Sort  (cost=0.00..1539.83 rows=1 width=126) (actual time=9924.662..9927.280 rows=16633 loops=1)
                                                         Sort Key: item.i_category, item.i_brand, call_center.cc_name, date_dim.d_year
                                                         Sort Method:  quicksort  Memory: 20076kB
                                                         ->  Redistribute Motion 2:2  (slice3; segments: 2)  (cost=0.00..1539.83 rows=1 width=126) (actual time=9394.220..9856.375 rows=16633 loops=1)
                                                               Hash Key: item.i_category, item.i_brand, call_center.cc_name, date_dim.d_year
                                                               ->  GroupAggregate  (cost=0.00..1539.83 rows=1 width=126) (actual time=9391.783..9833.988 rows=16424 loops=1)
                                                                     Group Key: item.i_category, item.i_brand, call_center.cc_name, date_dim.d_year, date_dim.d_moy
                                                                     ->  Sort  (cost=0.00..1539.83 rows=1 width=124) (actual time=9397.448..9628.606 rows=174584 loops=1)
                                                                           Sort Key: item.i_category, item.i_brand, call_center.cc_name, date_dim.d_year, date_dim.d_moy
                                                                           Sort Method:  external merge  Disk: 134144kB
                                                                           ->  Redistribute Motion 2:2  (slice4; segments: 2)  (cost=0.00..1539.83 rows=1 width=124) (actual time=6107.447..8237.581 rows=174584 loops=1)
                                                                                 Hash Key: item.i_category, item.i_brand, call_center.cc_name, date_dim.d_year, date_dim.d_moy
                                                                                 ->  Hash Join  (cost=0.00..1539.83 rows=1 width=124) (actual time=6112.706..7088.349 rows=178669 loops=1)
                                                                                       Hash Cond: (date_dim.d_date_sk = catalog_sales.cs_sold_date_sk)
                                                                                       ->  Seq Scan on date_dim  (cost=0.00..436.38 rows=204 width=12) (actual time=10.656..17.972 rows=222 loops=1)
                                                                                             Filter: ((d_year = 1999) OR ((d_year = 1998) AND (d_moy = 12)) OR ((d_year = 2000) AND (d_moy = 1)))
                                                                                             Rows Removed by Filter: 36504
                                                                                       ->  Hash  (cost=1103.41..1103.41 rows=1 width=120) (actual time=6100.040..6100.040 rows=1430799 loops=1)
                                                                                             Buckets: 16384 (originally 16384)  Batches: 32 (originally 1)  Memory Usage: 12493kB
                                                                                             ->  Broadcast Motion 2:2  (slice5; segments: 2)  (cost=0.00..1103.41 rows=1 width=120) (actual time=1.802..5410.377 rows=1434428 loops=1)
                                                                                                   ->  Nested Loop  (cost=0.00..1103.40 rows=1 width=120) (actual time=1.632..5127.625 rows=718766 loops=1)
                                                                                                         Join Filter: true
                                                                                                         ->  Redistribute Motion 2:2  (slice6; segments: 2)  (cost=0.00..1097.40 rows=1 width=22) (actual time=1.564..362.958 rows=718766 loops=1)
                                                                                                               Hash Key: catalog_sales.cs_item_sk
                                                                                                               ->  Hash Join  (cost=0.00..1097.40 rows=1 width=22) (actual time=1.112..996.643 rows=717589 loops=1)
                                                                                                                     Hash Cond: (catalog_sales.cs_call_center_sk = call_center.cc_call_center_sk)
                                                                                                                     ->  Seq Scan on catalog_sales  (cost=0.00..509.10 rows=720774 width=18) (actual time=0.144..602.362 rows=721193 loops=1)
                                                                                                                     ->  Hash  (cost=431.00..431.00 rows=1 width=12) (actual time=0.022..0.022 rows=6 loops=1)
                                                                                                                           Buckets: 32768  Batches: 1  Memory Usage: 257kB
                                                                                                                           ->  Broadcast Motion 2:2  (slice7; segments: 2)  (cost=0.00..431.00 rows=1 width=12) (actual time=0.009..0.012 rows=6 loops=1)
                                                                                                                                 ->  Seq Scan on call_center  (cost=0.00..431.00 rows=1 width=12) (actual time=0.032..0.035 rows=4 loops=1)
                                                                                                         ->  Index Scan using item_pkey on item  (cost=0.00..6.00 rows=1 width=102) (actual time=0.000..0.006 rows=1 loops=718766)
                                                                                                               Index Cond: (i_item_sk = catalog_sales.cs_item_sk)
                     ->  Redistribute Motion 1:2  (slice8)  (cost=0.00..1293.01 rows=1 width=72) (actual time=646.614..646.646 rows=50 loops=1)
                           ->  Limit  (cost=0.00..1293.01 rows=1 width=72) (actual time=10787.533..10787.700 rows=100 loops=1)
                                 ->  Gather Motion 2:1  (slice9; segments: 2)  (cost=0.00..1293.01 rows=1 width=72) (actual time=10787.527..10787.654 rows=100 loops=1)
                                       Merge Key: ((share0_ref4.sum_sales - share0_ref4.avg_monthly_sales)), share0_ref4.cc_name
                                       ->  Sort  (cost=0.00..1293.01 rows=1 width=72) (actual time=10789.933..10789.995 rows=357 loops=1)
                                             Sort Key: ((share0_ref4.sum_sales - share0_ref4.avg_monthly_sales)), share0_ref4.cc_name
                                             Sort Method:  quicksort  Memory: 14998kB
                                             ->  Result  (cost=0.00..1293.01 rows=1 width=150) (actual time=10648.280..10774.898 rows=12379 loops=1)
                                                   Filter: ((share0_ref4.d_year = 1999) AND (share0_ref4.avg_monthly_sales > '0'::numeric) AND (CASE WHEN (share0_ref4.avg_monthly_sales > '0'::numeric) THEN (abs((share0_ref4.sum_sales - share0_ref4.avg_monthly_sales)) / share0_ref4.avg_monthly_sales) ELSE NULL::numeric END > 0.1))
                                                   ->  Hash Join  (cost=0.00..1293.01 rows=1 width=150) (actual time=10648.253..10740.262 rows=13582 loops=1)
                                                         Hash Cond: ((share0_ref4.i_category = share0_ref3.i_category) AND (share0_ref4.i_brand = share0_ref3.i_brand) AND ((share0_ref4.cc_name)::text = (share0_ref3.cc_name)::text) AND (share0_ref4.rn = (share0_ref3.rn + 1)) AND (share0_ref4.rn = (share0_ref2.rn - 1)))
                                                         ->  Shared Scan (share slice:id 9:0)  (cost=0.00..431.00 rows=1 width=142) (actual time=0.013..5.570 rows=16510 loops=1)
                                                         ->  Hash  (cost=862.00..862.00 rows=1 width=142) (actual time=10647.380..10647.380 rows=209076 loops=1)
                                                               Buckets: 65536 (originally 32768)  Batches: 2 (originally 1)  Memory Usage: 31389kB
                                                               ->  Hash Join  (cost=0.00..862.00 rows=1 width=142) (actual time=10156.494..10374.421 rows=209076 loops=1)
                                                                     Hash Cond: ((share0_ref3.i_category = share0_ref2.i_category) AND (share0_ref3.i_brand = share0_ref2.i_brand) AND ((share0_ref3.cc_name)::text = (share0_ref2.cc_name)::text))
                                                                     ->  Shared Scan (share slice:id 9:0)  (cost=0.00..431.00 rows=1 width=126) (actual time=0.009..6.887 rows=16510 loops=1)
                                                                     ->  Hash  (cost=431.00..431.00 rows=1 width=126) (actual time=10156.297..10156.298 rows=16178 loops=1)
                                                                           Buckets: 32768  Batches: 1  Memory Usage: 3144kB
                                                                           ->  Shared Scan (share slice:id 9:0)  (cost=0.00..431.00 rows=1 width=126) (actual time=10139.421..10144.473 rows=16510 loops=1)
 Planning Time: 1905.667 ms
   (slice0)    Executor memory: 330K bytes.
   (slice1)    Executor memory: 4750K bytes avg x 2 workers, 4968K bytes max (seg1).  Work_mem: 4861K bytes max.
   (slice2)    Executor memory: 4701K bytes avg x 2 workers, 4952K bytes max (seg0).  Work_mem: 4894K bytes max.
   (slice3)    Executor memory: 12428K bytes avg x 2 workers, 12428K bytes max (seg0).  Work_mem: 12375K bytes max.
 * (slice4)    Executor memory: 14021K bytes avg x 2 workers, 14021K bytes max (seg0).  Work_mem: 12493K bytes max, 221759K bytes wanted.
   (slice5)    Executor memory: 77K bytes avg x 2 workers, 77K bytes max (seg0).
   (slice6)    Executor memory: 323K bytes avg x 2 workers, 323K bytes max (seg0).  Work_mem: 257K bytes max.
   (slice7)    Executor memory: 39K bytes avg x 2 workers, 39K bytes max (seg0).
   (slice8)    Executor memory: 242K bytes (entry db).
 * (slice9)    Executor memory: 35344K bytes avg x 2 workers, 35360K bytes max (seg1).  Work_mem: 31389K bytes max, 37501K bytes wanted.
 Memory used:  128000kB
 Memory wanted:  3328681kB
 Optimizer: Pivotal Optimizer (GPORCA)
 Execution Time: 10856.507 ms
(86 rows)

Time: 12779.991 ms (00:12.780)
```

There is only one share slice in this query, one producer in slice 1, three consumers in slice 9. However, when I turned GUC off, the query never returns, and the process situation looks like:

```
postgres  22285  22255  0 03:03 pts/1    00:00:00 psql -p9221
postgres  22288  20912  3 03:03 ?        00:00:03 postgres:  9221, postgres tpcds [local] con150 cmd16 EXPLAIN
postgres  22294  20939  0 03:03 ?        00:00:00 postgres:  9210, postgres tpcds 172.17.0.50(60732) con150 seg0 cmd17 slice1 MPPEXEC SELECT
postgres  22295  20950  0 03:03 ?        00:00:00 postgres:  9211, postgres tpcds 172.17.0.50(36177) con150 seg1 cmd17 slice1 MPPEXEC SELECT
postgres  22306  20939  5 03:03 ?        00:00:04 postgres:  9210, postgres tpcds 172.17.0.50(60742) con150 seg0 idle
postgres  22307  20950  5 03:03 ?        00:00:04 postgres:  9211, postgres tpcds 172.17.0.50(36187) con150 seg1 idle
postgres  22310  20939 11 03:03 ?        00:00:10 postgres:  9210, postgres tpcds 172.17.0.50(60745) con150 seg0 idle
postgres  22311  20950 12 03:03 ?        00:00:11 postgres:  9211, postgres tpcds 172.17.0.50(36190) con150 seg1 idle
postgres  22314  20939  5 03:03 ?        00:00:04 postgres:  9210, postgres tpcds 172.17.0.50(60748) con150 seg0 idle
postgres  22315  20950  5 03:03 ?        00:00:04 postgres:  9211, postgres tpcds 172.17.0.50(36193) con150 seg1 idle
postgres  22318  20939  1 03:03 ?        00:00:01 postgres:  9210, postgres tpcds 172.17.0.50(60750) con150 seg0 idle
postgres  22319  20950  2 03:03 ?        00:00:01 postgres:  9211, postgres tpcds 172.17.0.50(36195) con150 seg1 idle
postgres  22322  20912  0 03:03 ?        00:00:00 postgres:  9221, postgres tpcds [local] con150 seg-1 idle
postgres  22324  20939  0 03:03 ?        00:00:00 postgres:  9210, postgres tpcds 172.17.0.50(60754) con150 seg0 idle
postgres  22325  20950  0 03:03 ?        00:00:00 postgres:  9211, postgres tpcds 172.17.0.50(36199) con150 seg1 idle
postgres  22348  20939  0 03:05 ?        00:00:00 postgres:  9210, postgres tpcds 172.17.0.50(45936) con150 seg0 idle
postgres  22349  20950  0 03:05 ?        00:00:00 postgres:  9211, postgres tpcds 172.17.0.50(49614) con150 seg1 idle
postgres  22352  20939  4 03:05 ?        00:00:00 postgres:  9210, postgres tpcds 172.17.0.50(45939) con150 seg0 idle
postgres  22353  20950  4 03:05 ?        00:00:00 postgres:  9211, postgres tpcds 172.17.0.50(49617) con150 seg1 idle
```

According to my debugging, the stack of slice 1 processes looks like:

```
#0  0x00007fde606f94f3 in epoll_wait () from /lib64/libc.so.6
#1  0x0000000000d2eec1 in WaitEventSetWaitBlock (set=0x87d8fe0, cur_timeout=-1, occurred_events=0x7ffce695fe00, nevents=1) at latch.c:1081
#2  0x0000000000d2ed9a in WaitEventSetWait (set=0x87d8fe0, timeout=-1, occurred_events=0x7ffce695fe00, nevents=1, wait_event_info=0) at latch.c:1033
#3  0x0000000000d5987d in ConditionVariableSleep (cv=0x7fde540890b0, wait_event_info=0) at condition_variable.c:157
#4  0x0000000000b30a61 in shareinput_writer_waitdone (ref=0x87da950, nconsumers=1) at nodeShareInputScan.c:994
#5  0x0000000000b2fe89 in ExecEndShareInputScan (node=0x88c2ec0) at nodeShareInputScan.c:522
#6  0x0000000000ad63e8 in ExecEndNode (node=0x88c2ec0) at execProcnode.c:888
#7  0x0000000000b3237b in ExecEndSequence (node=0x88c2d80) at nodeSequence.c:132
#8  0x0000000000ad623f in ExecEndNode (node=0x88c2d80) at execProcnode.c:779
#9  0x0000000000b1772e in ExecEndSort (node=0x88c2658) at nodeSort.c:365
```

That is to say, the producer is waiting for consumers to wake it up, while the consumers didn't. According to further debugging, I found a **squelch** is triggered on the *Gather Motion* node upstream of three ShareInputScan consumer nodes. In the squelch logic of ShareInputScan, the consumer will notify producer only if `ndone == nsharers`:

```c
local_state->ndone++;

if (local_state->ndone == local_state->nsharers)
{
    shareinput_reader_notifydone(node->ref, sisc->nconsumers);
    local_state->closed = true;
}
```

While `ndone` will be accumulated one by one consumer, `nsharers` is initialized in ExecInitNode. However, GUC `execute_pruned_plan` affects the root node where the Executor starts to call `ExecInitNode`:

- `execute_pruned_plan` set to true: the initialization will start at the root node of slice 9, `nsharers` will be 3
- `execute_pruned_plan` set to false: the initialization will start at the root node of the whole plan tree, `nsharers` will be 4, then `ndone == nsharers` will never establish, because we only have three consumers, `ndone` will be 3 at most

According to my understanding, the algorithm should work well no matter this GUC is set to true or false. So I add some conditions in the process of initialization of `nsharers`: to accumulate `nsharers` only when initializing consumer nodes of current slice. Then this algorithm should be working fine.
BenderArenadata pushed a commit that referenced this pull request Jan 24, 2024
## Problem
An error occurs in python lib when a plpython function is executed.
After our analysis, in the user's cluster, a plpython UDF 
was running with the unstable network, and got a timeout error:
`failed to acquire resources on one or more segments`.
Then a plpython UDF was run in the same session, and the UDF
failed with GC error.

Here is the core dump:
```
2023-11-24 10:15:18.945507 CST,,,p2705198,th2081832064,,,,0,,,seg-1,,,,,"LOG","00000","3rd party error log:
    #0 0x7f7c68b6d55b in frame_dealloc /home/cc/repo/cpython/Objects/frameobject.c:509:5
    #1 0x7f7c68b5109d in gen_send_ex /home/cc/repo/cpython/Objects/genobject.c:108:9
    #2 0x7f7c68af9ddd in PyIter_Next /home/cc/repo/cpython/Objects/abstract.c:3118:14
    #3 0x7f7c78caa5c0 in PLy_exec_function /home/cc/repo/gpdb6/src/pl/plpython/plpy_exec.c:134:11
    #4 0x7f7c78cb5ffb in plpython_call_handler /home/cc/repo/gpdb6/src/pl/plpython/plpy_main.c:387:13
    #5 0x562f5e008bb5 in ExecMakeTableFunctionResult /home/cc/repo/gpdb6/src/backend/executor/execQual.c:2395:13
    #6 0x562f5e0dddec in FunctionNext_guts /home/cc/repo/gpdb6/src/backend/executor/nodeFunctionscan.c:142:5
    #7 0x562f5e0da094 in FunctionNext /home/cc/repo/gpdb6/src/backend/executor/nodeFunctionscan.c:350:11
    #8 0x562f5e03d4b0 in ExecScanFetch /home/cc/repo/gpdb6/src/backend/executor/execScan.c:84:9
    #9 0x562f5e03cd8f in ExecScan /home/cc/repo/gpdb6/src/backend/executor/execScan.c:154:10
    #10 0x562f5e0da072 in ExecFunctionScan /home/cc/repo/gpdb6/src/backend/executor/nodeFunctionscan.c:380:9
    #11 0x562f5e001a1c in ExecProcNode /home/cc/repo/gpdb6/src/backend/executor/execProcnode.c:1071:13
    #12 0x562f5dfe6377 in ExecutePlan /home/cc/repo/gpdb6/src/backend/executor/execMain.c:3202:10
    #13 0x562f5dfe5bf4 in standard_ExecutorRun /home/cc/repo/gpdb6/src/backend/executor/execMain.c:1171:5
    #14 0x562f5dfe4877 in ExecutorRun /home/cc/repo/gpdb6/src/backend/executor/execMain.c:992:4
    #15 0x562f5e857e69 in PortalRunSelect /home/cc/repo/gpdb6/src/backend/tcop/pquery.c:1164:4
    #16 0x562f5e856d3f in PortalRun /home/cc/repo/gpdb6/src/backend/tcop/pquery.c:1005:18
    #17 0x562f5e84607a in exec_simple_query /home/cc/repo/gpdb6/src/backend/tcop/postgres.c:1848:10
```

## Reproduce
We can use a simple procedure to reproduce the above problem:
- set timeout GUC: `gpconfig -c gp_segment_connect_timeout -v 5` and `gpstop -ari`
- prepare function:
```
CREATE EXTENSION plpythonu;
CREATE OR REPLACE FUNCTION test_func() RETURNS SETOF int AS
$$
plpy.execute("select pg_backend_pid()")

for i in range(0, 5):
    yield (i)

$$ LANGUAGE plpythonu;
```
- exit from the current psql session.
- stop the postmaster of segment: `gdb -p "the pid of segment postmaster"`
- enter a psql session.
- call `SELECT test_func();` and get error
```
gpadmin=# select test_func();
ERROR:  function "test_func" error fetching next item from iterator (plpy_elog.c:121)
DETAIL:  Exception: failed to acquire resources on one or more segments
CONTEXT:  Traceback (most recent call last):
PL/Python function "test_func"
```
- quit gdb and make postmaster runnable.
- call  `SELECT test_func();` again and get panic
```
gpadmin=# SELECT test_func();
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!> 
```

## Analysis
- There is an SPI call in test_func(): `plpy.execute()`. 
- Then coordinator will start a subtransaction by PLy_spi_subtransaction_begin();
- Meanwhile, if the segment cannot receive the instruction from the coordinator,
  the subtransaction beginning procedure return fails.
- BUT! The Python processor does not know whether an error happened and
  does not clean its environment.
- Then the next plpython UDF in the same session will fail due to the wrong
  Python environment.

## Solution
- Use try-catch to catch the exception caused by PLy_spi_subtransaction_begin()
- set the python error indicator by PLy_spi_exception_set()

backport from #16856


Co-authored-by: Chen Mulong <chenmulong@gmail.com>
BenderArenadata pushed a commit that referenced this pull request Jan 24, 2024
An error occurs in python lib when a plpython function is executed.
After our analysis, in the user's cluster, a plpython UDF
was running with the unstable network, and got a timeout error:
`failed to acquire resources on one or more segments`.
Then a plpython UDF was run in the same session, and the UDF
failed with GC error.

Here is the core dump:
```
2023-11-24 10:15:18.945507 CST,,,p2705198,th2081832064,,,,0,,,seg-1,,,,,"LOG","00000","3rd party error log:
    #0 0x7f7c68b6d55b in frame_dealloc /home/cc/repo/cpython/Objects/frameobject.c:509:5
    #1 0x7f7c68b5109d in gen_send_ex /home/cc/repo/cpython/Objects/genobject.c:108:9
    #2 0x7f7c68af9ddd in PyIter_Next /home/cc/repo/cpython/Objects/abstract.c:3118:14
    #3 0x7f7c78caa5c0 in PLy_exec_function /home/cc/repo/gpdb6/src/pl/plpython/plpy_exec.c:134:11
    #4 0x7f7c78cb5ffb in plpython_call_handler /home/cc/repo/gpdb6/src/pl/plpython/plpy_main.c:387:13
    #5 0x562f5e008bb5 in ExecMakeTableFunctionResult /home/cc/repo/gpdb6/src/backend/executor/execQual.c:2395:13
    #6 0x562f5e0dddec in FunctionNext_guts /home/cc/repo/gpdb6/src/backend/executor/nodeFunctionscan.c:142:5
    #7 0x562f5e0da094 in FunctionNext /home/cc/repo/gpdb6/src/backend/executor/nodeFunctionscan.c:350:11
    #8 0x562f5e03d4b0 in ExecScanFetch /home/cc/repo/gpdb6/src/backend/executor/execScan.c:84:9
    #9 0x562f5e03cd8f in ExecScan /home/cc/repo/gpdb6/src/backend/executor/execScan.c:154:10
    #10 0x562f5e0da072 in ExecFunctionScan /home/cc/repo/gpdb6/src/backend/executor/nodeFunctionscan.c:380:9
    #11 0x562f5e001a1c in ExecProcNode /home/cc/repo/gpdb6/src/backend/executor/execProcnode.c:1071:13
    #12 0x562f5dfe6377 in ExecutePlan /home/cc/repo/gpdb6/src/backend/executor/execMain.c:3202:10
    #13 0x562f5dfe5bf4 in standard_ExecutorRun /home/cc/repo/gpdb6/src/backend/executor/execMain.c:1171:5
    #14 0x562f5dfe4877 in ExecutorRun /home/cc/repo/gpdb6/src/backend/executor/execMain.c:992:4
    #15 0x562f5e857e69 in PortalRunSelect /home/cc/repo/gpdb6/src/backend/tcop/pquery.c:1164:4
    #16 0x562f5e856d3f in PortalRun /home/cc/repo/gpdb6/src/backend/tcop/pquery.c:1005:18
    #17 0x562f5e84607a in exec_simple_query /home/cc/repo/gpdb6/src/backend/tcop/postgres.c:1848:10
```

We can use a simple procedure to reproduce the above problem:
- set timeout GUC: `gpconfig -c gp_segment_connect_timeout -v 5` and `gpstop -ari`
- prepare function:
```
CREATE EXTENSION plpythonu;
CREATE OR REPLACE FUNCTION test_func() RETURNS SETOF int AS
$$
plpy.execute("select pg_backend_pid()")

for i in range(0, 5):
    yield (i)

$$ LANGUAGE plpythonu;
```
- exit from the current psql session.
- stop the postmaster of segment: `gdb -p "the pid of segment postmaster"`
- enter a psql session.
- call `SELECT test_func();` and get error
```
gpadmin=# select test_func();
ERROR:  function "test_func" error fetching next item from iterator (plpy_elog.c:121)
DETAIL:  Exception: failed to acquire resources on one or more segments
CONTEXT:  Traceback (most recent call last):
PL/Python function "test_func"
```
- quit gdb and make postmaster runnable.
- call  `SELECT test_func();` again and get panic
```
gpadmin=# SELECT test_func();
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!>
```

- There is an SPI call in test_func(): `plpy.execute()`.
- Then coordinator will start a subtransaction by PLy_spi_subtransaction_begin();
- Meanwhile, if the segment cannot receive the instruction from the coordinator,
  the subtransaction beginning procedure return fails.
- BUT! The Python processor does not know whether an error happened and
  does not clean its environment.
- Then the next plpython UDF in the same session will fail due to the wrong
  Python environment.

- Use try-catch to catch the exception caused by PLy_spi_subtransaction_begin()
- set the python error indicator by PLy_spi_exception_set()

backport from #16856

Co-authored-by: Chen Mulong <chenmulong@gmail.com>
(cherry picked from commit 45d6ba8)

Co-authored-by: Zhang Hao <hzhang2@vmware.com>
Stolb27 pushed a commit that referenced this pull request Feb 15, 2024
…586)

My previous commit 8915cd0 caused coredump in some pipeline jobs.
Example stack:
```
Core was generated by `postgres:  7000, ic proxy process
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000000000b46ec3 in pg_atomic_read_u32_impl (ptr=0x7f05a8c51104) at ../../../../src/include/port/atomics/generic.h:48
(gdb) bt
#0  0x0000000000b46ec3 in pg_atomic_read_u32_impl (ptr=0x7f05a8c51104) at ../../../../src/include/port/atomics/generic.h:48
#1  pg_atomic_read_u32 (ptr=0x7f05a8c51104) at ../../../../src/include/port/atomics.h:247
#2  LWLockAttemptLock (mode=LW_EXCLUSIVE, lock=0x7f05a8c51100) at lwlock.c:751
#3  LWLockAcquire (lock=0x7f05a8c51100, mode=mode@entry=LW_EXCLUSIVE) at lwlock.c:1188
#4  0x0000000000b32fff in ShmemInitStruct (name=name@entry=0x130e160 "", size=size@entry=4, foundPtr=foundPtr@entry=0x7ffcf94513bf) at shmem.c:412
#5  0x0000000000d6d18e in ic_proxy_server_main () at ic_proxy_main.c:545
#6  0x0000000000d6c219 in ICProxyMain (main_arg=<optimized out>) at ic_proxy_bgworker.c:36
#7  0x0000000000aa9caa in StartBackgroundWorker () at bgworker.c:955
#8  0x0000000000ab9407 in do_start_bgworker (rw=<optimized out>) at postmaster.c:6450
#9  maybe_start_bgworkers () at postmaster.c:6706
#10 0x0000000000abbc59 in ServerLoop () at postmaster.c:2095
#11 0x0000000000abd777 in PostmasterMain (argc=argc@entry=5, argv=argv@entry=0x36e3650) at postmaster.c:1633
#12 0x00000000006e4764 in main (argc=5, argv=0x36e3650) at main.c:240
(gdb) p *ptr
Cannot access memory at address 0x7f05a8c51104
```

The root cause is I forgot to init SHM structure at CreateSharedMemoryAndSemaphores().
Fix it in this commit.
Stolb27 pushed a commit that referenced this pull request Feb 15, 2024
## Problem
An error occurs in python lib when a plpython function is executed.
After our analysis, in the user's cluster, a plpython UDF 
was running with the unstable network, and got a timeout error:
`failed to acquire resources on one or more segments`.
Then a plpython UDF was run in the same session, and the UDF
failed with GC error.

Here is the core dump:
```
2023-11-24 10:15:18.945507 CST,,,p2705198,th2081832064,,,,0,,,seg-1,,,,,"LOG","00000","3rd party error log:
    #0 0x7f7c68b6d55b in frame_dealloc /home/cc/repo/cpython/Objects/frameobject.c:509:5
    #1 0x7f7c68b5109d in gen_send_ex /home/cc/repo/cpython/Objects/genobject.c:108:9
    #2 0x7f7c68af9ddd in PyIter_Next /home/cc/repo/cpython/Objects/abstract.c:3118:14
    #3 0x7f7c78caa5c0 in PLy_exec_function /home/cc/repo/gpdb6/src/pl/plpython/plpy_exec.c:134:11
    #4 0x7f7c78cb5ffb in plpython_call_handler /home/cc/repo/gpdb6/src/pl/plpython/plpy_main.c:387:13
    #5 0x562f5e008bb5 in ExecMakeTableFunctionResult /home/cc/repo/gpdb6/src/backend/executor/execQual.c:2395:13
    #6 0x562f5e0dddec in FunctionNext_guts /home/cc/repo/gpdb6/src/backend/executor/nodeFunctionscan.c:142:5
    #7 0x562f5e0da094 in FunctionNext /home/cc/repo/gpdb6/src/backend/executor/nodeFunctionscan.c:350:11
    #8 0x562f5e03d4b0 in ExecScanFetch /home/cc/repo/gpdb6/src/backend/executor/execScan.c:84:9
    #9 0x562f5e03cd8f in ExecScan /home/cc/repo/gpdb6/src/backend/executor/execScan.c:154:10
    #10 0x562f5e0da072 in ExecFunctionScan /home/cc/repo/gpdb6/src/backend/executor/nodeFunctionscan.c:380:9
    #11 0x562f5e001a1c in ExecProcNode /home/cc/repo/gpdb6/src/backend/executor/execProcnode.c:1071:13
    #12 0x562f5dfe6377 in ExecutePlan /home/cc/repo/gpdb6/src/backend/executor/execMain.c:3202:10
    #13 0x562f5dfe5bf4 in standard_ExecutorRun /home/cc/repo/gpdb6/src/backend/executor/execMain.c:1171:5
    #14 0x562f5dfe4877 in ExecutorRun /home/cc/repo/gpdb6/src/backend/executor/execMain.c:992:4
    #15 0x562f5e857e69 in PortalRunSelect /home/cc/repo/gpdb6/src/backend/tcop/pquery.c:1164:4
    #16 0x562f5e856d3f in PortalRun /home/cc/repo/gpdb6/src/backend/tcop/pquery.c:1005:18
    #17 0x562f5e84607a in exec_simple_query /home/cc/repo/gpdb6/src/backend/tcop/postgres.c:1848:10
```

## Reproduce
We can use a simple procedure to reproduce the above problem:
- set timeout GUC: `gpconfig -c gp_segment_connect_timeout -v 5` and `gpstop -ari`
- prepare function:
```
CREATE EXTENSION plpythonu;
CREATE OR REPLACE FUNCTION test_func() RETURNS SETOF int AS
$$
plpy.execute("select pg_backend_pid()")

for i in range(0, 5):
    yield (i)

$$ LANGUAGE plpythonu;
```
- exit from the current psql session.
- stop the postmaster of segment: `gdb -p "the pid of segment postmaster"`
- enter a psql session.
- call `SELECT test_func();` and get error
```
gpadmin=# select test_func();
ERROR:  function "test_func" error fetching next item from iterator (plpy_elog.c:121)
DETAIL:  Exception: failed to acquire resources on one or more segments
CONTEXT:  Traceback (most recent call last):
PL/Python function "test_func"
```
- quit gdb and make postmaster runnable.
- call  `SELECT test_func();` again and get panic
```
gpadmin=# SELECT test_func();
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!> 
```

## Analysis
- There is an SPI call in test_func(): `plpy.execute()`. 
- Then coordinator will start a subtransaction by PLy_spi_subtransaction_begin();
- Meanwhile, if the segment cannot receive the instruction from the coordinator,
  the subtransaction beginning procedure return fails.
- BUT! The Python processor does not know whether an error happened and
  does not clean its environment.
- Then the next plpython UDF in the same session will fail due to the wrong
  Python environment.

## Solution
- Use try-catch to catch the exception caused by PLy_spi_subtransaction_begin()
- set the python error indicator by PLy_spi_exception_set()


Co-authored-by: Chen Mulong <chenmulong@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.