ADBDEV-56. Sync with upstream 5.10.0 by leskin-in · Pull Request #15 · arenadata/gpdb

leskin-in · 2018-08-06T08:17:24Z

Merge changes from upstream 5.10.0 into adb-5.x.

The most important change in upstream (that influences our build) is that enable_filter_pushdown GUC parameter was renamed to gp_external_enable_filter_pushdown.

Warning: the upstream also changes build / test pipeline. See 7464c9f#diff-83df00db68da63c09b3530698dbcd090 for the details.

For CO table, storageAttributes.compress only conveys if should apply block compression or not. RLE is performed as stream compression within the block and hence storageAttributes.compress true or false doesn't relate to rle at all. So, with rle_type compression storageAttributes.compress is true for compression levels > 1 where along with stream compression, block compression is performed. For compress level = 1 storageAttributes.compress is always false as no block compression is applied. Now since rle doesn't relate to storageAttributes.compress there is no reason to touch the same based on rle_type compression. Also, the problem manifests more due the fact in datumstream layer AppendOnlyStorageAttributes in DatumStreamWrite (`acc->ao_attr.compress`) is used to decide block type whereas in cdb storage layer functions AppendOnlyStorageAttributes from AppendOnlyStorageWrite (`idesc->ds[i]->ao_write->storageAttributes.compress`) is used. Due to this difference changing just one that too unnecessarily is bound to cause issue during insert. So, removing the unnecessary and incorrect update to AppendOnlyStorageAttributes. Test case showcases the failing scenario without the patch.

* docs - gpbackup/gprestore new functionality. --gpbackup new option --jobs to backup tables in parallel. --gprestore --include-table* options support restoring views and sequences. * docs - gpbackup/gprestore. fixed typos. Updated backup/restore of sequences and views * docs - gpbackup/gprestore - clarified information on dependent objects. * docs - gpbackup/gprestore - updated information on locking/quiescent state. * docs - gpbackup/gprestore - clarify connection in --jobs option.

* docs - docs and updates for pgbouncer 1.8.1 * some edits requested by david * add pgbouncer config page to see also, include directive * add auth_hba_type config param * ldap - add info to migrating section, remove ldap passwds * remove ldap note

--change command that tests email notification to a psql command. --remove old example that uses gmail public SMTP server

Add tests to ensure sane behavior when a subquery appears nested inside a scalar expression. The intent is to check for correct results. Bump ORCA version to 2.63.0 Signed-off-by: Shreedhar Hardikar <shardikar@pivotal.io> (cherry picked from commit dd77c59)

* Edits to apply organizational improvements made in the HAWQ version, using consistent realm and domain names, and testing that procedures work. * Convert tasks to topics to fix formatting. Clean up pg_ident.conf topic. * Convert another task to topic * Remove extraneous tag * Formatting and minor edits * - added $ or # prompts for all code blocks - Reworked section "Mapping Kerberos Principals to Greenplum Database Roles" to describe, generally, a user's authentication process and to more clearly describe how principal name is mapped to gpdb name. * - add krb_realm auth param - add description of include_realm=1 for completeness

* docs - create ... external ... temp table * update CREATE EXTERNAL TABLE sgml docs

* Change src/backend/access/external functions to extract and pass query constraints; * Add a field with constraints to 'ExtProtocolData'; * Add 'pxffilters' to gpAux/extensions/pxf and modify the extension to use pushdown. * Remove duplicate '=' check in PXF Remove check for duplicate '=' for the parameters of external table. Some databases (MS SQL, for example) may use '=' for database name or other parameters. Now PXF extension finds the first '=' in a parameter and treats the whole remaining string as a parameter value. * disable pushdown by default * Disallow passing of constraints of type boolean (the decoding fails on PXF side); * Fix implicit AND expressions addition Fix implicit addition of extra 'BoolExpr' to a list of expression items. Before, there was a check that the expression items list did not contain logical operators (and if it did, no extra implicit AND operators were added). This behaviour is incorrect. Consider the following query: SELECT * FROM table_ex WHERE bool1=false AND id1=60003; Such query will be translated as a list of three items: 'BoolExpr', 'Var' and 'OpExpr'. Due to the presence of a 'BoolExpr', extra implicit 'BoolExpr' will not be added, and we get an error "stack is not empty ...". This commit changes the signatures of some internal pxffilters functions to fix this error. We pass a number of required extra 'BoolExpr's to 'add_extra_and_expression_items'. As 'BoolExpr's of different origin may be present in the list of expression items, the mechanism of freeing the BoolExpr node changes. The current mechanism of implicit AND expressions addition is suitable only before OR operators are introduced (we will have to add those expressions to different parts of a list, not just the end, as done now).

Co-authored-by: Lav Jain <ljain@pivotal.io> Co-authored-by: Ben Christel <bchristel@pivotal.io>

For long running commands such as gpinitstandby with a large master data directory, the server takes a long time. Therefore, there is no acitivity from the client to the server. If the ClientAliveInterval is set, then the server reports a timeout after ClientAliveInterval seconds. Setting a ServerAliveInterval value less than the ClientAliveInterval interval forces the client to send a Null message to the server. Hence, avoiding the timeout. Co-authored-by: Jamie McAtamney <jmcatamney@pivotal.io> Co-authored-by: Shoaib Lari <slari@pivotal.io> (cherry picked from commit 1549359)

Added new test job to the pipeline to certify GPHDFS with MAPR Hadoop distribution and renamed existing GPHDFS certification job to state that it tests with generic Hadoop. MAPR cluster consists of 1 node deployed by CCP scripts into GCE. Backported from GPDB master. - MAPR 5.2 - Parquet 1.8.1 Co-authored-by: Alexander Denissov <adenissov@pivotal.io> Co-authored-by: Shivram Mani <smani@pivotal.io> Co-authored-by: Francisco Guerrero <aguerrero@pivotal.io>

This is related to the work we have done to fix the sles11 and windows compilation failures on master. Co-authored-by: Jamie McAtamney <jmcatamney@pivotal.io> Co-authored-by: Lisa Oakley <loakley@pivotal.io>

This is related to the work we have done to fix the sles11 and windows compilation failures. Co-authored-by: Lisa Oakley <loakley@pivotal.io> Co-authored-by: Trevor Yacovone <tyacovone@pivotal.io>

The issue happens because of constant folding in the testexpr of the SUBPLAN expression node. The testexpr may be reduced to a const and any PARAMs, previous used in the testexpr, disappear, However, the subplan still remains. This behavior is similar in upstream Postgres 10 and may be of performance consideration. Leaving that aside for now, the constant folding produces an elog(ERROR)s when the plan has subplans and no PARAMs are used. This check in `addRemoteExecParamsToParamList()` uses `context.params` which computes the used PARAMs in the plan and `nIntPrm = list_length(root->glob->paramlist`, which is the number of PARAMs declared/created. Given the ERROR messages generated, the above check makes no sense. Especially since it won’t even trip for the InitPlan bug (mentioned in the comments) as long as there is at least one PARAM in the query. This commit removes this check since it doesn't correctly capture the intent. In theory, it could be be replaced by one specifically aimed at InitPlans, that is, find all the params ids used by InitPlan and then make sure they are used in the plan. But we already do this and remove any unused initplans in `remove_unused_initplans()`. So I don’t see the point of adding that. Fixes #2839

(cherry picked from commit 12888bf)

@dyozie

* Extra docs for gp_external_enable_filter_pushdown Add extra documentation for 'gp_external_enable_filter_pushdown' and the pushdown feature in PXF extension. * Minor doc text fixes Minor documentation text fixes, proposed by @dyozie. * Clarify the pushdown support by PXF Add the following information: * List the PXF connectors that support pushdown; * State that GPDB PXF extension supports pushdown; * Add a list of conditions that need to be fulfilled for the pushdown feature to work when PXF protocol is used. * Correct the list of PXF connectors with pushdown * State that Hive and HBase PXF connectors support filter predicate pushdown; * Remove references to JDBC and Apache Ignite PXF connectors, as proposed by @dyozie (these are not officially supported by Greenplum).

The gen pipeline script outputs a suggested command when setting a dev pipeline. Currently the git remote and git branch have to be edited before executing the command. Since often times the branch has been created and is tracing remote, it is possible to guess those details. The case statements attempt to prevent suggesting using the production branches, and fall back to the same string as before. Authored-by: Jim Doty <jdoty@pivotal.io> (cherry picked from commit ea12d3a)

- Introduce a new GUC gp_resource_group_bypass, when it is on, the query in this session will not be limited by resource group

* docs - update system catalog maintenance information. --Updated Admin. Guide and Best Practices for running REINDEX, VACUUM, and ANALYZE --Added note to REINDEX reference about running ANALYZE after REINDEX. * docs - edits for system catalog maintenance updates * docs - update recommendation for running vacuum and analyze. Update based on dev input.

If a segment exists in gp_segment_configuration but its ip address can not be resolved we will run into a runtime error on gang creation: ERROR: could not translate host name "segment-0a", port "40000" to address: Name or service not known (cdbutil.c:675) This happens even if segment-0a is a mirror and is marked as down. With this error queries can not be executed, gpstart and gpstop will also fail. One way to trigger the issue: - create a multiple segments cluster; - remove sdw1's dns entry from /etc/hosts on mdw; - kill postgres primary process on sdw1; FTS can detect this error and automatically switch to mirror, but queries can not be executed. (cherry picked from commit dd861e7)

gp_max_csv_line_length is a session level GUC. When change it in session, it affects statement like select * from <external_table>. But it does not work for INSERT INTO table SELECT * FROM <external_table>. For such statement, the scan of external table happens in a QE backend process, not the QD. This fix add GUC_GPDB_ADDOPT so that setting this GUC in session level can affect both QD and QE process.

…e review

…… (#5285) * docs - update gpbackup API - add segment instance and update backup directory information. Also update API version to 0.3.0. This will be ported to 5X_STABLE * docs - gpbackup API - review updates and fixes for scope information Also, cleanup edits. * docs - gpbackup API - more review updates and fixes to scope information.

…ons. (#5050) We have lately seen a lot of failures in test cases related to partitioning, with errors like this: select tablename, partitionlevel, partitiontablename, partitionname, partitionrank, partitionboundary from pg_partitions where tablename = 'mpp3079a'; ERROR: cache lookup failed for relation 148532 (ruleutils.c:7172) The culprit is that that the view passes a relation OID to the pg_get_partition_rule_def() function, and the function tries to perform a syscache lookup on the relation (in flatten_reloptions()), but the lookup fails because the relation was dropped concurrently by another transaction. This race is possible, because the query runs with an MVCC snapshot, but the syscache lookups use SnapshotNow. This commit doesn't eliminate the race completely, but at least it makes it narrower. A more reliable solution would've been to acquire a lock on the table, but then that might block, which isn't nice either. Another solution would've been to modify flatten_reloptions() to return NULL instead of erroring out, if the lookup fails. That approach is taken on the other lookups, but I'm reluctant to modify flatten_reloptions() because it's inherited from upstream. Let's see how well this works in practice first, before we do more drastic measures.

We've seen a lot of failures in the 'sreh' test in the pipeline, like this: --- 263,269 ---- FORMAT 'text' (delimiter '|') SEGMENT REJECT LIMIT 10000; SELECT * FROM sreh_ext; ! ERROR: connection failed dummy_protocol://DUMMY_LOCATION INSERT INTO sreh_target SELECT * FROM sreh_ext; NOTICE: Found 10 data formatting errors (10 or more input rows). Rejected related input data. SELECT count(*) FROM sreh_target; I don't really know, but I'm guessing it could be because it sometimes takes more than one second for gpfdist to fully start up, if there's a lot of disk or other activity. Increase the sleep time from 1 to 3 seconds; we'll see if that helps. (cherry picked from commit bb8575a)

- Previously, DNS was queried within the `Ping` utility constructor, so a DNS failure would always raised an exception. - Now the DNS query is in the standard `run()` method, so a failure from DNS will raise optionally, depending on the `validateAfter` parameter. - `Command` declared as a new style class so that `super(Ping, self).run()` can be called. Co-authored-by: Larry Hamel <lhamel@pivotal.io> Co-authored-by: Jemish Patel <jpatel@pivotal.io>

The MyProc inDropTransaction flag was used to make sure concurrent AO vacuum would not conflict with each other during drop phase. Two concurrent AO vacuum on same relation was possible back in 4.3 where the different AO vacuum phases (prepare, compaction, drop, cleanup) would interleave with each other, and having two AO vacuum drop phases concurrently on the same AO relation was dangerous. We now hold the ShareUpdateExclusiveLock through the entire AO vacuum which renders the inDropTransaction flag useless and disallows the interleaving mechanism. Co-authored-by: Jimmy Yih <jyih@pivotal.io>

* docs - correct log file locations in best practices * edit requested by david

1. When doing harvesting, raise the gp_max_csv_line_length to maximum legal value in session level. 2. For query longer than gp_max_csv_line_length, this workaround replaces line breaks in query text with space to prevent load failure. It may lead long query statement changed when load to history table, but it is still better than fail to load or truncate the query text. Co-authored-by: Teng Zhang tezhang@pivotal.io Co-authored-by: Hao Wang haowang@pivotal.io

- add fast_match option in gpload config file. If both reuse_tables and fast_match are true, gpload will try fast match external table(without checking columns). If reuse_tables is false and fast_match is true, it will print warning message.

* docs - remove duplicate gphdfs/kerberos topic in best practices * remove unused file

If autovacuum was triggered before ShmemVariableCache->latestCompletedXid is updated by manually consuming xids then autovacuum may not vacuum template0 with a proper transaction id to compare against. We made the test more reliable by suspending a new fault injector (auto_vac_worker_before_do_autovacuum) right before autovacuum worker sets recentXid and starts doing the autovacuum. This allows us to guarantee that autovacuum is comparing against a proper xid. We also removed the loop in the test because vacuum_update_dat_frozen_xid fault injector ensures the pg_database table has been updated. Co-authored-by: Jimmy Yih <jyih@pivotal.io>

The following information will be saved in ${GREENPLUM_INSTALL_DIR}/etc/git-info.json: * Root repo (uri, sha1) * Submodules (submodule source path, sha1, tag) Save git commits since last release tag into ${GREENPLUM_INSTALL_DIR}/etc/git-current-changelog.txt

acmnu · 2018-08-09T09:51:53Z

Useless pull request. We done it another way.

It had two indexes: * pg_statlastop_classid_objid_index on (classid oid_ops, objid oid_ops), and * pg_statlastop_classid_objid_staactionname_index on (classid oid_ops, objid oid_ops, staactionname name_ops) The first one is completely redundant with the second one. Remove it. Fixes assertion failure https://github.com/greenplum-db/gpdb/issues/6362. The assertion was added in PostgreSQL 9.1, commit d2f60a3. The failure happened on "VACUUM FULL pg_stat_last_operation", if the VACUUM FULL itself added a new row to the table. The insertion also inserted entries in the indexes, which tripped the assertion that checks that you don't try to insert entries into an index that's currently being reindexed, or pending reindexing: > (gdb) bt > #0 0x00007f02f5189783 in __select_nocancel () from /lib64/libc.so.6 > #1 0x0000000000be76ef in pg_usleep (microsec=30000000) at pgsleep.c:53 > #2 0x0000000000ad75aa in elog_debug_linger (edata=0x11bf760 <errordata>) at elog.c:5293 > #3 0x0000000000acdba4 in errfinish (dummy=0) at elog.c:675 > #4 0x0000000000acc3bf in ExceptionalCondition (conditionName=0xc15798 "!(!ReindexIsProcessingIndex(((indexRelation)->rd_id)))", errorType=0xc156ef "FailedAssertion", > fileName=0xc156d0 "indexam.c", lineNumber=215) at assert.c:46 > #5 0x00000000004fded5 in index_insert (indexRelation=0x7f02f6b6daa0, values=0x7ffdb43915e0, isnull=0x7ffdb43915c0 "", heap_t_ctid=0x240bd64, heapRelation=0x24efa78, > checkUnique=UNIQUE_CHECK_YES) at indexam.c:215 > #6 0x00000000005bda59 in CatalogIndexInsert (indstate=0x240e5d0, heapTuple=0x240bd60) at indexing.c:136 > #7 0x00000000005bdaaa in CatalogUpdateIndexes (heapRel=0x24efa78, heapTuple=0x240bd60) at indexing.c:162 > #8 0x00000000005b2203 in MetaTrackAddUpdInternal (classid=1259, objoid=6053, relowner=10, actionname=0xc51543 "VACUUM", subtype=0xc5153b "REINDEX", rel=0x24efa78, > old_tuple=0x0) at heap.c:744 > #9 0x00000000005b229d in MetaTrackAddObject (classid=1259, objoid=6053, relowner=10, actionname=0xc51543 "VACUUM", subtype=0xc5153b "REINDEX") at heap.c:773 > #10 0x00000000005b2553 in MetaTrackUpdObject (classid=1259, objoid=6053, relowner=10, actionname=0xc51543 "VACUUM", subtype=0xc5153b "REINDEX") at heap.c:856 > #11 0x00000000005bd271 in reindex_index (indexId=6053, skip_constraint_checks=1 '\001') at index.c:3741 > #12 0x00000000005bd418 in reindex_relation (relid=6052, flags=2) at index.c:3870 > #13 0x000000000067ba71 in finish_heap_swap (OIDOldHeap=6052, OIDNewHeap=16687, is_system_catalog=1 '\001', swap_toast_by_content=0 '\000', swap_stats=1 '\001', > check_constraints=0 '\000', is_internal=1 '\001', frozenXid=821, cutoffMulti=1) at cluster.c:1667 > #14 0x0000000000679ed5 in rebuild_relation (OldHeap=0x7f02f6b7a6f0, indexOid=0, verbose=0 '\000') at cluster.c:648 > #15 0x0000000000679913 in cluster_rel (tableOid=6052, indexOid=0, recheck=0 '\000', verbose=0 '\000', printError=1 '\001') at cluster.c:461 > #16 0x0000000000717580 in vacuum_rel (onerel=0x0, relid=6052, vacstmt=0x2533c38, lmode=8, for_wraparound=0 '\000') at vacuum.c:2315 > #17 0x0000000000714ce7 in vacuumStatement_Relation (vacstmt=0x2533c38, relid=6052, relations=0x24c12f8, bstrategy=0x24c1220, do_toast=1 '\001', for_wraparound=0 '\000', > isTopLevel=1 '\001') at vacuum.c:787 > #18 0x0000000000714303 in vacuum (vacstmt=0x2403260, relid=0, do_toast=1 '\001', bstrategy=0x24c1220, for_wraparound=0 '\000', isTopLevel=1 '\001') at vacuum.c:337 > #19 0x0000000000969cd2 in standard_ProcessUtility (parsetree=0x2403260, queryString=0x24027e0 "vacuum full;", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, dest=0x2403648, > completionTag=0x7ffdb4392550 "") at utility.c:804 > #20 0x00000000009691be in ProcessUtility (parsetree=0x2403260, queryString=0x24027e0 "vacuum full;", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, dest=0x2403648, > completionTag=0x7ffdb4392550 "") at utility.c:373 In this scenario, we had just reindexed one of the indexes of pg_stat_last_operation, and the metatrack update of that tried to insert a row into the same table. But the second index in the table was pending reindexing, which triggered the assertion. After removing the redundant index, pg_stat_last_operation has only one index, and that scenario no longer happens. This is a bit fragile fix, because the problem will reappear as soon as you add a second index on the table. But we have no plans of doing that, and I believe no harm would be done in production builds with assertions disabled, anyway. So this will do for now. Reviewed-by: Ashwin Agrawal <aagrawal@pivotal.io> Reviewed-by: Shaoqi Bai <sbai@pivotal.io> Reviewed-by: Jamie McAtamney <jmcatamney@pivotal.io>

We found that if gp_segment_configuration is locked, then it will fail by triggering FTS. We got the stack below #2 0x0000000000a6bb29 in ExceptionalCondition at assert.c:66 #3 0x0000000000aac19a in enable_timeout timeout.c:143 #4 0x0000000000aacb6c in enable_timeout_after timeout.c:473 #5 0x00000000008e86ef in ProcSleep at proc.c:1300 #6 0x00000000008deb70 in WaitOnLock at lock.c:1894 #7 0x00000000008e019e in LockAcquireExtended at lock.c:1205 #8 0x00000000008dd2d8 in LockRelationOid at lmgr.c:102 #9 0x000000000051c928 in heap_open at heapam.c:1083 #10 0x0000000000b7feaf in getCdbComponentInfo at cdbutil.c:173 #11 0x0000000000b81365 in cdbcomponent_getCdbComponents at cdbutil.c:606 #12 0x00000000007603e1 in ftsMain at fts.c:351 #13 0x0000000000760715 in ftsprobe_start at fts.c:121 #14 0x00000000004cc7b0 in ServerLoop () #15 0x00000000008769bf in PostmasterMain at postmaster.c:1531 #16 0x000000000079098b in main () So it is that FTS hasn't initialized timeout. Any process that wants to use timeout must call initilization first. This is the root cause gpexpand job fails on master pipeline in build 71 and 79. We added this initialization in FTS and GDD.

…CREATE/ALTER resouce group. In some scenarios, the AccessExclusiveLock for table pg_resgroupcapability may cause database setup/recovery pending. Below is why we need change the AccessExclusiveLock to ExclusiveLock. This lock on table pg_resgroupcapability is used to concurrent update this table when run "Create/Alter resource group" statement. There is a CPU limit, after modify one resource group, it has to check if the whole CPU usage of all resource groups doesn't exceed 100%. Before this fix, AccessExclusiveLock is used. Suppose one user is running "Alter resource group" statement, QD will dispatch this statement to all QEs, so it is a two phase commit(2PC) transaction. When QD dispatched "Alter resource group" statement and QE acquire the AccessExclusiveLock for table pg_resgroupcapability. Until the 2PC distributed transaction committed, QE can release the AccessExclusiveLock for this table. In the second phase, QD will call function doNotifyingCommitPrepared to broadcast "commit prepared" command to all QEs, QE has already finish prepared, this transation is a prepared transaction. Suppose at this point, there is a primary segment down and a mirror will be promoted to primary. The mirror got the "promoted" message from coordinator, and will recover based on xlog from primary, in order to recover the prepared transaction, it will read the prepared transaction log entry and acquire AccessExclusiveLock for table pg_resgroupcapability. The callstack is: #0 lock_twophase_recover (xid=, info=, recdata=, len=) at lock.c:4697 #1 ProcessRecords (callbacks=, xid=2933, bufptr=0x1d575a8 "") at twophase.c:1757 #2 RecoverPreparedTransactions () at twophase.c:2214 #3 StartupXLOG () at xlog.c:8013 #4 StartupProcessMain () at startup.c:231 #5 AuxiliaryProcessMain (argc=argc@entry=2, argv=argv@entry=0x7fff84b94a70) at bootstrap.c:459 #6 StartChildProcess (type=StartupProcess) at postmaster.c:5917 #7 PostmasterMain (argc=argc@entry=7, argv=argv@entry=0x1d555b0) at postmaster.c:1581 #8 main (argc=7, argv=0x1d555b0) at main.c:240 After that, the database instance will start up, all related initialization functions will be called. However, there is a function named "InitResGroups", it will acquire AccessShareLock for table pg_resgroupcapability and do some initialization stuff. The callstack is: #6 WaitOnLock (locallock=locallock@entry=0x1c7f248, owner=owner@entry=0x1ca0a40) at lock.c:1999 #7 LockAcquireExtended (locktag=locktag@entry=0x7ffd15d18d90, lockmode=lockmode@entry=1, sessionLock=sessionLock@entry=false, dontWait=dontWait@entry=false, reportMemoryError=reportMemoryError@entry=true, locallockp=locallockp@entry=0x7ffd15d18d88) at lock.c:1192 #8 LockRelationOid (relid=6439, lockmode=1) at lmgr.c:126 #9 relation_open (relationId=relationId@entry=6439, lockmode=lockmode@entry=1) at relation.c:56 #10 table_open (relationId=relationId@entry=6439, lockmode=lockmode@entry=1) at table.c:47 #11 InitResGroups () at resgroup.c:581 #12 InitResManager () at resource_manager.c:83 #13 initPostgres (in_dbname=, dboid=dboid@entry=0, username=username@entry=0x1c5b730 "linw", useroid=useroid@entry=0, out_dbname=out_dbname@entry=0x0, override_allow_connections=override_allow_connections@entry=false) at postinit.c:1284 #14 PostgresMain (argc=1, argv=argv@entry=0x1c8af78, dbname=0x1c89e70 "postgres", username=0x1c5b730 "linw") at postgres.c:4812 #15 BackendRun (port=, port=) at postmaster.c:4922 #16 BackendStartup (port=0x1c835d0) at postmaster.c:4607 #17 ServerLoop () at postmaster.c:1963 #18 PostmasterMain (argc=argc@entry=7, argv=argv@entry=0x1c595b0) at postmaster.c:1589 #19 in main (argc=7, argv=0x1c595b0) at main.c:240 The AccessExclusiveLock is not released, and it is not compatible with any other locks, so the startup process will be pending on this lock. So the mirror can't become primary successfully. Even users run "gprecoverseg" to recover the primary segment. the result is similar. The primary segment will recover from xlog, it will recover prepared transactions and acquire AccessExclusiveLock for table pg_resgroupcapability. Then the startup process is pending on this lock. Unless users change the resource type to "queue", the function InitResGroups will not be called, and won't be blocked, then the primary segment can startup normally. After this fix, ExclusiveLock is acquired when alter resource group. In above case, the startup process acquires AccessShareLock, ExclusiveLock and AccessShareLock are compatible. The startup process can run successfully. After startup, QE will get RECOVERY_COMMIT_PREPARED command from QD, it will finish the second phase of this distributed transaction and release ExclusiveLock on table pg_resgroupcapability. The callstack is: #0 lock_twophase_postcommit (xid=, info=, recdata=0x3303458, len=) at lock.c:4758 #1 ProcessRecords (callbacks=, xid=, bufptr=0x3303458 "") at twophase.c:1757 #2 FinishPreparedTransaction (gid=gid@entry=0x323caf5 "25", isCommit=isCommit@entry=true, raiseErrorIfNotFound=raiseErrorIfNotFound@entry=false) at twophase.c:1704 #3 in performDtxProtocolCommitPrepared (gid=gid@entry=0x323caf5 "25", raiseErrorIfNotFound=raiseErrorIfNotFound@entry=false) at cdbtm.c:2107 #4 performDtxProtocolCommand (dtxProtocolCommand=dtxProtocolCommand@entry=DTX_PROTOCOL_COMMAND_RECOVERY_COMMIT_PREPARED, gid=gid@entry=0x323caf5 "25", contextInfo=contextInfo@entry=0x10e1820 ) at cdbtm.c:2279 #5 exec_mpp_dtx_protocol_command (contextInfo=0x10e1820 , gid=0x323caf5 "25", loggingStr=0x323cad8 "Recovery Commit Prepared", dtxProtocolCommand=DTX_PROTOCOL_COMMAND_RECOVERY_COMMIT_PREPARED) at postgres.c:1570 #6 PostgresMain (argc=, argv=argv@entry=0x3268f98, dbname=0x3267e90 "postgres", username=) at postgres.c:5482 The test case of this commit simulates a repro of this bug.

## Problem An error occurs in python lib when a plpython function is executed. After our analysis, in the user's cluster, a plpython UDF was running with the unstable network, and got a timeout error: `failed to acquire resources on one or more segments`. Then a plpython UDF was run in the same session, and the UDF failed with GC error. Here is the core dump: ``` 2023-11-24 10:15:18.945507 CST,,,p2705198,th2081832064,,,,0,,,seg-1,,,,,"LOG","00000","3rd party error log: #0 0x7f7c68b6d55b in frame_dealloc /home/cc/repo/cpython/Objects/frameobject.c:509:5 #1 0x7f7c68b5109d in gen_send_ex /home/cc/repo/cpython/Objects/genobject.c:108:9 #2 0x7f7c68af9ddd in PyIter_Next /home/cc/repo/cpython/Objects/abstract.c:3118:14 #3 0x7f7c78caa5c0 in PLy_exec_function /home/cc/repo/gpdb6/src/pl/plpython/plpy_exec.c:134:11 #4 0x7f7c78cb5ffb in plpython_call_handler /home/cc/repo/gpdb6/src/pl/plpython/plpy_main.c:387:13 #5 0x562f5e008bb5 in ExecMakeTableFunctionResult /home/cc/repo/gpdb6/src/backend/executor/execQual.c:2395:13 #6 0x562f5e0dddec in FunctionNext_guts /home/cc/repo/gpdb6/src/backend/executor/nodeFunctionscan.c:142:5 #7 0x562f5e0da094 in FunctionNext /home/cc/repo/gpdb6/src/backend/executor/nodeFunctionscan.c:350:11 #8 0x562f5e03d4b0 in ExecScanFetch /home/cc/repo/gpdb6/src/backend/executor/execScan.c:84:9 #9 0x562f5e03cd8f in ExecScan /home/cc/repo/gpdb6/src/backend/executor/execScan.c:154:10 #10 0x562f5e0da072 in ExecFunctionScan /home/cc/repo/gpdb6/src/backend/executor/nodeFunctionscan.c:380:9 #11 0x562f5e001a1c in ExecProcNode /home/cc/repo/gpdb6/src/backend/executor/execProcnode.c:1071:13 #12 0x562f5dfe6377 in ExecutePlan /home/cc/repo/gpdb6/src/backend/executor/execMain.c:3202:10 #13 0x562f5dfe5bf4 in standard_ExecutorRun /home/cc/repo/gpdb6/src/backend/executor/execMain.c:1171:5 #14 0x562f5dfe4877 in ExecutorRun /home/cc/repo/gpdb6/src/backend/executor/execMain.c:992:4 #15 0x562f5e857e69 in PortalRunSelect /home/cc/repo/gpdb6/src/backend/tcop/pquery.c:1164:4 #16 0x562f5e856d3f in PortalRun /home/cc/repo/gpdb6/src/backend/tcop/pquery.c:1005:18 #17 0x562f5e84607a in exec_simple_query /home/cc/repo/gpdb6/src/backend/tcop/postgres.c:1848:10 ``` ## Reproduce We can use a simple procedure to reproduce the above problem: - set timeout GUC: `gpconfig -c gp_segment_connect_timeout -v 5` and `gpstop -ari` - prepare function: ``` CREATE EXTENSION plpythonu; CREATE OR REPLACE FUNCTION test_func() RETURNS SETOF int AS $$ plpy.execute("select pg_backend_pid()") for i in range(0, 5): yield (i) $$ LANGUAGE plpythonu; ``` - exit from the current psql session. - stop the postmaster of segment: `gdb -p "the pid of segment postmaster"` - enter a psql session. - call `SELECT test_func();` and get error ``` gpadmin=# select test_func(); ERROR: function "test_func" error fetching next item from iterator (plpy_elog.c:121) DETAIL: Exception: failed to acquire resources on one or more segments CONTEXT: Traceback (most recent call last): PL/Python function "test_func" ``` - quit gdb and make postmaster runnable. - call `SELECT test_func();` again and get panic ``` gpadmin=# SELECT test_func(); server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. !> ``` ## Analysis - There is an SPI call in test_func(): `plpy.execute()`. - Then coordinator will start a subtransaction by PLy_spi_subtransaction_begin(); - Meanwhile, if the segment cannot receive the instruction from the coordinator, the subtransaction beginning procedure return fails. - BUT! The Python processor does not know whether an error happened and does not clean its environment. - Then the next plpython UDF in the same session will fail due to the wrong Python environment. ## Solution - Use try-catch to catch the exception caused by PLy_spi_subtransaction_begin() - set the python error indicator by PLy_spi_exception_set() backport from #16856 Co-authored-by: Chen Mulong <chenmulong@gmail.com>

An error occurs in python lib when a plpython function is executed. After our analysis, in the user's cluster, a plpython UDF was running with the unstable network, and got a timeout error: `failed to acquire resources on one or more segments`. Then a plpython UDF was run in the same session, and the UDF failed with GC error. Here is the core dump: ``` 2023-11-24 10:15:18.945507 CST,,,p2705198,th2081832064,,,,0,,,seg-1,,,,,"LOG","00000","3rd party error log: #0 0x7f7c68b6d55b in frame_dealloc /home/cc/repo/cpython/Objects/frameobject.c:509:5 #1 0x7f7c68b5109d in gen_send_ex /home/cc/repo/cpython/Objects/genobject.c:108:9 #2 0x7f7c68af9ddd in PyIter_Next /home/cc/repo/cpython/Objects/abstract.c:3118:14 #3 0x7f7c78caa5c0 in PLy_exec_function /home/cc/repo/gpdb6/src/pl/plpython/plpy_exec.c:134:11 #4 0x7f7c78cb5ffb in plpython_call_handler /home/cc/repo/gpdb6/src/pl/plpython/plpy_main.c:387:13 #5 0x562f5e008bb5 in ExecMakeTableFunctionResult /home/cc/repo/gpdb6/src/backend/executor/execQual.c:2395:13 #6 0x562f5e0dddec in FunctionNext_guts /home/cc/repo/gpdb6/src/backend/executor/nodeFunctionscan.c:142:5 #7 0x562f5e0da094 in FunctionNext /home/cc/repo/gpdb6/src/backend/executor/nodeFunctionscan.c:350:11 #8 0x562f5e03d4b0 in ExecScanFetch /home/cc/repo/gpdb6/src/backend/executor/execScan.c:84:9 #9 0x562f5e03cd8f in ExecScan /home/cc/repo/gpdb6/src/backend/executor/execScan.c:154:10 #10 0x562f5e0da072 in ExecFunctionScan /home/cc/repo/gpdb6/src/backend/executor/nodeFunctionscan.c:380:9 #11 0x562f5e001a1c in ExecProcNode /home/cc/repo/gpdb6/src/backend/executor/execProcnode.c:1071:13 #12 0x562f5dfe6377 in ExecutePlan /home/cc/repo/gpdb6/src/backend/executor/execMain.c:3202:10 #13 0x562f5dfe5bf4 in standard_ExecutorRun /home/cc/repo/gpdb6/src/backend/executor/execMain.c:1171:5 #14 0x562f5dfe4877 in ExecutorRun /home/cc/repo/gpdb6/src/backend/executor/execMain.c:992:4 #15 0x562f5e857e69 in PortalRunSelect /home/cc/repo/gpdb6/src/backend/tcop/pquery.c:1164:4 #16 0x562f5e856d3f in PortalRun /home/cc/repo/gpdb6/src/backend/tcop/pquery.c:1005:18 #17 0x562f5e84607a in exec_simple_query /home/cc/repo/gpdb6/src/backend/tcop/postgres.c:1848:10 ``` We can use a simple procedure to reproduce the above problem: - set timeout GUC: `gpconfig -c gp_segment_connect_timeout -v 5` and `gpstop -ari` - prepare function: ``` CREATE EXTENSION plpythonu; CREATE OR REPLACE FUNCTION test_func() RETURNS SETOF int AS $$ plpy.execute("select pg_backend_pid()") for i in range(0, 5): yield (i) $$ LANGUAGE plpythonu; ``` - exit from the current psql session. - stop the postmaster of segment: `gdb -p "the pid of segment postmaster"` - enter a psql session. - call `SELECT test_func();` and get error ``` gpadmin=# select test_func(); ERROR: function "test_func" error fetching next item from iterator (plpy_elog.c:121) DETAIL: Exception: failed to acquire resources on one or more segments CONTEXT: Traceback (most recent call last): PL/Python function "test_func" ``` - quit gdb and make postmaster runnable. - call `SELECT test_func();` again and get panic ``` gpadmin=# SELECT test_func(); server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. !> ``` - There is an SPI call in test_func(): `plpy.execute()`. - Then coordinator will start a subtransaction by PLy_spi_subtransaction_begin(); - Meanwhile, if the segment cannot receive the instruction from the coordinator, the subtransaction beginning procedure return fails. - BUT! The Python processor does not know whether an error happened and does not clean its environment. - Then the next plpython UDF in the same session will fail due to the wrong Python environment. - Use try-catch to catch the exception caused by PLy_spi_subtransaction_begin() - set the python error indicator by PLy_spi_exception_set() backport from #16856 Co-authored-by: Chen Mulong <chenmulong@gmail.com> (cherry picked from commit 45d6ba8) Co-authored-by: Zhang Hao <hzhang2@vmware.com>

## Problem An error occurs in python lib when a plpython function is executed. After our analysis, in the user's cluster, a plpython UDF was running with the unstable network, and got a timeout error: `failed to acquire resources on one or more segments`. Then a plpython UDF was run in the same session, and the UDF failed with GC error. Here is the core dump: ``` 2023-11-24 10:15:18.945507 CST,,,p2705198,th2081832064,,,,0,,,seg-1,,,,,"LOG","00000","3rd party error log: #0 0x7f7c68b6d55b in frame_dealloc /home/cc/repo/cpython/Objects/frameobject.c:509:5 #1 0x7f7c68b5109d in gen_send_ex /home/cc/repo/cpython/Objects/genobject.c:108:9 #2 0x7f7c68af9ddd in PyIter_Next /home/cc/repo/cpython/Objects/abstract.c:3118:14 #3 0x7f7c78caa5c0 in PLy_exec_function /home/cc/repo/gpdb6/src/pl/plpython/plpy_exec.c:134:11 #4 0x7f7c78cb5ffb in plpython_call_handler /home/cc/repo/gpdb6/src/pl/plpython/plpy_main.c:387:13 #5 0x562f5e008bb5 in ExecMakeTableFunctionResult /home/cc/repo/gpdb6/src/backend/executor/execQual.c:2395:13 #6 0x562f5e0dddec in FunctionNext_guts /home/cc/repo/gpdb6/src/backend/executor/nodeFunctionscan.c:142:5 #7 0x562f5e0da094 in FunctionNext /home/cc/repo/gpdb6/src/backend/executor/nodeFunctionscan.c:350:11 #8 0x562f5e03d4b0 in ExecScanFetch /home/cc/repo/gpdb6/src/backend/executor/execScan.c:84:9 #9 0x562f5e03cd8f in ExecScan /home/cc/repo/gpdb6/src/backend/executor/execScan.c:154:10 #10 0x562f5e0da072 in ExecFunctionScan /home/cc/repo/gpdb6/src/backend/executor/nodeFunctionscan.c:380:9 #11 0x562f5e001a1c in ExecProcNode /home/cc/repo/gpdb6/src/backend/executor/execProcnode.c:1071:13 #12 0x562f5dfe6377 in ExecutePlan /home/cc/repo/gpdb6/src/backend/executor/execMain.c:3202:10 #13 0x562f5dfe5bf4 in standard_ExecutorRun /home/cc/repo/gpdb6/src/backend/executor/execMain.c:1171:5 #14 0x562f5dfe4877 in ExecutorRun /home/cc/repo/gpdb6/src/backend/executor/execMain.c:992:4 #15 0x562f5e857e69 in PortalRunSelect /home/cc/repo/gpdb6/src/backend/tcop/pquery.c:1164:4 #16 0x562f5e856d3f in PortalRun /home/cc/repo/gpdb6/src/backend/tcop/pquery.c:1005:18 #17 0x562f5e84607a in exec_simple_query /home/cc/repo/gpdb6/src/backend/tcop/postgres.c:1848:10 ``` ## Reproduce We can use a simple procedure to reproduce the above problem: - set timeout GUC: `gpconfig -c gp_segment_connect_timeout -v 5` and `gpstop -ari` - prepare function: ``` CREATE EXTENSION plpythonu; CREATE OR REPLACE FUNCTION test_func() RETURNS SETOF int AS $$ plpy.execute("select pg_backend_pid()") for i in range(0, 5): yield (i) $$ LANGUAGE plpythonu; ``` - exit from the current psql session. - stop the postmaster of segment: `gdb -p "the pid of segment postmaster"` - enter a psql session. - call `SELECT test_func();` and get error ``` gpadmin=# select test_func(); ERROR: function "test_func" error fetching next item from iterator (plpy_elog.c:121) DETAIL: Exception: failed to acquire resources on one or more segments CONTEXT: Traceback (most recent call last): PL/Python function "test_func" ``` - quit gdb and make postmaster runnable. - call `SELECT test_func();` again and get panic ``` gpadmin=# SELECT test_func(); server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. !> ``` ## Analysis - There is an SPI call in test_func(): `plpy.execute()`. - Then coordinator will start a subtransaction by PLy_spi_subtransaction_begin(); - Meanwhile, if the segment cannot receive the instruction from the coordinator, the subtransaction beginning procedure return fails. - BUT! The Python processor does not know whether an error happened and does not clean its environment. - Then the next plpython UDF in the same session will fail due to the wrong Python environment. ## Solution - Use try-catch to catch the exception caused by PLy_spi_subtransaction_begin() - set the python error indicator by PLy_spi_exception_set() Co-authored-by: Chen Mulong <chenmulong@gmail.com>

Ashwin Agrawal and others added 30 commits June 15, 2018 14:14

docs - cpuset content for 5x (#5158)

4ed94f4

docs - fix gpload example. change table name desc to descr

f0f72d2

docs - update email setup information. (#5162)

ae34d53

--change command that tests email notification to a psql command. --remove old example that uses gmail public SMTP server

docs - add resgroup links to best practices memory mgmt page (#5169)

d0c3d4d

Bump ORCA version to 2.64.0

df6f8df

docs - create ... external ... temp table (#5180)

f3861ba

* docs - create ... external ... temp table * update CREATE EXTERNAL TABLE sgml docs

Revert GPDB 6-specific test changes backported from PR #4968

c9a6e67

Co-authored-by: Lav Jain <ljain@pivotal.io> Co-authored-by: Ben Christel <bchristel@pivotal.io>

docs - fix image issue for PDF. Other minor dita fixes.

9e1aea3

ci: Separate sync_tools from compilation

85e793a

This is related to the work we have done to fix the sles11 and windows compilation failures on master. Co-authored-by: Jamie McAtamney <jmcatamney@pivotal.io> Co-authored-by: Lisa Oakley <loakley@pivotal.io>

Docs - cherry-picking doc file typo fixes from 996853c

31bc819

ci: Modify pxf compile job for sync_tools restructure

49c634e

This is related to the work we have done to fix the sles11 and windows compilation failures. Co-authored-by: Lisa Oakley <loakley@pivotal.io> Co-authored-by: Trevor Yacovone <tyacovone@pivotal.io>

Remove irrelevant comments from sql test file

9a6b5c3

Update README and remove depricated options

95d626b

(cherry picked from commit 12888bf)

fix problematic xrefs

e2c4961

Implement resource group bypass mode (#5223)

d893d8e

- Introduce a new GUC gp_resource_group_bypass, when it is on, the query in this session will not be limited by resource group

docs - fix gphdfs read/write external table examples (#5211)

c4f63c1

Docs: book updates for 5.10 release version

902a2cb

Wang Hao and others added 18 commits July 18, 2018 10:34

update pgbouncer to 1.8.1

ae72385

Bump ORCA Version 2.65.0

92ecff6

docs - add kafka connector xrefs (#5292)

6f3a8fb

docs - minor edit to gp_resource_group_bypass GUC base on release not…

1e7cc6c

…e review

docs - correct log file locations in best practices (#5307)

49a2b32

* docs - correct log file locations in best practices * edit requested by david

docs - remove one copy of duplicate gphdfs hdfs kerberos content (#5311)

85c3fd9

* docs - remove duplicate gphdfs/kerberos topic in best practices * remove unused file

Merge branch upstream-tag-5.10.0 with ADBDEV-56

05eb982

Merge branch 'upstream-tag-5.10.0' into ADBDEV-56

7464c9f

leskin-in requested a review from acmnu August 6, 2018 08:17

acmnu closed this Aug 9, 2018

leskin-in deleted the ADBDEV-56 branch September 26, 2018 09:56

seryozhasmirnov mentioned this pull request Aug 11, 2021

Send rows in binary mode for ANALYZE #218

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ADBDEV-56. Sync with upstream 5.10.0#15

ADBDEV-56. Sync with upstream 5.10.0#15
leskin-in wants to merge 73 commits intoadb-5.xfrom
ADBDEV-56

leskin-in commented Aug 6, 2018

Uh oh!

acmnu commented Aug 9, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

19 participants

Conversation

leskin-in commented Aug 6, 2018

Uh oh!

acmnu commented Aug 9, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

19 participants