[WIP] Cherry-pick ResourceGroup update from GreenPlum #440

foreyes · 2024-05-17T00:30:17Z

fix #ISSUE_Number

Change logs

Describe your change clearly, including what problem is being solved or what feature is being added.

If it has some breaking backward or forward compatibility, please clary.

Why are the changes needed?

Describe why the changes are necessary.

Does this PR introduce any user-facing change?

If yes, please clarify the previous behavior and the change this PR proposes.

How was this patch tested?

Please detail how the changes were tested, including manual tests and any relevant unit or integration tests.

Contributor's Checklist

Here are some reminders and checklists before/when submitting your pull request, please check them:

Make sure your Pull Request has a clear title and commit message. You can take git-commit template as a reference.
Sign the Contributor License Agreement as prompted for your first-time contribution(One-time setup).
Learn the coding contribution guide, including our code conventions, workflow and more.
List your communication in the GitHub Issues or Discussions (if has or needed).
Document changes.
Add tests for the change
Pass make installcheck
Pass make -C src/test installcheck-cbdb-parallel
Feel free to request cloudberrydb/dev team for review and approval when your PR is ready🥳

CLAassistant · 2024-05-17T00:30:25Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 11 committers have signed the CLA.

✅ foreyes
❌ kainwen
❌ SmartKeyerror
❌ fairyfar
❌ RMTT
❌ andr-sokolov
❌ AJR-VMware
❌ beeender
❌ yanwr1
❌ Tao-T
❌ zxuejing
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

Greenplum introduces auto_stats for a long time, please refer to https://groups.google.com/a/greenplum.org/g/gpdb-dev/c/bAyw2KBP6yE/m/hmoWikrPAgAJ for details and decision of auto_stats. auto_stats() now is invoked at the following 7 places: 1. ProcessQuery() 2. _SPI_pquery() 3. postquel_end() 4. ATExecExpandTableCTAS() 5. ATExecSetDistributedBy() 6. DoCopy() 7. ExecCreateTableAs() Previously, Place 2, 3 is hard-coded as inside function, Place 1, 4~7 is hard-coded as not-inside function. Place 4~7 does not cover the case that COPY or CTAS is called inside procedure language. Since in future auto_stats will be removed, for now let's just to do some simple fix instead of big refactor. To correctly pass the inFunction parameter for auto_stats() at Place 4~7 we introduce executor_run_nesting_level to mark if the program is already under ExecutorRun(). Place 4~7 is directly taken as Utility and will not call ExecutorRun() if they are not inside procedure language. This skill is like the extension `auto_explain`. For Place 1~3, the context is clear we do not need to check executor_run_nesting_level. NOTE: now, Greenplum support create procedure, and invoke it as call proc() instead of select proc(). To handle this case, we also have a static module variable to mark the nest level of standard_ProcessUtility, if the nest level >= 2, means it is inside a procedure call.

After we changed the semantics of the resource group at #14562, it's time to add the implementation of Linux Cgroup v2. In Linux Kernel, there is no difference between Cgroup v1 and Cgroup v2 in interior implementation, the only change is the pseudo-files interface. Thanks to the #14343, we just need to add a new file and implement a new `cgroupOpsRoutine`, which can be found at `get_group_routine_v2()`. In early times, we considered that developers will have different mount directories of Cgroup, so we generate test files dynamically. But we found it is an over-design, so move some test files from `input/` to `sql/`, and this will be convenient for developers. This is been written at `src/backend/utils/resgroup/cgroup-ops-linux-v2.c`. Cgroup version 2 uses the "unified hierarchy", all the presudo-file of different controllers are all in one directory, and not use the directyory "cpu, cpuset", "memory", etc. This is the biggest difference between them, and directly decide how we configure them, here is an example: ``` smart@stable:/sys/fs/cgroup$ ls blkio cpu cpuacct cpu,cpuacct cpuset devices freezer hugetlb memory net_cls net_cls,net_prio net_prio perf_event pids rdma systemd unified [root@localhost gpdb]# ls cgroup.controllers cgroup.max.descendants cgroup.type cpu.stat cpuset.cpus io.max pids.current cgroup.events cgroup.procs cpu.idle cpu.uclamp.max cpuset.cpus.effective io.pressure pids.events cgroup.freeze cgroup.stat cpu.max cpu.uclamp.min cpuset.cpus.partition io.prio.class pids.max cgroup.kill cgroup.subtree_control cpu.max.burst cpu.weight cpuset.mems io.stat cgroup.max.depth cgroup.threads cpu.pressure cpu.weight.nice cpuset.mems.effective io.weight ``` The document of Cgroup v2 can be found at: - [Control Group v2](https://docs.kernel.org/admin-guide/cgroup-v2.html) - [Using cgroups-v2 to control distribution of CPU time for applications](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/managing_monitoring_and_updating_the_kernel/using-cgroups-v2-to-control-distribution-of-cpu-time-for-applications_managing-monitoring-and-updating-the-kernel) For now, if we want to enable Cgroup v2, we can use: ``` gpconfig -c gp_resource_manager -v group-v2 ``` to config Greenplum database, and if we want use Cgroup v1, just use: ``` gpconfig -c gp_resource_manager -v group ```

This is the negligence of PR #14733. As the title said, this PR is just trying to fix the failed unit test of test_unit_gpcheckresgroupimpl.py

1. `pg_resgroup.dat` has defined reserved resource group oids. So, there is no need to redefine. 2. `DEFAULTAUXILIARY_OID` was renamed to `SYSTEMRESGROUP_OID`.

…tatus() function (#14959) In the case of insert-select, the cpu_usage field contains information about master only. This leads to an error at the moment of analyzing the table, as the contents of the field do not correspond to json format. The error was fixed by setting the proexeclocation property of the function to i (PROEXECLOCATION_INITPLAN). https://github.com/greenplum-db/gpdb/issues/14154

We need to bypass the query with only `catalog` tables in resource groups in some situations, like the Database GUI client(DBeaver), which will run some catalog queries to fetch metadata. The code just walk the parse tree and find `RangeVar` to compare it's schemaname with `pg_catalog`, it's just a few cases which be considered now. Because every query will be walked after enabling resource group, so the effort should be as small as as possible. Following catalog query will be bypassed: ```sql SELECT n.oid,n.*,d.description FROM pg_catalog.pg_namespace n LEFT OUTER JOIN pg_catalog.pg_description d ON d.objoid=n.oid AND d.objsubid=0 AND d.classoid='pg_namespace'::regclass WHERE nspname='public' ORDER BY nspname; ``` But query without any `RangeVar`s will be ignored, such as: ```sql SELECT 1; SELECT some_udf(); ```

since commit c25db0c support resgroup with memory_limit, adds it to gp_resgroup_config view.

In previous version, we disable red-zone check for resource group, the resgroup use cgroup to clean the process which consumes memory exceeds the rg's configuration. For the new version for resgroup, we use redZoneChunks to determine whether the current process is in red-zone or not, so remove the resGroupId check.

Add permission check to file /sys/fs/cgroup/cgroup.procs. Make sure gpstart -rai will definitely succeed if the customer uses gpconfig to set the gp_resource_manager to group-v2.

In the case of insert-select, the value column contained information about coordinator only. The error was fixed by setting the proexeclocation property of the function to i (PROEXECLOCATION_INITPLAN). https://github.com/greenplum-db/gpdb/issues/14154 Execution of the "SELECT * FROM pg_resgroup_get_status_kv(NULL);" query resulted in a segmentation fault. NULL check has been added. The dumpResGroupInfo() function was called 2 times per query. Now it is called only once. Simplification of the function code.

…290) In resource group mode, if user requests more than statement_mem, grant that, otherwise set statement_mem to the query's memory limit.

fix the broken pipeline

Fix bypass catalog unittest. Guc value gp_resource_group_bypass_catalog_query doesn't get default value in unittest environment, so i set the value in unittest. BTW, the unittest should return non-0 value when cases failed.

This commit introduces a new GUC gp_resource_group_bypass_direct_dispatch which is defaultly true. After getting the plan, we can unassign (bypass) from resgroup: - for SelectStmt, direct dispatch means there are two slices and the 1st slice is QD and the 2nd is direct dispatch - for (Update|Delete|Insert)Stmt, direct dispatch means there is only one slice and that slice is direct dispatch Also we move some tests into bypass suite. Co-authored-by: Wenru Yan <ywenru@vmware.com>

pygresql is not a dependency of plpython, it may not exist when running the regress tests. And since it is not in most ditro's official repo, we may use pyscopg2 instead in the future. In the future release, the python versions for management utilities and plpython will be different. That means, there would be no 3rd party py libs for plpython by default. Also, there is no need to use pygresql/pyscopg2 in the isolation2 tests. plpython comes with SPI APIs and most things can be achieved by that easily. This PR removed the usage of pygresql in the isolation2 tests. - Most pygresql calls can be simply replaced with plpython SPI calls. - plpython SPI will always start a transaction, and resgroup doesn't like it. Use psql instead. - Direct dispatch doesn't support transaction block, use multiple '-c' with psql to solve that.

… gp_resgroup_status_per_host (#15688) change "rsgname" to "groupname" on the view of gp_resgroup_status and gp_resgroup_status_per_host, make them consistent.

In recent times, the dev pipeline always failed at the resource_group_v2_rhel8. After some investigations, I found we must use the root role to config Cgroup directly, otherwise the echo "+cpu" >> $basedir/cgroup.subtree_control will failed. So when we config the Cgroup-v2, use sudo su to change the role.

After discussion, we decided to change the resource group name of "cpu_hard_quota_limit" to "cpu_max_percent", and "cpu_soft_priority" to "cpu_weight".

…y_tools_v1

remove the max limit for min_cost in resource group to be consistent with the resource queue.

just fix the field name to correctness.

In Cgroup v1, we can not set cpu_quota_us larger than parent's, it'll throw an error of "Invalid Argument": ``` smart@postgres:/sys/fs/cgroup/cpu/gpdb/6441$ echo 370000 > cpu.cfs_quota_us bash: echo: write error: Invalid argument ``` But this could happen in Cgroup v2, so the interface between them is not exactly the same, this PR is trying to fix it.

As title said, change name "cpu_hard_quota_limit" to "cpu_max_percent".

The pg_resgroup_move_query() function implemented in 51ee26b has several disadvantages: 1) Slot leaking. The slot acquired in initiator process doesn't free if target process isn't alive, didn't receive a signal, or even received, but at the time it was in idle state. 2) Race condition between UnassignResGroup() called at the end of transaction (or honestly any other code) and handler called immediately after USR1 signal. 3) Not working entrydb process moving. Previously, the USR1 signal was sent to only one process, target or entrydb. Entrydb moving was never tested properly. Improvements added to solve the first problem: 1) Feedback from target process to initiator. Target process can set initiator's latch to quickly interrupt pg_resgroup_move_query() from awaiting. 2) Existed movetoResSlot parameter acts like a mark. Initiator sets it and waits on latch. If movetoResSlot become NULL, slot control is on the target process side. 3) Initiator PID. Used by target process to get initiator's latch. 4) Mutex. To guard all critical moveto* parameters from parallel changes. To solve the second problem, there was an attempt to use Postgres interruptions. I was unhappy to know GPDB use raw InterruptPending value to do some pre-cancellation of dispatched queries. GPDB thinks InterruptPending can be triggered only by "negative" events, which leads to cancelation. The temporary solution with InterruptPending-like "positive" flag showed that we may wait for next CHECK_FOR_INTERRUPTS() call for a long. For example, some fat insert may do group moving only at the end of dispatching, which makes no sense. Thus, I decided to decrease the probability of races by using additional IsTransactionState() check. IMPORTANT NOTICE. The existed solution still affected by race conditions. The current handler's implementation is an example of bad design and should be reworked. To solve the third problem, now we send signals to target and entrydb processes separately. We do UnassignResGroup() for one process only, not cleaning all counters for still working entrydb process inside handler. More, entrydb moving logic is now separated from segments moving, and so, it became more readable. New GUC (gp_resource_group_move_timeout) was added to limit a time we're waiting for feedback from target. New regression tests shows there is no slot leaking with some rare and now correctly resolved cases. * Move faulty assertions so they not shoot in case process already moving. * Fix entrydb process wrong slot resetting on moving (cloudberrydb#386) Entrydb and main processes may share same resgroup slot or, in other words, they have same session slot. When moving process to new group, we free old slot only when last (in our case entrydb) process moved to new group. At the same time we zeroing current session slot. The `sessionResetSlot()` called from `UnassignResGroup()` didn't check which slot we zeroing. This caused a bug, when we was unable to move same session twice (with the error like `ERROR: process XXX is in IDLE state`). The reason - new session slot set in main process, was zeroed by entrydb process. From now, we analyze current session slot before zeroing. --------- Co-authored-by: Alexey Gordeev <goa@arenadata.io>

Use subprocess instead of paramiko in isolation2 test, and remove all dependencies.

For the previous investigation, I neglected the fact that the AutoVacuum worker will be executed by PostermasterMain, it will invoke InitResmanager to init the resource group if it is enabled. And it'll use StartTransactionCommand to create a transaction to finish its job, which could put the AutoVacuum worker process to the admin_group, not system_group, so this PR is trying to fix it.

Add block io limit for resource group. GPDB resource group uses linux cgroup as backend, and now, gpdb support cgroup v2. In cgroup v2, we can control block io limit(like iops) even not direct io(in cgroup v1, must be directed io), so gpdb can also use block io limit in resource group now. In cgroup v2 interface(a sysfs, default /sys/fs/cgroup), the limitation is for per-disk, you can find the detailed information about this in: https://docs.kernel.org/admin-guide/cgroup-v2.html In gpdb, tablespace works like disks in OS, so gpdb block io limitation designs for tablespace, the relation of gpdb tablespace and cgroup is: tablespace -> disks -> cgroup io controller DDLs for io limit: create resource group <rg_name> with(iolimit='iolimit_config_string'); alter resource group <rg_name> set iolimit 'iolimit_config_string'; Typical cases: create resource group rg with (iolimit='ts1:rbps=100,rbps=13; \ ts2:rbps=9,iobps=1'); create resource group rg with (iolimit='*:rbps=100,rbps=13'); Detailed syntax of io limit: <iolimit_config_string> :: tablespace_io_config :: tablespace_io_config;<iolimit_config_string> <tablespace_io_config> :: tablespace:ioconfigs <tablespace> :: identifier | * <ioconfigs> :: ioconfig :: ioconfig,<ioconfigs> <ioconfig> :: key=value <key> :: wbps | rbps | wiops | riops <value> :: integer | max There is a dev thread: https://groups.google.com/a/greenplum.org/g/gpdb-dev/c/l-5RpOz0fP8/m/mqAq2NH_DAAJ) you can get more informations from it. The most important part is find disks of tablespace. Tablespace is just a directory, and find disks of a directory in Linux is not easy, for example, some software raid fs(like btrfs), but for non-stacked block device, it's easy. GPDB now only support non-stacked block device, but in the future, we should support some common stacked-block device or software raid. Reviewed By: Zhenghua Lyu <kainwen@gmail.com> Zhenglong Li <SmartKeyerror@gmail.com> Haolin Wang <whaolin@vmware.com> Soumyadeep Chakraborty <soumyadeep2007@gmail.com> Sasasu <i@sasa.su> TAO TANG Co-authored-by: Zhenghua Lyu <kainwen@gmail.com>

…#16011) These files are generated by parser generator, let's ignore them.

This commit fixes following warnings: + "unused" warnings are caused by the token type (keyword type are declared to be value type). + result of "realpath" and "fscanf" not be handled.

It's not re-throw the ERROR in PG_CATCH, and according to the current codes, wo do not need to use PG_TRY() to catch the dispatch error, so just remove them.

Currently, QD will not attach a resource group if QD is in bypass mode when it's already running in bypass mode, things will go wrong if we change role, just like: ``` CREATE RESOURCE GROUP rg_test_group WITH (cpu_max_percent=10, concurrency=5); CREATE ROLE rg_test_role RESOURCE GROUP rg_test_group; SET ROLE rg_test_role; CREATE TABLE rg_test_group_table(a int); -- Since rg_test_group_table is empty, the total cost of this SQL will be less than min_cost, -- so it'll run in bypass mode SELECT count(*) FROM rg_test_group_table; SELECT is_session_in_group(pg_backend_pid(), 'rg_test_group'); RESET ROLE; -- Still running in bypass mode, but the role is not rg_test_role SELECT count(*) FROM rg_test_group_table; SELECT is_session_in_group(pg_backend_pid(), 'admin_group'); DROP TABLE rg_test_group_table; ```

Because some OS could not contain "io" in cgroup.subtree_control by default, so add it to fix the broken CI

…16104) The test cases in resgroup_cpu_max_percent is designed for the value of gp_resource_group_cpu_limit equals 0.9, but I set it to 1 in #15699. Because the test results allow for a 10% deviation, it is possible to pass the test incorrectly in most cases.

add some resource group test results into .gitignore

my-ship-it · 2024-08-26T06:19:55Z

After discussion, it is not needed anymore.

avamingli added the cherry-pick cherry-pick upstream commts label May 17, 2024

foreyes force-pushed the cherry-pick-rg branch 4 times, most recently from 4d25652 to 1d6398a Compare May 21, 2024 02:29

foreyes force-pushed the cherry-pick-rg branch 5 times, most recently from 8dfaed6 to 4ecc6d9 Compare June 7, 2024 11:48

foreyes force-pushed the cherry-pick-rg branch 5 times, most recently from b485135 to 4ac3137 Compare June 17, 2024 01:24

kainwen and others added 14 commits July 5, 2024 01:12

Disable drop resource group system_group.

07f5b76

Fix the failed unit test case (#14930)

a1a7453

This is the negligence of PR #14733. As the title said, this PR is just trying to fix the failed unit test of test_unit_gpcheckresgroupimpl.py

Removed duplicate resource group Oid definitions (#14940)

3f4d646

1. `pg_resgroup.dat` has defined reserved resource group oids. So, there is no need to redefine. 2. `DEFAULTAUXILIARY_OID` was renamed to `SYSTEMRESGROUP_OID`.

show memory_limit information in gp_toolkit.gp_resgroup_config view

b96b0ed

since commit c25db0c support resgroup with memory_limit, adds it to gp_resgroup_config view.

add permission check to file /sys/fs/cgroup/cgroup.procs (#15133)

7c29273

Add permission check to file /sys/fs/cgroup/cgroup.procs. Make sure gpstart -rai will definitely succeed if the customer uses gpconfig to set the gp_resource_manager to group-v2.

set resource group's minimum query memory limit to statement_mem (#15…

5c99edd

…290) In resource group mode, if user requests more than statement_mem, grant that, otherwise set statement_mem to the query's memory limit.

Fix broken pipeline (#15406)

6eaeecf

fix the broken pipeline

RMTT and others added 27 commits July 5, 2024 01:12

Fix bypass catalog unittest (#15499)

b261f8c

Fix bypass catalog unittest. Guc value gp_resource_group_bypass_catalog_query doesn't get default value in unittest environment, so i set the value in unittest. BTW, the unittest should return non-0 value when cases failed.

Refactor unassign and bypass logic for resgroup.

ddf78e0

change "rsgname" to "groupname" on the view of gp_resgroup_status and…

19125b7

… gp_resgroup_status_per_host (#15688) change "rsgname" to "groupname" on the view of gp_resgroup_status and gp_resgroup_status_per_host, make them consistent.

remove CGroupOpsRoutine from shared memory

d7a83f7

Change resgroup name (#15765)

dfa9c36

After discussion, we decided to change the resource group name of "cpu_hard_quota_limit" to "cpu_max_percent", and "cpu_soft_priority" to "cpu_weight".

fix indent and udf's ending whitespace for resgroup/resgroup_auxiliar…

03be17b

…y_tools_v1

remove the max limit for min_cost

b18e8f5

remove the max limit for min_cost in resource group to be consistent with the resource queue.

fix the name of "cpu_hard_quota_limit" in resgroup_syntax.sql (#15772)

31b2b0e

just fix the field name to correctness.

change name cpu_hard_quota_limit to cpu_max_percent (#15782)

afe8748

As title said, change name "cpu_hard_quota_limit" to "cpu_max_percent".

fix compile

acd1e11

remove paramiko dependencies (#15855)

0614845

Use subprocess instead of paramiko in isolation2 test, and remove all dependencies.

Don't track file changes in io_limit_gram[.h|.c]/io_limit_scanner.c (…

92ea6f6

…#16011) These files are generated by parser generator, let's ignore them.

fix compile warning in io_limit_gram.y (#16039)

7581d05

This commit fixes following warnings: + "unused" warnings are caused by the token type (keyword type are declared to be value type). + result of "realpath" and "fscanf" not be handled.

remove the inappropriate usgae of PG_TRY (#16017)

8c97e0a

It's not re-throw the ERROR in PG_CATCH, and according to the current codes, wo do not need to use PG_TRY() to catch the dispatch error, so just remove them.

fix gram.y

d7fc9f1

Add "+io" to cgroup.subtree_control (#16124)

76dcc68

Because some OS could not contain "io" in cgroup.subtree_control by default, so add it to fix the broken CI

Fix bug: resgroup total wait during always 0.

3fa694f

add some test results into .gitignore (#16254)

b1ecc85

add some resource group test results into .gitignore

foreyes force-pushed the cherry-pick-rg branch from 4ac3137 to b1ecc85 Compare July 19, 2024 13:04

my-ship-it closed this Aug 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Cherry-pick ResourceGroup update from GreenPlum #440

[WIP] Cherry-pick ResourceGroup update from GreenPlum #440

foreyes commented May 17, 2024

CLAassistant commented May 17, 2024 •

edited

Loading

my-ship-it commented Aug 26, 2024

[WIP] Cherry-pick ResourceGroup update from GreenPlum #440

[WIP] Cherry-pick ResourceGroup update from GreenPlum #440

Conversation

foreyes commented May 17, 2024

Change logs

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Contributor's Checklist

CLAassistant commented May 17, 2024 • edited Loading

my-ship-it commented Aug 26, 2024

CLAassistant commented May 17, 2024 •

edited

Loading