Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Cherry-pick ResourceGroup update from GreenPlum #440

Closed
wants to merge 45 commits into from

Conversation

foreyes
Copy link
Collaborator

@foreyes foreyes commented May 17, 2024

fix #ISSUE_Number


Change logs

Describe your change clearly, including what problem is being solved or what feature is being added.

If it has some breaking backward or forward compatibility, please clary.

Why are the changes needed?

Describe why the changes are necessary.

Does this PR introduce any user-facing change?

If yes, please clarify the previous behavior and the change this PR proposes.

How was this patch tested?

Please detail how the changes were tested, including manual tests and any relevant unit or integration tests.

Contributor's Checklist

Here are some reminders and checklists before/when submitting your pull request, please check them:

  • Make sure your Pull Request has a clear title and commit message. You can take git-commit template as a reference.
  • Sign the Contributor License Agreement as prompted for your first-time contribution(One-time setup).
  • Learn the coding contribution guide, including our code conventions, workflow and more.
  • List your communication in the GitHub Issues or Discussions (if has or needed).
  • Document changes.
  • Add tests for the change
  • Pass make installcheck
  • Pass make -C src/test installcheck-cbdb-parallel
  • Feel free to request cloudberrydb/dev team for review and approval when your PR is ready🥳

@CLAassistant
Copy link

CLAassistant commented May 17, 2024

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 11 committers have signed the CLA.

✅ foreyes
❌ kainwen
❌ SmartKeyerror
❌ fairyfar
❌ RMTT
❌ andr-sokolov
❌ AJR-VMware
❌ beeender
❌ yanwr1
❌ Tao-T
❌ zxuejing
You have signed the CLA already but the status is still pending? Let us recheck it.

@avamingli avamingli added the cherry-pick cherry-pick upstream commts label May 17, 2024
@foreyes foreyes force-pushed the cherry-pick-rg branch 4 times, most recently from 4d25652 to 1d6398a Compare May 21, 2024 02:29
@foreyes foreyes force-pushed the cherry-pick-rg branch 5 times, most recently from 8dfaed6 to 4ecc6d9 Compare June 7, 2024 11:48
@foreyes foreyes force-pushed the cherry-pick-rg branch 5 times, most recently from b485135 to 4ac3137 Compare June 17, 2024 01:24
kainwen and others added 14 commits July 5, 2024 01:12
Greenplum introduces auto_stats for a long time, please refer to
https://groups.google.com/a/greenplum.org/g/gpdb-dev/c/bAyw2KBP6yE/m/hmoWikrPAgAJ
for details and decision of auto_stats.

auto_stats() now is invoked at the following 7 places:
  1. ProcessQuery()
  2. _SPI_pquery()
  3. postquel_end()
  4. ATExecExpandTableCTAS()
  5. ATExecSetDistributedBy()
  6. DoCopy()
  7. ExecCreateTableAs()

Previously, Place 2, 3 is hard-coded as inside function,
Place 1, 4~7 is hard-coded as not-inside function.
Place 4~7 does not cover the case that COPY or CTAS
is called inside procedure language.

Since in future auto_stats will be removed, for now let's
just to do some simple fix instead of big refactor.

To correctly pass the inFunction parameter for auto_stats()
at Place 4~7 we introduce executor_run_nesting_level to mark
if the program is already under ExecutorRun(). Place 4~7 is
directly taken as Utility and will not call ExecutorRun() if
they are not inside procedure language. This skill is like
the extension `auto_explain`.

For Place 1~3, the context is clear we do not need to check
executor_run_nesting_level.

NOTE: now, Greenplum support create procedure, and invoke
it as call proc() instead of select proc(). To handle this case,
we also have a static module variable to mark the nest level
of standard_ProcessUtility, if the nest level >= 2, means it is
inside a procedure call.
After we changed the semantics of the resource group at #14562, it's time to add the implementation
of Linux Cgroup v2.

In Linux Kernel, there is no difference between Cgroup v1 and Cgroup v2 in interior implementation,
the only change is the pseudo-files interface. Thanks to the #14343, we just need to add a new file
and implement a new `cgroupOpsRoutine`, which can be found at `get_group_routine_v2()`.

In early times, we considered that developers will have different mount directories of Cgroup, so we
generate test files dynamically. But we found it is an over-design, so move some test files from `input/`
to `sql/`, and this will be convenient for developers.

This is been written at  `src/backend/utils/resgroup/cgroup-ops-linux-v2.c`. Cgroup version 2 uses
the "unified hierarchy", all the presudo-file of different controllers are all in one directory, and not
use the directyory "cpu, cpuset", "memory", etc.

This is the biggest difference between them, and directly decide how we configure them, here is an
example:

```
smart@stable:/sys/fs/cgroup$ ls
blkio  cpu  cpuacct  cpu,cpuacct  cpuset  devices  freezer  hugetlb  memory  net_cls  net_cls,net_prio  net_prio  perf_event  pids  rdma  systemd  unified

[root@localhost gpdb]# ls
cgroup.controllers  cgroup.max.descendants  cgroup.type    cpu.stat         cpuset.cpus            io.max         pids.current
cgroup.events       cgroup.procs            cpu.idle       cpu.uclamp.max   cpuset.cpus.effective  io.pressure    pids.events
cgroup.freeze       cgroup.stat             cpu.max        cpu.uclamp.min   cpuset.cpus.partition  io.prio.class  pids.max
cgroup.kill         cgroup.subtree_control  cpu.max.burst  cpu.weight       cpuset.mems            io.stat
cgroup.max.depth    cgroup.threads          cpu.pressure   cpu.weight.nice  cpuset.mems.effective  io.weight
```

The document of Cgroup v2 can be found at:

- [Control Group v2](https://docs.kernel.org/admin-guide/cgroup-v2.html)
- [Using cgroups-v2 to control distribution of CPU time for applications](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/managing_monitoring_and_updating_the_kernel/using-cgroups-v2-to-control-distribution-of-cpu-time-for-applications_managing-monitoring-and-updating-the-kernel)

For now, if we want to enable Cgroup v2, we can use:

```
gpconfig -c gp_resource_manager -v group-v2
```

to config Greenplum database, and if we want use Cgroup v1, just use:

```
gpconfig -c gp_resource_manager -v group
```
This is the negligence of PR #14733.

As the title said, this PR is just trying to fix the failed unit test of test_unit_gpcheckresgroupimpl.py
1. `pg_resgroup.dat` has defined reserved resource group oids. So, there is no need to redefine.
2. `DEFAULTAUXILIARY_OID` was renamed to `SYSTEMRESGROUP_OID`.
…tatus() function (#14959)

In the case of insert-select, the cpu_usage field contains information about
master only. This leads to an error at the moment of analyzing the table, as the
contents of the field do not correspond to json format.
The error was fixed by setting the proexeclocation property of the function
to i (PROEXECLOCATION_INITPLAN).

https://github.com/greenplum-db/gpdb/issues/14154
We need to bypass the query with only `catalog` tables in resource
groups in some situations, like the Database GUI client(DBeaver), which
will run some catalog queries to fetch metadata.

The code just walk the parse tree and find `RangeVar` to compare it's
schemaname with `pg_catalog`, it's just a few cases which be considered
now. Because every query will be walked after enabling resource group,
so the effort should be as small as as possible.

Following catalog query will be bypassed:
```sql
SELECT n.oid,n.*,d.description FROM pg_catalog.pg_namespace n
LEFT OUTER JOIN pg_catalog.pg_description d ON d.objoid=n.oid AND
d.objsubid=0 AND d.classoid='pg_namespace'::regclass
WHERE nspname='public' ORDER BY nspname;
```
But query without any `RangeVar`s will be ignored, such as:
```sql
SELECT 1;
SELECT some_udf();
```
since commit c25db0c support resgroup with memory_limit, adds it to
gp_resgroup_config view.
In previous version, we disable red-zone check for resource group,
the resgroup use cgroup to clean the process which consumes memory
exceeds the rg's configuration.

For the new version for resgroup, we use redZoneChunks to determine
whether the current process is in red-zone or not, so remove the
resGroupId check.
Add permission check to file /sys/fs/cgroup/cgroup.procs.

Make sure gpstart -rai will definitely succeed if the customer uses gpconfig to set the
gp_resource_manager to group-v2.
In the case of insert-select, the value column contained information about
coordinator only. The error was fixed by setting the proexeclocation property of
the function to i (PROEXECLOCATION_INITPLAN).
https://github.com/greenplum-db/gpdb/issues/14154

Execution of the "SELECT * FROM pg_resgroup_get_status_kv(NULL);" query resulted
in a segmentation fault. NULL check has been added.

The dumpResGroupInfo() function was called 2 times per query. Now it is called
only once.

Simplification of the function code.
…290)

In resource group mode, if user requests more than statement_mem,
grant that, otherwise set statement_mem to the query's memory limit.
Add memory_usage into the function pg_resgroup_get_status() result.

postgres=# select * from gp_toolkit.gp_resgroup_status_per_host;
    rsgname    | groupid | hostname | cpu_usage | memory_usage
---------------+---------+----------+-----------+--------------
 admin_group   |    6438 | zero     |      0.09 |         5.55
 default_group |    6437 | zero     |      0.00 |         0.00
 system_group  |    6441 | zero     |      0.01 |        40.55
(3 rows)

It'll return the memory_usage in megabytes, to help the customer address memory problems.

For Linux Cgroup v1, the memory usage is read from `memory.usage_in_bytes`. For Linux Cgroup
v2, it's read from `memory.current`.
fix the broken pipeline
RMTT and others added 27 commits July 5, 2024 01:12
Fix bypass catalog unittest.

Guc value gp_resource_group_bypass_catalog_query doesn't
get default value in unittest environment, so i set the value in unittest.

BTW, the unittest should return non-0 value when cases failed.
This commit introduces a new GUC gp_resource_group_bypass_direct_dispatch
which is defaultly true. After getting the plan, we can unassign (bypass)
from resgroup:
  - for SelectStmt, direct dispatch means there are two slices and the 1st
    slice is QD and the 2nd is direct dispatch
  - for (Update|Delete|Insert)Stmt, direct dispatch means there is only one
    slice and that slice is direct dispatch

Also we move some tests into bypass suite.

Co-authored-by: Wenru Yan <ywenru@vmware.com>
pygresql is not a dependency of plpython, it may not exist when running
the regress tests. And since it is not in most ditro's official repo, we
may use pyscopg2 instead in the future.

In the future release, the python versions for management utilities and
plpython will be different. That means, there would be no 3rd party py
libs for plpython by default.

Also, there is no need to use pygresql/pyscopg2 in the isolation2 tests.
plpython comes with SPI APIs and most things can be achieved by that
easily.

This PR removed the usage of pygresql in the isolation2 tests.
- Most pygresql calls can be simply replaced with plpython SPI calls.
- plpython SPI will always start a transaction, and resgroup doesn't
  like it. Use psql instead.
- Direct dispatch doesn't support transaction block, use multiple '-c'
  with psql to solve that.
… gp_resgroup_status_per_host (#15688)

change "rsgname" to "groupname" on the view of gp_resgroup_status
and gp_resgroup_status_per_host, make them consistent.
In recent times, the dev pipeline always failed at the resource_group_v2_rhel8.
After some investigations, I found we must use the root role to config Cgroup directly,
otherwise the echo "+cpu" >> $basedir/cgroup.subtree_control will failed.

So when we config the Cgroup-v2, use sudo su to change the role.
After discussion, we decided to change the resource group name of
"cpu_hard_quota_limit" to "cpu_max_percent", and "cpu_soft_priority"
to "cpu_weight".
remove the max limit for min_cost in resource group to be consistent with the resource queue.
In Cgroup v1, we can not set cpu_quota_us larger than parent's, it'll throw an
error of "Invalid Argument":

```
smart@postgres:/sys/fs/cgroup/cpu/gpdb/6441$ echo 370000 > cpu.cfs_quota_us
bash: echo: write error: Invalid argument
```

But this could happen in Cgroup v2, so the interface between them is not exactly
the same, this PR is trying to fix it.
As title said, change name "cpu_hard_quota_limit" to "cpu_max_percent".
The pg_resgroup_move_query() function implemented in 51ee26b has
several disadvantages:
1) Slot leaking. The slot acquired in initiator process doesn't free if
target process isn't alive, didn't receive a signal, or even received,
but at the time it was in idle state.
2) Race condition between UnassignResGroup() called at the end of
transaction (or honestly any other code) and handler called immediately
after USR1 signal.
3) Not working entrydb process moving. Previously, the USR1 signal was
sent to only one process, target or entrydb. Entrydb moving was never
tested properly.

Improvements added to solve the first problem:
1) Feedback from target process to initiator. Target process can set
initiator's latch to quickly interrupt pg_resgroup_move_query() from
awaiting.
2) Existed movetoResSlot parameter acts like a mark. Initiator sets it
and waits on latch. If movetoResSlot become NULL, slot control is on
the target process side.
3) Initiator PID. Used by target process to get initiator's latch.
4) Mutex. To guard all critical moveto* parameters from parallel
changes.

To solve the second problem, there was an attempt to use Postgres
interruptions. I was unhappy to know GPDB use raw InterruptPending
value to do some pre-cancellation of dispatched queries. GPDB thinks
InterruptPending can be triggered only by "negative" events, which
leads to cancelation. The temporary solution with InterruptPending-like
"positive" flag showed that we may wait for next CHECK_FOR_INTERRUPTS()
call for a long. For example, some fat insert may do group moving only
at the end of dispatching, which makes no sense.
Thus, I decided to decrease the probability of races by using
additional IsTransactionState() check.
IMPORTANT NOTICE. The existed solution still affected by race
conditions. The current handler's implementation is an example of bad
design and should be reworked.

To solve the third problem, now we send signals to target and entrydb
processes separately. We do UnassignResGroup() for one process only,
not cleaning all counters for still working entrydb process inside
handler. More, entrydb moving logic is now separated from segments
moving, and so, it became more readable.

New GUC (gp_resource_group_move_timeout) was added to limit a time
we're waiting for feedback from target.

New regression tests shows there is no slot leaking with some rare and
now correctly resolved cases.

* Move faulty assertions so they not shoot in case process already moving.

* Fix entrydb process wrong slot resetting on moving (cloudberrydb#386)

Entrydb and main processes may share same resgroup slot or, in other words, they
have same session slot. When moving process to new group, we free old slot only
when last (in our case entrydb) process moved to new group. At the same time we
zeroing current session slot. The `sessionResetSlot()` called from
`UnassignResGroup()` didn't check which slot we zeroing. This caused a bug, when
we was unable to move same session twice (with the error like `ERROR: process
XXX is in IDLE state`). The reason - new session slot set in main process, was
zeroed by entrydb process. From now, we analyze current session slot before
zeroing.

---------

Co-authored-by: Alexey Gordeev <goa@arenadata.io>
Use subprocess instead of paramiko in isolation2 test, and remove all
dependencies.
For the previous investigation, I neglected the fact that the AutoVacuum worker will be
executed by PostermasterMain, it will invoke InitResmanager to init the resource group if
it is enabled. And it'll use StartTransactionCommand to create a transaction to finish its
job, which could put the AutoVacuum worker process to the admin_group, not system_group,
so this PR is trying to fix it.
Add block io limit for resource group.

GPDB resource group uses linux cgroup as backend, and now, gpdb support
cgroup v2. In cgroup v2, we can control block io limit(like iops) even
not direct io(in cgroup v1, must be directed io), so gpdb can also use
block io limit in resource group now.

In cgroup v2 interface(a sysfs, default /sys/fs/cgroup), the limitation
is for per-disk, you can find the detailed information about this in:
  https://docs.kernel.org/admin-guide/cgroup-v2.html

In gpdb, tablespace works like disks in OS, so gpdb block io limitation
 designs for tablespace, the relation of gpdb tablespace and cgroup is:
    tablespace -> disks -> cgroup io controller

DDLs for io limit:
 create resource group <rg_name> with(iolimit='iolimit_config_string');
 alter resource group <rg_name> set iolimit 'iolimit_config_string';
Typical cases:
 create resource group rg with (iolimit='ts1:rbps=100,rbps=13; \
ts2:rbps=9,iobps=1');
 create resource group rg with (iolimit='*:rbps=100,rbps=13');

Detailed syntax of io limit:
<iolimit_config_string> :: tablespace_io_config
                        :: tablespace_io_config;<iolimit_config_string>
<tablespace_io_config> :: tablespace:ioconfigs
<tablespace> :: identifier | *
<ioconfigs> :: ioconfig
            :: ioconfig,<ioconfigs>
<ioconfig>  :: key=value
<key>       :: wbps | rbps | wiops | riops
<value>     :: integer | max

There is a dev thread:
https://groups.google.com/a/greenplum.org/g/gpdb-dev/c/l-5RpOz0fP8/m/mqAq2NH_DAAJ)
you can get more informations from it.

The most important part is find disks of tablespace. Tablespace is just
a directory, and find disks of a directory in Linux is not easy, for
example, some software raid fs(like btrfs), but for non-stacked block
device, it's easy. GPDB now only support non-stacked block device, but
in the future, we should support some common stacked-block device or
software raid.

Reviewed By:
Zhenghua Lyu <kainwen@gmail.com>
Zhenglong Li <SmartKeyerror@gmail.com>
Haolin Wang <whaolin@vmware.com>
Soumyadeep Chakraborty <soumyadeep2007@gmail.com>
Sasasu <i@sasa.su>
TAO TANG

Co-authored-by: Zhenghua Lyu <kainwen@gmail.com>
…#16011)

These files are generated by parser generator, let's ignore them.
This commit fixes following warnings:
+ "unused" warnings are caused by the token type (keyword type are
declared to be value type).

+ result of "realpath" and "fscanf" not be handled.
It's not re-throw the ERROR in PG_CATCH, and according to the current codes,
wo do not need to use PG_TRY() to catch the dispatch error, so just remove them.
Currently, QD will not attach a resource group if QD is in bypass mode when it's already
running in bypass mode, things will go wrong if we change role, just like:

```
CREATE RESOURCE GROUP rg_test_group WITH (cpu_max_percent=10, concurrency=5);
CREATE ROLE rg_test_role RESOURCE GROUP rg_test_group;

SET ROLE rg_test_role;
CREATE TABLE rg_test_group_table(a int);

-- Since rg_test_group_table is empty, the total cost of this SQL will be less than min_cost,
-- so it'll run in bypass mode
SELECT count(*) FROM rg_test_group_table;
SELECT is_session_in_group(pg_backend_pid(), 'rg_test_group');
RESET ROLE;

-- Still running in bypass mode, but the role is not rg_test_role
SELECT count(*) FROM rg_test_group_table;
SELECT is_session_in_group(pg_backend_pid(), 'admin_group');
DROP TABLE rg_test_group_table;
```
Because some OS could not contain "io" in cgroup.subtree_control by
default, so add it to fix the broken CI
…16104)

The test cases in resgroup_cpu_max_percent is designed for the value of
gp_resource_group_cpu_limit equals 0.9, but I set it to 1 in #15699.

Because the test results allow for a 10% deviation, it is possible to pass the
test incorrectly in most cases.
add some resource group test results into .gitignore
@my-ship-it
Copy link
Contributor

After discussion, it is not needed anymore.

@my-ship-it my-ship-it closed this Aug 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-pick cherry-pick upstream commts
Projects
None yet
Development

Successfully merging this pull request may close these issues.