Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rgw: make incomplete multipart upload part of bucket check efficient #57083

Merged
merged 1 commit into from
May 9, 2024

Conversation

ivancich
Copy link
Member

@ivancich ivancich commented Apr 24, 2024

Previously the incomplete multipart portion of bucket check would list all entries in the multipart namespace across all shards and then analyze them in memory before taking further action.

Since all index entries for a given multipart upload are all on the same shard by design, we can work on this asynchronously shard by shard. Furthermore since all entries for a given multipart upload are sequential in the bucket index, we can use a small window to analyze each of the uploads.

This should make the operation quicker and use much less memory in the worst cases.

Fixes: https://tracker.ceph.com/issues/65769

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@ivancich ivancich requested a review from a team as a code owner April 24, 2024 22:26
@ivancich ivancich requested a review from cfsnyder April 24, 2024 22:30
@ivancich ivancich force-pushed the wip-optimize-bucket-check-mulitparts branch from ec09933 to 4994233 Compare April 24, 2024 22:36
@ivancich ivancich requested a review from mkogan1 April 24, 2024 22:37
@ivancich ivancich force-pushed the wip-optimize-bucket-check-mulitparts branch 2 times, most recently from 01d2953 to aa08528 Compare April 25, 2024 02:59
@ivancich
Copy link
Member Author

jenkins test make check

@ivancich ivancich force-pushed the wip-optimize-bucket-check-mulitparts branch 2 times, most recently from fd20641 to 1f10776 Compare April 25, 2024 18:02
@ivancich
Copy link
Member Author

jenkins test api

Copy link
Contributor

@cbodley cbodley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nice. a lot of things could benefit from concurrency like this, but it gets complicated with optional_yield

@adamemerson and i have been working on some concurrency primitives for use with c++20 coroutines (#50005 for example). something similar for optional_yield could help to simplify efforts like this by hiding the complexity needed to support the two different runtimes

edit: the co_throttle class from #49720 would be a better algorithm here because it enforces bounded concurrency like your use of max_aio

p.s. expecting minor conflicts from #55592 which removes our fork of the spawn library

src/rgw/driver/rados/rgw_bucket.cc Outdated Show resolved Hide resolved
src/rgw/driver/rados/rgw_bucket.cc Outdated Show resolved Hide resolved
src/rgw/driver/rados/rgw_bucket.cc Outdated Show resolved Hide resolved
src/rgw/driver/rados/rgw_bucket.cc Show resolved Hide resolved
src/rgw/driver/rados/rgw_bucket.cc Outdated Show resolved Hide resolved
@ivancich ivancich force-pushed the wip-optimize-bucket-check-mulitparts branch from 1f10776 to 6f46e87 Compare April 29, 2024 15:15
@ivancich ivancich requested a review from cbodley April 29, 2024 16:05
Copy link
Contributor

@cbodley cbodley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great if it works. do we have any useful test coverage of this stuff?

src/rgw/driver/rados/rgw_bucket.cc Outdated Show resolved Hide resolved
src/rgw/driver/rados/rgw_bucket.cc Outdated Show resolved Hide resolved
@ivancich ivancich force-pushed the wip-optimize-bucket-check-mulitparts branch from 6f46e87 to 6f78ffd Compare April 29, 2024 16:58
@cbodley
Copy link
Contributor

cbodley commented Apr 30, 2024

unittest-seastore (Timeout)

apparently fixed since in https://tracker.ceph.com/issues/65585

@cbodley
Copy link
Contributor

cbodley commented Apr 30, 2024

jenkins test make check

Previously the incomplete multipart portion of bucket check would list
all entries in the _multipart_ namespace across all shards and then
analyze them in memory before taking further action.

Since all index entries for a given multipart upload are all on the
same shard by design, we can work on this asynchronously shard by
shard. Furthermore since all entries for a given multipart upload are
sequential in the bucket index, we can use a small window to analyze
each of the uploads.

This should make the operation quicker and use much less memory in the
worst cases.

Signed-off-by: J. Eric Ivancich <ivancich@redhat.com>
@ivancich ivancich force-pushed the wip-optimize-bucket-check-mulitparts branch from 6f78ffd to ebe9893 Compare April 30, 2024 16:01
@ivancich
Copy link
Member Author

I've rebased on latest main; we'll see if that fixes the CI issue(s).

@ivancich
Copy link
Member Author

@mkogan1 : I was looking into what it'd take to port this to Quincy. It'd be non-trivial. But maybe it'd be easy to make a version that would simply process each shard sequentially.

I was curious as to whether a HEAD radosgw-admin could operate safely on a Quincy cluster. It looks like one key data structure does change b/w the two versions -- cls_rgw_bucket_instance_entry -- and that is in the rgw_bucket_dir_header in which the stats are stored, and radosgw-admin bucket check --fix ... does update stats. So I don't think mis-matching the tool's version with the cluster's version is viable.

@ivancich
Copy link
Member Author

ivancich commented May 1, 2024

jenkins test api

@cbodley
Copy link
Contributor

cbodley commented May 1, 2024

https://jenkins.ceph.com/job/ceph-api/73295/

ERROR: setUpClass (tasks.mgr.dashboard.test_health.HealthTest)

commented on https://tracker.ceph.com/issues/47612, but it's over 3 years old now. cc @Pegonzal @epuertat can someone please look into this?

@cbodley
Copy link
Contributor

cbodley commented May 1, 2024

jenkins test api

@cbodley
Copy link
Contributor

cbodley commented May 1, 2024

https://jenkins.ceph.com/job/ceph-windows-pull-requests/39437/consoleFull

ceph_test_librbd.TestMigration.Stress2
[2024-04-30T20:01:03.000Z] subunit-trace reports failures: Command failed: type C:\workspace\test_results\subunit.out | subunit-trace
One or more test suites have failed

asked about this on #ceph-devel slack

@cbodley
Copy link
Contributor

cbodley commented May 1, 2024

jenkins test windows

@ivancich
Copy link
Member Author

ivancich commented May 2, 2024

@mkogan1 Ignore my previous comment. Backporting to quincy was not trivial but not that hard. And I have a pr ready to go once we resolve this one. See: #57244

@ivancich
Copy link
Member Author

ivancich commented May 2, 2024

https://jenkins.ceph.com/job/ceph-windows-pull-requests/39437/consoleFull

ceph_test_librbd.TestMigration.Stress2
[2024-04-30T20:01:03.000Z] subunit-trace reports failures: Command failed: type C:\workspace\test_results\subunit.out | subunit-trace
One or more test suites have failed

asked about this on #ceph-devel slack

Thanks, @cbodley!

@ivancich ivancich added the wip-eric-testing-1 for ivancich testing label May 6, 2024
@mkogan1
Copy link
Contributor

mkogan1 commented May 8, 2024

@ivancich hi, ran a test of bucket check [--fix] performance and memory footprint before and after this PR commit, the real memory footprint is lower, the bucket check operation takes longer thou

## Dataset : single bucket with 100M objects, 1999 shards:


❯ time (nice numactl -N 1 -m 1 -- ~/go/bin/hsbench -a b2345678901234567890 -s b234567890123456789012345678901234567890 -u http://127.0.0.1:8000 -z 4K -d -1 -t $(( $(numactl -N 0 -- nproc) / 1 )) -b 1 -n 100000000 -m cxip -bp b01b- -op 'folder01/stage01_' |& tee hsbench.log | stdbuf -o0 colrm 170 | ccze -Aonolookups)


"bucket": "b01b-000000000000",= "num_shards": 1999


#### BEFORE PR:
~~~~~~~~~~~~~~~

❯ sudo time -v ./bin/radosgw-admin bucket check --bucket=b01b-000000000000
{
    "invalid_multipart_entries": [],
    "check_result": {
        "existing_header": {
            "usage": {
                "rgw.main": {
                    "size": 409600000000,
                    "size_actual": 409600000000,
                    "size_utilized": 409600000000,
                    "size_kb": 400000000,
                    "size_kb_actual": 400000000,
                    "size_kb_utilized": 400000000,
                    "num_objects": 100000000
                }
            }
        },
        "calculated_header": {
            "usage": {
                "rgw.main": {
                    "size": 409600000000,
                    "size_actual": 409600000000,
                    "size_utilized": 409600000000,
                    "size_kb": 400000000,
                    "size_kb_actual": 400000000,
                    "size_kb_utilized": 400000000,
                    "num_objects": 100000000
                }
            }
        }
    }
}
        Command being timed: "./bin/radosgw-admin bucket check --bucket=b01b-000000000000"
        User time (seconds): 0.52
        System time (seconds): 0.21
        Percent of CPU this job got: 1%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:57.88
        ^^^^^^^^^^^^^^^^^^^^^^^^^                    ~~~~~~~
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 52608
                                            ^^^^^
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 184
        Minor (reclaiming a frame) page faults: 4285
        Voluntary context switches: 18759
        Involuntary context switches: 5
        Swaps: 0
        File system inputs: 59608
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0


❯ atop -1 4
    PID       TID   MINFLT    MAJFLT   VSTEXT    VSLIBS    VDATA    VSTACK   LOCKSZ     VSIZE    RSIZE    PSIZE     VGROW    RGROW    SWAPSZ   RUID        EUID         MEM    CMD        1/1
1807722         -      1/s       0/s    16.5M     27.7M   665.1M    692.0K     0.0K      4.0G    47.2M       0B        0B       0B        0B   root        root          0%    radosgw-admin
                                                                                                 ^^^^^

❯ sudo time -v ./bin/radosgw-admin bucket check --fix --bucket=b01b-000000000000
    ...
        Command being timed: "./bin/radosgw-admin bucket check --fix --bucket=b01b-000000000000"
        User time (seconds): 0.70
        System time (seconds): 0.25
        Percent of CPU this job got: 0%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:49.75
                                                     ^^^^^^^
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 52960
                                            ^^^^^
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 4432
        Voluntary context switches: 27435
        Involuntary context switches: 6
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0


❯ atop -1 4
    PID       TID   MINFLT    MAJFLT   VSTEXT    VSLIBS    VDATA    VSTACK   LOCKSZ     VSIZE    RSIZE    PSIZE     VGROW    RGROW    SWAPSZ   RUID        EUID         MEM    CMD        1/1
1812859         -      0/s       0/s    16.5M     27.7M   665.0M    692.0K     0.0K      4.0G    47.6M       0B        0B       0B        0B   root        root          0%    radosgw-admin
                                                                                                 ^^^^^

#### AFTER PR:
~~~~~~~~~~~~~

❯ sudo time -v ./bin/radosgw-admin bucket check --bucket=b01b-000000000000
    ...
        Command being timed: "./bin/radosgw-admin bucket check --bucket=b01b-000000000000"
        User time (seconds): 0.95
        System time (seconds): 0.34
        Percent of CPU this job got: 1%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:26.49
                                                     ^^^^^^^
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 52196
                                            ^^^^^
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 4164
        Voluntary context switches: 19322
        Involuntary context switches: 3
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0


❯ atop -1 4
    PID       TID   MINFLT    MAJFLT   VSTEXT    VSLIBS    VDATA    VSTACK   LOCKSZ     VSIZE    RSIZE    PSIZE     VGROW    RGROW    SWAPSZ   RUID        EUID         MEM    CMD        1/1
1843838         -      2/s       0/s    16.6M     27.7M   661.2M    692.0K     0.0K      4.0G    44.0M       0B        0B       0B        0B   root        root          0%    radosgw-admin
                                                                                                 ^^^^^

❯ sudo time -v ./bin/radosgw-admin bucket check --fix --bucket=b01b-000000000000
    ...
        Command being timed: "./bin/radosgw-admin bucket check --fix --bucket=b01b-000000000000"
        User time (seconds): 1.28
        System time (seconds): 0.43
        Percent of CPU this job got: 1%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 2:20.75
                                                     ^^^^^^^
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 52240
                                            ^^^^^
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 4175
        Voluntary context switches: 28772
        Involuntary context switches: 14
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0


❯ atop -1 4
    PID       TID   MINFLT    MAJFLT   VSTEXT    VSLIBS    VDATA    VSTACK   LOCKSZ     VSIZE    RSIZE    PSIZE     VGROW    RGROW    SWAPSZ   RUID        EUID         MEM    CMD        1/1
1836143         -      1/s       0/s    16.6M     27.7M   661.2M    692.0K     0.0K      4.0G    44.3M       0B        0B       0B        0B   root        root          0%    radosgw-admin
                                                                                                 ^^^^^

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants