improve trace client performance using concurrent resource load #5726

gotwarlost · 2024-05-22T22:37:42Z

Description of your changes

Add a loader with configurable concurrency to load resources in concurrent manner. The xrm client delegates to the loader for resource load and supports a functional option to set the concurrency.

Add a --concurrency flag for the crank beta trace command and configure the xrm client appropriately.

I have:

Read and followed Crossplane's [contribution process].
Run make reviewable to ensure this PR is ready for review.
Added or updated unit tests.
~~- [ ] Added or updated e2e tests.~~
~~- [ ] Linked a PR or a [docs tracking issue] to [document this change].~~
~~- [ ] Added backport release-x.y labels to auto-backport this PR.~~

gotwarlost · 2024-05-22T22:42:33Z

A totally bad benchmark

This only serves to show that increasing concurrency does matter.

Some numbers on my laptop after hard-coding qps and burst to 50 and 100 such that the kube client doesn't backoff and become the bottleneck. There are ~420 children for the resource being traced.

$ for c in 1 2 3 4 5 10 20 30 40; do echo concurrency=$c;time crank beta trace --concurrency=$c \
        myfacade/facade-1>/dev/null; done

concurrency=1

real	1m29.923s
user	0m0.835s
sys	0m0.308s

concurrency=2

real	0m45.674s
user	0m0.773s
sys	0m0.301s

concurrency=3

real	0m32.130s
user	0m0.733s
sys	0m0.298s

concurrency=4

real	0m24.895s
user	0m0.753s
sys	0m0.254s

concurrency=5

real	0m20.722s
user	0m0.739s
sys	0m0.286s

concurrency=10

real	0m20.902s
user	0m0.727s
sys	0m0.267s

concurrency=20

real	0m10.015s
user	0m0.635s
sys	0m0.217s

concurrency=30

real	0m12.343s
user	0m0.613s
sys	0m0.214s

concurrency=40

real	0m13.128s
user	0m0.598s
sys	0m0.203s

As you can see at some point, increasing the concurrency doesn't make a difference and there are diminishing returns.

gotwarlost · 2024-05-23T16:02:04Z

Interesting - lint is failing for go 1.22 int range usage but make reviewable passed on my laptop...

gotwarlost · 2024-05-23T16:56:09Z

Seems to me that the linter error is spurious since the code doesn't compile with that change (given 1.21 in go.mod)

phisco · 2024-05-23T17:24:48Z

go.mod's directive is only the minimum required version, what matters is the version of the go binary, which is set to 1.22.3 in ci.

I have to go through it more thoroughly, but in the meantime thanks for your contribution!

Just a breadcrumb: another possible optimization I left behind during the initial implementation was to use the cached discovery client, unfortunately wiring up all the usual Kubernetes flags wasn't so straightforward given that we aren't using pflag/cobra. That would probably reduce the need to fine tune the qps and burst settings.

gotwarlost · 2024-05-23T17:32:49Z

what matters is the version of the go binary

not able to compile locally even with go 1.22.3

$ git diff
diff --git a/cmd/crank/beta/trace/internal/resource/xrm/loader.go b/cmd/crank/beta/trace/internal/resource/xrm/loader.go
index d6686f40..795a6d27 100644
--- a/cmd/crank/beta/trace/internal/resource/xrm/loader.go
+++ b/cmd/crank/beta/trace/internal/resource/xrm/loader.go
@@ -79,7 +79,7 @@ func (l *loader) load(ctx context.Context, concurrency int) {
                concurrency = defaultConcurrency
        }
        var wg sync.WaitGroup
-       for i := 0; i < concurrency; i++ {
+       for range concurrency {
                wg.Add(1)
                go func() {
                        defer wg.Done()
$ go version

go version go1.22.3 darwin/arm64

$ go install ./cmd/crank
# github.com/crossplane/crossplane/cmd/crank/beta/trace/internal/resource/xrm
cmd/crank/beta/trace/internal/resource/xrm/loader.go:82:12: cannot range over concurrency (variable of type int): requires go1.22 or later (-lang was set to go1.21; check go.mod)

phisco · 2024-05-23T17:39:57Z

🤔 you are right, I didn't realise what it was complaining about 🤔

jbw976

Definitely curious what the performance would be in your 500+ resource scenario @gotwarlost after #5737 is merged. That looks to be fixing a big recent regression in performance...

cmd/crank/beta/trace/trace.go

cmd/crank/beta/trace/internal/resource/xrm/loader_test.go

cmd/crank/beta/trace/trace.go

gotwarlost · 2024-05-27T18:05:25Z

Second commit fixes test structure and sorts children after load. Also sets QPS and burst but I will get rid of this with a rebase from master.

cmd/crank/beta/trace/internal/resource/xrm/loader.go

cmd/crank/beta/trace/internal/resource/xrm/loader_test.go

gotwarlost · 2024-06-01T17:21:51Z

Not sure why the E2E tests are failing - the failing test cases seem unrelated to the changes in this PR.

gotwarlost · 2024-06-12T19:12:53Z

hello, how can I help moving this PR forward? Please let me know.

phisco · 2024-06-12T20:19:44Z

Hi, sorry, I'll make sure to have a look at it tomorrow. Could you rerun the benchmarks after the fix as @jbw976 was suggesting in the meantime?

gotwarlost · 2024-06-17T20:29:53Z

Benchmarks for 3 cases, before QPS change, after QPS change and current PR. Note that the QPS changes do not affect performance since access to resources are sequential with current head.

# before qps/ burst was set
$ git checkout 914b83a3c65e73aef00c9b96aafd57be01dbe98e
HEAD is now at 914b83a3 Merge pull request #5733 from crossplane/renovate/master-github.com-golang-jwt-jwt-v5-5.x

$ go install ./cmd/crank
$ time crank beta trace facade-type/facade-name -s >/dev/null

real	1m28.786s
user	0m1.141s
sys	0m0.385s

# after qps/ burst was set
$ git checkout b404cd0c67edd859c47df06a6e3c718d8951127d
Previous HEAD position was 914b83a3 Merge pull request #5733 from crossplane/renovate/master-github.com-golang-jwt-jwt-v5-5.x
HEAD is now at b404cd0c Merge pull request #5737 from phisco/gimme-more-requests

$ go install ./cmd/crank
$ time crank beta trace facade-type/facade-name -s >/dev/null # basically no changes, since access is one-at-a-time

real	1m29.491s
user	0m1.095s
sys	0m0.404s

# with concurrency
$ git checkout trace-perf
Previous HEAD position was b404cd0c Merge pull request #5737 from phisco/gimme-more-requests
Switched to branch 'trace-perf'
$ go install ./cmd/crank
$ for c in 1 2 4 8 16 32
> do
> echo Concurrency=$c
> time crank beta trace facade-type/facade-name -s --concurrency=$c >/dev/null
> done 

# output slightly formatted for newlines to make the timings appear next to the concurrency value

Concurrency=1
real	1m28.942s
user	0m1.247s
sys	0m0.439s

Concurrency=2
real	0m48.002s
user	0m1.034s
sys	0m0.363s

Concurrency=4
real	0m24.246s
user	0m0.943s
sys	0m0.337s

Concurrency=8
real	0m16.333s
user	0m0.975s
sys	0m0.347s

Concurrency=16
real	0m15.118s
user	0m0.998s
sys	0m0.371s

Concurrency=32
I0617 13:25:06.604160   96865 request.go:697] Waited for 1.020916333s due to client-side throttling, not priority and fairness, request: GET:https://49528D6033D107FEF93C8D493C0EDC30.gr7.eu-west-1.eks.amazonaws.com/apis/autoscaling.aws.upbound.io/v1beta1/grouptags/redacted

real	0m13.566s
user	0m0.986s
sys	0m0.352s

jbw976 · 2024-06-20T01:21:49Z

Nice @gotwarlost, thank you for running these benchmarks in your environment, that definitely shows an improvement when running with more concurrency! 💪 Interesting to see for you that b404cd0 had no effect, but your explanation makes sense for environments where the RTT for each serialized network request is significant enough.

I wanted to take these changes for a test spin as well and I got different results for my environment that is just using a local kind cluster where the RTT network time from the kubectl client is very small.

TL;DR: Higher QPS/Burst (b404cd0) does make a significant difference for me in this local kind scenario, while higher concurrency doesn't make much difference. I still think we should proceed with your general approach of adding concurrency support, since it clearly makes an improvement for more real world environments 🎉 , but it is still interesting to note this behavior difference 🤓

To set up my perf testing scenario, I created a test repo with a XTracePerf XRD and composition that essentially creates a variable number of ConfigMap objects using provider-kubernetes and function-kcl. I used the steps and resources in this repo to create a claim/XR with 250 composed resources, then ran the same steps you did to test the same commits and the changes in this PR.

trace perf testing repo: https://github.com/jbw976/xtrace-perf

250 ConfigMap objects created:

❯ kubectl get configmap --no-headers=true | grep '^object[-][0-9]' | wc -l
     250

Test 914b83a (QPS and burst missing): 49 seconds

❯ git checkout 914b83a3c65e73aef00c9b96aafd57be01dbe98e
❯ go install ./cmd/crank
❯ time crank beta trace traceperf/traceperf-tester > /dev/null
crank beta trace traceperf/traceperf-tester > /dev/null  0.42s user 0.16s system 1% cpu 49.001 total

Test b404cd0 (QPS and burst settings restored): 11.8 seconds

❯ git checkout b404cd0c67edd859c47df06a6e3c718d8951127d
❯ go install ./cmd/crank
❯ time crank beta trace traceperf/traceperf-tester > /dev/null
crank beta trace traceperf/traceperf-tester > /dev/null  0.28s user 0.14s system 3% cpu 11.867 total

Test this PR #5726 with new concurrency setting: Min: 11.08 seconds @ 8 concurrency, Max: 11.89 seconds @ 1 concurrency

❯ git checkout gotwarlost/trace-perf
HEAD is now at 3c14da317 Apply suggestions from code review
❯ go install ./cmd/crank
❯ for c in 1 2 4 8 16 32; do echo concurrency=$c; time crank beta trace traceperf/traceperf-tester --concurrency=$c > /dev/null; done
concurrency=1
crank beta trace traceperf/traceperf-tester --concurrency=$c > /dev/null  0.32s user 0.16s system 4% cpu 11.885 total
concurrency=2
crank beta trace traceperf/traceperf-tester --concurrency=$c > /dev/null  0.31s user 0.13s system 3% cpu 11.104 total
concurrency=4
crank beta trace traceperf/traceperf-tester --concurrency=$c > /dev/null  0.29s user 0.12s system 3% cpu 11.097 total
concurrency=8
crank beta trace traceperf/traceperf-tester --concurrency=$c > /dev/null  0.28s user 0.11s system 3% cpu 11.083 total
concurrency=16
crank beta trace traceperf/traceperf-tester --concurrency=$c > /dev/null  0.31s user 0.12s system 3% cpu 11.096 total
concurrency=32
I0619 18:12:18.001407   84861 request.go:697] Waited for 1.043695958s due to client-side throttling, not priority and fairness, request: GET:https://127.0.0.1:56038/apis/kubernetes.crossplane.io/v1alpha2/objects/object-143
crank beta trace traceperf/traceperf-tester --concurrency=$c > /dev/null  0.31s user 0.13s system 3% cpu 11.114 total

I will take a full code review pass through the code changes as well, but I wanted to share those performance findings, as well as the reusable https://github.com/jbw976/xtrace-perf testing repo that could benefit others. Once again, I don't think these results indicate we shouldn't accept this change, as they are for a less important and less common local laptop cluster environment, with different network RTT characteristics. The concurrency support should definitely be beneficial for remote clusters 🙇‍♂️

gotwarlost · 2024-06-20T17:06:02Z

Those are interesting findings and certainly way different from what I have. I've noticed in the past that the network round trip time massively affects performance.

One a kind cluster where we run E2E tests, I had never noticed a performance problem (stuff loads in a few seconds) and it only when running the command on EKS clusters etc. with higher network latency that all these settings become material.

My hypothesis is that resources are loading so fast in your kind cluster that the qps/ burst of the client becomes the bottleneck because it is executing many more k8s requests per second than it would otherwise do with network latency.
In my setup where network latency is much higher, individual resources take enough time to load that the client stays within the default qps setting.

I'll re-run my tests for the facade composition on a kind cluster and report here. I suspect the behavior will be closer to what you have than what I previously published.

jbw976 · 2024-06-20T17:11:06Z

My hypothesis is that resources are loading so fast in your kind cluster that the qps/ burst of the client becomes the bottleneck because it is executing many more k8s requests per second than it would otherwise do with network latency.
In my setup where network latency is much higher, individual resources take enough time to load that the client stays within the default qps setting.

Exactly @gotwarlost, I 💯 believe that hypothesis as well! It makes logical sense 🙇‍♂️

gotwarlost · 2024-06-20T19:13:35Z

Yes, I can reproduce results similar to yours for a kind cluster with ~480 resources
55s drops to 13s after the qps fix and marginal to no improvements for concurrency

$ git checkout 914b83a3c65e73aef00c9b96aafd57be01dbe98e
HEAD is now at 914b83a3 Merge pull request #5733 from crossplane/renovate/master-github.com-golang-jwt-jwt-v5-5.x

$ go install ./cmd/crank
$ crank beta trace facade-type/name -s

real	0m55.567s
user	0m0.609s
sys	0m0.215s

$ git checkout b404cd0c67edd859c47df06a6e3c718d8951127d
HEAD is now at b404cd0c Merge pull request #5737 from phisco/gimme-more-requests
$ go install ./cmd/crank
$ crank beta trace facade-type/name -s

real	0m12.911s
user	0m0.539s
sys	0m0.196s

$ git checkout trace-perf
Switched to branch 'trace-perf'
$ go install ./cmd/crank
$ for c in 1 2 4 8 16 32; do echo Concurrency=$c; time crank beta trace facade-type/name -s --concurrency=$c >/dev/null; done

Concurrency=1
real	0m12.847s
user	0m0.494s
sys	0m0.184s

Concurrency=2
real	0m11.714s
user	0m0.551s
sys	0m0.174s

Concurrency=4
real	0m11.585s
user	0m0.568s
sys	0m0.174s

Concurrency=8
real	0m11.409s
user	0m0.586s
sys	0m0.190s

Concurrency=16
real	0m10.978s
user	0m0.577s
sys	0m0.194s

Concurrency=32
I0620 12:00:41.315141    7075 request.go:697] Waited for 1.036958833s due to client-side throttling, not priority and fairness, request: GET:https://127.0.0.1:60368/apis/some/redacted/resource
real	0m10.040s
user	0m0.556s
sys	0m0.193s

jbw976

I took a pass through the code today too @gotwarlost, thanks again for this contribution! Most of my comments are around helping future readers of this code understand it quickly, i.e. simplifying things and adding some more plain english comments to help people keep track of where they are within the nested concurrency/goroutines. Let me know what you think of these suggestions, thank you! 🙇‍♂️

cmd/crank/beta/trace/internal/resource/xrm/loader.go

cmd/crank/beta/trace/internal/resource/xrm/loader_test.go

cmd/crank/beta/trace/internal/resource/xrm/loader.go

fixes crossplane#5707 Add a loader with configurable concurrency to load resources in concurrent manner. The xrm client delegates to the loader for resource load and supports a functional option to set the concurrency. Add a `--concurrency` flag for the `crank beta trace` command and configure the xrm client appropriately. Signed-off-by: gotwarlost <kananthmail-github@yahoo.com>

* sorts children after concurrent load using the same algorithm that crossplane uses to sort resource refs * sets default QPS and burst when not set to the same values as a recent PR * updates test structure to match existing tests Signed-off-by: gotwarlost <kananthmail-github@yahoo.com>

Signed-off-by: gotwarlost <kananthmail-github@yahoo.com>

* remove context from the loader interface when not needed * change member name from l to rl * simplify tests to lose regexes * add concurrency docs * remove module global for channel capacity, pass in as parameter * use range over ints instead of for loops since linter complains otherwise * do not run goroutines when there are no child refs to add * add more docs for sorting logic and why it is needed Signed-off-by: gotwarlost <kananthmail-github@yahoo.com>

gotwarlost · 2024-06-24T19:54:49Z

Thanks for the detailed PR review @jbw976 - I resolved all but one of your comments, the code is much better for it!

jbw976

Awesome @gotwarlost, thank you very much for incorporating the feedback, I appreciate you taking the time to do that. I took it for another test spin today on a live cluster and verified all the items are being returned as expected with multiple runs, so it is all working reliably well for me. I also ran the unit tests with count=1000 and those passed perfectly too, so feeling pretty good about the correctness of this change.

Just one final question from me about the for loop variable gotcha - if you can confirm with your thinking that we are clear from that, then that would be good!

jbw976 · 2024-07-12T18:05:00Z

cmd/crank/beta/trace/internal/resource/xrm/loader.go

+		for _, ref := range refs {
+			l.ch <- workItem{
+				parent: parent,
+				child:  ref,


Is this susceptible to the infamous go for loop variable gotcha? https://dev.to/kkentzo/the-golang-for-loop-gotcha-1n35

Looking a bit further, it appears this was fixed with go 1.22, which we appear to be using: https://github.com/crossplane/crossplane/blob/master/go.mod#L3

Do we completely not have to worry about this anymore? or is there any further steps we would need to take? should we continue to be defensive about this regardless, out of habit/tradition? what do you think? 🤔
https://go.dev/blog/loopvar-preview

This code also works on Go 1.21 BTW, since there are no pointers or goroutines involved that use the loop variable outside its intended scope. Also as you rightly said Go 1.22 doesn't have this issue any more.

I can put a ref := ref in the loop if you think that is more consistent with the codebase, but I don't know that it is necessary.

Ah yes, looking again more closely you are correct that this would work OK before go 1.22 also! A copy of the ref variable here is sent through the channel, so ref is never used outside of the scope of this loop 🙇‍♂️

No need to make a change here as far as I'm concerned!

jbw976

Thanks again for a great contribution that will help real world usage of crossplane beta trace much more performant! A great addition for v1.17 🤩

gotwarlost requested review from phisco and a team as code owners May 22, 2024 22:37

gotwarlost requested a review from turkenh May 22, 2024 22:37

gotwarlost force-pushed the trace-perf branch 2 times, most recently from 389d7cc to b5d8a44 Compare May 22, 2024 23:30

phisco self-assigned this May 25, 2024

jbw976 reviewed May 26, 2024

View reviewed changes

cmd/crank/beta/trace/trace.go Outdated Show resolved Hide resolved

phisco reviewed May 27, 2024

View reviewed changes

cmd/crank/beta/trace/internal/resource/xrm/loader_test.go Outdated Show resolved Hide resolved

cmd/crank/beta/trace/trace.go Outdated Show resolved Hide resolved

gotwarlost force-pushed the trace-perf branch from b693bb8 to 497b1a3 Compare May 27, 2024 18:06

phisco reviewed May 27, 2024

View reviewed changes

cmd/crank/beta/trace/internal/resource/xrm/loader.go Outdated Show resolved Hide resolved

cmd/crank/beta/trace/internal/resource/xrm/loader_test.go Outdated Show resolved Hide resolved

jbw976 reviewed Jun 21, 2024

View reviewed changes

gotwarlost added 4 commits June 24, 2024 12:49

remove duplicate QPS/ burst setting

5147274

Signed-off-by: gotwarlost <kananthmail-github@yahoo.com>

gotwarlost force-pushed the trace-perf branch from 3c14da3 to df75bca Compare June 24, 2024 19:50

jbw976 reviewed Jul 12, 2024

View reviewed changes

jbw976 approved these changes Jul 12, 2024

View reviewed changes

jbw976 merged commit 2c4631f into crossplane:master Jul 13, 2024
16 of 17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve trace client performance using concurrent resource load #5726

improve trace client performance using concurrent resource load #5726

gotwarlost commented May 22, 2024 •

edited

Loading

gotwarlost commented May 22, 2024 •

edited

Loading

gotwarlost commented May 23, 2024

gotwarlost commented May 23, 2024

phisco commented May 23, 2024

gotwarlost commented May 23, 2024

phisco commented May 23, 2024

jbw976 left a comment •

edited

Loading

gotwarlost commented May 27, 2024 •

edited

Loading

gotwarlost commented Jun 1, 2024

gotwarlost commented Jun 12, 2024

phisco commented Jun 12, 2024

gotwarlost commented Jun 17, 2024 •

edited

Loading

jbw976 commented Jun 20, 2024

gotwarlost commented Jun 20, 2024 •

edited

Loading

jbw976 commented Jun 20, 2024

gotwarlost commented Jun 20, 2024

jbw976 left a comment

gotwarlost commented Jun 24, 2024

jbw976 left a comment •

edited

Loading

jbw976 Jul 12, 2024

gotwarlost Jul 12, 2024

jbw976 Jul 12, 2024

jbw976 left a comment

improve trace client performance using concurrent resource load #5726

improve trace client performance using concurrent resource load #5726

Conversation

gotwarlost commented May 22, 2024 • edited Loading

Description of your changes

gotwarlost commented May 22, 2024 • edited Loading

A totally bad benchmark

gotwarlost commented May 23, 2024

gotwarlost commented May 23, 2024

phisco commented May 23, 2024

gotwarlost commented May 23, 2024

phisco commented May 23, 2024

jbw976 left a comment • edited Loading

Choose a reason for hiding this comment

gotwarlost commented May 27, 2024 • edited Loading

gotwarlost commented Jun 1, 2024

gotwarlost commented Jun 12, 2024

phisco commented Jun 12, 2024

gotwarlost commented Jun 17, 2024 • edited Loading

jbw976 commented Jun 20, 2024

gotwarlost commented Jun 20, 2024 • edited Loading

jbw976 commented Jun 20, 2024

gotwarlost commented Jun 20, 2024

jbw976 left a comment

Choose a reason for hiding this comment

gotwarlost commented Jun 24, 2024

jbw976 left a comment • edited Loading

Choose a reason for hiding this comment

jbw976 Jul 12, 2024

Choose a reason for hiding this comment

gotwarlost Jul 12, 2024

Choose a reason for hiding this comment

jbw976 Jul 12, 2024

Choose a reason for hiding this comment

jbw976 left a comment

Choose a reason for hiding this comment

gotwarlost commented May 22, 2024 •

edited

Loading

gotwarlost commented May 22, 2024 •

edited

Loading

jbw976 left a comment •

edited

Loading

gotwarlost commented May 27, 2024 •

edited

Loading

gotwarlost commented Jun 17, 2024 •

edited

Loading

gotwarlost commented Jun 20, 2024 •

edited

Loading

jbw976 left a comment •

edited

Loading