Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve trace client performance using concurrent resource load #5726

Merged
merged 4 commits into from
Jul 13, 2024

Conversation

gotwarlost
Copy link
Contributor

@gotwarlost gotwarlost commented May 22, 2024

Description of your changes

fixes #5707

Add a loader with configurable concurrency to load resources in concurrent manner. The xrm client delegates to the loader for resource load and supports a functional option to set the concurrency.

Add a --concurrency flag for the crank beta trace command and configure the xrm client appropriately.

I have:

  • Read and followed Crossplane's [contribution process].
  • Run make reviewable to ensure this PR is ready for review.
  • Added or updated unit tests.
    - [ ] Added or updated e2e tests.
    - [ ] Linked a PR or a [docs tracking issue] to [document this change].
    - [ ] Added backport release-x.y labels to auto-backport this PR.

@gotwarlost gotwarlost requested review from phisco and a team as code owners May 22, 2024 22:37
@gotwarlost gotwarlost requested a review from turkenh May 22, 2024 22:37
@gotwarlost
Copy link
Contributor Author

gotwarlost commented May 22, 2024

A totally bad benchmark

This only serves to show that increasing concurrency does matter.

Some numbers on my laptop after hard-coding qps and burst to 50 and 100 such that the kube client doesn't backoff and become the bottleneck. There are ~420 children for the resource being traced.

$ for c in 1 2 3 4 5 10 20 30 40; do echo concurrency=$c;time crank beta trace --concurrency=$c \
        myfacade/facade-1>/dev/null; done

concurrency=1

real	1m29.923s
user	0m0.835s
sys	0m0.308s

concurrency=2

real	0m45.674s
user	0m0.773s
sys	0m0.301s

concurrency=3

real	0m32.130s
user	0m0.733s
sys	0m0.298s

concurrency=4

real	0m24.895s
user	0m0.753s
sys	0m0.254s

concurrency=5

real	0m20.722s
user	0m0.739s
sys	0m0.286s

concurrency=10

real	0m20.902s
user	0m0.727s
sys	0m0.267s

concurrency=20

real	0m10.015s
user	0m0.635s
sys	0m0.217s

concurrency=30

real	0m12.343s
user	0m0.613s
sys	0m0.214s

concurrency=40

real	0m13.128s
user	0m0.598s
sys	0m0.203s

As you can see at some point, increasing the concurrency doesn't make a difference and there are diminishing returns.

@gotwarlost gotwarlost force-pushed the trace-perf branch 2 times, most recently from 389d7cc to b5d8a44 Compare May 22, 2024 23:30
@gotwarlost
Copy link
Contributor Author

Interesting - lint is failing for go 1.22 int range usage but make reviewable passed on my laptop...

@gotwarlost
Copy link
Contributor Author

Seems to me that the linter error is spurious since the code doesn't compile with that change (given 1.21 in go.mod)

@phisco
Copy link
Contributor

phisco commented May 23, 2024

go.mod's directive is only the minimum required version, what matters is the version of the go binary, which is set to 1.22.3 in ci.

I have to go through it more thoroughly, but in the meantime thanks for your contribution!

Just a breadcrumb: another possible optimization I left behind during the initial implementation was to use the cached discovery client, unfortunately wiring up all the usual Kubernetes flags wasn't so straightforward given that we aren't using pflag/cobra. That would probably reduce the need to fine tune the qps and burst settings.

@gotwarlost
Copy link
Contributor Author

what matters is the version of the go binary

not able to compile locally even with go 1.22.3

$ git diff
diff --git a/cmd/crank/beta/trace/internal/resource/xrm/loader.go b/cmd/crank/beta/trace/internal/resource/xrm/loader.go
index d6686f40..795a6d27 100644
--- a/cmd/crank/beta/trace/internal/resource/xrm/loader.go
+++ b/cmd/crank/beta/trace/internal/resource/xrm/loader.go
@@ -79,7 +79,7 @@ func (l *loader) load(ctx context.Context, concurrency int) {
                concurrency = defaultConcurrency
        }
        var wg sync.WaitGroup
-       for i := 0; i < concurrency; i++ {
+       for range concurrency {
                wg.Add(1)
                go func() {
                        defer wg.Done()
$ go version

go version go1.22.3 darwin/arm64

$ go install ./cmd/crank
# github.com/crossplane/crossplane/cmd/crank/beta/trace/internal/resource/xrm
cmd/crank/beta/trace/internal/resource/xrm/loader.go:82:12: cannot range over concurrency (variable of type int): requires go1.22 or later (-lang was set to go1.21; check go.mod)

@phisco
Copy link
Contributor

phisco commented May 23, 2024

🤔 you are right, I didn't realise what it was complaining about 🤔

@phisco phisco self-assigned this May 25, 2024
Copy link
Member

@jbw976 jbw976 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely curious what the performance would be in your 500+ resource scenario @gotwarlost after #5737 is merged. That looks to be fixing a big recent regression in performance...

cmd/crank/beta/trace/trace.go Outdated Show resolved Hide resolved
@gotwarlost
Copy link
Contributor Author

gotwarlost commented May 27, 2024

Second commit fixes test structure and sorts children after load. Also sets QPS and burst but I will get rid of this with a rebase from master.

@gotwarlost
Copy link
Contributor Author

Not sure why the E2E tests are failing - the failing test cases seem unrelated to the changes in this PR.

@gotwarlost
Copy link
Contributor Author

hello, how can I help moving this PR forward? Please let me know.

@phisco
Copy link
Contributor

phisco commented Jun 12, 2024

Hi, sorry, I'll make sure to have a look at it tomorrow. Could you rerun the benchmarks after the fix as @jbw976 was suggesting in the meantime?

@gotwarlost
Copy link
Contributor Author

gotwarlost commented Jun 17, 2024

Benchmarks for 3 cases, before QPS change, after QPS change and current PR. Note that the QPS changes do not affect performance since access to resources are sequential with current head.

# before qps/ burst was set
$ git checkout 914b83a3c65e73aef00c9b96aafd57be01dbe98e
HEAD is now at 914b83a3 Merge pull request #5733 from crossplane/renovate/master-github.com-golang-jwt-jwt-v5-5.x

$ go install ./cmd/crank
$ time crank beta trace facade-type/facade-name -s >/dev/null

real	1m28.786s
user	0m1.141s
sys	0m0.385s

# after qps/ burst was set
$ git checkout b404cd0c67edd859c47df06a6e3c718d8951127d
Previous HEAD position was 914b83a3 Merge pull request #5733 from crossplane/renovate/master-github.com-golang-jwt-jwt-v5-5.x
HEAD is now at b404cd0c Merge pull request #5737 from phisco/gimme-more-requests

$ go install ./cmd/crank
$ time crank beta trace facade-type/facade-name -s >/dev/null # basically no changes, since access is one-at-a-time

real	1m29.491s
user	0m1.095s
sys	0m0.404s

# with concurrency
$ git checkout trace-perf
Previous HEAD position was b404cd0c Merge pull request #5737 from phisco/gimme-more-requests
Switched to branch 'trace-perf'
$ go install ./cmd/crank
$ for c in 1 2 4 8 16 32
> do
> echo Concurrency=$c
> time crank beta trace facade-type/facade-name -s --concurrency=$c >/dev/null
> done 

# output slightly formatted for newlines to make the timings appear next to the concurrency value

Concurrency=1
real	1m28.942s
user	0m1.247s
sys	0m0.439s

Concurrency=2
real	0m48.002s
user	0m1.034s
sys	0m0.363s

Concurrency=4
real	0m24.246s
user	0m0.943s
sys	0m0.337s

Concurrency=8
real	0m16.333s
user	0m0.975s
sys	0m0.347s

Concurrency=16
real	0m15.118s
user	0m0.998s
sys	0m0.371s

Concurrency=32
I0617 13:25:06.604160   96865 request.go:697] Waited for 1.020916333s due to client-side throttling, not priority and fairness, request: GET:https://49528D6033D107FEF93C8D493C0EDC30.gr7.eu-west-1.eks.amazonaws.com/apis/autoscaling.aws.upbound.io/v1beta1/grouptags/redacted

real	0m13.566s
user	0m0.986s
sys	0m0.352s

@jbw976
Copy link
Member

jbw976 commented Jun 20, 2024

Nice @gotwarlost, thank you for running these benchmarks in your environment, that definitely shows an improvement when running with more concurrency! 💪 Interesting to see for you that b404cd0 had no effect, but your explanation makes sense for environments where the RTT for each serialized network request is significant enough.

I wanted to take these changes for a test spin as well and I got different results for my environment that is just using a local kind cluster where the RTT network time from the kubectl client is very small.

TL;DR: Higher QPS/Burst (b404cd0) does make a significant difference for me in this local kind scenario, while higher concurrency doesn't make much difference. I still think we should proceed with your general approach of adding concurrency support, since it clearly makes an improvement for more real world environments 🎉 , but it is still interesting to note this behavior difference 🤓

To set up my perf testing scenario, I created a test repo with a XTracePerf XRD and composition that essentially creates a variable number of ConfigMap objects using provider-kubernetes and function-kcl. I used the steps and resources in this repo to create a claim/XR with 250 composed resources, then ran the same steps you did to test the same commits and the changes in this PR.

trace perf testing repo: https://github.com/jbw976/xtrace-perf

250 ConfigMap objects created:

❯ kubectl get configmap --no-headers=true | grep '^object[-][0-9]' | wc -l
     250

Test 914b83a (QPS and burst missing): 49 seconds

❯ git checkout 914b83a3c65e73aef00c9b96aafd57be01dbe98e
❯ go install ./cmd/crank
❯ time crank beta trace traceperf/traceperf-tester > /dev/null
crank beta trace traceperf/traceperf-tester > /dev/null  0.42s user 0.16s system 1% cpu 49.001 total

Test b404cd0 (QPS and burst settings restored): 11.8 seconds

❯ git checkout b404cd0c67edd859c47df06a6e3c718d8951127d
❯ go install ./cmd/crank
❯ time crank beta trace traceperf/traceperf-tester > /dev/null
crank beta trace traceperf/traceperf-tester > /dev/null  0.28s user 0.14s system 3% cpu 11.867 total

Test this PR #5726 with new concurrency setting: Min: 11.08 seconds @ 8 concurrency, Max: 11.89 seconds @ 1 concurrency

❯ git checkout gotwarlost/trace-perf
HEAD is now at 3c14da317 Apply suggestions from code review
❯ go install ./cmd/crank
❯ for c in 1 2 4 8 16 32; do echo concurrency=$c; time crank beta trace traceperf/traceperf-tester --concurrency=$c > /dev/null; done
concurrency=1
crank beta trace traceperf/traceperf-tester --concurrency=$c > /dev/null  0.32s user 0.16s system 4% cpu 11.885 total
concurrency=2
crank beta trace traceperf/traceperf-tester --concurrency=$c > /dev/null  0.31s user 0.13s system 3% cpu 11.104 total
concurrency=4
crank beta trace traceperf/traceperf-tester --concurrency=$c > /dev/null  0.29s user 0.12s system 3% cpu 11.097 total
concurrency=8
crank beta trace traceperf/traceperf-tester --concurrency=$c > /dev/null  0.28s user 0.11s system 3% cpu 11.083 total
concurrency=16
crank beta trace traceperf/traceperf-tester --concurrency=$c > /dev/null  0.31s user 0.12s system 3% cpu 11.096 total
concurrency=32
I0619 18:12:18.001407   84861 request.go:697] Waited for 1.043695958s due to client-side throttling, not priority and fairness, request: GET:https://127.0.0.1:56038/apis/kubernetes.crossplane.io/v1alpha2/objects/object-143
crank beta trace traceperf/traceperf-tester --concurrency=$c > /dev/null  0.31s user 0.13s system 3% cpu 11.114 total

I will take a full code review pass through the code changes as well, but I wanted to share those performance findings, as well as the reusable https://github.com/jbw976/xtrace-perf testing repo that could benefit others. Once again, I don't think these results indicate we shouldn't accept this change, as they are for a less important and less common local laptop cluster environment, with different network RTT characteristics. The concurrency support should definitely be beneficial for remote clusters 🙇‍♂️

@gotwarlost
Copy link
Contributor Author

gotwarlost commented Jun 20, 2024

Those are interesting findings and certainly way different from what I have. I've noticed in the past that the network round trip time massively affects performance.

One a kind cluster where we run E2E tests, I had never noticed a performance problem (stuff loads in a few seconds) and it only when running the command on EKS clusters etc. with higher network latency that all these settings become material.

My hypothesis is that resources are loading so fast in your kind cluster that the qps/ burst of the client becomes the bottleneck because it is executing many more k8s requests per second than it would otherwise do with network latency.
In my setup where network latency is much higher, individual resources take enough time to load that the client stays within the default qps setting.

I'll re-run my tests for the facade composition on a kind cluster and report here. I suspect the behavior will be closer to what you have than what I previously published.

@jbw976
Copy link
Member

jbw976 commented Jun 20, 2024

My hypothesis is that resources are loading so fast in your kind cluster that the qps/ burst of the client becomes the bottleneck because it is executing many more k8s requests per second than it would otherwise do with network latency.
In my setup where network latency is much higher, individual resources take enough time to load that the client stays within the default qps setting.

Exactly @gotwarlost, I 💯 believe that hypothesis as well! It makes logical sense 🙇‍♂️

@gotwarlost
Copy link
Contributor Author

Yes, I can reproduce results similar to yours for a kind cluster with ~480 resources
55s drops to 13s after the qps fix and marginal to no improvements for concurrency

$ git checkout 914b83a3c65e73aef00c9b96aafd57be01dbe98e
HEAD is now at 914b83a3 Merge pull request #5733 from crossplane/renovate/master-github.com-golang-jwt-jwt-v5-5.x

$ go install ./cmd/crank
$ crank beta trace facade-type/name -s

real	0m55.567s
user	0m0.609s
sys	0m0.215s

$ git checkout b404cd0c67edd859c47df06a6e3c718d8951127d
HEAD is now at b404cd0c Merge pull request #5737 from phisco/gimme-more-requests
$ go install ./cmd/crank
$ crank beta trace facade-type/name -s

real	0m12.911s
user	0m0.539s
sys	0m0.196s

$ git checkout trace-perf
Switched to branch 'trace-perf'
$ go install ./cmd/crank
$ for c in 1 2 4 8 16 32; do echo Concurrency=$c; time crank beta trace facade-type/name -s --concurrency=$c >/dev/null; done

Concurrency=1
real	0m12.847s
user	0m0.494s
sys	0m0.184s

Concurrency=2
real	0m11.714s
user	0m0.551s
sys	0m0.174s

Concurrency=4
real	0m11.585s
user	0m0.568s
sys	0m0.174s

Concurrency=8
real	0m11.409s
user	0m0.586s
sys	0m0.190s

Concurrency=16
real	0m10.978s
user	0m0.577s
sys	0m0.194s

Concurrency=32
I0620 12:00:41.315141    7075 request.go:697] Waited for 1.036958833s due to client-side throttling, not priority and fairness, request: GET:https://127.0.0.1:60368/apis/some/redacted/resource
real	0m10.040s
user	0m0.556s
sys	0m0.193s

Copy link
Member

@jbw976 jbw976 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a pass through the code today too @gotwarlost, thanks again for this contribution! Most of my comments are around helping future readers of this code understand it quickly, i.e. simplifying things and adding some more plain english comments to help people keep track of where they are within the nested concurrency/goroutines. Let me know what you think of these suggestions, thank you! 🙇‍♂️

fixes crossplane#5707

Add a loader with configurable concurrency to load resources in concurrent manner.
The xrm client delegates to the loader for resource load and supports a functional
option to set the concurrency.

Add a `--concurrency` flag for the `crank beta trace` command and configure the
xrm client appropriately.

Signed-off-by: gotwarlost <kananthmail-github@yahoo.com>
* sorts children after concurrent load using the same algorithm that
  crossplane uses to sort resource refs
* sets default QPS and burst when not set to the same values as a recent PR
* updates test structure to match existing tests

Signed-off-by: gotwarlost <kananthmail-github@yahoo.com>
Signed-off-by: gotwarlost <kananthmail-github@yahoo.com>
* remove context from the loader interface when not needed
* change member name from l to rl
* simplify tests to lose regexes
* add concurrency docs
* remove module global for channel capacity, pass in as parameter
* use range over ints instead of for loops since linter complains otherwise
* do not run goroutines when there are no child refs to add
* add more docs for sorting logic and why it is needed

Signed-off-by: gotwarlost <kananthmail-github@yahoo.com>
@gotwarlost
Copy link
Contributor Author

Thanks for the detailed PR review @jbw976 - I resolved all but one of your comments, the code is much better for it!

Copy link
Member

@jbw976 jbw976 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome @gotwarlost, thank you very much for incorporating the feedback, I appreciate you taking the time to do that. I took it for another test spin today on a live cluster and verified all the items are being returned as expected with multiple runs, so it is all working reliably well for me. I also ran the unit tests with count=1000 and those passed perfectly too, so feeling pretty good about the correctness of this change.

Just one final question from me about the for loop variable gotcha - if you can confirm with your thinking that we are clear from that, then that would be good!

for _, ref := range refs {
l.ch <- workItem{
parent: parent,
child: ref,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this susceptible to the infamous go for loop variable gotcha? https://dev.to/kkentzo/the-golang-for-loop-gotcha-1n35

Looking a bit further, it appears this was fixed with go 1.22, which we appear to be using: https://github.com/crossplane/crossplane/blob/master/go.mod#L3

Do we completely not have to worry about this anymore? or is there any further steps we would need to take? should we continue to be defensive about this regardless, out of habit/tradition? what do you think? 🤔
https://go.dev/blog/loopvar-preview

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code also works on Go 1.21 BTW, since there are no pointers or goroutines involved that use the loop variable outside its intended scope. Also as you rightly said Go 1.22 doesn't have this issue any more.

I can put a ref := ref in the loop if you think that is more consistent with the codebase, but I don't know that it is necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, looking again more closely you are correct that this would work OK before go 1.22 also! A copy of the ref variable here is sent through the channel, so ref is never used outside of the scope of this loop 🙇‍♂️

No need to make a change here as far as I'm concerned!

Copy link
Member

@jbw976 jbw976 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for a great contribution that will help real world usage of crossplane beta trace much more performant! A great addition for v1.17 🤩

@jbw976 jbw976 merged commit 2c4631f into crossplane:master Jul 13, 2024
16 of 17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve the trace command for performance
3 participants