Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move perf test command to separate subcommand #2168

Merged
merged 1 commit into from
Dec 15, 2023

Conversation

marseel
Copy link
Contributor

@marseel marseel commented Dec 8, 2023

I really tried to be mindful of splitting that into commits, but unfortunately, I ended up refactoring the whole thing 😢

Summary:

  • Move cilium connectivity test --perf to separate subcommand cilium connectivity perf
  • Allow to run host/pod tests together (before you could only run one of), while also allow to test host-to-pod, pod-to-host etc.
  • Before NodeSelector was totally not working in perf test and we were forcing users to use one zone for nodes, now nodeSelector works and we just warn in case nodes are in different zones (so now we can test both in-zone and across-zones)
  • Added possibility to export results in json format that can be visualized with perfdash - this will be used for tracking datapath performance in CI
  • Visualize operation rate next to latency in console output (as they were from the same netperf test)
  • Always cleanup perf pods before performing tests - to make sure we respect nodeSelector
  • Refactored netperf integration, it was full of hacks.

Help for perf subcommand:

cilium connectivity perf --help
Test network performance

Usage:
  cilium connectivity perf [flags]

Flags:
  -d, --debug                             Show debug messages
      --deployment-pod-annotations json   Add annotations to the connectivity pods, e.g. '{"client":{"foo":"bar"}}' (default {})
      --duration duration                 Duration for the Performance test to run (default 10s)
  -h, --help                              help for perf
      --host-net                          Test host network
      --node-selector stringToString      Restrict connectivity pods to nodes matching this label (default [])
      --performance-image string          Image path to use for performance (default "quay.io/cilium/network-perf:a816f935930cb2b40ba43230643da4d5751a5711@sha256:679d3a370c696f63884da4557a4466f3b5569b4719bb4f86e8aac02fbe390eea")
      --pod-net                           Test pod network (default true)
      --report-dir string                 Directory to save perf results in json format
      --samples int                       Number of Performance samples to capture (how many times to run each test) (default 1)
      --test-namespace string             Namespace to perform the connectivity in (default "cilium-test")

Global Flags:
      --context string     Kubernetes configuration context
  -n, --namespace string   Namespace Cilium is running in (default "kube-system")

Example output of full tests:

🔥 Network Performance Test Summary:
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
📋 Scenario        | Node       | Test            | Duration        | Min             | Mean            | Max             | P50             | P90             | P99             | Transaction rate OP/s
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
📋 pod-to-pod      | same-node  | TCP_RR          | 1s              | 16µs            | 36.6µs          | 6.983ms         | 22µs            | 54µs            | 116µs           | 27126.67    
📋 pod-to-pod      | same-node  | UDP_RR          | 1s              | 14µs            | 30.04µs         | 1.442ms         | 17µs            | 48µs            | 90µs            | 33049.72    
📋 pod-to-pod      | same-node  | TCP_CRR         | 1s              | 286µs           | 490.63µs        | 7.024ms         | 458µs           | 560µs           | 1.125ms         | 2034.74     
📋 pod-to-host     | same-node  | TCP_RR          | 1s              | 12µs            | 31.94µs         | 1.674ms         | 38µs            | 44µs            | 81µs            | 31080.33    
📋 pod-to-host     | same-node  | UDP_RR          | 1s              | 12µs            | 31.07µs         | 1.704ms         | 37µs            | 44µs            | 88µs            | 31974.69    
📋 pod-to-host     | same-node  | TCP_CRR         | 1s              | 190µs           | 333.39µs        | 6.135ms         | 313µs           | 396µs           | 710µs           | 2991.20     
📋 host-to-pod     | same-node  | TCP_RR          | 1s              | 12µs            | 32.51µs         | 7.725ms         | 37µs            | 45µs            | 99µs            | 30553.66    
📋 host-to-pod     | same-node  | UDP_RR          | 1s              | 12µs            | 27.8µs          | 1.226ms         | 22µs            | 42µs            | 83µs            | 35702.75    
📋 host-to-pod     | same-node  | TCP_CRR         | 1s              | 187µs           | 361.23µs        | 9.06ms          | 319µs           | 415µs           | 1.02ms          | 2761.72     
📋 host-to-host    | same-node  | TCP_RR          | 1s              | 12µs            | 30.9µs          | 4.123ms         | 35µs            | 42µs            | 83µs            | 32146.29    
📋 host-to-host    | same-node  | UDP_RR          | 1s              | 11µs            | 28.42µs         | 2.547ms         | 34µs            | 40µs            | 101µs           | 34959.06    
📋 host-to-host    | same-node  | TCP_CRR         | 1s              | 62µs            | 146.65µs        | 8.981ms         | 127µs           | 172µs           | 506µs           | 6794.35     
📋 pod-to-pod      | other-node | TCP_RR          | 1s              | 331µs           | 730.65µs        | 13ms            | 650µs           | 1.037ms         | 2.125ms         | 1363.17     
📋 pod-to-pod      | other-node | UDP_RR          | 1s              | 310µs           | 636.93µs        | 4.923ms         | 581µs           | 891µs           | 1.45ms          | 1564.81     
📋 pod-to-pod      | other-node | TCP_CRR         | 1s              | 936µs           | 2.12513ms       | 9.839ms         | 1.585ms         | 5.566ms         | 6.85ms          | 469.93      
📋 pod-to-host     | other-node | TCP_RR          | 1s              | 243µs           | 529.46µs        | 7.377ms         | 499µs           | 727µs           | 1.058ms         | 1881.60     
📋 pod-to-host     | other-node | UDP_RR          | 1s              | 205µs           | 506.11µs        | 16.028ms        | 414µs           | 684µs           | 2.066ms         | 1969.76     
📋 pod-to-host     | other-node | TCP_CRR         | 1s              | 607µs           | 1.17934ms       | 15.836ms        | 1.085ms         | 1.529ms         | 2.5ms           | 845.96      
📋 host-to-pod     | other-node | TCP_RR          | 1s              | 254µs           | 537.74µs        | 18.134ms        | 481µs           | 736µs           | 1.287ms         | 1851.63     
📋 host-to-pod     | other-node | UDP_RR          | 1s              | 222µs           | 489.01µs        | 4.161ms         | 446µs           | 696µs           | 1.181ms         | 2036.72     
📋 host-to-pod     | other-node | TCP_CRR         | 1s              | 590µs           | 1.11054ms       | 10.26ms         | 1.03ms          | 1.436ms         | 2.3ms           | 898.38      
📋 host-to-host    | other-node | TCP_RR          | 1s              | 223µs           | 503.58µs        | 4.419ms         | 452µs           | 685µs           | 1.45ms          | 1977.50     
📋 host-to-host    | other-node | UDP_RR          | 1s              | 203µs           | 430.1µs         | 4.677ms         | 388µs           | 600µs           | 1.05ms          | 2315.76     
📋 host-to-host    | other-node | TCP_CRR         | 1s              | 529µs           | 949.29µs        | 8.49ms          | 894µs           | 1.252ms         | 1.9ms           | 1049.70     
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------
📋 Scenario        | Node       | Test            | Duration        | Throughput Mb/s 
-------------------------------------------------------------------------------------
📋 pod-to-pod      | same-node  | TCP_STREAM      | 1s              | 605.80       
📋 pod-to-pod      | same-node  | UDP_STREAM      | 1s              | 443.89       
📋 pod-to-host     | same-node  | TCP_STREAM      | 1s              | 1102.85      
📋 pod-to-host     | same-node  | UDP_STREAM      | 1s              | 612.62       
📋 host-to-pod     | same-node  | TCP_STREAM      | 1s              | 1364.10      
📋 host-to-pod     | same-node  | UDP_STREAM      | 1s              | 645.54       
📋 host-to-host    | same-node  | TCP_STREAM      | 1s              | 6928.44      
📋 host-to-host    | same-node  | UDP_STREAM      | 1s              | 493.50       
📋 pod-to-pod      | other-node | TCP_STREAM      | 1s              | 350.29       
📋 pod-to-pod      | other-node | UDP_STREAM      | 1s              | 141.66       
📋 pod-to-host     | other-node | TCP_STREAM      | 1s              | 1401.42      
📋 pod-to-host     | other-node | UDP_STREAM      | 1s              | 169.30       
📋 host-to-pod     | other-node | TCP_STREAM      | 1s              | 3802.40      
📋 host-to-pod     | other-node | UDP_STREAM      | 1s              | 166.53       
📋 host-to-host    | other-node | TCP_STREAM      | 1s              | 3894.40      
📋 host-to-host    | other-node | UDP_STREAM      | 1s              | 202.85       
-------------------------------------------------------------------------------------

Fixes #1111
Fixes #2114

@marseel marseel force-pushed the netperf_with_perfdash_integration branch from 12a0bec to f0ff431 Compare December 8, 2023 13:59
@marseel marseel added the kind/enhancement This would improve or streamline existing functionality. label Dec 8, 2023
@marseel marseel force-pushed the netperf_with_perfdash_integration branch from f0ff431 to de07125 Compare December 8, 2023 14:04
@marseel marseel changed the title Netperf with perfdash integration Move perf test command to separate subcommand Dec 8, 2023
@marseel marseel force-pushed the netperf_with_perfdash_integration branch from de07125 to c82a67c Compare December 8, 2023 14:10
@marseel marseel force-pushed the netperf_with_perfdash_integration branch from c82a67c to 9abae58 Compare December 8, 2023 14:15
@marseel
Copy link
Contributor Author

marseel commented Dec 8, 2023

/cc @darox @learnitall
as you worked on the perf test before, I would appreciate your reviews.

@marseel marseel added the dont-merge/preview-only Only for preview or testing, don't merge it. label Dec 11, 2023
@marseel marseel marked this pull request as ready for review December 11, 2023 09:34
@marseel marseel requested review from a team as code owners December 11, 2023 09:34
@marseel marseel force-pushed the netperf_with_perfdash_integration branch from 05dfd80 to 9248a9a Compare December 11, 2023 09:52
@darox
Copy link
Contributor

darox commented Dec 11, 2023

/cc @darox @learnitall as you worked on the perf test before, I would appreciate your reviews.

Great, I planned to work on refactoring, but you were faster. Planning to have a look at it this week (obviously customer tickets have priority).

@marseel
Copy link
Contributor Author

marseel commented Dec 11, 2023

I think we can diregard EKS (tunnel) / EKS (tunnel) Installation and Connectivity Test (classic) (pull_request_target) failure as it pulls in-cluster-test-scripts from main, where it tries to run cilium connectivity test --perf --perf-duration 1s

Copy link
Contributor

@derailed derailed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marseel Nice work! Just a few small picks...

d.Latency["p90"][0], d.Metric,
d.Latency["p99"][0], d.Metric)
ct.Header("🔥 Network Performance Test Summary:")
ct.Logf("%s", strings.Repeat("-", 200))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be cleaner to define a PerfResult type vs the simple map and hang Printer methods to output the different perf tables. Also you could hang helpers aka nodeString.
We could perhaps leverage a table printing lib and use it for tabular output.
Lastly we could sort the results to make it for a quick perf scan.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that using some kind of table lib would be great.

Copy link
Contributor Author

@marseel marseel Dec 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I definitely agree with using some lib for printing table and also cleaning up this mess, but I think it's out of scope of this PR. I am happy to create follow-up issue for that.
Re sorting, results are already sorted in PerfResults - that's why I changed it from map to list (before they weren't sorted) + added sorting here

sort.SliceStable(ct.perfServerPod, func(i, j int) bool {

While I would do it differently normally and sort results in the presentation layer, I wanted to do it as simply as possible for now as this part was not really my goal of improving console presentation format.

connectivity/perf/benchmarks/netperf/perfpod.go Outdated Show resolved Hide resolved
connectivity/perf/common/metrics.go Show resolved Hide resolved
res.Latency = nil
// Verify that throughput unit is 10^6bits/s
if values[8] != "10^6bits/s" {
a.Fatal("Unable to process netperf result")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this call should return an err and let the call site decide vs bombing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's worth it - we already handle a bunch of errors through a.Fatal - FYI this is just failing a single action, not the whole program.

Copy link
Contributor

@learnitall learnitall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for doing this, I'm very excited for the perfdash integration. I have a couple of suggestions and nits, please let me know what you think.

d.Latency["p90"][0], d.Metric,
d.Latency["p99"][0], d.Metric)
ct.Header("🔥 Network Performance Test Summary:")
ct.Logf("%s", strings.Repeat("-", 200))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that using some kind of table lib would be great.

connectivity/perf/common/metrics.go Outdated Show resolved Hide resolved
connectivity/perf/common/metrics.go Outdated Show resolved Hide resolved
connectivity/perf/common/metrics.go Outdated Show resolved Hide resolved
internal/cli/cmd/connectivity.go Outdated Show resolved Hide resolved
internal/cli/cmd/connectivity.go Show resolved Hide resolved
connectivity/perf/benchmarks/netperf/perfpod.go Outdated Show resolved Hide resolved
connectivity/perf/benchmarks/netperf/perfpod.go Outdated Show resolved Hide resolved
@marseel
Copy link
Contributor Author

marseel commented Dec 12, 2023

Thanks for the reviews, I pushed changed as separate commit to make it easier for you to review it - I will squash them later on.

Copy link
Contributor

@derailed derailed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marseel SG - Thanks for the updates!

@darox
Copy link
Contributor

darox commented Dec 13, 2023

/cc @darox @learnitall as you worked on the perf test before, I would appreciate your reviews.

Great, I planned to work on refactoring, but you were faster. Planning to have a look at it this week (obviously customer tickets have priority).

I see you removed me from the reviewers. Makes sense, the others have deeper Go knowledge than me.

Before, perf test was part of `connectivity test` subcommand.
Now, perf test has separate `connectivity perf` subcommand.
We allow to run both host and pod network perf test in a single
command.
We allow to test host-to-pod type of traffic too.
Perf test now respects nodeSelector and allows to run for nodes in two
different zones.
Always force deploy test pods for perf test to respect nodeSelector.
You can export perf test results in json format, which is compatible
with perfdash.

Fixes: #1111
Fixes: #2114

Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com>
@marseel
Copy link
Contributor Author

marseel commented Dec 13, 2023

Since I removed pull_request: {} from workflows I expect that the test will be failing right now -but this is fine, they were passing before with it.

Copy link
Contributor

@ldelossa ldelossa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable to me.

@marseel marseel removed the dont-merge/preview-only Only for preview or testing, don't merge it. label Dec 13, 2023
@marseel
Copy link
Contributor Author

marseel commented Dec 13, 2023

I'm marking this PR as ready-to-merge as tests passed before with pull_request, but they do not pass currently with pull_request_target as I am modyfing in-cluster-test-scripts

@marseel marseel added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Dec 13, 2023
@nebril nebril merged commit 4e51788 into main Dec 15, 2023
7 of 13 checks passed
@nebril nebril deleted the netperf_with_perfdash_integration branch December 15, 2023 11:07
@pchaigno
Copy link
Member

I believe this pull request broke main. At least, the following is currently failing for me on main locally and in my rebased PR:

$ make && ./cilium connectivity perf --duration 1s
CGO_ENABLED=0 go build  \
	-ldflags "-w -s \
	-X 'github.com/cilium/cilium-cli/cli.Version=v0.15.18'" \
	-o cilium \
	./cmd/cilium
unknown flag: --duration

@pchaigno
Copy link
Member

Hm, my local clone wasn't on latest main and after pulling it doesn't fail anymore. And apparently GitHub sometimes lies about PRs being up-to-date because I had this:
Screenshot from 2023-12-15 18-49-28
Yet I just updated the branch after rebasing...

So false alert (and weird GH bug). Sorry for the noise!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement This would improve or streamline existing functionality. ready-to-merge This PR has passed all tests and received consensus from code owners to merge.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Merge connectivity test --perf-* flags Split Performance Tests into their Own Sub-Command
9 participants