Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dnsproxy: Improve regex pattern #20246

Merged
merged 5 commits into from Jan 30, 2023
Merged

Conversation

odinuge
Copy link
Member

@odinuge odinuge commented Jun 19, 2022

TL;DR: This is now V3 of #20214.

The full list of changes can be found in the commits (tho. this is based off #21288);

  • Extract regex anchors to the beginning and end of the full pattern
  • Remove the use of regex groups since we don't use them
  • Short circuit matchPatterns with just a wildcard and match all fqdns
  • Skip running .MatchString on wildcard regexes.

The overall perf change for all commits in this PR vs. #21288 (all without the lru cache, and ignore the time/op, since they are running in docker for mac 😢 ):

$ benchstat <pre-this-pr> <this-pr>
name                             old time/op          new time/op          delta
_perEPAllow_setPortRulesForID-7           77.4s ± 8%           83.0s ±17%     ~     (p=0.310 n=5+5)

name                             old B(HeapInUse)/op  new B(HeapInUse)/op  delta
_perEPAllow_setPortRulesForID-7           97.3M ± 1%           74.2M ± 1%  -23.75%  (p=0.008 n=5+5)

name                             old alloc/op         new alloc/op         delta
_perEPAllow_setPortRulesForID-7          78.6GB ± 0%          73.0GB ± 0%   -7.18%  (p=0.008 n=5+5)

name                             old allocs/op        new allocs/op        delta
_perEPAllow_setPortRulesForID-7            167M ± 0%            108M ± 0%  -35.37%  (p=0.008 n=5+5)

After this we plan doing;

  • Switch from string concatenation to a byte buffer in GeneratePattern
  • Consider the regex for matchPatterns + map[string]struct{} for matchNames.
  • Replacing the [.] with \. to reduce the regex size
  • Replacing the * replacement from [-a-zA-Z0-9_]* to [^.]*
  • ...

--

  • For first time contributors, read Submitting a pull request
  • All code is covered by unit and/or runtime tests where feasible.
  • All commits contain a well written commit description including a title,
    description and a Fixes: #XXX line if the commit addresses a particular
    GitHub issue.
  • All commits are signed off. See the section Developer’s Certificate of Origin
  • Provide a title or release-note blurb suitable for the release notes.
  • Thanks for contributing!

Fixes: #issue-number

dnsproxy: Improve regex used for matching dns queries by reducing its complexity and size to save memory and speed up matching

@odinuge odinuge requested a review from a team June 19, 2022 16:08
@odinuge odinuge requested a review from a team as a code owner June 19, 2022 16:08
@odinuge odinuge requested review from jibi and jrajahalme June 19, 2022 16:08
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jun 19, 2022
@odinuge odinuge changed the title [WIP][DNM] dnsproxy: Improve dnsproxy regex pattern [WIP][DNM] dnsproxy: Improve regex pattern Jun 19, 2022
@odinuge odinuge force-pushed the ou-regex-improvements-v2 branch 3 times, most recently from 0830dc1 to 311ed71 Compare June 19, 2022 16:12
@christarazi christarazi self-requested a review June 20, 2022 00:01
@christarazi christarazi added kind/performance There is a performance impact of this. area/proxy Impacts proxy components, including DNS, Kafka, Envoy and/or XDS servers. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies. labels Jun 20, 2022
Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! It would be great if we can get benchmarking code added to the tests. There should already be a framework that tests at a high-level inside pkg/fqdn/dnsproxy/proxy_test.go. However, it would be good to add more piece-meal benchmarks corresponding to the code changes in this PR.

Benchmark_perEPAllow_setPortRulesForID_large is skipped by default, but if you remove the t.Skip() and add the CNPs as a yaml to testdata/cnps-large.yaml and run go test, you should be able to get some results.

@odinuge
Copy link
Member Author

odinuge commented Jun 20, 2022

Hey @christarazi! Thanks! Did the initial testing on the v1.9 release branch, since thats what we are using atm., so didn't see those benchmarks.

I don't have a set of policies at hand atm., I've just been creating a set of domains to push through. Do you have the CNP list you were using in that PR?

Also, discussed this with a colleague, and we decided to try using a set (essentially a map[string]bool) for the fqdn names, and then keeping the regex for patterns. That was waaaaay faster on the synthetic tests as well, and the memory allocations aren't even comparable. Will need to test in some real environment first tho. Will hopefully do some quick implementation this week, and will ship my benchmarking code as soon as I can, hopefully before end of week.

@odinuge
Copy link
Member Author

odinuge commented Jun 21, 2022

Did a quick test now with this PR vs. master, with a lru size of 128 with the same dataset as @christarazi was using with his initial testing;

# This PR
$ go test -tags privileged_tests -v -run '^$' -bench Benchmark_perEPAllow_setPortRulesForID_large -benchmem -benchtime 1x -memprofile memprofile.out ./pkg/fqdn/dnsproxy
goos: linux
goarch: amd64
pkg: github.com/cilium/cilium/pkg/fqdn/dnsproxy
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
Benchmark_perEPAllow_setPortRulesForID_large
Before (N=1)
Alloc = 32 MiB  HeapInuse = 53 MiB      Sys = 982 MiB   NumGC = 15
After (N=1)
Alloc = 369 MiB HeapInuse = 378 MiB     Sys = 982 MiB   NumGC = 37
Before (N=1)
Alloc = 34 MiB  HeapInuse = 52 MiB      Sys = 1017 MiB  NumGC = 48
After (N=1)
Alloc = 237 MiB HeapInuse = 300 MiB     Sys = 1017 MiB  NumGC = 71
Benchmark_perEPAllow_setPortRulesForID_large-7                 1        5772238001 ns/op        2161473136 B/op 23052832 allocs/op
PASS
ok      github.com/cilium/cilium/pkg/fqdn/dnsproxy      21.727s
# Baseline / master
root@eab730ccc555:/code# go test -tags privileged_tests -v -run '^$' -bench Benchmark_perEPAllow_setPortRulesForID_large -benchmem -benchtime 1x -memprofile memprofile.out ./pkg/fqdn/dnsproxy
goos: linux
goarch: amd64
pkg: github.com/cilium/cilium/pkg/fqdn/dnsproxy
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
Benchmark_perEPAllow_setPortRulesForID_large
Before (N=1)
Alloc = 32 MiB  HeapInuse = 50 MiB      Sys = 990 MiB   NumGC = 14
After (N=1)
Alloc = 464 MiB HeapInuse = 484 MiB     Sys = 990 MiB   NumGC = 35
Before (N=1)
Alloc = 36 MiB  HeapInuse = 54 MiB      Sys = 1101 MiB  NumGC = 45
After (N=1)
Alloc = 527 MiB HeapInuse = 538 MiB     Sys = 1101 MiB  NumGC = 66
Benchmark_perEPAllow_setPortRulesForID_large-7                 1        5331359995 ns/op        3019854640 B/op 26485069 allocs/op
PASS
ok      github.com/cilium/cilium/pkg/fqdn/dnsproxy      23.638s

We clearly see the heap usage is very much lower with this PR. Still need some tweaking tho. Will try to grab a bigger dataset and check the diff

Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, this is a great step towards a massive performance improvement in the Agent!

I understand this is a WIP :), and I have several comments about the code, but overall the idea is sound!

pkg/fqdn/dnsproxy/proxy.go Outdated Show resolved Hide resolved
pkg/fqdn/dnsproxy/proxy.go Outdated Show resolved Hide resolved
pkg/fqdn/dnsproxy/proxy.go Outdated Show resolved Hide resolved
pkg/fqdn/dnsproxy/proxy.go Outdated Show resolved Hide resolved
pkg/fqdn/dnsproxy/proxy.go Outdated Show resolved Hide resolved
pkg/fqdn/dnsproxy/proxy.go Outdated Show resolved Hide resolved
pkg/fqdn/dnsproxy/proxy.go Outdated Show resolved Hide resolved
pkg/fqdn/dnsproxy/proxy.go Outdated Show resolved Hide resolved
pkg/fqdn/dnsproxy/proxy.go Outdated Show resolved Hide resolved
pkg/fqdn/dnsproxy/proxy_test.go Outdated Show resolved Hide resolved
@christarazi christarazi added release-note/misc This PR makes changes that have no direct user impact. and removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Jun 21, 2022
@aanm
Copy link
Member

aanm commented Jul 8, 2022

@odinuge I've marked the PR as draft since it's in WIP. Feel free to mark it as ready for review once it's done. Thank you!

@aanm aanm marked this pull request as draft July 8, 2022 21:54
@odinuge
Copy link
Member Author

odinuge commented Jul 13, 2022

I've marked the PR as draft since it's in WIP. Feel free to mark it as ready for review once it's done. Thank you!

Thanks @aanm! And thanks @christarazi for the review. Did a overall cleanup, but as discussed AFK, we will look at #20396 first.

Sneak peak of the benchmarks, where heap size goes from ~1800MiB to ~550MiB ;

root@43144d6cc6bb:/code# go test -tags privileged_tests -v -run '^$' -bench Benchmark_perEPAllow_setPortRulesForID_large -benchmem -benchtime 1x -memprofile memprofile.out ./pkg/fqdn/dnsproxy
goos: linux
goarch: amd64
pkg: github.com/cilium/cilium/pkg/fqdn/dnsproxy
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
Benchmark_perEPAllow_setPortRulesForID_large
Before Setup (N=1,EPs=20,cache=128)
Alloc = 3 MiB   HeapInuse = 5 MiB       Sys = 20 MiB    NumGC = 6
Before Test (N=1,EPs=20,cache=128)
Alloc = 37 MiB  HeapInuse = 51 MiB      Sys = 988 MiB   NumGC = 18
After Test (N=1,EPs=20,cache=128)
Alloc = 1701 MiB        HeapInuse = 1837 MiB    Sys = 2964 MiB  NumGC = 46
Benchmark_perEPAllow_setPortRulesForID_large-7                 1        15223701428 ns/op       10021494552 B/op        46482958 allocs/op
PASS
ok      github.com/cilium/cilium/pkg/fqdn/dnsproxy      20.774s

With this PR;

root@43144d6cc6bb:/code# go test -tags privileged_tests -v -run '^$' -bench Benchmark_perEPAllow_setPortRulesForID_large -benchmem -benchtime 1x -memprofile memprofile.out ./pkg/fqdn/dnsproxy
goos: linux
goarch: amd64
pkg: github.com/cilium/cilium/pkg/fqdn/dnsproxy
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
Benchmark_perEPAllow_setPortRulesForID_large
Before Setup (N=1,EPs=20,cache=128)
Alloc = 3 MiB   HeapInuse = 5 MiB       Sys = 19 MiB    NumGC = 6
Before Test (N=1,EPs=20,cache=128)
Alloc = 37 MiB  HeapInuse = 49 MiB      Sys = 938 MiB   NumGC = 18
After Test (N=1,EPs=20,cache=128)
Alloc = 409 MiB HeapInuse = 542 MiB     Sys = 939 MiB   NumGC = 50
Benchmark_perEPAllow_setPortRulesForID_large-7                 1        11047563811 ns/op       4351742336 B/op 40233104 allocs/op
PASS
ok      github.com/cilium/cilium/pkg/fqdn/dnsproxy      17.455s

pkg/fqdn/dns/dns.go Outdated Show resolved Hide resolved
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.12.0 Jul 15, 2022
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Dec 5, 2022
@odinuge odinuge marked this pull request as ready for review December 5, 2022 20:05
@odinuge odinuge requested a review from a team as a code owner December 5, 2022 20:05
@odinuge odinuge requested review from christarazi and removed request for jibi and jrajahalme December 6, 2022 09:13
Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I just realized is that once this lands, will it change the behavior of existing policies? Is there a potential it might break on existing matchpatterns in user environments?

pkg/fqdn/dnsproxy/proxy.go Show resolved Hide resolved
@christarazi
Copy link
Member

christarazi commented Dec 6, 2022

/test

Job 'Cilium-PR-K8s-1.26-kernel-net-next' failed:

Click to show.

Test Name

K8sDatapathConfig AutoDirectNodeRoutes Check direct connectivity with per endpoint routes

Failure Output

FAIL: Found 1 k8s-app=cilium logs matching list of errors that must be investigated:

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.26-kernel-net-next so I can create one.

@maintainer-s-little-helper maintainer-s-little-helper bot removed this from Needs backport from master in 1.12.5 Dec 6, 2022
@odinuge
Copy link
Member Author

odinuge commented Dec 6, 2022

Hi, thanks!

One thing I just realized is that once this lands, will it change the behavior of existing policies? Is there a potential it might break on existing matchpatterns in user environments?

This should be a noop for users, since what the regex will match is the same, its just the regex pattern that is different. It should also be fully backwards compatible with older versions with the restore feature as well (unless I'm missing anything obvious..).

I think the unit-tests here and here should cover all cases, and make sure that the end user behavior is the same.

@github-actions
Copy link

github-actions bot commented Jan 6, 2023

This pull request has been automatically marked as stale because it
has not had recent activity. It will be closed if no further activity
occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Jan 6, 2023
@christarazi christarazi removed the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Jan 6, 2023
@christarazi
Copy link
Member

/test

This now wraps the overall regex in anchors and keep the distinct
rules/fqdns as normal non-anchored regexes. It also removes the regex
groups since we don't use them. This makes the final regex much smaller,
(the heap in the benchmark is ~20% smaller), and makes it more
performant when used for matching.

$ benchstat <pre-this-commit> <this-commit>
name                             old time/op          new time/op          delta
_perEPAllow_setPortRulesForID-7           77.4s ± 8%           84.8s ±21%     ~     (p=0.310 n=5+5)

name                             old B(HeapInUse)/op  new B(HeapInUse)/op  delta
_perEPAllow_setPortRulesForID-7           97.3M ± 1%           78.3M ± 1%  -19.51%  (p=0.008 n=5+5)

name                             old alloc/op         new alloc/op         delta
_perEPAllow_setPortRulesForID-7          78.6GB ± 0%          77.2GB ± 0%   -1.86%  (p=0.008 n=5+5)

name                             old allocs/op        new allocs/op        delta
_perEPAllow_setPortRulesForID-7            167M ± 0%            113M ± 0%  -32.47%  (p=0.008 n=5+5)

Signed-off-by: Odin Ugedal <ougedal@palantir.com>
If there is a wildcard rule in the ruleset, return a regex matching
everything early. This will reduce the size in memory, and the perf of
the dnsproxy with such rules in general.

$ benchstat <pre-this-commit> <this-commit>
name                             old time/op          new time/op          delta
_perEPAllow_setPortRulesForID-7           84.8s ±21%           83.0s ±17%    ~     (p=0.690 n=5+5)

name                             old B(HeapInUse)/op  new B(HeapInUse)/op  delta
_perEPAllow_setPortRulesForID-7           78.3M ± 1%           74.2M ± 1%  -5.27%  (p=0.008 n=5+5)

name                             old alloc/op         new alloc/op         delta
_perEPAllow_setPortRulesForID-7          77.2GB ± 0%          73.0GB ± 0%  -5.42%  (p=0.008 n=5+5)

name                             old allocs/op        new allocs/op        delta
_perEPAllow_setPortRulesForID-7            113M ± 0%            108M ± 0%  -4.28%  (p=0.008 n=5+5)

Signed-off-by: Odin Ugedal <ougedal@palantir.com>
If we know the regex match all input, we don't really need to run the
matching. This will reduce the overall memory usage and allocations in
this code path.

Signed-off-by: Odin Ugedal <ougedal@palantir.com>
Also switch to %q to get quotes around it as well.

Signed-off-by: Odin Ugedal <ougedal@palantir.com>
Reuse escapeRegexpCharacters func that implement the same escaping to
avoid duplication.

Signed-off-by: Odin Ugedal <ougedal@palantir.com>
@aanm
Copy link
Member

aanm commented Jan 27, 2023

/test

@aanm aanm self-assigned this Jan 29, 2023
@aanm
Copy link
Member

aanm commented Jan 30, 2023

Travis failed with #23314

@aanm aanm merged commit 0ca71e0 into cilium:master Jan 30, 2023
@odinuge
Copy link
Member Author

odinuge commented Jan 30, 2023

Thanks @aanm!

We deployed this to production here over the last few weeks, and on the hosts with the biggest l7 rulests and the most dns queries, including a matchPattern: * rule for some stuff, we saw a memory reduction of more than 50% (the graph is per host/node, and the drop is when deploying a version with this patchset).

Screen Shot 2023-01-30 at 11 52 51 AM

@odinuge odinuge deleted the ou-regex-improvements-v2 branch January 30, 2023 12:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/proxy Impacts proxy components, including DNS, Kafka, Envoy and/or XDS servers. kind/community-contribution This was a contribution made by a community member. kind/performance There is a performance impact of this. release-note/misc This PR makes changes that have no direct user impact. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants