-
-
Notifications
You must be signed in to change notification settings - Fork 222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: adds memoize implementation for regexes and ahocorasick #836
Conversation
Currently we create and allocate memory for every regex we compile, however there are cases where you compile the same regex over and over e.g. corazawaf/coraza-caddy#76. Here we implement the memoize pattern to be able to reuse the regex and reduce the memory consumption.
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #836 +/- ##
==========================================
- Coverage 81.59% 77.36% -4.24%
==========================================
Files 159 157 -2
Lines 9013 8738 -275
==========================================
- Hits 7354 6760 -594
- Misses 1412 1749 +337
+ Partials 247 229 -18
Flags with carried forward coverage won't be shown. Click here to find out more.
☔ View full report in Codecov by Sentry. |
We should include a flush or reset function so we can delete everything after a web server reload. That function must be exported. I would include it in experimental/coraza |
There is a |
Waf close is not coming in v3 as we cannot break the api, we should just export all helpers in experimental until we can merge them. But closing cache is different, as this is global and no per-WAF |
Not a breaking change if we don't force people to use it. ATM there is no close method and people still live with it. Although adding |
If adding the method to the interface is too concerning we could:
|
I believe we can't do this for proxy-wasm but I do believe we might have the same issue with different wafs for authorities. Any thought @anuraaga ? |
Is the question whether we can add a Close method? As I mentioned before, we could have documented it better already but I don't think it's too late to document the waf interface is not for external implementation without worrying about major version, it's very early still. For this PR though, I don't see how a WAF close method would help though since IIUC the cache is global, not per waf. Having a global method in experimental for now seems ok. |
internal/memoize/cache.go
Outdated
|
||
//go:build !tinygo | ||
|
||
// Highly inspired in https://github.com/patrickmn/go-cache/blob/master/cache.go |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is ok but this file looks like an obvious mutex guarded map, not sure it's worth referencing anything.
Anyways it looks like a sync.Map
should be much faster for this, it's docs specifically mention the write once read many case of a cache
PTAL @anuraaga @jptosso I did some simplifications and removed the I believe in coraza-caddy and WAF in general is that you don’t do radical changes between deployments, e.g. if you change your ruleset you won’t change 100 regexes, mostly just a couple so it is still OK to keep the global cache that will keep alive a few remaining regexes until next deployment. If we want to really be precise we should do something like |
@@ -456,7 +457,12 @@ func (r *Rule) AddVariable(v variables.RuleVariable, key string, iscount bool) e | |||
var re *regexp.Regexp | |||
if len(key) > 2 && key[0] == '/' && key[len(key)-1] == '/' { | |||
key = key[1 : len(key)-1] | |||
re = regexp.MustCompile(key) | |||
|
|||
if vare, err := memoize.Do(key, func() (interface{}, error) { return regexp.Compile(key) }); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be worth extracting a function for the two usages
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean for the regex and binaryregex?
internal/memoize/memoize.go
Outdated
func makeDoer(cache *sync.Map, group *singleflight.Group) func(string, func() (interface{}, error)) (interface{}, error, bool) { | ||
return func(key string, fn func() (interface{}, error)) (interface{}, error, bool) { | ||
// Check cache | ||
value, found := cache.Load(key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, can combine two lines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah we need to add a ci build with the new flag. In particular with tinygo
@anuraaga we do that already, see: As for tinygo, we already run tests for tinygo https://github.com/corazawaf/coraza/blob/main/.github/workflows/tinygo.yml#L49 and we don't have to enable |
internal/memoize/README.md
Outdated
the same process and hence same regexes are being compiled | ||
over and over. | ||
|
||
Currently it is opt-in under the `memoize_regex` build tag |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: what about also adding a one-line description in the main readme, under https://github.com/corazawaf/coraza#build-tags?
Ah missed the magefile change. But this doesn't need to apply to proxy-wasm? If it's only caddy that seems more narrow than I expected. Yeah I know I recommended sync.map but forgot about TinyGo, we should probably have a non-sync version for it to still have memorization with Envoy, where the memory penalty of multiple wafs is way higher than caddy |
Believe it or not @anuraaga I was thinking exactly the same thing an hour
ago or so. We don't need to be concurrently safe in tinygo as you would
launch one waf at the time so this could be perfectly applied to proxy-wasm.
…On Thu, 6 Jul 2023, 12:39 Anuraag Agrawal, ***@***.***> wrote:
Ah missed the magefile change.
But this doesn't need to apply to proxy-wasm? If it's only caddy that
seems more narrow than I expected. Yeah I know I recommended sync.map but
forgot about TinyGo, we should probably have a non-sync version for it to
still have memorization with Envoy, where the memory penalty of multiple
wafs is way higher than caddy
—
Reply to this email directly, view it on GitHub
<#836 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAXOYAUUGEE7IAVQET53QJTXO2IVFANCNFSM6AAAAAAZ7NUIZA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
BTW TinyGo does have an implementation of sync.map, which of course doesn't actually synchronize https://github.com/tinygo-org/tinygo/blob/release/src/sync/map.go So I think the current code should work but we should make sure in CI |
I will check and also adds memoize for aho-corasick trees. |
internal/corazawaf/rule.go
Outdated
@@ -521,7 +527,11 @@ func (r *Rule) AddVariableNegation(v variables.RuleVariable, key string) error { | |||
var re *regexp.Regexp | |||
if len(key) > 2 && key[0] == '/' && key[len(key)-1] == '/' { | |||
key = key[1 : len(key)-1] | |||
re = regexp.MustCompile(key) | |||
if vare, err := memoize.Do(key, func() (interface{}, error) { return regexp.Compile(key) }); err != nil { | |||
panic(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should never panic, you can return error here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, however this should be fixed in main
, I am just reproducing the MustCompile
behaviour.
…t synced as no concurrency.
PTAL @anuraaga |
@@ -106,6 +106,9 @@ have compatibility guarantees across minor versions - use with care. | |||
the operator with `plugins.RegisterOperator` to reduce binary size / startup overhead. | |||
* `coraza.rule.multiphase_valuation` - enables evaluation of rule variables in the phases that they are ready, not | |||
only the phase the rule is defined for. | |||
* `memoize_builders` - enables memoization of builders for regex and aho-corasick |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ping @anuraaga
Currently we create and allocate memory for every regex we compile, however there are cases where you compile the same regex over and over e.g. corazawaf/coraza-caddy#76. Here we implement the memoize pattern to be able to reuse the regex and reduce the memory consumption.
TODO:
Ping @llinder @anuraaga @jptosso