-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a global RegexManager #240
Conversation
63b2697
to
b9f217d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think overall this approach is a good idea.
I noticed a few places where the indentation is a bit messed up. I'd recommend cargo fmt
, but currently it would produce a bunch of unrelated changes, which is not ideal. Could you at least go over the parts you've touched and make sure the indentation is 4 spaces? If it's easier you can copy your code changes into another file and format that one specifically.
|
||
#[cfg(test)] | ||
fn matches_test(&self, request: &request::Request) -> bool; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the interest of not making breaking changes until the next semver bump, let's leave the matches
and get_regex
signatures untouched. We can add matches_with_regex_manager
in addition to these.
get_regex
can re-compile the regex whenever it's called.matches
could just be a thin wrapper aroundmatches_with_regex_manager
with an emptyRegexManager
; defined in-line with a#[deprecated]
attribute.
That way, we don't need any #[cfg(test)]
methods, and tests wouldn't have to be modified at all either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea was to force all the instances to use a correct RegexManager
(except the tests).
The thing that I'm afraid that we occasionally call the wrong function in the production code.
Like returning in old signature fn matches(&self, request: &request::Request) -> bool
results in still usin it in some part of adblock-rust or 3rd party product.
That's why I believe we'd better to make breaking changes here.
A singleton as alternative doesn't sounds good..
@antonok-edm PTAL |
@@ -3283,17 +3329,11 @@ mod match_tests { | |||
{ | |||
let filter = r#"/^https?:\/\/([0-9a-z\-]+\.)?(9anime|animeland|animenova|animeplus|animetoon|animewow|gamestorrent|goodanime|gogoanime|igg-games|kimcartoon|memecenter|readcomiconline|toonget|toonova|watchcartoononline)\.[a-z]{2,4}\/(?!([Ee]xternal|[Ii]mages|[Ss]cripts|[Uu]ploads|ac|ajax|assets|combined|content|cov|cover|(img\/bg)|(img\/icon)|inc|jwplayer|player|playlist-cat-rss|static|thumbs|wp-content|wp-includes)\/)(.*)/$image,other,script,~third-party,xmlhttprequest,domain=~animeland.hu"#; | |||
let network_filter = NetworkFilter::parse(filter, true, Default::default()).unwrap(); | |||
let regex = Arc::try_unwrap(network_filter.get_regex()).unwrap(); | |||
assert!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed because there is no way to get the CompiledRegex
type.
Do it important here to check it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is just checking that the filter has a compiled regex. The equivalent would be checking that an external RegexManager
compiles the regex for it (after a matches
call). I don't think it's critical in this test, but we probably should have at least one test somewhere to confirm that the regex count increases as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I see.
I've modified the test to check get_compiled_regex_count()
before/after calling is_match
.
src/filters/network.rs
Outdated
for (_, v) in &self.map { | ||
println!("{}", v.regex) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume we don't want this in production code; I'll remove it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the correction.
Originally, that code was supposed to be in the next PR, but after the plan was changed.
So I messed up the branches and some test code went here.
Now I've changed the solution to preserve the map entries but to drop the regexs themselves.
This gives extra information about which reg expressions were discarded.
src/filters/network.rs
Outdated
self.map.clear(); | ||
let now = self.now; | ||
self.map.retain(|_, v| now - v.last_used < std::time::Duration::from_secs(180)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.map.clear()
will completely empty the hashmap before the retain
operation, which ends up being a no-op. I assume we just want the retain
, so I removed the clear
.
63c66d5
to
ee1978d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to break RegexManager
out into its own file; just another top-level regex_manager.rs
would be fine. If you can do that and add a test or two there, I think it should be good to go.
@@ -3283,17 +3329,11 @@ mod match_tests { | |||
{ | |||
let filter = r#"/^https?:\/\/([0-9a-z\-]+\.)?(9anime|animeland|animenova|animeplus|animetoon|animewow|gamestorrent|goodanime|gogoanime|igg-games|kimcartoon|memecenter|readcomiconline|toonget|toonova|watchcartoononline)\.[a-z]{2,4}\/(?!([Ee]xternal|[Ii]mages|[Ss]cripts|[Uu]ploads|ac|ajax|assets|combined|content|cov|cover|(img\/bg)|(img\/icon)|inc|jwplayer|player|playlist-cat-rss|static|thumbs|wp-content|wp-includes)\/)(.*)/$image,other,script,~third-party,xmlhttprequest,domain=~animeland.hu"#; | |||
let network_filter = NetworkFilter::parse(filter, true, Default::default()).unwrap(); | |||
let regex = Arc::try_unwrap(network_filter.get_regex()).unwrap(); | |||
assert!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is just checking that the filter has a compiled regex. The equivalent would be checking that an external RegexManager
compiles the regex for it (after a matches
call). I don't think it's critical in this test, but we probably should have at least one test somewhere to confirm that the regex count increases as expected.
src/filters/network.rs
Outdated
@@ -151,13 +152,15 @@ impl From<&request::RequestType> for NetworkFilterMask { | |||
pub enum CompiledRegex { | |||
Compiled(Regex), | |||
CompiledSet(RegexSet), | |||
None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added this new entry instead of using Option<CompiledRegex>
.
This allow to avoid growing the map entry to +8 bytes (because of extra Option
).
Motivation: https://adeschamps.github.io/enum-size
Moved |
src/filters/network.rs
Outdated
MatchAll, | ||
RegexParsingError(regex::Error), | ||
} | ||
|
||
impl CompiledRegex { | ||
pub fn is_match(&self, pattern: &str) -> bool { | ||
match &self { | ||
CompiledRegex::None => {panic!("use of CompiledRegex::None")} // No value, invalid call. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the kind of thing I'd prefer to see behind the debug-info
flag, so we can guarantee it won't cause crashes for users in the wild.
Can we do this?
- Rename
CompiledRegex::None
toCompiledRegex::Discarded
- Put
CompiledRegex::Discarded
behind a#[cfg(feature = "debug-info")]
- Completely remove the
CompiledRegex
incleanup
whendebug-info
is disabled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory we can, I also though about the similar approach.
But I believe we need discarded map entries to make a better decision in future (=stop discarding the entries that we often recreate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I believe we need discarded map entries to make a better decision in future (=stop discarding the entries that we often recreate.
I think that makes sense as an approach, but I'd strongly prefer to add panics like this only when necessary. Unless you're planning to implement that strategy in this PR, let's leave it out and we can revisit later.
There could also be alternatives to this pattern that wouldn't panic (maybe we can store hashes of the discarded rules in a bloom filter or hashmap?). But we should look at that holistically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to Option<CompiledRegex>
.
@antonok-edm Any other comments? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
last few comments and it should be good to go
@antonok-edm PTAL |
Draft to implement the storage Try to optimize Use AHasher Clean up and stats Rename and cleanup
f32a785
to
952df88
Compare
For brave/brave-browser#26892
The idea is to use global regular expression storage instead of individual ones.
This allows to challenge the long-term regexp memory consumption problem in the next PRs.
A new
RegexManager
is in charge of matching/spawning/deleting regex instances.NetworkFilter
is enforced to match usingRegexManager
.Plans:
Note:
get_debug_info()
method is supposed to be exposed to something likebrave://adblock-internals
.