Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a global RegexManager #240

Merged
merged 14 commits into from
Jan 16, 2023
Merged

Add a global RegexManager #240

merged 14 commits into from
Jan 16, 2023

Conversation

atuchin-m
Copy link
Collaborator

@atuchin-m atuchin-m commented Dec 8, 2022

For brave/brave-browser#26892

The idea is to use global regular expression storage instead of individual ones.
This allows to challenge the long-term regexp memory consumption problem in the next PRs.
A new RegexManager is in charge of matching/spawning/deleting regex instances. NetworkFilter is enforced to match using RegexManager.

Plans:

  • track the last regex usage and drop the instance it after a few time;
  • use less effective, but more memory-saving algorithms for some of the reg exps until it becomes a hot path;

Note:

  • Using a standard hasher for u64 instead of AHasher results in 3% performance drop.. Not sure AHasher is the best option. I am open to other suggestions.
  • Replacing a standard hasher in other places could also give a perf boost, worst trying.
  • get_debug_info() method is supposed to be exposed to something like brave://adblock-internals.

Copy link
Collaborator

@antonok-edm antonok-edm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think overall this approach is a good idea.

I noticed a few places where the indentation is a bit messed up. I'd recommend cargo fmt, but currently it would produce a bunch of unrelated changes, which is not ideal. Could you at least go over the parts you've touched and make sure the indentation is 4 spaces? If it's easier you can copy your code changes into another file and format that one specifically.

src/filters/network.rs Outdated Show resolved Hide resolved
Comment on lines +1102 to +1062

#[cfg(test)]
fn matches_test(&self, request: &request::Request) -> bool;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the interest of not making breaking changes until the next semver bump, let's leave the matches and get_regex signatures untouched. We can add matches_with_regex_manager in addition to these.

  • get_regex can re-compile the regex whenever it's called.
  • matches could just be a thin wrapper around matches_with_regex_manager with an empty RegexManager; defined in-line with a #[deprecated] attribute.

That way, we don't need any #[cfg(test)] methods, and tests wouldn't have to be modified at all either.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea was to force all the instances to use a correct RegexManager (except the tests).
The thing that I'm afraid that we occasionally call the wrong function in the production code.
Like returning in old signature fn matches(&self, request: &request::Request) -> bool results in still usin it in some part of adblock-rust or 3rd party product.

That's why I believe we'd better to make breaking changes here.
A singleton as alternative doesn't sounds good..

Cargo.toml Outdated Show resolved Hide resolved
Cargo.toml Show resolved Hide resolved
src/blocker.rs Outdated Show resolved Hide resolved
@atuchin-m
Copy link
Collaborator Author

@antonok-edm PTAL

@@ -3283,17 +3329,11 @@ mod match_tests {
{
let filter = r#"/^https?:\/\/([0-9a-z\-]+\.)?(9anime|animeland|animenova|animeplus|animetoon|animewow|gamestorrent|goodanime|gogoanime|igg-games|kimcartoon|memecenter|readcomiconline|toonget|toonova|watchcartoononline)\.[a-z]{2,4}\/(?!([Ee]xternal|[Ii]mages|[Ss]cripts|[Uu]ploads|ac|ajax|assets|combined|content|cov|cover|(img\/bg)|(img\/icon)|inc|jwplayer|player|playlist-cat-rss|static|thumbs|wp-content|wp-includes)\/)(.*)/$image,other,script,~third-party,xmlhttprequest,domain=~animeland.hu"#;
let network_filter = NetworkFilter::parse(filter, true, Default::default()).unwrap();
let regex = Arc::try_unwrap(network_filter.get_regex()).unwrap();
assert!(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed because there is no way to get the CompiledRegex type.
Do it important here to check it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is just checking that the filter has a compiled regex. The equivalent would be checking that an external RegexManager compiles the regex for it (after a matches call). I don't think it's critical in this test, but we probably should have at least one test somewhere to confirm that the regex count increases as expected.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I see.
I've modified the test to check get_compiled_regex_count() before/after calling is_match.

Comment on lines 351 to 353
for (_, v) in &self.map {
println!("{}", v.regex)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume we don't want this in production code; I'll remove it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the correction.
Originally, that code was supposed to be in the next PR, but after the plan was changed.
So I messed up the branches and some test code went here.

Now I've changed the solution to preserve the map entries but to drop the regexs themselves.
This gives extra information about which reg expressions were discarded.

Comment on lines 348 to 349
self.map.clear();
let now = self.now;
self.map.retain(|_, v| now - v.last_used < std::time::Duration::from_secs(180));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.map.clear() will completely empty the hashmap before the retain operation, which ends up being a no-op. I assume we just want the retain, so I removed the clear.

@antonok-edm antonok-edm force-pushed the use-global-regex-storage branch 2 times, most recently from 63c66d5 to ee1978d Compare January 6, 2023 04:08
Copy link
Collaborator

@antonok-edm antonok-edm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to break RegexManager out into its own file; just another top-level regex_manager.rs would be fine. If you can do that and add a test or two there, I think it should be good to go.

@@ -3283,17 +3329,11 @@ mod match_tests {
{
let filter = r#"/^https?:\/\/([0-9a-z\-]+\.)?(9anime|animeland|animenova|animeplus|animetoon|animewow|gamestorrent|goodanime|gogoanime|igg-games|kimcartoon|memecenter|readcomiconline|toonget|toonova|watchcartoononline)\.[a-z]{2,4}\/(?!([Ee]xternal|[Ii]mages|[Ss]cripts|[Uu]ploads|ac|ajax|assets|combined|content|cov|cover|(img\/bg)|(img\/icon)|inc|jwplayer|player|playlist-cat-rss|static|thumbs|wp-content|wp-includes)\/)(.*)/$image,other,script,~third-party,xmlhttprequest,domain=~animeland.hu"#;
let network_filter = NetworkFilter::parse(filter, true, Default::default()).unwrap();
let regex = Arc::try_unwrap(network_filter.get_regex()).unwrap();
assert!(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is just checking that the filter has a compiled regex. The equivalent would be checking that an external RegexManager compiles the regex for it (after a matches call). I don't think it's critical in this test, but we probably should have at least one test somewhere to confirm that the regex count increases as expected.

@@ -151,13 +152,15 @@ impl From<&request::RequestType> for NetworkFilterMask {
pub enum CompiledRegex {
Compiled(Regex),
CompiledSet(RegexSet),
None,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added this new entry instead of using Option<CompiledRegex>.
This allow to avoid growing the map entry to +8 bytes (because of extra Option).

Motivation: https://adeschamps.github.io/enum-size

@atuchin-m
Copy link
Collaborator Author

Moved RegexManager to regex_manager.rs .

MatchAll,
RegexParsingError(regex::Error),
}

impl CompiledRegex {
pub fn is_match(&self, pattern: &str) -> bool {
match &self {
CompiledRegex::None => {panic!("use of CompiledRegex::None")} // No value, invalid call.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the kind of thing I'd prefer to see behind the debug-info flag, so we can guarantee it won't cause crashes for users in the wild.

Can we do this?

  • Rename CompiledRegex::None to CompiledRegex::Discarded
  • Put CompiledRegex::Discarded behind a #[cfg(feature = "debug-info")]
  • Completely remove the CompiledRegex in cleanup when debug-info is disabled

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory we can, I also though about the similar approach.
But I believe we need discarded map entries to make a better decision in future (=stop discarding the entries that we often recreate.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I believe we need discarded map entries to make a better decision in future (=stop discarding the entries that we often recreate.

I think that makes sense as an approach, but I'd strongly prefer to add panics like this only when necessary. Unless you're planning to implement that strategy in this PR, let's leave it out and we can revisit later.

There could also be alternatives to this pattern that wouldn't panic (maybe we can store hashes of the discarded rules in a bloom filter or hashmap?). But we should look at that holistically.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to Option<CompiledRegex>.

src/filters/regex_manager.rs Outdated Show resolved Hide resolved
@atuchin-m
Copy link
Collaborator Author

@antonok-edm Any other comments?

Copy link
Collaborator

@antonok-edm antonok-edm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

last few comments and it should be good to go

src/filters/regex_manager.rs Outdated Show resolved Hide resolved
src/filters/regex_manager.rs Outdated Show resolved Hide resolved
Cargo.toml Outdated Show resolved Hide resolved
@atuchin-m
Copy link
Collaborator Author

@antonok-edm PTAL

@antonok-edm antonok-edm merged commit 066eb36 into master Jan 16, 2023
@antonok-edm antonok-edm deleted the use-global-regex-storage branch January 16, 2023 09:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants