Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider replacing Regex with simpler scanner #228

Closed
cloneable opened this issue Jul 31, 2021 · 8 comments · Fixed by #229
Closed

Consider replacing Regex with simpler scanner #228

cloneable opened this issue Jul 31, 2021 · 8 comments · Fixed by #229

Comments

@cloneable
Copy link
Contributor

I was checking a binary with cargo bloat and to my surprise I noticed that log4rs pulled in regex which turned out to be 4x as large as log4rs itself. So I was curious what log4rs is using regex for. I could only find one line with a reg exp matcher that's not even that complex.

$ grep -rni regex .
./Cargo.toml:21:file_appender = ["parking_lot", "simple_writer", "pattern_encoder", "regex"]
./Cargo.toml:22:rolling_file_appender = ["parking_lot", "simple_writer", "pattern_encoder", "regex"]
./Cargo.toml:70:regex = { version = "1", optional = true }
./src/append/mod.rs:29:        let matcher = regex::Regex::new(r#"\$ENV\{([\w][\w|\d|\.|_]*)\}"#).unwrap();

So I'm wondering if you have more plans for regex crate or if you would be willing to replace this one use with a handwritten scanner?

@estk
Copy link
Owner

estk commented Aug 3, 2021

@cloneable this seems like a great idea! Any chance you'd like to open a PR on it?

@cloneable
Copy link
Contributor Author

Ok, I'll give it a shot. Seems a good chance to improve my Rust in this area.

@estk
Copy link
Owner

estk commented Aug 5, 2021

@cloneable Great to hear!
Feel free to open a Draft PR and ask any and all questions you'd like there.

@cloneable
Copy link
Contributor Author

Great, thanks!

The current pattern for an env var name is this: [\w][\w|\d|\.|_]*

(Btw, I believe the pipes are matched too. You probably meant to use parentheses or no pipes in brackets. Also, the period loses it's special meaning in brackets and doesn't need to be escaped. The first \w doesn't need to be in brackets.)

Question: \w actually implies a lot, including \d. I guess the original intent was to match a letter at the beginning and then zero or more letters, digits, periods and underscores. One could argue that it's now too late to restrict this, but let me know if you want to do that as is would simplify my code.

https://docs.rs/regex/1.5.4/regex/#perl-character-classes-unicode-friendly

@estk
Copy link
Owner

estk commented Aug 6, 2021

@cloneable I agree with your assessment, I think the logic should basically be completely replaced with just a dumb find which scans from "$ENV{" to the first "}". Whether the thing inside the "$ENV{...}" is valid I dont think we really care.

@cloneable
Copy link
Contributor Author

OK, I'll adjust the draft PR.

@cloneable
Copy link
Contributor Author

@estk, allowing anything between { } means we allow $ENV{} too. Users might think it's possible to use nested $ENVs, like e.g. $ENV{endpoint-$ENV{instance_number}}.
The new code would try to look for an env var "endpoint-$ENV{instance_number". (I can try to turn the code into recursive lookup, but I kinda don't want this to turn into a rabbit hole.)

I think it's best to replicate what the regular expression currently allows. Or tighten that to ascii and additionally allow current behavior via feature. Wdyt?

@estk
Copy link
Owner

estk commented Aug 7, 2021 via email

@estk estk closed this as completed in #229 Aug 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants