-
Notifications
You must be signed in to change notification settings - Fork 6.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove/Reduce use of Regex in ObjectRegistry/Library #9264
Conversation
Added new ObjectLibrary::Entry classes to replace/reduce the use of Regex. For simple factories that only do name matching, there are "StringEntry" and "AltStringEntry" classes. For classes that use some semblance of regular expressions, there is a PatternEntry class that can match a name and prefixes. There is also a class for Customizable::IndividualId format matches. Added tests for the new derivative classes and got all unit tests to pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a lot of new complexity in the public API. If I were to rate solutions on various axes, with 10 as perfect score: ("aux" refers to ease of ongoing maintenance not related to compatibility and safety)
Existing regex interface: usability=7, compatibility=7, safety=4, aux=9
Integrate better RE library: usability=7, compatibility=9, safety=5, aux=5
This PR PatternEntry etc. assuming Regexes eventually removed:
usability=5, compatibility=10, safety=7, aux=7
While regexes are easy to get wrong and showing compatibility issues, they are at least concise and pretty well understood. I find a lot of cognitive load for this PatternEntry stuff because although some common patterns are covered, there's still some complexity in getting it, and without a new derived class, there's not nearly the generality of regex operators nor (I would argue) potential safety benefits of limiting expressiveness.
My proposal (related to yours but perhaps with advantages) would use builder pattern instead of subclasses. Example:
->Register<SomeType>(ObjectLibrary::Entry(my_primary_name)
.AltName(old_name1)
.AltName(old_name2)
.RequiredSeparator("://")
.SuffixFilter([](const Slice& suffix) { ... }),
This would allow us to do some enhanced checking (ambiguity?), and even indexing, because we know specific prefixes associated with entries. There would be some rigidity where it could be useful to take advantage of that, and the format is somewhat prescriptive, but flexibility in the tail parts where we don't care so much.
Because it's IMHO easier to find what you want when dealing with variation within a single class, something like this I might rate usability=7, compatibility=10, safety=up to 9, aux=8
|
||
class RegexEntry : public Entry { | ||
public: | ||
static std::unique_ptr<Entry> Create(const std::string& name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't think of a reason not to return unique_ptr<RegexEntry>
and let the likely conversions happen in the caller.
bool is_enabled(Mode mode) const { return (mode_ & mode) == mode; } | ||
virtual bool matches_pattern(const std::string& target, | ||
size_t length) const { | ||
if (!is_enabled(kMatchPattern)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No reason for these implementation details (and likely more) to be in header file.
Cleaned up multiple entry implementation classes for pattern matching (leaving just PatternEntry). Added setter methods and suggested in the PR comments.
@mrambacher has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Promising, with some clarifications
if (nlen == tlen) { // The lengths are the same | ||
return (name_only_ || patterns_.empty()) && name == target; | ||
} else if (patterns_.empty()) { | ||
return nlen == tlen && name == target; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In context, this case can never return true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the nlen==tlen at line 22 is meant to prevent the comparison of strings on the second half. If the nlen==tlen, it will be caught at line 19 but if they are not equal then the check here could speed up the check.
I re-ordered the cases to simplify the code slightly.
// -Pattern: Comparable to [name][suffix].+ | ||
// -Number: Comparable to [name][suffix].[0-9]+ | ||
// -AltName: Comparable to ([name]|[alt]) | ||
// -NameOnly: Comparable to ([name][pattern]* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really understand this (missing close )
and what is pattern
?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- A suffix implies an exact match of the pattern.
- A pattern implies a match of the pattern followed by one or more characters
- A number implies a match of a pattern followed by a number
- Alternative names implies a match of the "name" or the "alternate" (followed by whatever suffix/pattern/number rules are active)
- NameOnly means that we can match just the name (or alternates) without the pattern rules (the patterns are optional).
I updated the comment to hopefully make it clearer/less confusing.
} | ||
|
||
// Creates a new pattern entry for "name". If name_only is true, | ||
// Matches will return true if name==target |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This description of name_only
is confusing, because won't Matches
return true on name==target
regardless of name_only
--until this PatternEntry is modified in some way? If I modify it, is name_only
aspect modified? And only
is suggestive of "only if" which doesn't seem to be the case here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed "name_only" to "optional" and added/updated comments
// -Number: Comparable to [name][suffix].[0-9]+ | ||
// -AltName: Comparable to ([name]|[alt]) | ||
// -NameOnly: Comparable to ([name][pattern]* | ||
// Multiple patterns can be combined and cause multiple matches. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems weird that AddName
is an OR, while AddPattern
etc. are CONCAT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you prefer "AddName" and "AppendPattern" or "AddAlternateName"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong preference, but I think we can avoid using the same verb for different things (like avoiding same noun for different things).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed "AddName" to "AlternateName" and "AddPattern" to "AddSeparator"
ASSERT_FALSE(entry.Matches("AB::1##")); | ||
ASSERT_FALSE(entry.Matches("AB::1##2")); | ||
ASSERT_FALSE(entry.Matches("AA##1::")); | ||
ASSERT_TRUE(entry.Matches("AA::1##")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, "AA::###" would not be a match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. Comparable test added.
virtual const char* Name() const = 0; | ||
}; | ||
|
||
// Class for matching target strings to a pattern. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's worth pointing out a key difference from regexes:
- In matching, the left-most match for one piece of a pattern is the only one considered (e.g. if next separator is "::" then "::" cannot be in the current piece), thus
- No backtracking is needed for matching, so matching is reliably efficient
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment to PatternEntry. LMK if you want more and what else you would like to see
@mrambacher has updated the pull request. You must reimport the pull request before landing. |
@mrambacher has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
@mrambacher has updated the pull request. You must reimport the pull request before landing. |
@mrambacher has updated the pull request. You must reimport the pull request before landing. |
@mrambacher has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
// Entries consist of a name that starts the pattern and attributes | ||
// The following attributes can be added to the entry: | ||
// -Suffix: Comparable to name(pattern) | ||
// -Pattern: Comparable to name(pattern).+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also find it confusing that "pattern" is used in three ways: PatternEntry (general), Pattern subtype, and pattern string to match. The last I think we can call "separator" as in my discussion about difference vs. regexes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed "AddPattern" and associated variables to "Separator". The class is still called PatternEntry but most other references have been removed.
Add comments and rename methods/variables.
@mrambacher has updated the pull request. You must reimport the pull request before landing. |
@mrambacher has updated the pull request. You must reimport the pull request before landing. |
// Class for matching target strings to a pattern. | ||
// Entries consist of a name that starts the pattern and attributes | ||
// The following attributes can be added to the entry: | ||
// -Suffix: Comparable to name(suffic) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo suffix
// | ||
// Note that though this class does provide some regex-style matching, | ||
// it is not a full regex parser and has some key differences: | ||
// - Separatorss are matched left-most. For example, an entry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo Separators
} | ||
|
||
template <typename T> | ||
const FactoryFunc<T>& Register(const PatternEntry& pattern, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Entry
appears to be publicly extensible, inviting custom matching implementations, which I don't really object to except that we are also using value semantics (copy) for a kind of Entry here. At least in public facing API, you should be consistent about value vs. owned pointer semantics for the entire hierarchy. IMHO mixing extensibility with value semantics in a hierarchy is confusing (because you obviously can't have both in the same place at the same time).
Fix typos. Make Entry private to ObjectLibrary
@mrambacher has updated the pull request. You must reimport the pull request before landing. |
@mrambacher has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good enough now, except GitHub reporting merge conflict
Thanks
Also, this will be a breaking API change to some small set of users. Needs to be mentioned in HISTORY.md. I'll see if there's any objection to this landing in a minor release. |
And can we remove regex.h public header? Could be a follow-up PR... |
@mrambacher has updated the pull request. You must reimport the pull request before landing. |
@mrambacher has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@mrambacher has updated the pull request. You must reimport the pull request before landing. |
@mrambacher has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
1 similar comment
@mrambacher has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Would it be possible to move the breaking plugin API change to the next major release? The change gives us a bit of a headache as it forces zenfs to add an compile time API-version flag and implement preprocessing to support both the old version and the new version of the API, as two of our users (terarkdb and percona mysql server) is not using the latest rocksdb(as of now). see westerndigitalcorporation/zenfs#125 If breaking API changes are needed, defining macros in the rocksdb public headers (something like ROCKSDB_REG_API_V2 in this case) indicating what version of the API is used would mitigate the pain, as that would at least remove the requirement on the user to set the right compile-time flags. For any changes in the public API could you give us(plugin maintainers) a heads up of the change? We do rely on a stable API. Maybe add us as reviewers? |
Added new ObjectLibrary::Entry classes to replace/reduce the use of Regex. For simple factories that only do name matching, there are "StringEntry" and "AltStringEntry" classes. For classes that use some semblance of regular expressions, there is a PatternEntry class that can match a name and prefixes. There is also a class for Customizable::IndividualId format matches.
Added tests for the new derivative classes and got all unit tests to pass.
Resolves #9225.