Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove/Reduce use of Regex in ObjectRegistry/Library #9264

Closed
wants to merge 11 commits into from

Conversation

mrambacher
Copy link
Contributor

Added new ObjectLibrary::Entry classes to replace/reduce the use of Regex. For simple factories that only do name matching, there are "StringEntry" and "AltStringEntry" classes. For classes that use some semblance of regular expressions, there is a PatternEntry class that can match a name and prefixes. There is also a class for Customizable::IndividualId format matches.

Added tests for the new derivative classes and got all unit tests to pass.

Resolves #9225.

Added new ObjectLibrary::Entry classes to replace/reduce the use of Regex.  For simple factories that only do name matching, there are "StringEntry" and "AltStringEntry" classes.  For classes that use some semblance of regular expressions, there is a PatternEntry class that can match a name and prefixes.  There is also a class for Customizable::IndividualId format matches.

Added tests for the new derivative classes and got all unit tests to pass.
Copy link
Contributor

@pdillinger pdillinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a lot of new complexity in the public API. If I were to rate solutions on various axes, with 10 as perfect score: ("aux" refers to ease of ongoing maintenance not related to compatibility and safety)

Existing regex interface: usability=7, compatibility=7, safety=4, aux=9
Integrate better RE library: usability=7, compatibility=9, safety=5, aux=5
This PR PatternEntry etc. assuming Regexes eventually removed:
usability=5, compatibility=10, safety=7, aux=7

While regexes are easy to get wrong and showing compatibility issues, they are at least concise and pretty well understood. I find a lot of cognitive load for this PatternEntry stuff because although some common patterns are covered, there's still some complexity in getting it, and without a new derived class, there's not nearly the generality of regex operators nor (I would argue) potential safety benefits of limiting expressiveness.

My proposal (related to yours but perhaps with advantages) would use builder pattern instead of subclasses. Example:

   ->Register<SomeType>(ObjectLibrary::Entry(my_primary_name)
                                          .AltName(old_name1)
                                          .AltName(old_name2)
                                          .RequiredSeparator("://")
                                          .SuffixFilter([](const Slice& suffix) { ... }),

This would allow us to do some enhanced checking (ambiguity?), and even indexing, because we know specific prefixes associated with entries. There would be some rigidity where it could be useful to take advantage of that, and the format is somewhat prescriptive, but flexibility in the tail parts where we don't care so much.

Because it's IMHO easier to find what you want when dealing with variation within a single class, something like this I might rate usability=7, compatibility=10, safety=up to 9, aux=8


class RegexEntry : public Entry {
public:
static std::unique_ptr<Entry> Create(const std::string& name,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't think of a reason not to return unique_ptr<RegexEntry> and let the likely conversions happen in the caller.

bool is_enabled(Mode mode) const { return (mode_ & mode) == mode; }
virtual bool matches_pattern(const std::string& target,
size_t length) const {
if (!is_enabled(kMatchPattern)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reason for these implementation details (and likely more) to be in header file.

Cleaned up multiple entry implementation classes for pattern matching (leaving just PatternEntry).  Added setter methods and suggested in the PR comments.
@facebook-github-bot
Copy link
Contributor

@mrambacher has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@pdillinger pdillinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Promising, with some clarifications

if (nlen == tlen) { // The lengths are the same
return (name_only_ || patterns_.empty()) && name == target;
} else if (patterns_.empty()) {
return nlen == tlen && name == target;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In context, this case can never return true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the nlen==tlen at line 22 is meant to prevent the comparison of strings on the second half. If the nlen==tlen, it will be caught at line 19 but if they are not equal then the check here could speed up the check.

I re-ordered the cases to simplify the code slightly.

// -Pattern: Comparable to [name][suffix].+
// -Number: Comparable to [name][suffix].[0-9]+
// -AltName: Comparable to ([name]|[alt])
// -NameOnly: Comparable to ([name][pattern]*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand this (missing close ) and what is pattern?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • A suffix implies an exact match of the pattern.
  • A pattern implies a match of the pattern followed by one or more characters
  • A number implies a match of a pattern followed by a number
  • Alternative names implies a match of the "name" or the "alternate" (followed by whatever suffix/pattern/number rules are active)
  • NameOnly means that we can match just the name (or alternates) without the pattern rules (the patterns are optional).

I updated the comment to hopefully make it clearer/less confusing.

}

// Creates a new pattern entry for "name". If name_only is true,
// Matches will return true if name==target
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This description of name_only is confusing, because won't Matches return true on name==target regardless of name_only--until this PatternEntry is modified in some way? If I modify it, is name_only aspect modified? And only is suggestive of "only if" which doesn't seem to be the case here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed "name_only" to "optional" and added/updated comments

// -Number: Comparable to [name][suffix].[0-9]+
// -AltName: Comparable to ([name]|[alt])
// -NameOnly: Comparable to ([name][pattern]*
// Multiple patterns can be combined and cause multiple matches.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems weird that AddName is an OR, while AddPattern etc. are CONCAT.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you prefer "AddName" and "AppendPattern" or "AddAlternateName"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong preference, but I think we can avoid using the same verb for different things (like avoiding same noun for different things).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed "AddName" to "AlternateName" and "AddPattern" to "AddSeparator"

ASSERT_FALSE(entry.Matches("AB::1##"));
ASSERT_FALSE(entry.Matches("AB::1##2"));
ASSERT_FALSE(entry.Matches("AA##1::"));
ASSERT_TRUE(entry.Matches("AA::1##"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, "AA::###" would not be a match.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. Comparable test added.

virtual const char* Name() const = 0;
};

// Class for matching target strings to a pattern.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth pointing out a key difference from regexes:

  • In matching, the left-most match for one piece of a pattern is the only one considered (e.g. if next separator is "::" then "::" cannot be in the current piece), thus
  • No backtracking is needed for matching, so matching is reliably efficient

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment to PatternEntry. LMK if you want more and what else you would like to see

@facebook-github-bot
Copy link
Contributor

@mrambacher has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@mrambacher has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@mrambacher has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@mrambacher has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@mrambacher has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

// Entries consist of a name that starts the pattern and attributes
// The following attributes can be added to the entry:
// -Suffix: Comparable to name(pattern)
// -Pattern: Comparable to name(pattern).+
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also find it confusing that "pattern" is used in three ways: PatternEntry (general), Pattern subtype, and pattern string to match. The last I think we can call "separator" as in my discussion about difference vs. regexes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed "AddPattern" and associated variables to "Separator". The class is still called PatternEntry but most other references have been removed.

Add comments and rename methods/variables.
@facebook-github-bot
Copy link
Contributor

@mrambacher has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@mrambacher has updated the pull request. You must reimport the pull request before landing.

// Class for matching target strings to a pattern.
// Entries consist of a name that starts the pattern and attributes
// The following attributes can be added to the entry:
// -Suffix: Comparable to name(suffic)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo suffix

//
// Note that though this class does provide some regex-style matching,
// it is not a full regex parser and has some key differences:
// - Separatorss are matched left-most. For example, an entry
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo Separators

}

template <typename T>
const FactoryFunc<T>& Register(const PatternEntry& pattern,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Entry appears to be publicly extensible, inviting custom matching implementations, which I don't really object to except that we are also using value semantics (copy) for a kind of Entry here. At least in public facing API, you should be consistent about value vs. owned pointer semantics for the entire hierarchy. IMHO mixing extensibility with value semantics in a hierarchy is confusing (because you obviously can't have both in the same place at the same time).

Fix typos.  Make Entry private to ObjectLibrary
@facebook-github-bot
Copy link
Contributor

@mrambacher has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@mrambacher has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@pdillinger pdillinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good enough now, except GitHub reporting merge conflict

Thanks

@pdillinger
Copy link
Contributor

Also, this will be a breaking API change to some small set of users. Needs to be mentioned in HISTORY.md. I'll see if there's any objection to this landing in a minor release.

@pdillinger
Copy link
Contributor

And can we remove regex.h public header? Could be a follow-up PR...

@facebook-github-bot
Copy link
Contributor

@mrambacher has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@mrambacher has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@mrambacher has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@mrambacher has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

1 similar comment
@facebook-github-bot
Copy link
Contributor

@mrambacher has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@yhr
Copy link
Contributor

yhr commented Jan 4, 2022

@mrambacher @pdillinger ,

Would it be possible to move the breaking plugin API change to the next major release?

The change gives us a bit of a headache as it forces zenfs to add an compile time API-version flag and implement preprocessing to support both the old version and the new version of the API, as two of our users (terarkdb and percona mysql server) is not using the latest rocksdb(as of now).

see westerndigitalcorporation/zenfs#125

If breaking API changes are needed, defining macros in the rocksdb public headers (something like ROCKSDB_REG_API_V2 in this case) indicating what version of the API is used would mitigate the pain, as that would at least remove the requirement on the user to set the right compile-time flags.

For any changes in the public API could you give us(plugin maintainers) a heads up of the change? We do rely on a stable API. Maybe add us as reviewers?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants