Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I think we should remove Regex #9

Closed
Zac-HD opened this issue May 7, 2022 · 14 comments · Fixed by #7
Closed

I think we should remove Regex #9

Zac-HD opened this issue May 7, 2022 · 14 comments · Fixed by #7

Comments

@Zac-HD
Copy link
Member

Zac-HD commented May 7, 2022

I tried writing up Regex docs that would describe how people actually want to use them, and...

Regex(regex_pattern=p, regex_flags=x) implies that the string should contain a
match for p with flags x, at any position in the string. If you want the
full string to match, or the match to be at the start or end of the string,
you can use boundary markers like ^...$.

Regex() can be used with unicode strings or byte strings; if either are allowed
we suggest using one Regex() item for each type, and therefore ignoring Regex()
items with the wrong string type.

We do not specify the pattern syntax: libraries may choose to interpret it as
the Python stdlib re module, regex package, ECMAscript syntax, etc.,
and we encourage them to clearly document their chosen interpretation.
The meaning of the regex_flags argument is also implementation-defined.

I think this is sufficiently-implementation-defined that we should just, well, make downstream implementations define their own Regex metadata with more precise semantics. Either that, or we explicitly separate PythonRegex(pattern: str|bytes, flags: int) from JSONschemaRegex(pattern: str) and make people choose one.

@adriangb
Copy link
Contributor

adriangb commented May 8, 2022

Regex() can be used with unicode strings or byte strings; if either are allowed
we suggest using one Regex() item for each type, and therefore ignoring Regex()
items with the wrong string type.

Is this Union[Annotated[bytes, Regex(b"foo")], Annotated[str, Regex("bar")]]? If so, would Regex() be ignored in this situation?

@samuelcolvin
Copy link
Contributor

The str vs. bytes thing is hard in dirty-equals I have logic to allow both ways round. - e.g. a str regex can be used to check bytes.

However we don't pretend anywhere else that we deal with bad types - e.g. we have no explanation for rust Gt means on a str.

On balance I think we should keep this but perhaps remove regex_flags and instead keep allow and store kwargs which libraries can use as they wish.

@Zac-HD
Copy link
Member Author

Zac-HD commented May 9, 2022

Is this Union[Annotated[bytes, Regex(b"foo")], Annotated[str, Regex("bar")]]? If so, would Regex() be ignored in this situation?

I was thinking that Annotated[str|bytes, Regex("abc"), Regex(b"xyz")] would require a unicode string to contain abc (but not xyz), and the converse for bytestrings. This is definitely getting confusing enough that I'm counting it against having Regex in the base annotated-types package though.

However we don't pretend anywhere else that we deal with bad types - e.g. we have no explanation for rust Gt means on a str.

No, we do, strings are comparable. Annotated[str, Gt("a")] means that x > "a" must be true, which is e.g. the case for "b" but not for "Z".

The str vs. bytes thing is hard in dirty-equals I have logic to allow both ways round. - e.g. a str regex can be used to check bytes. ... On balance I think we should keep this but perhaps remove regex_flags and instead keep allow and store kwargs which libraries can use as they wish.

My point is that I think we both have clear and incompatible interpretations: regex-for-schemas and regex-for-strategies are just basically different use-cases, and I've run up against the differences many times with hypothesis-jsonschema. On that basis I'd prefer to keep annotated-types as clear as possible and have Regex() defined downstream for now.

@adriangb
Copy link
Contributor

adriangb commented May 9, 2022

Annotated[str|bytes, Regex("abc"), Regex(b"xyz")]

I would say this is a bad usage of the metadata, and instead should be Annotated[bytes, Regex(b"xyz")] | Annotated[str, Regex("abc"). I have no formal training in type systems though so I'm not sure how to justify this, I just think that if you're applying constraints to a union it should be applicable to every item in the union or you should move it into the individual elements.

I'll throw one more idea out there: Regex(pattern=r"...", flavor="pcre"). Not sure if this helps.

@samuelcolvin I'm curious how this is handled in Pydantic.
My understanding is that Pydantic's regex ends up being both a Python constraint and json schema constraint.

All of this said, if something may cause issues down the road or is not fully baked I am 100% in favor of removing it for now.
Punting on these sorts of decisions, especially when we have very few voices thus far, will lead to a better library long term.
I say this despite wanting to have a Regex here since I think it would be quite useful.

@Zac-HD
Copy link
Member Author

Zac-HD commented May 11, 2022

All of this said, if something may cause issues down the road or is not fully baked I am 100% in favor of removing it for now. Punting on these sorts of decisions, especially when we have very few voices thus far, will lead to a better library long term. I say this despite wanting to have a Regex here since I think it would be quite useful.

Fully agree - I really really want it, but I want it to be good more than I want it right now.

@pschanely
Copy link

pschanely commented Nov 7, 2022

Another random idea! Unsure about JSONschema regexes, but re and regex compiled patterns are introspectable via the .pattern and .flags members. Regex() could be defined to take anything with a .search() method, (e.g. used like Regex(re.compile('a*b'))) and tools could then case-split on the type.

@Zac-HD
Copy link
Member Author

Zac-HD commented Nov 7, 2022

The same pattern can match different strings between Python and JS (required by jsonschema) regex engines! Currently everyone handles this by politely ignoring the problem, but I don't think that's acceptable at the very bottom of the stack where I hope annotated-types can interop between everything correctly.

@samuelcolvin
Copy link
Contributor

Just as an example, pydantic-core (which will be used by pydantic which will use annotation-types) users the "regex" rust crate to perform regexes which (from memory, don't blindly trust me) doesn't implement look-aheads to improve performance.

This is not a weird edge case, look aheads are fairly commonly used and powerful.

@Zac-HD
Copy link
Member Author

Zac-HD commented Nov 7, 2022

Regex performance is pretty tricky! Algorithmically, you can check whether a string matches some 'regular language' in worst-case linear time (in the length of the string); but lookahead or lookbehind and a few other constructs aren't actually regular in the strict computer-science meaning.

For bonus points, Python doesn't actually implement the fast algorithm, so even strictly-regular expressions can take exponential time. This is, uh, suboptimal. The rust regex crate does guarantee linear-time though, so you're safe from denial-of-service issues there.

@adriangb
Copy link
Contributor

adriangb commented Nov 8, 2022

Since this is on a bit of a tangent, this is a great article that has some eye opening (to me at least) ideas about regex engines: https://blog.burntsushi.net/transducers/

@pschanely
Copy link

pschanely commented Nov 8, 2022

Ah, ok the Rust regex use case is making this clearer for me. BTW: feels like it might be useful to be able to run pydantic-core-flavored-regex independently? (heck, even making a trivial package that wraps the Rust regex might be pretty popular!)

Anyway, I don't feel like I understand the goals of annotated-types enough to take a real position of any kind. (my main interest is whether it would be a good idea to make CrossHair aware of these types; if so, I need very unambiguous interpretations) Perhaps my only insight is that a handy cooperative loose-coupling pattern might be "callable thing that can be introspected if you support that type" Heck, regex introspection is even possible with something like Predicate(re.compile("abc").search), via the method's __self__ member:

>>> re.compile("abc").search.__self__.pattern
'abc'

@Zac-HD
Copy link
Member Author

Zac-HD commented Nov 8, 2022

I like all these ideas, want it to be useful for crosshair, and have accordingly taken some notes to update the documentation 😊

@adriangb
Copy link
Contributor

adriangb commented Nov 8, 2022

I think the issue with that hack is that it is then up to the tool to determine if they support that regex or not. For the case mentioned above of pydantic-core I think pydantic-core would just have to error if the pattern or flags aren't supported by Rusts' regex crate. The harder ones are where there is no error but the results are different, it would be pretty hard for the tool to figure that out.

If we were to go down this route, I guess we could do something like:

class Regex(Predicate):
    def __init__(self, pattern: re.Pattern[str]) -> None:
        super().__init__(pattern.match)
        self.pattern = pattern

So that the fallback is to do the call in Python, but if the tool decides that it can do better then it is free to implement it differently by introspecting Regex.pattern.

@Zac-HD
Copy link
Member Author

Zac-HD commented Nov 8, 2022

I think the issue with that hack is that it is then up to the tool to determine if they support that regex or not. For the case mentioned above of pydantic-core I think pydantic-core would just have to error if the pattern or flags aren't supported by Rusts' regex crate. The harder ones are where there is no error but the results are different, it would be pretty hard for the tool to figure that out.

I think this is just an inherent problem of 'there are way too many regex syntaxes' - for example my js-regex project died because I didn't have the spare time to handle charclasses-in-groups, and hypothesis-jsonschema has a horrible hack plumbed through so that $ is handled the JS way (end of string, in Python it also allows a trailing newline).

The proper way to handle this, at least for truly-regular expressions, would be to extend greenery with parsers and unparsers for each alternative syntax that we want to support. See python-jsonschema/hypothesis-jsonschema#85 and HypothesisWorks/hypothesis#3479.

If we were to go down this route, I guess we could do something like class Regex(Predicate):
So that the fallback is to do the call in Python, but if the tool decides that it can do better then it is free to implement it differently by introspecting Regex.pattern.

-1 on this because you can already introspect the contents of a standard Predicate as @pschanely shows above, and I have a very strong desire to keep the API very, very small and clean.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants