I think we should remove `Regex` #9

Zac-HD · 2022-05-07T21:29:03Z

I tried writing up Regex docs that would describe how people actually want to use them, and...

Regex(regex_pattern=p, regex_flags=x) implies that the string should contain a
match for p with flags x, at any position in the string. If you want the
full string to match, or the match to be at the start or end of the string,
you can use boundary markers like ^...$.

Regex() can be used with unicode strings or byte strings; if either are allowed
we suggest using one Regex() item for each type, and therefore ignoring Regex()
items with the wrong string type.

We do not specify the pattern syntax: libraries may choose to interpret it as
the Python stdlib re module, regex package, ECMAscript syntax, etc.,
and we encourage them to clearly document their chosen interpretation.
The meaning of the regex_flags argument is also implementation-defined.

I think this is sufficiently-implementation-defined that we should just, well, make downstream implementations define their own Regex metadata with more precise semantics. Either that, or we explicitly separate PythonRegex(pattern: str|bytes, flags: int) from JSONschemaRegex(pattern: str) and make people choose one.

The text was updated successfully, but these errors were encountered:

adriangb · 2022-05-08T07:58:30Z

Regex() can be used with unicode strings or byte strings; if either are allowed
we suggest using one Regex() item for each type, and therefore ignoring Regex()
items with the wrong string type.

Is this Union[Annotated[bytes, Regex(b"foo")], Annotated[str, Regex("bar")]]? If so, would Regex() be ignored in this situation?

samuelcolvin · 2022-05-08T15:48:11Z

The str vs. bytes thing is hard in dirty-equals I have logic to allow both ways round. - e.g. a str regex can be used to check bytes.

However we don't pretend anywhere else that we deal with bad types - e.g. we have no explanation for rust Gt means on a str.

On balance I think we should keep this but perhaps remove regex_flags and instead keep allow and store kwargs which libraries can use as they wish.

Zac-HD · 2022-05-09T03:47:38Z

Is this Union[Annotated[bytes, Regex(b"foo")], Annotated[str, Regex("bar")]]? If so, would Regex() be ignored in this situation?

I was thinking that Annotated[str|bytes, Regex("abc"), Regex(b"xyz")] would require a unicode string to contain abc (but not xyz), and the converse for bytestrings. This is definitely getting confusing enough that I'm counting it against having Regex in the base annotated-types package though.

However we don't pretend anywhere else that we deal with bad types - e.g. we have no explanation for rust Gt means on a str.

No, we do, strings are comparable. Annotated[str, Gt("a")] means that x > "a" must be true, which is e.g. the case for "b" but not for "Z".

The str vs. bytes thing is hard in dirty-equals I have logic to allow both ways round. - e.g. a str regex can be used to check bytes. ... On balance I think we should keep this but perhaps remove regex_flags and instead keep allow and store kwargs which libraries can use as they wish.

My point is that I think we both have clear and incompatible interpretations: regex-for-schemas and regex-for-strategies are just basically different use-cases, and I've run up against the differences many times with hypothesis-jsonschema. On that basis I'd prefer to keep annotated-types as clear as possible and have Regex() defined downstream for now.

adriangb · 2022-05-09T15:17:58Z

Annotated[str|bytes, Regex("abc"), Regex(b"xyz")]

I would say this is a bad usage of the metadata, and instead should be Annotated[bytes, Regex(b"xyz")] | Annotated[str, Regex("abc"). I have no formal training in type systems though so I'm not sure how to justify this, I just think that if you're applying constraints to a union it should be applicable to every item in the union or you should move it into the individual elements.

I'll throw one more idea out there: Regex(pattern=r"...", flavor="pcre"). Not sure if this helps.

@samuelcolvin I'm curious how this is handled in Pydantic.
My understanding is that Pydantic's regex ends up being both a Python constraint and json schema constraint.

All of this said, if something may cause issues down the road or is not fully baked I am 100% in favor of removing it for now.
Punting on these sorts of decisions, especially when we have very few voices thus far, will lead to a better library long term.
I say this despite wanting to have a Regex here since I think it would be quite useful.

Zac-HD · 2022-05-11T05:48:31Z

All of this said, if something may cause issues down the road or is not fully baked I am 100% in favor of removing it for now. Punting on these sorts of decisions, especially when we have very few voices thus far, will lead to a better library long term. I say this despite wanting to have a Regex here since I think it would be quite useful.

Fully agree - I really really want it, but I want it to be good more than I want it right now.

pschanely · 2022-11-07T14:02:46Z

Another random idea! Unsure about JSONschema regexes, but re and regex compiled patterns are introspectable via the .pattern and .flags members. Regex() could be defined to take anything with a .search() method, (e.g. used like Regex(re.compile('a*b'))) and tools could then case-split on the type.

Zac-HD · 2022-11-07T18:20:58Z

The same pattern can match different strings between Python and JS (required by jsonschema) regex engines! Currently everyone handles this by politely ignoring the problem, but I don't think that's acceptable at the very bottom of the stack where I hope annotated-types can interop between everything correctly.

samuelcolvin · 2022-11-07T22:26:38Z

Just as an example, pydantic-core (which will be used by pydantic which will use annotation-types) users the "regex" rust crate to perform regexes which (from memory, don't blindly trust me) doesn't implement look-aheads to improve performance.

This is not a weird edge case, look aheads are fairly commonly used and powerful.

Zac-HD · 2022-11-07T23:30:27Z

Regex performance is pretty tricky! Algorithmically, you can check whether a string matches some 'regular language' in worst-case linear time (in the length of the string); but lookahead or lookbehind and a few other constructs aren't actually regular in the strict computer-science meaning.

For bonus points, Python doesn't actually implement the fast algorithm, so even strictly-regular expressions can take exponential time. This is, uh, suboptimal. The rust regex crate does guarantee linear-time though, so you're safe from denial-of-service issues there.

adriangb · 2022-11-08T00:28:42Z

Since this is on a bit of a tangent, this is a great article that has some eye opening (to me at least) ideas about regex engines: https://blog.burntsushi.net/transducers/

pschanely · 2022-11-08T16:46:43Z

Ah, ok the Rust regex use case is making this clearer for me. BTW: feels like it might be useful to be able to run pydantic-core-flavored-regex independently? (heck, even making a trivial package that wraps the Rust regex might be pretty popular!)

Anyway, I don't feel like I understand the goals of annotated-types enough to take a real position of any kind. (my main interest is whether it would be a good idea to make CrossHair aware of these types; if so, I need very unambiguous interpretations) Perhaps my only insight is that a handy cooperative loose-coupling pattern might be "callable thing that can be introspected if you support that type" Heck, regex introspection is even possible with something like Predicate(re.compile("abc").search), via the method's __self__ member:

>>> re.compile("abc").search.__self__.pattern
'abc'

Zac-HD · 2022-11-08T17:11:15Z

I like all these ideas, want it to be useful for crosshair, and have accordingly taken some notes to update the documentation 😊

adriangb · 2022-11-08T17:19:15Z

I think the issue with that hack is that it is then up to the tool to determine if they support that regex or not. For the case mentioned above of pydantic-core I think pydantic-core would just have to error if the pattern or flags aren't supported by Rusts' regex crate. The harder ones are where there is no error but the results are different, it would be pretty hard for the tool to figure that out.

If we were to go down this route, I guess we could do something like:

class Regex(Predicate):
    def __init__(self, pattern: re.Pattern[str]) -> None:
        super().__init__(pattern.match)
        self.pattern = pattern

So that the fallback is to do the call in Python, but if the tool decides that it can do better then it is free to implement it differently by introspecting Regex.pattern.

Zac-HD · 2022-11-08T18:29:21Z

I think the issue with that hack is that it is then up to the tool to determine if they support that regex or not. For the case mentioned above of pydantic-core I think pydantic-core would just have to error if the pattern or flags aren't supported by Rusts' regex crate. The harder ones are where there is no error but the results are different, it would be pretty hard for the tool to figure that out.

I think this is just an inherent problem of 'there are way too many regex syntaxes' - for example my js-regex project died because I didn't have the spare time to handle charclasses-in-groups, and hypothesis-jsonschema has a horrible hack plumbed through so that $ is handled the JS way (end of string, in Python it also allows a trailing newline).

The proper way to handle this, at least for truly-regular expressions, would be to extend greenery with parsers and unparsers for each alternative syntax that we want to support. See python-jsonschema/hypothesis-jsonschema#85 and HypothesisWorks/hypothesis#3479.

If we were to go down this route, I guess we could do something like class Regex(Predicate):
So that the fallback is to do the call in Python, but if the tool decides that it can do better then it is free to implement it differently by introspecting Regex.pattern.

-1 on this because you can already introspect the contents of a standard Predicate as @pschanely shows above, and I have a very strong desire to keep the API very, very small and clean.

Zac-HD mentioned this issue May 18, 2022

Document everything for users and implementors #7

Merged

adriangb closed this as completed in #7 May 19, 2022

pschanely mentioned this issue Mar 27, 2023

Support annotated-types pschanely/CrossHair#201

Open

Zac-HD mentioned this issue Nov 8, 2023

Pattern or Regex type #57

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I think we should remove `Regex` #9

I think we should remove `Regex` #9

Zac-HD commented May 7, 2022

adriangb commented May 8, 2022 •

edited

samuelcolvin commented May 8, 2022

Zac-HD commented May 9, 2022

adriangb commented May 9, 2022 •

edited

Zac-HD commented May 11, 2022

pschanely commented Nov 7, 2022 •

edited

Zac-HD commented Nov 7, 2022

samuelcolvin commented Nov 7, 2022

Zac-HD commented Nov 7, 2022

adriangb commented Nov 8, 2022

pschanely commented Nov 8, 2022 •

edited

Zac-HD commented Nov 8, 2022

adriangb commented Nov 8, 2022 •

edited

Zac-HD commented Nov 8, 2022

I think we should remove Regex #9

I think we should remove Regex #9

Comments

Zac-HD commented May 7, 2022

adriangb commented May 8, 2022 • edited

samuelcolvin commented May 8, 2022

Zac-HD commented May 9, 2022

adriangb commented May 9, 2022 • edited

Zac-HD commented May 11, 2022

pschanely commented Nov 7, 2022 • edited

Zac-HD commented Nov 7, 2022

samuelcolvin commented Nov 7, 2022

Zac-HD commented Nov 7, 2022

adriangb commented Nov 8, 2022

pschanely commented Nov 8, 2022 • edited

Zac-HD commented Nov 8, 2022

adriangb commented Nov 8, 2022 • edited

Zac-HD commented Nov 8, 2022

I think we should remove `Regex` #9

I think we should remove `Regex` #9

adriangb commented May 8, 2022 •

edited

adriangb commented May 9, 2022 •

edited

pschanely commented Nov 7, 2022 •

edited

pschanely commented Nov 8, 2022 •

edited

adriangb commented Nov 8, 2022 •

edited