New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I think we should remove Regex
#9
Comments
Is this |
The str vs. bytes thing is hard in dirty-equals I have logic to allow both ways round. - e.g. a str regex can be used to check bytes. However we don't pretend anywhere else that we deal with bad types - e.g. we have no explanation for rust Gt means on a str. On balance I think we should keep this but perhaps remove |
I was thinking that
No, we do, strings are comparable.
My point is that I think we both have clear and incompatible interpretations: regex-for-schemas and regex-for-strategies are just basically different use-cases, and I've run up against the differences many times with |
I would say this is a bad usage of the metadata, and instead should be I'll throw one more idea out there: @samuelcolvin I'm curious how this is handled in Pydantic. All of this said, if something may cause issues down the road or is not fully baked I am 100% in favor of removing it for now. |
Fully agree - I really really want it, but I want it to be good more than I want it right now. |
Another random idea! Unsure about JSONschema regexes, but |
The same pattern can match different strings between Python and JS (required by jsonschema) regex engines! Currently everyone handles this by politely ignoring the problem, but I don't think that's acceptable at the very bottom of the stack where I hope annotated-types can interop between everything correctly. |
Just as an example, pydantic-core (which will be used by pydantic which will use annotation-types) users the "regex" rust crate to perform regexes which (from memory, don't blindly trust me) doesn't implement look-aheads to improve performance. This is not a weird edge case, look aheads are fairly commonly used and powerful. |
Regex performance is pretty tricky! Algorithmically, you can check whether a string matches some 'regular language' in worst-case linear time (in the length of the string); but lookahead or lookbehind and a few other constructs aren't actually regular in the strict computer-science meaning. For bonus points, Python doesn't actually implement the fast algorithm, so even strictly-regular expressions can take exponential time. This is, uh, suboptimal. The rust |
Since this is on a bit of a tangent, this is a great article that has some eye opening (to me at least) ideas about regex engines: https://blog.burntsushi.net/transducers/ |
Ah, ok the Rust regex use case is making this clearer for me. BTW: feels like it might be useful to be able to run pydantic-core-flavored-regex independently? (heck, even making a trivial package that wraps the Rust regex might be pretty popular!) Anyway, I don't feel like I understand the goals of annotated-types enough to take a real position of any kind. (my main interest is whether it would be a good idea to make CrossHair aware of these types; if so, I need very unambiguous interpretations) Perhaps my only insight is that a handy cooperative loose-coupling pattern might be "callable thing that can be introspected if you support that type" Heck, regex introspection is even possible with something like
|
I like all these ideas, want it to be useful for crosshair, and have accordingly taken some notes to update the documentation 😊 |
I think the issue with that hack is that it is then up to the tool to determine if they support that regex or not. For the case mentioned above of If we were to go down this route, I guess we could do something like: class Regex(Predicate):
def __init__(self, pattern: re.Pattern[str]) -> None:
super().__init__(pattern.match)
self.pattern = pattern So that the fallback is to do the call in Python, but if the tool decides that it can do better then it is free to implement it differently by introspecting |
I think this is just an inherent problem of 'there are way too many regex syntaxes' - for example my The proper way to handle this, at least for truly-regular expressions, would be to extend
-1 on this because you can already introspect the contents of a standard |
I tried writing up
Regex
docs that would describe how people actually want to use them, and...I think this is sufficiently-implementation-defined that we should just, well, make downstream implementations define their own Regex metadata with more precise semantics. Either that, or we explicitly separate
PythonRegex(pattern: str|bytes, flags: int)
fromJSONschemaRegex(pattern: str)
and make people choose one.The text was updated successfully, but these errors were encountered: