New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AVRO-2906: Traversal validation #936
AVRO-2906: Traversal validation #936
Conversation
Attn: @kojiromike, this PR is in relation to this email thread on the dev list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some typos
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commenting on the code first, because I'm not in a place where I can run this just now.
I think it's great. I like the bfs and overall approach. I do think there's a fair amount of repetition in the individual validate methods that they're all basically return node if <valid> else None
. It makes me think that that valid:bool primitive can be still be of value.
I also think we've reinvented functools.singledispatch. The fact of that predates this pr, but maybe now's the moment to think about it.
The validation tree that this PR illuminates is the same tree as the Schema object tree, right? What if validation was a method of the schema object?
Absolutely this is a disguised form of singledispatch. My first approach here was really designed to be a drop-in replacement for the existing approach, which is why it keeps things like dict lookup of validator by schema type (and now by logical_type as well). It was originally meant to be monkey patched into place so that we could solve our problem with error messages while the longer process of this conversation took place. I'm pretty sure I can make another pass at this that uses Any other thoughts while we look at this first pass? |
As a warning this will most likely have conflicts with the changes in #933 just because I'm moving all the exception classes to a dedicated module and clarifying the import syntax. |
Not well-formed ones, anyway. I think that moving validation to a method on schema types opens up so many doors that it'll be more an exercise in avoiding the temptation to blow up the scope of this PR. So I'll try to be conservative and reserve my tangents for later tickets. ;) |
appreciated :) |
@kojiromike, I've moved the validators over to schema objects and have all tests passing again. There might be some cleaning of repetition that could be done by careful restructuring of the class hierarchies, but I didn't want to go too deep into that. In particular the logical types could use some work. I would love input especially on an approach to the I'm open to suggestions. |
@kojiromike I think this is pretty well ready to go. If you feel the same way, I assume the next steps are to create a JIRA issue and attach this to it. Can you think of any testing we might want to do beyond the existing schema tests? |
Let's see what the coverage looks like, but if we're after the debuggability of a hairy, nested record or union schema, then we might need a couple more cases. |
There appears to be a failure in testing on a windows container. The other two runs of the tests are passing, however. Not sure what to do about that. Would someone with access look at the test results and let me know if there's something I can fix? |
This failure is not caused by this PR. The Windows build is perennial flaky and to be honest I don't know exactly what it's based on that causes this kind of behavior. I am surprised that at least readonly access to Travis isn't universal. (That said, you can set up TravisCI on your fork pretty easily if you want to have full access to what it does.)
|
yeah, I was able to see that much, but it didn't mean much to me. It did look like a failed setup, didn't appear even to get so far as running tests, but I figured I'd ask in case I was missing something. What's the protocol for proceeding when there are semi-expected test failures like this? |
Hello! I've been either relaunching the failing job until it works, or ignoring the failures on the Windows container... We have a JIRA AVRO-2847 tracking this, but I haven't been able to figure it out (or reproduce it locally). You should be able to see the error logs in the container though. My apologies for this unfortunate state, I relaunched your PR until it's green! |
@kojiromike so all the tests pass now (thanks to @RyanSkraba!). It's unclear to me what the next steps are. The contributing docs in the avro wiki appear to suggest that one can either open a PR here or submit a patch via Jira. Should I create a patch and submit it via Jira? In addition, one of the claims of this PR is that it will be more memory efficient because of moving away from recursive processing. Do I need to add a test for a more deeply nested schema that demonstrates this savings? Should I do some performance analysis that shows that this approach hasn't slowed down the parsing process for large schemas? |
@cewing if you have time to add additional tests that demonstrate the memory properties or performance characteristics that would be excellent. I planned to take a deeper look at this on the weekend, but life has got in the way recently. Apologies. I'll try to get to it in the next couple weeks. |
@cewing I'm going to look at this today. Trying to come up with some good manual test cases. Also, do you want to rebase -i and edit your commits to include the ticket number? |
Oh, by the way, you do not need to support Python 2 anymore in this PR. |
@@ -493,6 +516,17 @@ def __eq__(self, that): | |||
class PrimitiveSchema(Schema): | |||
"""Valid primitive types are in PRIMITIVE_TYPES.""" | |||
|
|||
_validators = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I started this approach, I realize now that these lambdas don't show up clearly in test coverage reports. We can leave it this way for now, but maybe in the future I should move these to named functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I'd like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I'm satisfied with this changeset. The tests we have today are pretty decent, I think. For clarity, we might need to shuffle where the tests are around (test_io tests stuff in schema.py, etc). But all that can be done in a future change.
@cewing LMK if you plan to do any other changes. When you're ready, LMK and I can merge this. |
I’m on the road today, but I do intend to rebase and add issue numbers. I will also update that abstract method.
Typed painstakingly with my thumbs and the active hindrance of autocorrect.
… On Aug 15, 2020, at 8:42 AM, Michael A. Smith ***@***.***> wrote:
@cewing LMK if you plan to do any other changes. When you're ready, LMK and I can merge this.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
fc0dc92
to
3b673c8
Compare
3b673c8
to
6accc84
Compare
@kojiromike the rebase is complete, only one commit now, and it's got the avro issue number in the first line. I held off on the question of abstract method for the base |
@cewing hmm, github is still reporting conflicts. To be sure, did you rebase against |
@kojiromike, I have not yet rebased to master. I was waiting to do that until we resolved the question of the |
Save that work for another time. Maybe I'll do it as part of getting rid of all the Python 2 polyfills. |
Use schema-type specific iterators and validators to allow a breadth-first traversal of a full schema, validating each node as you go. The benefit of this approach is that it allows us to pin-point the specific part of the schema that has failed validation. Where previously the error message for a large schema would print the entire datum as well as the full schema and say "this is not that", this new approach will print the specific sub-schema that has failed in order to allow more informative errors. A second improvement is that by traversing the schema instead of processing it recursively, the algorithm is more efficient in use of system resources. In particular for schemas that have lots of nested parts, this will make a difference. Make the required changes to pass tests in all supported python versions. This commit removes type hints present in the first commit in order to allow using the code in older Python versions. In addition: * the use of `str` has been replaced by the compatible `unicode`. * the ValidationNode namedtuple has been re-expressed in syntax available in all supported Python versions. * the use of a custom InvalidEvent exception has been replace by using AvroTypeException * all specific single-type validators have been replaced by partials of _validate_type with a tuple of one or more type objects. Fix typos and raise StopIteration as suggested in code review Move the responsibility for validation to the Schema class. Each schema subclass will be responsible for its own validation. This simplifies the structure of io.py, removes the dict lookup of validators, and reduces somewhat the repetition that was in io.py. Move validators to a class attribute and update method code. This makes things look a little bit cleaner than having the validators right in the midst of the method. Add arg spec docs to docstring for base Schema class. Clean up mistakes. * Fix a docstring to be a more accurate statement of reality. * Remove an unused import. * Remove extra blank lines.
6accc84
to
10c7deb
Compare
@kojiromike, great. Fixed. All set to merge when you're ready. Thank you for helping to get me through this first contribution! |
Argh, a different, but still unrelated test failure. Can you push an empty commit to retrigger build? |
running now, @kojiromike |
This PR replaces the existing
validate
function inavro.io
with a new version that uses a traversal-based approach rather than a recursive approach. It also establishes the concept of a "validator" function that handles validation of various schema types and an "iterator" function, which powers the traversal of specific schema types.The point of the work is two-fold. First, by traversing rather than recursing, exceptions raised by validation can be raised immediately where the problem happened, allowing for error messages that are much more localized. This is an advantage when working with very large schemas. Second, by using traversal instead of recursion, this approach is more conservative of system resources, especially for deeply nested schemas.
The goal of this pr specifically is to spur discussion of the approach we've taken and to seek approval from the community for the change. I anticipate that it will not be acceptable entirely as-is, and will be happy to make any requested changes should the approach be approved.
Jira
Tests
Commits
Documentation