AVRO-2906: Traversal validation #936

cewing · 2020-07-24T20:31:02Z

This PR replaces the existing validate function in avro.io with a new version that uses a traversal-based approach rather than a recursive approach. It also establishes the concept of a "validator" function that handles validation of various schema types and an "iterator" function, which powers the traversal of specific schema types.

The point of the work is two-fold. First, by traversing rather than recursing, exceptions raised by validation can be raised immediately where the problem happened, allowing for error messages that are much more localized. This is an advantage when working with very large schemas. Second, by using traversal instead of recursion, this approach is more conservative of system resources, especially for deeply nested schemas.

The goal of this pr specifically is to spur discussion of the approach we've taken and to seek approval from the community for the change. I anticipate that it will not be acceptable entirely as-is, and will be happy to make any requested changes should the approach be approved.

Jira

[ x] AVRO-2906

Tests

[ x] Validation testing is pretty good already, so this PR does not add any tests. If more are required in order to verify the process works, I will take responsibility for adding them.

Commits

[ -] My commits are well formed, but as yet there is no JIRA issue so they do not reference one. I can fix that in any final PR to be submitted.

Documentation

[ x] This PR does not add any new functionality. It does however alter existing functionality, at least in terms of how it is output. I am very open to discussion of the best way to document those changes should they be accepted.

cewing · 2020-07-24T20:33:42Z

Attn: @kojiromike, this PR is in relation to this email thread on the dev list

kojiromike

Some typos

lang/py/avro/io.py

kojiromike

Commenting on the code first, because I'm not in a place where I can run this just now.

I think it's great. I like the bfs and overall approach. I do think there's a fair amount of repetition in the individual validate methods that they're all basically return node if <valid> else None. It makes me think that that valid:bool primitive can be still be of value.

I also think we've reinvented functools.singledispatch. The fact of that predates this pr, but maybe now's the moment to think about it.

The validation tree that this PR illuminates is the same tree as the Schema object tree, right? What if validation was a method of the schema object?

cewing · 2020-07-24T23:23:18Z

Absolutely this is a disguised form of singledispatch. My first approach here was really designed to be a drop-in replacement for the existing approach, which is why it keeps things like dict lookup of validator by schema type (and now by logical_type as well). It was originally meant to be monkey patched into place so that we could solve our problem with error messages while the longer process of this conversation took place.

I'm pretty sure I can make another pass at this that uses validate as a method of the schema class. That would be pretty clean.

Any other thoughts while we look at this first pass?

kojiromike · 2020-07-27T12:15:50Z

As a warning this will most likely have conflicts with the changes in #933 just because I'm moving all the exception classes to a dedicated module and clarifying the import syntax.

kojiromike · 2020-07-27T12:17:54Z

Absolutely this is a disguised form of singledispatch. My first approach here was really designed to be a drop-in replacement for the existing approach, which is why it keeps things like dict lookup of validator by schema type (and now by logical_type as well). It was originally meant to be monkey patched into place so that we could solve our problem with error messages while the longer process of this conversation took place.

I'm pretty sure I can make another pass at this that uses validate as a method of the schema class. That would be pretty clean.

Any other thoughts while we look at this first pass?

Not well-formed ones, anyway. I think that moving validation to a method on schema types opens up so many doors that it'll be more an exercise in avoiding the temptation to blow up the scope of this PR. So I'll try to be conservative and reserve my tangents for later tickets. ;)

cewing · 2020-07-27T17:48:32Z

I'll try to be conservative and reserve my tangents for later tickets. ;)

appreciated :)

cewing · 2020-07-27T23:25:14Z

@kojiromike, I've moved the validators over to schema objects and have all tests passing again. There might be some cleaning of repetition that could be done by careful restructuring of the class hierarchies, but I didn't want to go too deep into that. In particular the logical types could use some work.

I would love input especially on an approach to the PrimitiveSchema type. I followed the example laid out in .match(self, writer), but the result is not one I'm particularly happy about. It basically moves about half of the original _VALIDATORS mapping into that structure. I'm thinking perhaps something like a class attribute of validator_type that maps the schema type to the builtin type-object to be used. But that still doesn't solve the problem for the compound tests for int and long.

I'm open to suggestions.

lang/py/avro/schema.py

cewing · 2020-07-28T18:24:43Z

@kojiromike I think this is pretty well ready to go. If you feel the same way, I assume the next steps are to create a JIRA issue and attach this to it.

Can you think of any testing we might want to do beyond the existing schema tests?

kojiromike · 2020-07-29T17:37:19Z

Can you think of any testing we might want to do beyond the existing schema tests?

Let's see what the coverage looks like, but if we're after the debuggability of a hairy, nested record or union schema, then we might need a couple more cases.

cewing · 2020-07-29T22:15:41Z

There appears to be a failure in testing on a windows container. The other two runs of the tests are passing, however. Not sure what to do about that. Would someone with access look at the test results and let me know if there's something I can fix?

kojiromike · 2020-07-30T13:38:29Z

There appears to be a failure in testing on a windows container. The other two runs of the tests are passing, however. Not sure what to do about that. Would someone with access look at the test results and let me know if there's something I can fix?

This failure is not caused by this PR. The Windows build is perennial flaky and to be honest I don't know exactly what it's based on that causes this kind of behavior.

I am surprised that at least readonly access to Travis isn't universal. (That said, you can set up TravisCI on your fork pretty easily if you want to have full access to what it does.)

C:\Users\travis\build\apache\avro\lang\csharp\src\apache\test\Avro.test.csproj : error NU1603: Apache.Avro depends on Newtonsoft.Json (>= 10.0.3) but Newtonsoft.Json 10.0.3 was not found. An approximate best match of Newtonsoft.Json 11.0.2 was resolved. [C:\Users\travis\build\apache\avro\lang\csharp\Avro.sln]

C:\Users\travis\build\apache\avro\lang\csharp\src\apache\test\Avro.test.csproj : error NU1101: Unable to find package nunit. No packages exist with this id in source(s): Microsoft Visual Studio Offline Packages [C:\Users\travis\build\apache\avro\lang\csharp\Avro.sln]

C:\Users\travis\build\apache\avro\lang\csharp\src\apache\test\Avro.test.csproj : error NU1101: Unable to find package nunit3testadapter. No packages exist with this id in source(s): Microsoft Visual Studio Offline Packages [C:\Users\travis\build\apache\avro\lang\csharp\Avro.sln]

C:\Users\travis\build\apache\avro\lang\csharp\src\apache\test\Avro.test.csproj : error NU1101: Unable to find package NUnit.ConsoleRunner. No packages exist with this id in source(s): Microsoft Visual Studio Offline Packages [C:\Users\travis\build\apache\avro\lang\csharp\Avro.sln]

C:\Users\travis\build\apache\avro\lang\csharp\src\apache\test\Avro.test.csproj : error NU1101: Unable to find package Microsoft.NET.Test.Sdk. No packages exist with this id in source(s): Microsoft Visual Studio Offline Packages [C:\Users\travis\build\apache\avro\lang\csharp\Avro.sln]

C:\Users\travis\build\apache\avro\lang\csharp\src\apache\test\Avro.test.csproj : error NU1101: Unable to find package System.CodeDom. No packages exist with this id in source(s): Microsoft Visual Studio Offline Packages [C:\Users\travis\build\apache\avro\lang\csharp\Avro.sln]

  Restore failed in 95.21 ms for C:\Users\travis\build\apache\avro\lang\csharp\src\apache\test\Avro.test.csproj.

  Restore completed in 1.23 sec for C:\Users\travis\build\apache\avro\lang\csharp\src\apache\perf\Avro.perf.csproj.

Build FAILED.

C:\Users\travis\build\apache\avro\lang\csharp\src\apache\test\Avro.test.csproj : error NU1603: Apache.Avro depends on Newtonsoft.Json (>= 10.0.3) but Newtonsoft.Json 10.0.3 was not found. An approximate best match of Newtonsoft.Json 11.0.2 was resolved. [C:\Users\travis\build\apache\avro\lang\csharp\Avro.sln]

C:\Users\travis\build\apache\avro\lang\csharp\src\apache\test\Avro.test.csproj : error NU1101: Unable to find package nunit. No packages exist with this id in source(s): Microsoft Visual Studio Offline Packages [C:\Users\travis\build\apache\avro\lang\csharp\Avro.sln]

C:\Users\travis\build\apache\avro\lang\csharp\src\apache\test\Avro.test.csproj : error NU1101: Unable to find package nunit3testadapter. No packages exist with this id in source(s): Microsoft Visual Studio Offline Packages [C:\Users\travis\build\apache\avro\lang\csharp\Avro.sln]

C:\Users\travis\build\apache\avro\lang\csharp\src\apache\test\Avro.test.csproj : error NU1101: Unable to find package NUnit.ConsoleRunner. No packages exist with this id in source(s): Microsoft Visual Studio Offline Packages [C:\Users\travis\build\apache\avro\lang\csharp\Avro.sln]

C:\Users\travis\build\apache\avro\lang\csharp\src\apache\test\Avro.test.csproj : error NU1101: Unable to find package Microsoft.NET.Test.Sdk. No packages exist with this id in source(s): Microsoft Visual Studio Offline Packages [C:\Users\travis\build\apache\avro\lang\csharp\Avro.sln]

C:\Users\travis\build\apache\avro\lang\csharp\src\apache\test\Avro.test.csproj : error NU1101: Unable to find package System.CodeDom. No packages exist with this id in source(s): Microsoft Visual Studio Offline Packages [C:\Users\travis\build\apache\avro\lang\csharp\Avro.sln]

cewing · 2020-07-30T19:01:00Z

yeah, I was able to see that much, but it didn't mean much to me. It did look like a failed setup, didn't appear even to get so far as running tests, but I figured I'd ask in case I was missing something.

What's the protocol for proceeding when there are semi-expected test failures like this?

RyanSkraba · 2020-08-03T08:44:37Z

Hello! I've been either relaunching the failing job until it works, or ignoring the failures on the Windows container... We have a JIRA AVRO-2847 tracking this, but I haven't been able to figure it out (or reproduce it locally).

You should be able to see the error logs in the container though.

My apologies for this unfortunate state, I relaunched your PR until it's green!

cewing · 2020-08-04T16:28:10Z

@kojiromike so all the tests pass now (thanks to @RyanSkraba!). It's unclear to me what the next steps are. The contributing docs in the avro wiki appear to suggest that one can either open a PR here or submit a patch via Jira. Should I create a patch and submit it via Jira?

In addition, one of the claims of this PR is that it will be more memory efficient because of moving away from recursive processing. Do I need to add a test for a more deeply nested schema that demonstrates this savings? Should I do some performance analysis that shows that this approach hasn't slowed down the parsing process for large schemas?

kojiromike · 2020-08-04T18:01:00Z

@cewing if you have time to add additional tests that demonstrate the memory properties or performance characteristics that would be excellent. I planned to take a deeper look at this on the weekend, but life has got in the way recently. Apologies. I'll try to get to it in the next couple weeks.

kojiromike · 2020-08-14T14:56:05Z

@cewing I'm going to look at this today. Trying to come up with some good manual test cases. Also, do you want to rebase -i and edit your commits to include the ticket number?

kojiromike · 2020-08-14T15:03:35Z

Oh, by the way, you do not need to support Python 2 anymore in this PR.

lang/py/avro/schema.py

kojiromike · 2020-08-15T15:35:26Z

lang/py/avro/schema.py

@@ -493,6 +516,17 @@ def __eq__(self, that):
 class PrimitiveSchema(Schema):
    """Valid primitive types are in PRIMITIVE_TYPES."""

+    _validators = {


While I started this approach, I realize now that these lambdas don't show up clearly in test coverage reports. We can leave it this way for now, but maybe in the future I should move these to named functions.

yeah, I'd like that.

kojiromike

OK, I'm satisfied with this changeset. The tests we have today are pretty decent, I think. For clarity, we might need to shuffle where the tests are around (test_io tests stuff in schema.py, etc). But all that can be done in a future change.

kojiromike · 2020-08-15T15:41:42Z

@cewing LMK if you plan to do any other changes. When you're ready, LMK and I can merge this.

cewing · 2020-08-15T20:48:16Z

I’m on the road today, but I do intend to rebase and add issue numbers. I will also update that abstract method. Typed painstakingly with my thumbs and the active hindrance of autocorrect.

…

On Aug 15, 2020, at 8:42 AM, Michael A. Smith ***@***.***> wrote: @cewing LMK if you plan to do any other changes. When you're ready, LMK and I can merge this. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

cewing · 2020-08-17T21:41:00Z

@kojiromike the rebase is complete, only one commit now, and it's got the avro issue number in the first line.

I held off on the question of abstract method for the base Schema.validate, see my comment and question above. I am still happy to do that work once the question is resolved.

kojiromike · 2020-08-18T13:34:02Z

@cewing hmm, github is still reporting conflicts. To be sure, did you rebase against apache/master?

cewing · 2020-08-18T17:10:13Z

@kojiromike, I have not yet rebased to master. I was waiting to do that until we resolved the question of the abstractmethod approach. Given that the base Schema class is not an ABC, writing it as an abstractmethod is not allowed. Do we want to update the Schema class to be an ABC in this PR, or save that work for another time?

kojiromike · 2020-08-18T18:31:07Z

Save that work for another time. Maybe I'll do it as part of getting rid of all the Python 2 polyfills.

Use schema-type specific iterators and validators to allow a breadth-first traversal of a full schema, validating each node as you go. The benefit of this approach is that it allows us to pin-point the specific part of the schema that has failed validation. Where previously the error message for a large schema would print the entire datum as well as the full schema and say "this is not that", this new approach will print the specific sub-schema that has failed in order to allow more informative errors. A second improvement is that by traversing the schema instead of processing it recursively, the algorithm is more efficient in use of system resources. In particular for schemas that have lots of nested parts, this will make a difference. Make the required changes to pass tests in all supported python versions. This commit removes type hints present in the first commit in order to allow using the code in older Python versions. In addition: * the use of `str` has been replaced by the compatible `unicode`. * the ValidationNode namedtuple has been re-expressed in syntax available in all supported Python versions. * the use of a custom InvalidEvent exception has been replace by using AvroTypeException * all specific single-type validators have been replaced by partials of _validate_type with a tuple of one or more type objects. Fix typos and raise StopIteration as suggested in code review Move the responsibility for validation to the Schema class. Each schema subclass will be responsible for its own validation. This simplifies the structure of io.py, removes the dict lookup of validators, and reduces somewhat the repetition that was in io.py. Move validators to a class attribute and update method code. This makes things look a little bit cleaner than having the validators right in the midst of the method. Add arg spec docs to docstring for base Schema class. Clean up mistakes. * Fix a docstring to be a more accurate statement of reality. * Remove an unused import. * Remove extra blank lines.

cewing · 2020-08-18T20:23:28Z

@kojiromike, great. Fixed. All set to merge when you're ready. Thank you for helping to get me through this first contribution!

kojiromike · 2020-08-19T00:25:49Z

Argh, a different, but still unrelated test failure. Can you push an empty commit to retrigger build? git commit --allow-empty --fixup HEAD is how I have done this kind of thing.

cewing · 2020-08-19T18:07:46Z

running now, @kojiromike

kojiromike self-requested a review July 24, 2020 20:49

kojiromike reviewed Jul 24, 2020

View reviewed changes

lang/py/avro/io.py Outdated Show resolved Hide resolved

lang/py/avro/io.py Outdated Show resolved Hide resolved

lang/py/avro/io.py Outdated Show resolved Hide resolved

kojiromike reviewed Jul 24, 2020

View reviewed changes

probot-autolabeler bot added the Python label Jul 24, 2020

kojiromike reviewed Jul 28, 2020

View reviewed changes

lang/py/avro/schema.py Outdated Show resolved Hide resolved

kojiromike reviewed Jul 28, 2020

View reviewed changes

lang/py/avro/schema.py Show resolved Hide resolved

cewing changed the title ~~Speculative: Traversal validation~~ AVRO-2906: Traversal validation Jul 29, 2020

cewing marked this pull request as ready for review July 29, 2020 19:51

kojiromike reviewed Aug 15, 2020

View reviewed changes

lang/py/avro/schema.py Show resolved Hide resolved

kojiromike reviewed Aug 15, 2020

View reviewed changes

kojiromike approved these changes Aug 15, 2020

View reviewed changes

cewing force-pushed the traversal-validation branch from fc0dc92 to 3b673c8 Compare August 17, 2020 17:44

cewing force-pushed the traversal-validation branch from 3b673c8 to 6accc84 Compare August 17, 2020 17:52

cewing force-pushed the traversal-validation branch from 6accc84 to 10c7deb Compare August 18, 2020 20:18

fixup! AVRO-2906: Convert validation to a traversal-based approach

87cf1ef

kojiromike merged commit efb1231 into apache:master Aug 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AVRO-2906: Traversal validation #936

AVRO-2906: Traversal validation #936

cewing commented Jul 24, 2020 •

edited

cewing commented Jul 24, 2020

kojiromike left a comment

kojiromike left a comment

cewing commented Jul 24, 2020

kojiromike commented Jul 27, 2020 •

edited

kojiromike commented Jul 27, 2020

cewing commented Jul 27, 2020

cewing commented Jul 27, 2020

cewing commented Jul 28, 2020

kojiromike commented Jul 29, 2020

cewing commented Jul 29, 2020

kojiromike commented Jul 30, 2020

cewing commented Jul 30, 2020

RyanSkraba commented Aug 3, 2020

cewing commented Aug 4, 2020

kojiromike commented Aug 4, 2020

kojiromike commented Aug 14, 2020

kojiromike commented Aug 14, 2020

kojiromike Aug 15, 2020

cewing Aug 17, 2020

kojiromike left a comment

kojiromike commented Aug 15, 2020

cewing commented Aug 15, 2020 via email

cewing commented Aug 17, 2020

kojiromike commented Aug 18, 2020

cewing commented Aug 18, 2020

kojiromike commented Aug 18, 2020

cewing commented Aug 18, 2020

kojiromike commented Aug 19, 2020

cewing commented Aug 19, 2020

AVRO-2906: Traversal validation #936

AVRO-2906: Traversal validation #936

Conversation

cewing commented Jul 24, 2020 • edited

Jira

Tests

Commits

Documentation

cewing commented Jul 24, 2020

kojiromike left a comment

Choose a reason for hiding this comment

kojiromike left a comment

Choose a reason for hiding this comment

cewing commented Jul 24, 2020

kojiromike commented Jul 27, 2020 • edited

kojiromike commented Jul 27, 2020

cewing commented Jul 27, 2020

cewing commented Jul 27, 2020

cewing commented Jul 28, 2020

kojiromike commented Jul 29, 2020

cewing commented Jul 29, 2020

kojiromike commented Jul 30, 2020

cewing commented Jul 30, 2020

RyanSkraba commented Aug 3, 2020

cewing commented Aug 4, 2020

kojiromike commented Aug 4, 2020

kojiromike commented Aug 14, 2020

kojiromike commented Aug 14, 2020

kojiromike Aug 15, 2020

Choose a reason for hiding this comment

cewing Aug 17, 2020

Choose a reason for hiding this comment

kojiromike left a comment

Choose a reason for hiding this comment

kojiromike commented Aug 15, 2020

cewing commented Aug 15, 2020 via email

cewing commented Aug 17, 2020

kojiromike commented Aug 18, 2020

cewing commented Aug 18, 2020

kojiromike commented Aug 18, 2020

cewing commented Aug 18, 2020

kojiromike commented Aug 19, 2020

cewing commented Aug 19, 2020

cewing commented Jul 24, 2020 •

edited

kojiromike commented Jul 27, 2020 •

edited