Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AVRO-2906: Traversal validation #936

Merged
merged 2 commits into from Aug 19, 2020

Conversation

cewing
Copy link
Contributor

@cewing cewing commented Jul 24, 2020

This PR replaces the existing validate function in avro.io with a new version that uses a traversal-based approach rather than a recursive approach. It also establishes the concept of a "validator" function that handles validation of various schema types and an "iterator" function, which powers the traversal of specific schema types.

The point of the work is two-fold. First, by traversing rather than recursing, exceptions raised by validation can be raised immediately where the problem happened, allowing for error messages that are much more localized. This is an advantage when working with very large schemas. Second, by using traversal instead of recursion, this approach is more conservative of system resources, especially for deeply nested schemas.

The goal of this pr specifically is to spur discussion of the approach we've taken and to seek approval from the community for the change. I anticipate that it will not be acceptable entirely as-is, and will be happy to make any requested changes should the approach be approved.

Jira

Tests

  • [ x] Validation testing is pretty good already, so this PR does not add any tests. If more are required in order to verify the process works, I will take responsibility for adding them.

Commits

  • [ -] My commits are well formed, but as yet there is no JIRA issue so they do not reference one. I can fix that in any final PR to be submitted.

Documentation

  • [ x] This PR does not add any new functionality. It does however alter existing functionality, at least in terms of how it is output. I am very open to discussion of the best way to document those changes should they be accepted.

@cewing
Copy link
Contributor Author

cewing commented Jul 24, 2020

Attn: @kojiromike, this PR is in relation to this email thread on the dev list

@kojiromike kojiromike self-requested a review July 24, 2020 20:49
Copy link
Contributor

@kojiromike kojiromike left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some typos

lang/py/avro/io.py Outdated Show resolved Hide resolved
lang/py/avro/io.py Outdated Show resolved Hide resolved
lang/py/avro/io.py Outdated Show resolved Hide resolved
Copy link
Contributor

@kojiromike kojiromike left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commenting on the code first, because I'm not in a place where I can run this just now.

I think it's great. I like the bfs and overall approach. I do think there's a fair amount of repetition in the individual validate methods that they're all basically return node if <valid> else None. It makes me think that that valid:bool primitive can be still be of value.

I also think we've reinvented functools.singledispatch. The fact of that predates this pr, but maybe now's the moment to think about it.

The validation tree that this PR illuminates is the same tree as the Schema object tree, right? What if validation was a method of the schema object?

@cewing
Copy link
Contributor Author

cewing commented Jul 24, 2020

Absolutely this is a disguised form of singledispatch. My first approach here was really designed to be a drop-in replacement for the existing approach, which is why it keeps things like dict lookup of validator by schema type (and now by logical_type as well). It was originally meant to be monkey patched into place so that we could solve our problem with error messages while the longer process of this conversation took place.

I'm pretty sure I can make another pass at this that uses validate as a method of the schema class. That would be pretty clean.

Any other thoughts while we look at this first pass?

@kojiromike
Copy link
Contributor

kojiromike commented Jul 27, 2020

As a warning this will most likely have conflicts with the changes in #933 just because I'm moving all the exception classes to a dedicated module and clarifying the import syntax.

@kojiromike
Copy link
Contributor

Absolutely this is a disguised form of singledispatch. My first approach here was really designed to be a drop-in replacement for the existing approach, which is why it keeps things like dict lookup of validator by schema type (and now by logical_type as well). It was originally meant to be monkey patched into place so that we could solve our problem with error messages while the longer process of this conversation took place.

I'm pretty sure I can make another pass at this that uses validate as a method of the schema class. That would be pretty clean.

Any other thoughts while we look at this first pass?

Not well-formed ones, anyway. I think that moving validation to a method on schema types opens up so many doors that it'll be more an exercise in avoiding the temptation to blow up the scope of this PR. So I'll try to be conservative and reserve my tangents for later tickets. ;)

@cewing
Copy link
Contributor Author

cewing commented Jul 27, 2020

I'll try to be conservative and reserve my tangents for later tickets. ;)

appreciated :)

@cewing
Copy link
Contributor Author

cewing commented Jul 27, 2020

@kojiromike, I've moved the validators over to schema objects and have all tests passing again. There might be some cleaning of repetition that could be done by careful restructuring of the class hierarchies, but I didn't want to go too deep into that. In particular the logical types could use some work.

I would love input especially on an approach to the PrimitiveSchema type. I followed the example laid out in .match(self, writer), but the result is not one I'm particularly happy about. It basically moves about half of the original _VALIDATORS mapping into that structure. I'm thinking perhaps something like a class attribute of validator_type that maps the schema type to the builtin type-object to be used. But that still doesn't solve the problem for the compound tests for int and long.

I'm open to suggestions.

lang/py/avro/schema.py Outdated Show resolved Hide resolved
@cewing
Copy link
Contributor Author

cewing commented Jul 28, 2020

@kojiromike I think this is pretty well ready to go. If you feel the same way, I assume the next steps are to create a JIRA issue and attach this to it.

Can you think of any testing we might want to do beyond the existing schema tests?

@kojiromike
Copy link
Contributor

Can you think of any testing we might want to do beyond the existing schema tests?

Let's see what the coverage looks like, but if we're after the debuggability of a hairy, nested record or union schema, then we might need a couple more cases.

@cewing cewing changed the title Speculative: Traversal validation AVRO-2906: Traversal validation Jul 29, 2020
@cewing cewing marked this pull request as ready for review July 29, 2020 19:51
@cewing
Copy link
Contributor Author

cewing commented Jul 29, 2020

There appears to be a failure in testing on a windows container. The other two runs of the tests are passing, however. Not sure what to do about that. Would someone with access look at the test results and let me know if there's something I can fix?

@kojiromike
Copy link
Contributor

There appears to be a failure in testing on a windows container. The other two runs of the tests are passing, however. Not sure what to do about that. Would someone with access look at the test results and let me know if there's something I can fix?

This failure is not caused by this PR. The Windows build is perennial flaky and to be honest I don't know exactly what it's based on that causes this kind of behavior.

I am surprised that at least readonly access to Travis isn't universal. (That said, you can set up TravisCI on your fork pretty easily if you want to have full access to what it does.)

C:\Users\travis\build\apache\avro\lang\csharp\src\apache\test\Avro.test.csproj : error NU1603: Apache.Avro depends on Newtonsoft.Json (>= 10.0.3) but Newtonsoft.Json 10.0.3 was not found. An approximate best match of Newtonsoft.Json 11.0.2 was resolved. [C:\Users\travis\build\apache\avro\lang\csharp\Avro.sln]

C:\Users\travis\build\apache\avro\lang\csharp\src\apache\test\Avro.test.csproj : error NU1101: Unable to find package nunit. No packages exist with this id in source(s): Microsoft Visual Studio Offline Packages [C:\Users\travis\build\apache\avro\lang\csharp\Avro.sln]

C:\Users\travis\build\apache\avro\lang\csharp\src\apache\test\Avro.test.csproj : error NU1101: Unable to find package nunit3testadapter. No packages exist with this id in source(s): Microsoft Visual Studio Offline Packages [C:\Users\travis\build\apache\avro\lang\csharp\Avro.sln]

C:\Users\travis\build\apache\avro\lang\csharp\src\apache\test\Avro.test.csproj : error NU1101: Unable to find package NUnit.ConsoleRunner. No packages exist with this id in source(s): Microsoft Visual Studio Offline Packages [C:\Users\travis\build\apache\avro\lang\csharp\Avro.sln]

C:\Users\travis\build\apache\avro\lang\csharp\src\apache\test\Avro.test.csproj : error NU1101: Unable to find package Microsoft.NET.Test.Sdk. No packages exist with this id in source(s): Microsoft Visual Studio Offline Packages [C:\Users\travis\build\apache\avro\lang\csharp\Avro.sln]

C:\Users\travis\build\apache\avro\lang\csharp\src\apache\test\Avro.test.csproj : error NU1101: Unable to find package System.CodeDom. No packages exist with this id in source(s): Microsoft Visual Studio Offline Packages [C:\Users\travis\build\apache\avro\lang\csharp\Avro.sln]

  Restore failed in 95.21 ms for C:\Users\travis\build\apache\avro\lang\csharp\src\apache\test\Avro.test.csproj.

  Restore completed in 1.23 sec for C:\Users\travis\build\apache\avro\lang\csharp\src\apache\perf\Avro.perf.csproj.

Build FAILED.

C:\Users\travis\build\apache\avro\lang\csharp\src\apache\test\Avro.test.csproj : error NU1603: Apache.Avro depends on Newtonsoft.Json (>= 10.0.3) but Newtonsoft.Json 10.0.3 was not found. An approximate best match of Newtonsoft.Json 11.0.2 was resolved. [C:\Users\travis\build\apache\avro\lang\csharp\Avro.sln]

C:\Users\travis\build\apache\avro\lang\csharp\src\apache\test\Avro.test.csproj : error NU1101: Unable to find package nunit. No packages exist with this id in source(s): Microsoft Visual Studio Offline Packages [C:\Users\travis\build\apache\avro\lang\csharp\Avro.sln]

C:\Users\travis\build\apache\avro\lang\csharp\src\apache\test\Avro.test.csproj : error NU1101: Unable to find package nunit3testadapter. No packages exist with this id in source(s): Microsoft Visual Studio Offline Packages [C:\Users\travis\build\apache\avro\lang\csharp\Avro.sln]

C:\Users\travis\build\apache\avro\lang\csharp\src\apache\test\Avro.test.csproj : error NU1101: Unable to find package NUnit.ConsoleRunner. No packages exist with this id in source(s): Microsoft Visual Studio Offline Packages [C:\Users\travis\build\apache\avro\lang\csharp\Avro.sln]

C:\Users\travis\build\apache\avro\lang\csharp\src\apache\test\Avro.test.csproj : error NU1101: Unable to find package Microsoft.NET.Test.Sdk. No packages exist with this id in source(s): Microsoft Visual Studio Offline Packages [C:\Users\travis\build\apache\avro\lang\csharp\Avro.sln]

C:\Users\travis\build\apache\avro\lang\csharp\src\apache\test\Avro.test.csproj : error NU1101: Unable to find package System.CodeDom. No packages exist with this id in source(s): Microsoft Visual Studio Offline Packages [C:\Users\travis\build\apache\avro\lang\csharp\Avro.sln]

@cewing
Copy link
Contributor Author

cewing commented Jul 30, 2020

yeah, I was able to see that much, but it didn't mean much to me. It did look like a failed setup, didn't appear even to get so far as running tests, but I figured I'd ask in case I was missing something.

What's the protocol for proceeding when there are semi-expected test failures like this?

@RyanSkraba
Copy link
Contributor

Hello! I've been either relaunching the failing job until it works, or ignoring the failures on the Windows container... We have a JIRA AVRO-2847 tracking this, but I haven't been able to figure it out (or reproduce it locally).

You should be able to see the error logs in the container though.

My apologies for this unfortunate state, I relaunched your PR until it's green!

@cewing
Copy link
Contributor Author

cewing commented Aug 4, 2020

@kojiromike so all the tests pass now (thanks to @RyanSkraba!). It's unclear to me what the next steps are. The contributing docs in the avro wiki appear to suggest that one can either open a PR here or submit a patch via Jira. Should I create a patch and submit it via Jira?

In addition, one of the claims of this PR is that it will be more memory efficient because of moving away from recursive processing. Do I need to add a test for a more deeply nested schema that demonstrates this savings? Should I do some performance analysis that shows that this approach hasn't slowed down the parsing process for large schemas?

@kojiromike
Copy link
Contributor

@cewing if you have time to add additional tests that demonstrate the memory properties or performance characteristics that would be excellent. I planned to take a deeper look at this on the weekend, but life has got in the way recently. Apologies. I'll try to get to it in the next couple weeks.

@kojiromike
Copy link
Contributor

@cewing I'm going to look at this today. Trying to come up with some good manual test cases. Also, do you want to rebase -i and edit your commits to include the ticket number?

@kojiromike
Copy link
Contributor

Oh, by the way, you do not need to support Python 2 anymore in this PR.

@@ -493,6 +516,17 @@ def __eq__(self, that):
class PrimitiveSchema(Schema):
"""Valid primitive types are in PRIMITIVE_TYPES."""

_validators = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I started this approach, I realize now that these lambdas don't show up clearly in test coverage reports. We can leave it this way for now, but maybe in the future I should move these to named functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I'd like that.

Copy link
Contributor

@kojiromike kojiromike left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'm satisfied with this changeset. The tests we have today are pretty decent, I think. For clarity, we might need to shuffle where the tests are around (test_io tests stuff in schema.py, etc). But all that can be done in a future change.

@kojiromike
Copy link
Contributor

@cewing LMK if you plan to do any other changes. When you're ready, LMK and I can merge this.

@cewing
Copy link
Contributor Author

cewing commented Aug 15, 2020 via email

@cewing
Copy link
Contributor Author

cewing commented Aug 17, 2020

@kojiromike the rebase is complete, only one commit now, and it's got the avro issue number in the first line.

I held off on the question of abstract method for the base Schema.validate, see my comment and question above. I am still happy to do that work once the question is resolved.

@kojiromike
Copy link
Contributor

@cewing hmm, github is still reporting conflicts. To be sure, did you rebase against apache/master?

@cewing
Copy link
Contributor Author

cewing commented Aug 18, 2020

@kojiromike, I have not yet rebased to master. I was waiting to do that until we resolved the question of the abstractmethod approach. Given that the base Schema class is not an ABC, writing it as an abstractmethod is not allowed. Do we want to update the Schema class to be an ABC in this PR, or save that work for another time?

@kojiromike
Copy link
Contributor

Save that work for another time. Maybe I'll do it as part of getting rid of all the Python 2 polyfills.

Use schema-type specific iterators and validators to allow a
breadth-first traversal of a full schema, validating each node
as you go.

The benefit of this approach is that it allows us to pin-point
the specific part of the schema that has failed validation.
Where previously the error message for a large schema would print
the entire datum as well as the full schema and say "this is not
that", this new approach will print the specific sub-schema that has
failed in order to allow more informative errors.

A second improvement is that by traversing the schema instead of
processing it recursively, the algorithm is more efficient in use
of system resources.  In particular for schemas that have lots of
nested parts, this will make a difference.

Make the required changes to pass tests in all supported python versions.

This commit removes type hints present in the first commit in order to
allow using the code in older Python versions.

In addition:
  * the use of `str` has been replaced by the compatible `unicode`.
  * the ValidationNode namedtuple has been re-expressed in syntax available
    in all supported Python versions.
  * the use of a custom InvalidEvent exception has been replace by using
    AvroTypeException
  * all specific single-type validators have been replaced by partials of
    _validate_type with a tuple of one or more type objects.

Fix typos and raise StopIteration as suggested in code review

Move the responsibility for validation to the Schema class.

Each schema subclass will be responsible for its own validation. This
simplifies the structure of io.py, removes the dict lookup of validators,
and reduces somewhat the repetition that was in io.py.

Move validators to a class attribute and update method code.

This makes things look a little bit cleaner than having the validators right in the midst of the method.

Add arg spec docs to docstring for base Schema class.

Clean up mistakes.

* Fix a docstring to be a more accurate statement of reality.
* Remove an unused import.
* Remove extra blank lines.
@cewing
Copy link
Contributor Author

cewing commented Aug 18, 2020

@kojiromike, great. Fixed. All set to merge when you're ready. Thank you for helping to get me through this first contribution!

@kojiromike
Copy link
Contributor

Argh, a different, but still unrelated test failure. Can you push an empty commit to retrigger build? git commit --allow-empty --fixup HEAD is how I have done this kind of thing.

@cewing
Copy link
Contributor Author

cewing commented Aug 19, 2020

running now, @kojiromike

@kojiromike kojiromike merged commit efb1231 into apache:master Aug 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants