Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AVRO-1938: Add fingerprinting support to Python implementation #1181

Merged
merged 17 commits into from Jul 17, 2023

Conversation

subhashb
Copy link
Contributor

@subhashb subhashb commented Apr 6, 2021

This PR adds support for generating fingerprints from schema objects. Schema fingerprints are extracted by invoking the fingerprint method on the schema object. By default, fingerprints are generated with the CRC-64 algorithm. Optionally, an algorithm, specified by the algorithm name used by hashlib, can be supplied.

All algorithms supported by hashlib are made available, but Avro recommends using one among CRC-32, MD5, and SHA256 as per needs.

Make sure you have checked all steps below.

Jira

Tests

  • My PR adds the following unit tests:
    • Fingerprint test cases in FingerprintTestCase class
    • Miscellaneous fingerprint test cases in TestMisc class: test_unsupported_fingerprint_algorithm and test_less_popular_fingerprint_algorithm test methods

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    Functionality confirms to Avro Specification.

With this change, Schema fingerprints can be extracted by
invoking the `fingerprint` method on the schema object. By default,
fingerprints will be generated with the CRC-64 algorithm. Optinally,
the algorithm can be supplied.

All algorithms supported by hashlib are available, but Avro
recommends using one among CRC-32, MD5, and SHA256 as per needs.
@github-actions github-actions bot added the Python label Apr 6, 2021
lang/py/avro/schema.py Outdated Show resolved Hide resolved
@subhashb subhashb changed the title AVRO-1938: Add support for fingerprinting schemas AVRO-1938: Add fingerprinting support to Python implementation Apr 6, 2021
@kojiromike
Copy link
Contributor

@subhashb In a recent PR I've moved the linting system over to use black. I guess if you autoformat your code with black it'll make the conflicts pretty easy to fix.

@subhashb
Copy link
Contributor Author

subhashb commented Jul 8, 2021

@kojiromike

Good on moving to black! 🎉 🎉
I have changed the code to use frozenset instead of set and fixed conflicts after running black.

@kojiromike
Copy link
Contributor

@subhashb could you please take a crack at adding type hints to the changes here?

lang/py/avro/schema.py Outdated Show resolved Hide resolved
lang/py/avro/schema.py Outdated Show resolved Hide resolved
lang/py/avro/schema.py Outdated Show resolved Hide resolved
lang/py/avro/schema.py Outdated Show resolved Hide resolved
lang/py/avro/test/test_schema.py Fixed Show fixed Hide fixed
lang/py/avro/test/test_schema.py Fixed Show fixed Hide fixed
Addresses PR 1181 review comments. Methods within Fingerprint mixin
have been made available at the module level, including static
variables used in fingerprinting. This PR has been synced with latest
master.
@subhashb
Copy link
Contributor Author

@kojiromike All comments have been addressed, along with type hints where applicable.

@subhashb
Copy link
Contributor Author

@RyanSkraba @kojiromike Can you please help me understand what could be causing this error? Do we need to flush caches?

As fas as I can see, these failures are not due to changes in this PR's branch.

@martin-g
Copy link
Member

There is some problem with Python since recently (today ?!).
It fails even on master branch with these errors: https://github.com/apache/avro/actions/workflows/test-lang-py.yml

@subhashb
Copy link
Contributor Author

@martin-g Could it have something to do with this change: pytest-dev/pytest-xdist#821 ?

tox uses pytest-xdist internally. I am unsure; the last pytest-xdist release was 11 days ago. Either way, we may have to clear the cache at the repo level and retry. I reinitialized my virtualenv, and tests still succeed, so I assume the error is because of broken dependencies.

@RyanSkraba If you agree, we will need to clear the cache manually.

@subhashb
Copy link
Contributor Author

@kojiromike Synced with latest master. Can you please approve workflow runs and review the PR?

lang/py/avro/schema.py Outdated Show resolved Hide resolved
@subhashb
Copy link
Contributor Author

@kojiromike Addressed review comments.

lang/py/avro/schema.py Outdated Show resolved Hide resolved
@subhashb
Copy link
Contributor Author

@kojiromike Addressed further review comments and left a note about try/except format instead of assertRaises.

@kojiromike kojiromike merged commit f504265 into apache:master Jul 17, 2023
20 checks passed
martin-g pushed a commit that referenced this pull request Jul 17, 2023
* AVRO-1938 Add support for fingerprinting schemas

With this change, Schema fingerprints can be extracted by
invoking the `fingerprint` method on the schema object. By default,
fingerprints will be generated with the CRC-64 algorithm. Optinally,
the algorithm can be supplied.

All algorithms supported by hashlib are available, but Avro
recommends using one among CRC-32, MD5, and SHA256 as per needs.

* AVRO-1938 Fix issue with AbstractSet typecheck

* Format with black

* Freeze Supported Algorithms Set

This commit addresses review comments and freezes the supported
fingerprinting algorithms set.

* Minor lint fix with black

* Address Typecheck issues with Frozenset

* Fold Fingerprint Mixin within Schema

Addresses PR 1181 review comments. Methods within Fingerprint mixin
have been made available at the module level, including static
variables used in fingerprinting. This PR has been synced with latest
master.

* Add type hints to fingerprint methods/variables

* Fix incorrect import sorting in schema.py to pass lint check

* Address @kojiromike Jul 16 review comments

* Address @kojiromike Jul 16 review comments - 2

* Address @kojiromike Jul 17 review comments

* Fix black lint issue

(cherry picked from commit f504265)
RanbirK pushed a commit to RanbirK/avro that referenced this pull request May 13, 2024
…e#1181)

* AVRO-1938 Add support for fingerprinting schemas

With this change, Schema fingerprints can be extracted by
invoking the `fingerprint` method on the schema object. By default,
fingerprints will be generated with the CRC-64 algorithm. Optinally,
the algorithm can be supplied.

All algorithms supported by hashlib are available, but Avro
recommends using one among CRC-32, MD5, and SHA256 as per needs.

* AVRO-1938 Fix issue with AbstractSet typecheck

* Format with black

* Freeze Supported Algorithms Set

This commit addresses review comments and freezes the supported
fingerprinting algorithms set.

* Minor lint fix with black

* Address Typecheck issues with Frozenset

* Fold Fingerprint Mixin within Schema

Addresses PR 1181 review comments. Methods within Fingerprint mixin
have been made available at the module level, including static
variables used in fingerprinting. This PR has been synced with latest
master.

* Add type hints to fingerprint methods/variables

* Fix incorrect import sorting in schema.py to pass lint check

* Address @kojiromike Jul 16 review comments

* Address @kojiromike Jul 16 review comments - 2

* Address @kojiromike Jul 17 review comments

* Fix black lint issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants