Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AVRO-3065: Introduce UUID logical type to Python implementation #1148

Merged
merged 3 commits into from Apr 16, 2021

Conversation

subhashb
Copy link
Contributor

This PR introduces the UUID logical type implementation that was missing in the
primary python implementation. A new UUIDSchema has been introduced that
validates for valid UUID4 string values.

Closes: https://issues.apache.org/jira/browse/AVRO-3065

Make sure you have checked all steps below.

Jira

Tests

  • My PR adds the following unit tests:
    • Test for valid UUID Logical Type in Schema
    • Test for valid UUID in values

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

This PR introduces the UUID logical type implementation that was missing in the
primary python implementation. A new `UUIDSchema` has been introduced, and test
cases for schema and io have been updated.

Closes: https://issues.apache.org/jira/browse/AVRO-3065
@RyanSkraba RyanSkraba changed the title AVRO-3065 Introduce UUID logical type to Python implementation AVRO-3065: Introduce UUID logical type to Python implementation Mar 26, 2021
@@ -80,7 +80,7 @@
'{"type": "long", "logicalType": "timestamp-micros"}',
datetime.datetime(2000, 1, 18, 2, 2, 1, 123499, tzinfo=avro.timezones.tst)
),
('{"type": "string", "logicalType": "uuid"}', u'12345abcd'),
('{"type": "string", "logicalType": "uuid"}', u'570feebe-2bbc-4937-98df-285944e1dbbd'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the test case, should this be:

Suggested change
('{"type": "string", "logicalType": "uuid"}', u'570feebe-2bbc-4937-98df-285944e1dbbd'),
('{"type": "string", "logicalType": "uuid"}', uuid.UUID('570feebe-2bbc-4937-98df-285944e1dbbd')),

I'm not sure if this suggestion is correct, but it's what I would expect once UUID logical type is implemented!

Since this is a new feature, should we go with the expectation that the in-memory representation of a UUID datum is a UUID object instead of a string? Should we accept either/or when serializing, but always deserialize to one or the other? What do you think?

lang/py/avro/schema.py Outdated Show resolved Hide resolved
Copy link
Contributor

@RyanSkraba RyanSkraba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking this on!

My comment on the test case is that if we're (finally!) introducing the UUID type to python, we might use the uuid type for these data (like we use datetime instances for the date and time logical types).

This could break code for developers that have schemas with uuid logical types and are happy to treat them as strings for now, though. Much like using avro-1.7.7 would happily ignore date logical types and treat them as ints...

If we decide there's no real advantage to using uuid instances, this looks good though!

@subhashb
Copy link
Contributor Author

Hey @RyanSkraba!

Thank you for the review.

Yes, I was thinking primarily about backward compatibility when I represented UUID as a string.

Some aspects (and past experiences) that made string representation a better option for me:

  • UUID objects are not serializable by default.
  • A project persisting objects (with a UUID logical field) from Avro today will need to be enhanced to handle UUID objects (you mentioned this too) manually
  • Marshmallow's UUID field accepts a string but always returns a UUID object
  • Django has a UUID Field, but Postgres is the only database with native UUID data type support. Rest interpret the value as a string. Using the Django UUID field still requires you to generate the UUID yourself.
  • SQLAlchemy treats UUIDs as strings and needs as_uuid=True to explicitly treat the value as a UUID object.

Having said that, UUID object representation might be a better long-term option. Please let me know what you think.

This commit changes UUID validation to accept UUID values of any
version, instead of being locked to version 4.

Also, return `None` on failed validation instead of `False`, for
consistency with others.
@subhashb
Copy link
Contributor Author

f7ec849 updates the validation to accept UUIDs of different versions, instead of being locked to version 4.

@RyanSkraba
Copy link
Contributor

Thanks for the explanation -- your reasoning is sound. Especially since the uuid package doesn't really add a lot of functionality over strings (in comparison to using datetime instances over numbers).

@RyanSkraba
Copy link
Contributor

Hey, if nobody has any other comments, let's just merge this! Thanks for the contribution!

@RyanSkraba RyanSkraba merged commit 0c9a60e into apache:master Apr 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants