-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AVRO-2474][WIP] - unit analysis for Avro python schema #841
Conversation
Adds a "unit" field to Avro schema of type 'record' on a per-field basis. Units that are compatible are automatically converted. Units that are not compatible cause schema matching to fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a ton of work and care went into it. I have some hesitation about the amount of very strict runtime type checking going on. I think that leads to a potentially brittle result that could be avoided by a combination of duck typing and abstract base classes. Type annotations may help too, but we don't test those automatically yet.
Besides that my comments are largely about style.
Very nice work.
""" | ||
As defined in the Avro specification, we call the schema encoded | ||
in the data the "writer's schema", and the schema expected by the | ||
reader the "reader's schema". | ||
""" | ||
self._writers_schema = writers_schema | ||
self._readers_schema = readers_schema | ||
if (unit_db is not None) and (not isinstance(unit_db, UnitAnalysisDB)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using comment-style type annotations in lieu of runtime type checking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you referring to this?
https://www.python.org/dev/peps/pep-0484/#type-comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. We still don't automatically check type hints yet. (It's something I very much want to do, see AVRO-2387, but it's a complex task to get right so it's been slow.) That said, I think the entire python avro implementation suffers from overly strict runtime type checking, and I'd like us to get better at that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: Now that we've dropped Python 2 support, type hints will work! Feel free to add actual type hints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that's a good idea! I will probably not do any further work on this unless/until the avro committee votes this feature in, but I'll add type hints if we decide to proceed with the PR.
from avro import constants, schema, timezones | ||
import avro | ||
from avro import constants, schema, timezones, units | ||
from avro.units import UnitAnalysisDB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider accessing this from units
since it makes name spacing clearer and units was imported in the previous line.
@@ -936,6 +945,18 @@ def read_record(self, writers_schema, readers_schema, decoder): | |||
else: | |||
self.skip_data(field.type, decoder) | |||
|
|||
if self.unit_db is not None: | |||
wsid = str(id(writers_schema)) | |||
for rfname in read_record.keys(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider dropping the .keys()
part.
@@ -936,6 +945,18 @@ def read_record(self, writers_schema, readers_schema, decoder): | |||
else: | |||
self.skip_data(field.type, decoder) | |||
|
|||
if self.unit_db is not None: | |||
wsid = str(id(writers_schema)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand the technical motivation for using the memory address of the writers schema here. Would its fingerprint be sufficient, or does it have to be precisely the same exact writers schema object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Schema fingerprint might work too - I just wanted something to uniquely identify the schema, that was cheap to compute. I'm not very familiar with the properties of schema fingerprints, except that they seem to come in multiple hash sizes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The motivation for tagging it at all was to save redundant compute - don't resolve things that have already been resolved. It touches on the fact that schema resolution happens for each data element, which itself seems redundant to me, except in the particular case where unions are in play.
https://issues.apache.org/jira/browse/AVRO-2748
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, the memoization is a good idea. I'm just asking if the fingerprint would be an effective key. If it is, then two instances created in running schema.parse
on the same schema string would have the same wsid, and so the memoization would have a wider net.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confident fingerprint would work. It might also be more portable across avro language implementations. My main concern is that computing a hash on a schema is itself expensive, at least compared to "get the memory address of this object". Are schema fingerprints themselves memoized?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh, sorry, i don't think python has a fingerprint implementation at all just yet. There was work on it, but I can't find it in the code now (looking on my phone, anyway). As I recall, it ought to be quite memoizable.
""" | ||
self._unit_map = {} | ||
for ud in udefs: | ||
self._unit_map[ud.name] = ud |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider writing this as a dictionary comprehension.
lmap[u] = e | ||
return CanonicalUnit(self.coef.mul(rhs.coef), lmap) | ||
|
||
def div(self, rhs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these be implemented as operators?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually tried doing these with operators, and I kept getting some kind of compile error, so I just went with named methods. It's not really a user-facing API anyway.
|
||
class BaseUnitDef(UnitDef): | ||
def __init__(self, prop, name, abbv): | ||
super().__init__(prop, name, abbv) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems unnecessary?
abbv = property(lambda self: self._abbv) | ||
|
||
def __repr__(self): | ||
return str(self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the built in repr insufficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the built-in repr wasn't giving me a human readable string, IIRC. either that or I was doing something wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's interesting. I'd suggest steering clear of making repr
and str
the same. Or more's the point, please be very careful not to let the repr
look like any other type of object. I think there was a case above where a Units type object had a repr
that made it look exactly like a dictionary. That's definitely a situation to avoid.
expr = property(lambda self: self._expr) | ||
|
||
def __str__(self): | ||
return "DerivedUnitDef(%s, %s, %s, %s)" % \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this should be the repr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed, I need to rationalize the way I'm defining repr
vs str
return DerivedUnitDef(obj, uname.name, uabbv.name, ucoef, uexpr) | ||
raise UnitParseException('Unrecognized unit type: "%s"' % (utype)) | ||
|
||
UNIT_NAME_REGEX = re.compile("^([a-z]|[A-Z])([a-z]|[A-Z]|[0-9])*$") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please put these at the top of the module
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed - there are also a few other places I ought to be making use of constants
@kojiromike thank you for reviewing! I can certainly investigate use of abstract base classes. Most of the code in At some point this will require attention to unit testing and documentation, and I could use some guidance on where each of those lives and how they should be composed. |
Is the avro community still interested in this, or should I close it? |
Make sure you have checked all steps below.
Jira
Tests
This feature requires unit testing, however the initial draft has not added tests. This PR should not be approved or merged until testing is added.
Commits
Documentation
This feature would require additional documentation which I haven't written yet.
The following code demonstrates unit analysis with python schema:
The program above will output: