Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AVRO-2474][WIP] - unit analysis for Avro python schema #841

Closed
wants to merge 1 commit into from

Conversation

erikerlandson
Copy link

Make sure you have checked all steps below.

Jira

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:
    This feature requires unit testing, however the initial draft has not added tests. This PR should not be approved or merged until testing is added.

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    This feature would require additional documentation which I haven't written yet.

The following code demonstrates unit analysis with python schema:

#!/usr/bin/env python3

import avro.schema, avro.units
from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter

# An example of a unit definition db
# format described in JSON Schema:
# https://github.com/erikerlandson/unit-analysis-json-schema/blob/master/unit-analysis-schema.json
udb = avro.units.parse("""
[
  {"unit": "base", "name": "meter", "abbv": "m"},

  {"unit": "base", "name": "second", "abbv": "s"},

  {"unit": "derived", "name": "minute", "coef": 60, "expr": "second"},

  {"unit": "derived", "name": "hour", "coef": 60, "expr": "minute"},

  {"unit": "derived", "name": "foot", "coef": 0.3048, "expr": "meter"},
    
  {"unit": "derived", "name": "mile", "coef": 5280, "expr": "foot"},

  {"unit": "derived",
   "name": "liter", "abbv": "L",
   "coef": {"coef": "rational", "num": 1, "den": 1000},
   "expr": {"lhs": "meter", "op": "^", "rhs": 3}
  },

  {"unit": "derived",
   "name": "gravity", "abbv": "g",
   "coef": 9.8,
   "expr": {"lhs": "meter", "op": "/", "rhs": {"lhs": "second", "op": "^", "rhs": 2}}
  },

  {"unit": "derived", "name": "kilo", "coef": 1000, "expr": "unitless"}
]
""")

# An Avro schema with additional "unit" expressions.
# unit expression format is same as in unit db above
writeschema = avro.schema.parse("""
{
     "type": "record",
     "name": "demo_units",
     "fields": [
       { "name": "distance",     "type": "float",  "unit": {"lhs": "kilo", "op": "*", "rhs": "meter"} },
       { "name": "velocity",     "type": "float",  "unit": {"lhs": "meter", "op": "/", "rhs": "second"} },
       { "name": "acceleration", "type": "float",  "unit": {"lhs": "foot", "op": "/", "rhs": {"lhs": "second", "op": "^", "rhs": 2}} }
     ]
}
""")

# write some data assuming the write schema above:
writer = DataFileWriter(open("/tmp/unitdata.avro", "wb"), DatumWriter(), writeschema)
writer.append({"distance": 5, "velocity": 20, "acceleration": 32})
writer.close()

# Now define a reader schema which expects different (but compatible) units:
readschema = avro.schema.parse("""
{
     "type": "record",
     "name": "demo_units",
     "fields": [
       { "name": "distance",     "type": "float",  "unit": "mile" },
       { "name": "velocity",     "type": "float",  "unit": {"lhs": "mile", "op": "/", "rhs": "hour"} },
       { "name": "acceleration", "type": "float",  "unit": "gravity" }
     ]
}
""")

# read our data back in, with read schema having differing units:
reader = DataFileReader(open("/tmp/unitdata.avro", "rb"), \
                        DatumReader(writers_schema=writeschema, \
                                    readers_schema=readschema, \
                                    unit_db=udb))

# Print out the data - verify that proper unit conversions were applied
for record in reader:
    print(record)
reader.close()

The program above will output:

{'distance': 3.1068559611866697, 'velocity': 44.73872584108805, 'acceleration': 0.995265306122449}

Adds a "unit" field to Avro schema of type 'record' on a per-field
basis. Units that are compatible are automatically converted. Units
that are not compatible cause schema matching to fail.
@erikerlandson
Copy link
Author

cc @nielsbasjes @RyanSkraba

@Fokko Fokko requested a review from kojiromike March 29, 2020 10:35
Copy link
Contributor

@kojiromike kojiromike left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a ton of work and care went into it. I have some hesitation about the amount of very strict runtime type checking going on. I think that leads to a potentially brittle result that could be avoided by a combination of duck typing and abstract base classes. Type annotations may help too, but we don't test those automatically yet.

Besides that my comments are largely about style.

Very nice work.

"""
As defined in the Avro specification, we call the schema encoded
in the data the "writer's schema", and the schema expected by the
reader the "reader's schema".
"""
self._writers_schema = writers_schema
self._readers_schema = readers_schema
if (unit_db is not None) and (not isinstance(unit_db, UnitAnalysisDB)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using comment-style type annotations in lieu of runtime type checking.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We still don't automatically check type hints yet. (It's something I very much want to do, see AVRO-2387, but it's a complex task to get right so it's been slow.) That said, I think the entire python avro implementation suffers from overly strict runtime type checking, and I'd like us to get better at that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: Now that we've dropped Python 2 support, type hints will work! Feel free to add actual type hints.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that's a good idea! I will probably not do any further work on this unless/until the avro committee votes this feature in, but I'll add type hints if we decide to proceed with the PR.

from avro import constants, schema, timezones
import avro
from avro import constants, schema, timezones, units
from avro.units import UnitAnalysisDB
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider accessing this from units since it makes name spacing clearer and units was imported in the previous line.

@@ -936,6 +945,18 @@ def read_record(self, writers_schema, readers_schema, decoder):
else:
self.skip_data(field.type, decoder)

if self.unit_db is not None:
wsid = str(id(writers_schema))
for rfname in read_record.keys():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider dropping the .keys() part.

@@ -936,6 +945,18 @@ def read_record(self, writers_schema, readers_schema, decoder):
else:
self.skip_data(field.type, decoder)

if self.unit_db is not None:
wsid = str(id(writers_schema))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the technical motivation for using the memory address of the writers schema here. Would its fingerprint be sufficient, or does it have to be precisely the same exact writers schema object?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Schema fingerprint might work too - I just wanted something to uniquely identify the schema, that was cheap to compute. I'm not very familiar with the properties of schema fingerprints, except that they seem to come in multiple hash sizes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The motivation for tagging it at all was to save redundant compute - don't resolve things that have already been resolved. It touches on the fact that schema resolution happens for each data element, which itself seems redundant to me, except in the particular case where unions are in play.
https://issues.apache.org/jira/browse/AVRO-2748

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, the memoization is a good idea. I'm just asking if the fingerprint would be an effective key. If it is, then two instances created in running schema.parse on the same schema string would have the same wsid, and so the memoization would have a wider net.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confident fingerprint would work. It might also be more portable across avro language implementations. My main concern is that computing a hash on a schema is itself expensive, at least compared to "get the memory address of this object". Are schema fingerprints themselves memoized?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, sorry, i don't think python has a fingerprint implementation at all just yet. There was work on it, but I can't find it in the code now (looking on my phone, anyway). As I recall, it ought to be quite memoizable.

"""
self._unit_map = {}
for ud in udefs:
self._unit_map[ud.name] = ud
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider writing this as a dictionary comprehension.

lmap[u] = e
return CanonicalUnit(self.coef.mul(rhs.coef), lmap)

def div(self, rhs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these be implemented as operators?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually tried doing these with operators, and I kept getting some kind of compile error, so I just went with named methods. It's not really a user-facing API anyway.


class BaseUnitDef(UnitDef):
def __init__(self, prop, name, abbv):
super().__init__(prop, name, abbv)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems unnecessary?

abbv = property(lambda self: self._abbv)

def __repr__(self):
return str(self)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the built in repr insufficient?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the built-in repr wasn't giving me a human readable string, IIRC. either that or I was doing something wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's interesting. I'd suggest steering clear of making repr and str the same. Or more's the point, please be very careful not to let the repr look like any other type of object. I think there was a case above where a Units type object had a repr that made it look exactly like a dictionary. That's definitely a situation to avoid.

expr = property(lambda self: self._expr)

def __str__(self):
return "DerivedUnitDef(%s, %s, %s, %s)" % \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this should be the repr?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, I need to rationalize the way I'm defining repr vs str

return DerivedUnitDef(obj, uname.name, uabbv.name, ucoef, uexpr)
raise UnitParseException('Unrecognized unit type: "%s"' % (utype))

UNIT_NAME_REGEX = re.compile("^([a-z]|[A-Z])([a-z]|[A-Z]|[0-9])*$")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please put these at the top of the module

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed - there are also a few other places I ought to be making use of constants

@erikerlandson
Copy link
Author

@kojiromike thank you for reviewing! I can certainly investigate use of abstract base classes. Most of the code in units.py isn't user/api facing, and so I'm unsure how much payoff there would be. However a few objects like UnitAnalysisDB are more exposed and might in theory appear in differing implementations.

At some point this will require attention to unit testing and documentation, and I could use some guidance on where each of those lives and how they should be composed.

@erikerlandson
Copy link
Author

Is the avro community still interested in this, or should I close it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants