[AVRO-2474][WIP] - unit analysis for Avro python schema #841

erikerlandson · 2020-03-08T00:55:00Z

Make sure you have checked all steps below.

Jira

My PR addresses the following Avro Jira issues and references them in the PR title.
- https://issues.apache.org/jira/browse/AVRO-2474

Tests

My PR adds the following unit tests OR does not need testing for this extremely good reason:
This feature requires unit testing, however the initial draft has not added tests. This PR should not be approved or merged until testing is added.

Commits

My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Documentation

In case of new functionality, my PR adds documentation that describes how to use it.
This feature would require additional documentation which I haven't written yet.

The following code demonstrates unit analysis with python schema:

#!/usr/bin/env python3

import avro.schema, avro.units
from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter

# An example of a unit definition db
# format described in JSON Schema:
# https://github.com/erikerlandson/unit-analysis-json-schema/blob/master/unit-analysis-schema.json
udb = avro.units.parse("""
[
  {"unit": "base", "name": "meter", "abbv": "m"},

  {"unit": "base", "name": "second", "abbv": "s"},

  {"unit": "derived", "name": "minute", "coef": 60, "expr": "second"},

  {"unit": "derived", "name": "hour", "coef": 60, "expr": "minute"},

  {"unit": "derived", "name": "foot", "coef": 0.3048, "expr": "meter"},
    
  {"unit": "derived", "name": "mile", "coef": 5280, "expr": "foot"},

  {"unit": "derived",
   "name": "liter", "abbv": "L",
   "coef": {"coef": "rational", "num": 1, "den": 1000},
   "expr": {"lhs": "meter", "op": "^", "rhs": 3}
  },

  {"unit": "derived",
   "name": "gravity", "abbv": "g",
   "coef": 9.8,
   "expr": {"lhs": "meter", "op": "/", "rhs": {"lhs": "second", "op": "^", "rhs": 2}}
  },

  {"unit": "derived", "name": "kilo", "coef": 1000, "expr": "unitless"}
]
""")

# An Avro schema with additional "unit" expressions.
# unit expression format is same as in unit db above
writeschema = avro.schema.parse("""
{
     "type": "record",
     "name": "demo_units",
     "fields": [
       { "name": "distance",     "type": "float",  "unit": {"lhs": "kilo", "op": "*", "rhs": "meter"} },
       { "name": "velocity",     "type": "float",  "unit": {"lhs": "meter", "op": "/", "rhs": "second"} },
       { "name": "acceleration", "type": "float",  "unit": {"lhs": "foot", "op": "/", "rhs": {"lhs": "second", "op": "^", "rhs": 2}} }
     ]
}
""")

# write some data assuming the write schema above:
writer = DataFileWriter(open("/tmp/unitdata.avro", "wb"), DatumWriter(), writeschema)
writer.append({"distance": 5, "velocity": 20, "acceleration": 32})
writer.close()

# Now define a reader schema which expects different (but compatible) units:
readschema = avro.schema.parse("""
{
     "type": "record",
     "name": "demo_units",
     "fields": [
       { "name": "distance",     "type": "float",  "unit": "mile" },
       { "name": "velocity",     "type": "float",  "unit": {"lhs": "mile", "op": "/", "rhs": "hour"} },
       { "name": "acceleration", "type": "float",  "unit": "gravity" }
     ]
}
""")

# read our data back in, with read schema having differing units:
reader = DataFileReader(open("/tmp/unitdata.avro", "rb"), \
                        DatumReader(writers_schema=writeschema, \
                                    readers_schema=readschema, \
                                    unit_db=udb))

# Print out the data - verify that proper unit conversions were applied
for record in reader:
    print(record)
reader.close()

The program above will output:

{'distance': 3.1068559611866697, 'velocity': 44.73872584108805, 'acceleration': 0.995265306122449}

Adds a "unit" field to Avro schema of type 'record' on a per-field basis. Units that are compatible are automatically converted. Units that are not compatible cause schema matching to fail.

erikerlandson · 2020-03-08T19:21:31Z

cc @nielsbasjes @RyanSkraba

kojiromike

This looks like a ton of work and care went into it. I have some hesitation about the amount of very strict runtime type checking going on. I think that leads to a potentially brittle result that could be avoided by a combination of duck typing and abstract base classes. Type annotations may help too, but we don't test those automatically yet.

Besides that my comments are largely about style.

Very nice work.

kojiromike · 2020-03-30T01:07:20Z

lang/py/avro/io.py

    """
    As defined in the Avro specification, we call the schema encoded
    in the data the "writer's schema", and the schema expected by the
    reader the "reader's schema".
    """
    self._writers_schema = writers_schema
    self._readers_schema = readers_schema
+    if (unit_db is not None) and (not isinstance(unit_db, UnitAnalysisDB)):


Consider using comment-style type annotations in lieu of runtime type checking.

are you referring to this?
https://www.python.org/dev/peps/pep-0484/#type-comments

Yes. We still don't automatically check type hints yet. (It's something I very much want to do, see AVRO-2387, but it's a complex task to get right so it's been slow.) That said, I think the entire python avro implementation suffers from overly strict runtime type checking, and I'd like us to get better at that.

Update: Now that we've dropped Python 2 support, type hints will work! Feel free to add actual type hints.

Thanks, that's a good idea! I will probably not do any further work on this unless/until the avro committee votes this feature in, but I'll add type hints if we decide to proceed with the PR.

kojiromike · 2020-03-30T01:09:58Z

lang/py/avro/io.py

-from avro import constants, schema, timezones
+import avro
+from avro import constants, schema, timezones, units
+from avro.units import UnitAnalysisDB


Consider accessing this from units since it makes name spacing clearer and units was imported in the previous line.

kojiromike · 2020-03-30T01:14:05Z

lang/py/avro/io.py

@@ -936,6 +945,18 @@ def read_record(self, writers_schema, readers_schema, decoder):
      else:
        self.skip_data(field.type, decoder)

+    if self.unit_db is not None:
+      wsid = str(id(writers_schema))
+      for rfname in read_record.keys():


Consider dropping the .keys() part.

kojiromike · 2020-03-30T01:15:29Z

lang/py/avro/io.py

@@ -936,6 +945,18 @@ def read_record(self, writers_schema, readers_schema, decoder):
      else:
        self.skip_data(field.type, decoder)

+    if self.unit_db is not None:
+      wsid = str(id(writers_schema))


I'm not sure I understand the technical motivation for using the memory address of the writers schema here. Would its fingerprint be sufficient, or does it have to be precisely the same exact writers schema object?

Schema fingerprint might work too - I just wanted something to uniquely identify the schema, that was cheap to compute. I'm not very familiar with the properties of schema fingerprints, except that they seem to come in multiple hash sizes.

The motivation for tagging it at all was to save redundant compute - don't resolve things that have already been resolved. It touches on the fact that schema resolution happens for each data element, which itself seems redundant to me, except in the particular case where unions are in play.
https://issues.apache.org/jira/browse/AVRO-2748

Right, the memoization is a good idea. I'm just asking if the fingerprint would be an effective key. If it is, then two instances created in running schema.parse on the same schema string would have the same wsid, and so the memoization would have a wider net.

I'm confident fingerprint would work. It might also be more portable across avro language implementations. My main concern is that computing a hash on a schema is itself expensive, at least compared to "get the memory address of this object". Are schema fingerprints themselves memoized?

Ugh, sorry, i don't think python has a fingerprint implementation at all just yet. There was work on it, but I can't find it in the code now (looking on my phone, anyway). As I recall, it ought to be quite memoizable.

kojiromike · 2020-03-30T01:16:41Z

lang/py/avro/units.py

+    """
+    self._unit_map = {}
+    for ud in udefs:
+      self._unit_map[ud.name] = ud


Consider writing this as a dictionary comprehension.

kojiromike · 2020-03-30T01:47:02Z

lang/py/avro/units.py

+        lmap[u] = e
+    return CanonicalUnit(self.coef.mul(rhs.coef), lmap)
+
+  def div(self, rhs):


Should these be implemented as operators?

I actually tried doing these with operators, and I kept getting some kind of compile error, so I just went with named methods. It's not really a user-facing API anyway.

kojiromike · 2020-03-30T01:48:11Z

lang/py/avro/units.py

+
+class BaseUnitDef(UnitDef):
+  def __init__(self, prop, name, abbv):
+    super().__init__(prop, name, abbv)


This seems unnecessary?

kojiromike · 2020-03-30T01:48:52Z

lang/py/avro/units.py

+  abbv = property(lambda self: self._abbv)
+
+  def __repr__(self):
+    return str(self)


Is the built in repr insufficient?

the built-in repr wasn't giving me a human readable string, IIRC. either that or I was doing something wrong.

That's interesting. I'd suggest steering clear of making repr and str the same. Or more's the point, please be very careful not to let the repr look like any other type of object. I think there was a case above where a Units type object had a repr that made it look exactly like a dictionary. That's definitely a situation to avoid.

kojiromike · 2020-03-30T01:49:17Z

lang/py/avro/units.py

+  expr = property(lambda self: self._expr)
+
+  def __str__(self):
+    return "DerivedUnitDef(%s, %s, %s, %s)" % \


Seems like this should be the repr?

agreed, I need to rationalize the way I'm defining repr vs str

kojiromike · 2020-03-30T01:52:19Z

lang/py/avro/units.py

+    return DerivedUnitDef(obj, uname.name, uabbv.name, ucoef, uexpr)
+  raise UnitParseException('Unrecognized unit type: "%s"' % (utype))
+
+UNIT_NAME_REGEX = re.compile("^([a-z]|[A-Z])([a-z]|[A-Z]|[0-9])*$")


Please put these at the top of the module

agreed - there are also a few other places I ought to be making use of constants

erikerlandson · 2020-03-30T23:54:41Z

@kojiromike thank you for reviewing! I can certainly investigate use of abstract base classes. Most of the code in units.py isn't user/api facing, and so I'm unsure how much payoff there would be. However a few objects like UnitAnalysisDB are more exposed and might in theory appear in differing implementations.

At some point this will require attention to unit testing and documentation, and I could use some guidance on where each of those lives and how they should be composed.

erikerlandson · 2021-02-20T16:16:45Z

Is the avro community still interested in this, or should I close it?

AVRO-2474: Add unit analysis to python Avro schema

9b78dcd

Adds a "unit" field to Avro schema of type 'record' on a per-field basis. Units that are compatible are automatically converted. Units that are not compatible cause schema matching to fail.

probot-autolabeler bot added the Python label Mar 8, 2020

Fokko requested a review from kojiromike March 29, 2020 10:35

kojiromike reviewed Mar 30, 2020

View reviewed changes

This was referenced Jul 22, 2020

Attempt to support scala.js erikerlandson/coulomb#97

Merged

Redesign QuantityParser to not require reflective compilation erikerlandson/coulomb#99

Closed

erikerlandson closed this Mar 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AVRO-2474][WIP] - unit analysis for Avro python schema #841

[AVRO-2474][WIP] - unit analysis for Avro python schema #841

erikerlandson commented Mar 8, 2020

erikerlandson commented Mar 8, 2020

kojiromike left a comment

kojiromike Mar 30, 2020

erikerlandson Mar 31, 2020

kojiromike Mar 31, 2020

kojiromike Aug 17, 2020

erikerlandson Aug 17, 2020

kojiromike Mar 30, 2020

kojiromike Mar 30, 2020

kojiromike Mar 30, 2020

erikerlandson Mar 30, 2020

erikerlandson Mar 31, 2020

kojiromike Mar 31, 2020

erikerlandson Mar 31, 2020

kojiromike Apr 3, 2020

kojiromike Mar 30, 2020

kojiromike Mar 30, 2020

erikerlandson Mar 30, 2020

kojiromike Mar 30, 2020

kojiromike Mar 30, 2020

erikerlandson Mar 30, 2020

kojiromike Mar 31, 2020

kojiromike Mar 30, 2020

erikerlandson Mar 30, 2020

kojiromike Mar 30, 2020

erikerlandson Mar 30, 2020

erikerlandson commented Mar 30, 2020

erikerlandson commented Feb 20, 2021

[AVRO-2474][WIP] - unit analysis for Avro python schema #841

[AVRO-2474][WIP] - unit analysis for Avro python schema #841

Conversation

erikerlandson commented Mar 8, 2020

Jira

Tests

Commits

Documentation

erikerlandson commented Mar 8, 2020

kojiromike left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erikerlandson commented Mar 30, 2020

erikerlandson commented Feb 20, 2021