Support serializing core astropy classes with YAML #5486

taldcroft · 2016-11-19T18:42:03Z

This adds support for safe YAML serializing (dump and load) of:

np.ndarray
Quantity, Angle, Latitude, Longitude
Time, TimeDelta
EarthLocation
SkyCoord

The current implementation does not allow for arbitrary subclasses of these objects in order to ensure that the load process is "safe" and only uses trusted constructors.

This PR is on the path to allowing mixin columns in ECSV that fully round-trip. It also relates somewhat to #5471.

@mhvk @cdeil @astrofrog

taldcroft · 2016-11-19T18:45:18Z

FYI, here is a somewhat complicated example:

In [7]:     c = SkyCoord([[1,2],[3,4]], [[5,6], [7,8]],
   ...:                  unit='deg', frame=frame,
   ...:                  obstime=Time.now(),
   ...:                  location=EarthLocation(1000, 2000, 3000, unit=u.km))

In [8]: print(dump(c))
!astropy.coordinates.sky_coordinate.SkyCoord
alt: !astropy.units.Quantity
  class: Latitude
  unit: &id001 !astropy.units.Unit {name: deg}
  value: !numpy.ndarray
    __ndarray__: !!binary |
      QUFBQUFBQUFGRUFBQUFBQUFBQVlRQUFBQUFBQUFCeEFBQUFBQUFBQUlFQT0=
    dtype: float64
    shape: !!python/tuple [2, 2]
az: !astropy.units.Quantity
  class: Longitude
  unit: *id001
  value: !numpy.ndarray
    __ndarray__: !!binary |
      QUFBQUFBQUE4RDhBQUFBQUFBQUFRQUFBQUFBQUFBaEFBQUFBQUFBQUVFQT0=
    dtype: float64
    shape: !!python/tuple [2, 2]
frame: altaz
location: !astropy.coordinates.earth.EarthLocation
  ellipsoid: WGS84
  x: !astropy.units.Quantity
    class: Quantity
    unit: &id002 !astropy.units.Unit {name: km}
    value: 1000.0
  y: !astropy.units.Quantity
    class: Quantity
    unit: *id002
    value: 2000.0
  z: !astropy.units.Quantity
    class: Quantity
    unit: *id002
    value: 3000.0
obstime: !astropy.time.Time {format: datetime, in_subfmt: '*', jd1: 2457711.5, jd2: 0.7801903949421296,
  out_subfmt: '*', precision: 3, scale: utc}
obswl: !astropy.units.Quantity
  class: Quantity
  unit: !astropy.units.Unit {name: micron}
  value: 1.0
pressure: !astropy.units.Quantity
  class: Quantity
  unit: !astropy.units.Unit {name: hPa}
  value: 0.0
relative_humidity: 0
representation: spherical
temperature: !astropy.units.Quantity
  class: Quantity
  unit: !astropy.units.Unit {name: deg_C}
  value: 0.0

mhvk · 2016-11-19T20:09:22Z

I like where this is going! But I'll need to think more to comment very sensibly. For now,

For SkyCoord, it turns out to be very handy that it has a list of things that are required to reproduce it faithfully. Perhaps we should have the same for other classes? E.g., in Time._apply, the list you have here is also used.
Slightly relatedly, to what extent could we combine the information on how to serialise the objects with what pickle does? In the end, it needs the same information. Similarly, any effort to serialise in JSON would need the same information as well. Could we rely more on self.__dict__?
There is something to be said for letting classes provide their own interface; this could perhaps be made part of info?

p.s. You'd need to add coords.Distance, u.Magnitude, u.Dex, u.Decibel to the list of Quantity subclasses that should be supported (where I'd need to check your code would work for the function quantities)

taldcroft · 2016-11-19T22:45:35Z

For SkyCoord, it turns out to be very handy that it has a list of things that are required to reproduce it faithfully. Perhaps we should have the same for other classes? E.g., in Time._apply, the list you have here is also used.

The basic strategy here was mostly looking at the class constructor API and then supplementing that with knowledge from methods like Time._apply. But for instance with SkyCoord one doesn't get anything very useful by looking at SkyCoord._apply since it basically delegates to __init__ with an input SkyCoord object.

Maybe your point here is that it would be handy if these objects had an info.yaml_representer method that returned a map (dict) of what is need to construct a copy (but see my response to your third point)?

Slightly relatedly, to what extent could we combine the information on how to serialise the objects with what pickle does? In the end, it needs the same information. Similarly, any effort to serialise in JSON would need the same information as well. Could we rely more on self.dict?

One could rely more on self.__dict__, but I think that is a slippery path to re-writing pickle. Note that I believe yaml does use pickle for dumping arbitrary objects, but interestingly yaml fails to dump a Unit object:

>>> yaml.dump(u.m)
/Users/aldcroft/anaconda/envs/astropy3/lib/python3.3/copyreg.py in _reduce_ex(self, proto)
     63     else:
     64         if base is self.__class__:
---> 65             raise TypeError("can't pickle %s objects" % base.__name__)
     66         state = base(self)
     67     args = (self.__class__, base, state)

TypeError: can't pickle int objects

So the only way I could figure out how to accomplish YAML dump/load in a reasonable time was this hand-curated way. But it is true that the representer functions could easily be adapted to JSON and that was something I had in mind.

There is something to be said for letting classes provide their own interface; this could perhaps be made part of info?

I thought about this as well (and even wrote code for that at one point). The issue I have is that not all classes that we want to serialize will have an info attribute, and some like np.ndarray are not even under our control. So at the end of the day it is probably just easier to have all the representers and constructors in one module?

taldcroft · 2016-11-19T22:48:33Z

BTW, I just remembered that this functionality requires the latest release of YAML (3.12) which fixes a problem where the dumper code is trying to compare ndarray objects to None or () and this makes Py3 unhappy. (This is an odd story, because when running interactively that just generates a warning, but under pytest it turns into an exception. I don't understand that.)

MSeifert04 · 2016-11-20T14:32:00Z

This is an odd story, because when running interactively that just generates a warning, but under pytest it turns into an exception. I don't understand that.

What kind of warning? For example deprecation warnings are automatically turned into exceptions in astropy-conftest.

pllim · 2016-11-21T16:26:36Z

Travis failed with AssertionError (files not closed) and AttributeError.

mhvk · 2016-11-21T18:55:37Z

@taldcroft - the problem with basic yaml not working on units is that UnitBase has a meta class InheritDocstrings. With our normal installation, even just

yaml.dump(u.UnitBase)

fails, while if I remove @six.add_metaclass(InheritDocstrings), it works.

(This may well be a bug in yaml.)

mhvk · 2016-11-21T19:18:21Z

The problem is indeed the metaclass with yaml. E.g., in astropy 1.2.1, I can do

yaml.dump(Time)
# "!!python/name:astropy.time.core.Time ''\n"

but in current master it gives the same error as for u.UnitBase because Time now has the ShapedLikeNDArray metaclass.

mhvk · 2016-11-21T19:40:34Z

I don't know how relevant it is, but here is a work-around for the inability to serialize metaclasses:

import yaml
import astropy.units as u
class AstropyDumper(yaml.Dumper):
    pass

AstropyDumper.add_multi_representer(type, AstropyDumper.yaml_representers[type])
yaml.dump(u.m, Dumper=AstropyDumper)
# "!!python/object/apply:astropy.units.core._recreate_irreducible_unit\nargs:\n- !!python/name:astropy.units.core.IrreducibleUnit ''\n- [m, meter]\n- true\nstate:\n  __doc__: 'meter: base unit of length in SI'\n  _format: {}\n  _long_names: [meter]\n  _names: [m, meter]\n  _short_names: [m]\n"

(It seems python-yaml requires one to sign-up to submit bug reports; sadly, this means they won't get one...)

taldcroft · 2016-11-21T20:12:34Z

@mhvk - interesting and illuminating. So will you OK in the end with a "hand-curated" approach which assumes that we can put a selected set of attributes into a flat dict and then use this in __init__ (with some possible post-processing as in Time)?

As you can see the raw dump of u.m is a bit of a mess that I don't really want to dive into. In other words, I'm not sure how to keep this idea both simple and safe (no arbitrary code) without using the curated approach.

mhvk · 2016-11-21T20:28:01Z

astropy/utils/tests/test_yaml.py

+@pytest.mark.parametrize('c', [u.m, u.m / u.s, u.hPa, u.dimensionless_unscaled])
+def test_unit(c):
+    cy = load(dump(c))
+    assert c == cy


For u.m, the units should be identical. So, do add assert c is cy for that case (maybe separate test).

Actually, this should be true for all the units that are defined, i.e., all but u.m / u.s.

mhvk · 2016-11-21T20:38:14Z

@taldcroft - yes, good to get back to the higher-level question. I think it is fine to take a staged approach, i.e., start with what you have here, with stuff is hardcoded in the encoder/decoder, and try later to generalize it.

eteq · 2016-11-29T06:41:29Z

@taldcroft - this is very neat and a promising step!

Two large-scale questions:

Why *_constructor/*_representer functions instead of <class>._yaml_repr_/<class>._yaml_init_ or similar methods? The latter strikes me as a good way to not have to manage the registry? (at least on the representer side, anyway)
This is fairly YAML-specific and meant to stay that way, right? So the load/dump names should perhaps be load_yaml/dump_yaml? (a bit nitpicking for a WIP, but it might be less work down the road to make such changes sooner)

taldcroft · 2016-11-29T13:46:23Z

Why _constructor/_representer functions instead of .yaml_repr/.yaml_init or similar methods? The latter strikes me as a good way to not have to manage the registry? (at least on the representer side, anyway)

There is at least one class (numpy) where we can't take that approach. (Yes, upstream patch is possible but not really useful/practical here).

But more to the point, having a registry gets a little tricky or unsavory when you think about details. First, I think it would involve requiring some mixin metaclass for every YAML-enabled class in order to do the registry. So I'm not super keen on that, but maybe people don't mind, e.g. putting this YAML metaclass on Unit?

Then it gets slightly tricky with the AstropyDumper/Loader classes. You see that right now all the supported classes get explicitly put into the class with class method calls when importing utils.yaml. If the classes register themselves, I'm not sure it is possible to do this without importing yaml (the pyyaml version), and we definitely don't want to require importing yaml in order to import Quantity or Time.

So all in all I think the outside-in approach I've done is just conceptually simpler to develop, and probably maintain. One can actually see all the constructors and representers in one place.

This is fairly YAML-specific and meant to stay that way, right? So the load/dump names should perhaps be load_yaml/dump_yaml? (a bit nitpicking for a WIP, but it might be less work down the road to make such changes sooner)

About naming, I'm just following existing naming convention for the real yaml, e.g.:

from astropy.utils import yaml
...
out = yaml.dump(astropy_obj)

So effectively this yaml is exactly like the real yaml but it is astropy-enabled.

taldcroft · 2016-11-29T13:59:05Z

BTW much of the above was written thinking about everything happening automagically via a registry. But taking a hybrid approach of defining <class>._represent_as_dict() and <class>._construct_from_dict which are essentially the existing functions, and then registering them manually in utils/yaml.py would be straightforward and probably make more sense. (Except for numpy, of course).

That way those methods could in theory be used for JSON as well with just a little tweaking, and might also help out in making ECSV serialize mixin columns completely.

mhvk · 2016-11-29T14:29:19Z

On where to put things: above, I suggested similar that a class could provide the interface since then one doesn't have to keep a list. But at the same time, I don't think we want to make classes too cluttered with things that do not speak to their basic functionality (i.e., I think we should not have Unit._to_yaml...). This in part since for Unit/Quantity at least, I feel they should eventually go upstream (to scipy if not numpy). So, if we go the class-route, I'd propose to make this part of info (this not that strange -- it already helps to define how items are formatted within tables; one can start to see it as a general property where information relevant to other parts of astropy is stored).

In any case, I don't think this discussion has to hold up the PR: one can always move to a on-the-class scheme from the current outside-in approach (and, as @taldcroft notes, there will be special cases for objects like numpy arrays anyway).

taldcroft · 2016-12-03T17:51:21Z

@mhvk - made some progress, I hope you will like the changes.

taldcroft · 2016-12-04T18:07:39Z

This is ready for review now. There is only one obvious issue now, namely in building the docs locally I get these warnings:

/Users/aldcroft/git/astropy/docs/api/astropy.io.misc.yaml.AstropyDumper.rst:7: WARNING: py:class reference target not found: yaml.dumper.SafeDumper
/Users/aldcroft/git/astropy/docs/api/astropy.io.misc.yaml.AstropyLoader.rst:7: WARNING: py:class reference target not found: yaml.loader.SafeLoader
/Users/aldcroft/git/astropy/docs/io/misc.rst:74: WARNING: py:class reference target not found: yaml.representer.SafeRepresenter
/Users/aldcroft/git/astropy/docs/io/misc.rst:74: WARNING: py:class reference target not found: yaml.composer.Composer
/Users/aldcroft/git/astropy/docs/io/misc.rst:74: WARNING: py:class reference target not found: yaml.constructor.SafeConstructor
/Users/aldcroft/git/astropy/docs/io/misc.rst:74: WARNING: py:class reference target not found: yaml.reader.Reader
/Users/aldcroft/git/astropy/docs/io/misc.rst:74: WARNING: py:class reference target not found: yaml.emitter.Emitter
/Users/aldcroft/git/astropy/docs/io/misc.rst:74: WARNING: py:class reference target not found: yaml.dumper.SafeDumper
/Users/aldcroft/git/astropy/docs/io/misc.rst:74: WARNING: py:class reference target not found: yaml.parser.Parser
/Users/aldcroft/git/astropy/docs/io/misc.rst:74: WARNING: py:class reference target not found: yaml.serializer.Serializer
/Users/aldcroft/git/astropy/docs/io/misc.rst:74: WARNING: py:class reference target not found: yaml.scanner.Scanner
/Users/aldcroft/git/astropy/docs/io/misc.rst:74: WARNING: py:class reference target not found: yaml.loader.SafeLoader
/Users/aldcroft/git/astropy/docs/io/misc.rst:74: WARNING: py:class reference target not found: yaml.representer.BaseRepresenter
/Users/aldcroft/git/astropy/docs/io/misc.rst:74: WARNING: py:class reference target not found: yaml.resolver.Resolver
/Users/aldcroft/git/astropy/docs/io/misc.rst:74: WARNING: py:class reference target not found: yaml.constructor.BaseConstructor
/Users/aldcroft/git/astropy/docs/io/misc.rst:74: WARNING: py:class reference target not found: yaml.resolver.BaseResolver

It looks like in generating the API docs it wants to find definitions for various PyYaml objects. These definitions don't exist anywhere AFAIK. So I have no idea how to fix this.

taldcroft · 2016-12-04T18:11:28Z

The only other thing I can think of is a question about whether to inject info attributes into Unit and EarthLocation. That would make all of the serialized object classes use info for the core represent / construct operations. It makes no real difference for YAML stuff apart from introducing consistency. I'm slightly concerned about unexpected side effects, however, esp. for Unit.

taldcroft · 2016-12-04T21:49:46Z

Test failures:

Travis: unrelated timeouts and expected doc build failure mentioned above. Actual code tests (py2.7 and py3.5 with optional dependencies) are passing.
CircleCI: failure to build, presumably unrelated
AppVeyor: float format difference in docstring, need # doctest: +FLOAT_CMP

taldcroft · 2016-12-06T03:02:56Z

@mhvk - I think I have addressed all comments. Cross fingers that tests and doc build pass.

taldcroft · 2016-12-06T14:05:29Z

@mhvk - the test failure is related to the following funny that appears only in Python 3. This makes no sense to me and clearly I don't understand something. On Python 2 the byte order is changed in the way that I would expect.

In [13]: f = bytes(np.array([[0,1], [2,3]], order='F').data)

In [14]: f
Out[14]: b'\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00'

In [15]: c = bytes(np.array([[0,1], [2,3]], order='C').data)

In [16]: c
Out[16]: b'\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00'

In [17]: f == c
Out[17]: True

taldcroft · 2016-12-06T14:21:02Z

@mhvk - I found a workaround though I still don't understand the original problem.

mhvk · 2016-12-06T18:06:07Z

astropy/io/misc/yaml.py

+        obj = np.ascontiguousarray(obj)
+    order = 'F' if np.isfortran(obj) else 'C'
+
+    data_b64 = base64.b64encode(bytes(obj.ravel(order=order).data))


I think this trick is good, but thought we might be able to use more of the numpy internals. So, I looked at the code for np.save (which gets one to https://github.com/numpy/numpy/blob/master/numpy/lib/format.py#L577 eventually), and what is does for fortran-order is array.T.tofile(...). This suggests we might just use array.T.tostring() without the call to bytes. So, one could rewrite the above as:

if np.isfortran(obj): obj = obj.T order = 'F' else: order = 'C' data_b64 = base64.b64encode(obj.tostring())

mhvk · 2016-12-07T02:58:58Z

A really nice addition!!

taldcroft added Affects-release coordinates Enhancement time units utils labels Nov 19, 2016

taldcroft added this to the v1.3.0 milestone Nov 19, 2016

mhvk reviewed Nov 21, 2016

View reviewed changes

taldcroft added Ready-for-final-review and removed Work in progress labels Dec 4, 2016

taldcroft added 7 commits December 5, 2016 20:08

Move SkyCoord constructor/representer into info

8e7cdae

Move Quantity constructor/representer into info [skip ci]

7803dd8

Move yaml from utils to io.misc

3b58907

Add test of load_all

c16ca58

Add initial docs

5a1366e

Get rid of Quantity subclassing, use explicit class declarations

5bd93b2

Update changelog

7a8acce

taldcroft assigned mhvk Dec 6, 2016

taldcroft added 3 commits December 5, 2016 20:45

Move EarthLocation constructor/representer into info

3f778c1

Add nitpick exceptions for PyYaml classes

6ac0e1e

Remove unnecessary u (unicode) tags

bfca5ec

taldcroft mentioned this pull request Dec 6, 2016

Upgrade extern PLY version to 3.9 #5526

Merged

taldcroft added 2 commits December 5, 2016 21:08

Various minor fixes suggested in review

15ef5ba

Improve ndarray serialization

16b982d

taldcroft force-pushed the yaml-serialize branch from 05cc7a6 to 16b982d Compare December 6, 2016 03:00

mhvk reviewed Dec 6, 2016

View reviewed changes

eteq approved these changes Dec 6, 2016

View reviewed changes

Fix issue with numpy Fortran data order using Py3

c2bc8ea

taldcroft force-pushed the yaml-serialize branch from 5e44f5b to c2bc8ea Compare December 6, 2016 21:12

taldcroft merged commit 27f107a into astropy:master Dec 7, 2016

taldcroft deleted the yaml-serialize branch December 7, 2016 01:55

taldcroft added the whatsnew-needed label Dec 7, 2016

mhvk mentioned this pull request Dec 14, 2016

ecsv files fail to write with units in metadata #5604

Closed

mhvk mentioned this pull request Mar 1, 2017

Can't instantiate Quantity with logarithmic units. #5851

Closed

mhvk mentioned this pull request Apr 18, 2017

Quantity arrays do not play nice with Bokeh server #5970

Closed

pllim mentioned this pull request May 1, 2020

MNT: Make yaml import truly optional in io.misc.yaml #10257

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support serializing core astropy classes with YAML #5486

Support serializing core astropy classes with YAML #5486

taldcroft commented Nov 19, 2016

taldcroft commented Nov 19, 2016

mhvk commented Nov 19, 2016

taldcroft commented Nov 19, 2016

taldcroft commented Nov 19, 2016

MSeifert04 commented Nov 20, 2016

pllim commented Nov 21, 2016

mhvk commented Nov 21, 2016

mhvk commented Nov 21, 2016

mhvk commented Nov 21, 2016 •

edited

Loading

taldcroft commented Nov 21, 2016

mhvk Nov 21, 2016

mhvk Nov 21, 2016

mhvk commented Nov 21, 2016

eteq commented Nov 29, 2016 •

edited

Loading

taldcroft commented Nov 29, 2016

taldcroft commented Nov 29, 2016

mhvk commented Nov 29, 2016

taldcroft commented Dec 3, 2016

taldcroft commented Dec 4, 2016

taldcroft commented Dec 4, 2016

taldcroft commented Dec 4, 2016 •

edited

Loading

taldcroft commented Dec 6, 2016

taldcroft commented Dec 6, 2016

taldcroft commented Dec 6, 2016

mhvk Dec 6, 2016

mhvk commented Dec 7, 2016

Support serializing core astropy classes with YAML #5486

Support serializing core astropy classes with YAML #5486

Conversation

taldcroft commented Nov 19, 2016

taldcroft commented Nov 19, 2016

mhvk commented Nov 19, 2016

taldcroft commented Nov 19, 2016

taldcroft commented Nov 19, 2016

MSeifert04 commented Nov 20, 2016

pllim commented Nov 21, 2016

mhvk commented Nov 21, 2016

mhvk commented Nov 21, 2016

mhvk commented Nov 21, 2016 • edited Loading

taldcroft commented Nov 21, 2016

mhvk Nov 21, 2016

Choose a reason for hiding this comment

mhvk Nov 21, 2016

Choose a reason for hiding this comment

mhvk commented Nov 21, 2016

eteq commented Nov 29, 2016 • edited Loading

taldcroft commented Nov 29, 2016

taldcroft commented Nov 29, 2016

mhvk commented Nov 29, 2016

taldcroft commented Dec 3, 2016

taldcroft commented Dec 4, 2016

taldcroft commented Dec 4, 2016

taldcroft commented Dec 4, 2016 • edited Loading

taldcroft commented Dec 6, 2016

taldcroft commented Dec 6, 2016

taldcroft commented Dec 6, 2016

mhvk Dec 6, 2016

Choose a reason for hiding this comment

mhvk commented Dec 7, 2016

mhvk commented Nov 21, 2016 •

edited

Loading

eteq commented Nov 29, 2016 •

edited

Loading

taldcroft commented Dec 4, 2016 •

edited

Loading