Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support serializing core astropy classes with YAML #5486

Merged
merged 21 commits into from
Dec 7, 2016

Conversation

taldcroft
Copy link
Member

This adds support for safe YAML serializing (dump and load) of:

  • np.ndarray
  • Quantity, Angle, Latitude, Longitude
  • Time, TimeDelta
  • EarthLocation
  • SkyCoord

The current implementation does not allow for arbitrary subclasses of these objects in order to ensure that the load process is "safe" and only uses trusted constructors.

This PR is on the path to allowing mixin columns in ECSV that fully round-trip. It also relates somewhat to #5471.

@mhvk @cdeil @astrofrog

@taldcroft
Copy link
Member Author

FYI, here is a somewhat complicated example:

In [7]:     c = SkyCoord([[1,2],[3,4]], [[5,6], [7,8]],
   ...:                  unit='deg', frame=frame,
   ...:                  obstime=Time.now(),
   ...:                  location=EarthLocation(1000, 2000, 3000, unit=u.km))

In [8]: print(dump(c))
!astropy.coordinates.sky_coordinate.SkyCoord
alt: !astropy.units.Quantity
  class: Latitude
  unit: &id001 !astropy.units.Unit {name: deg}
  value: !numpy.ndarray
    __ndarray__: !!binary |
      QUFBQUFBQUFGRUFBQUFBQUFBQVlRQUFBQUFBQUFCeEFBQUFBQUFBQUlFQT0=
    dtype: float64
    shape: !!python/tuple [2, 2]
az: !astropy.units.Quantity
  class: Longitude
  unit: *id001
  value: !numpy.ndarray
    __ndarray__: !!binary |
      QUFBQUFBQUE4RDhBQUFBQUFBQUFRQUFBQUFBQUFBaEFBQUFBQUFBQUVFQT0=
    dtype: float64
    shape: !!python/tuple [2, 2]
frame: altaz
location: !astropy.coordinates.earth.EarthLocation
  ellipsoid: WGS84
  x: !astropy.units.Quantity
    class: Quantity
    unit: &id002 !astropy.units.Unit {name: km}
    value: 1000.0
  y: !astropy.units.Quantity
    class: Quantity
    unit: *id002
    value: 2000.0
  z: !astropy.units.Quantity
    class: Quantity
    unit: *id002
    value: 3000.0
obstime: !astropy.time.Time {format: datetime, in_subfmt: '*', jd1: 2457711.5, jd2: 0.7801903949421296,
  out_subfmt: '*', precision: 3, scale: utc}
obswl: !astropy.units.Quantity
  class: Quantity
  unit: !astropy.units.Unit {name: micron}
  value: 1.0
pressure: !astropy.units.Quantity
  class: Quantity
  unit: !astropy.units.Unit {name: hPa}
  value: 0.0
relative_humidity: 0
representation: spherical
temperature: !astropy.units.Quantity
  class: Quantity
  unit: !astropy.units.Unit {name: deg_C}
  value: 0.0

@mhvk
Copy link
Contributor

mhvk commented Nov 19, 2016

I like where this is going! But I'll need to think more to comment very sensibly. For now,

  1. For SkyCoord, it turns out to be very handy that it has a list of things that are required to reproduce it faithfully. Perhaps we should have the same for other classes? E.g., in Time._apply, the list you have here is also used.
  2. Slightly relatedly, to what extent could we combine the information on how to serialise the objects with what pickle does? In the end, it needs the same information. Similarly, any effort to serialise in JSON would need the same information as well. Could we rely more on self.__dict__?
  3. There is something to be said for letting classes provide their own interface; this could perhaps be made part of info?

p.s. You'd need to add coords.Distance, u.Magnitude, u.Dex, u.Decibel to the list of Quantity subclasses that should be supported (where I'd need to check your code would work for the function quantities)

@taldcroft
Copy link
Member Author

  1. For SkyCoord, it turns out to be very handy that it has a list of things that are required to reproduce it faithfully. Perhaps we should have the same for other classes? E.g., in Time._apply, the list you have here is also used.

The basic strategy here was mostly looking at the class constructor API and then supplementing that with knowledge from methods like Time._apply. But for instance with SkyCoord one doesn't get anything very useful by looking at SkyCoord._apply since it basically delegates to __init__ with an input SkyCoord object.

Maybe your point here is that it would be handy if these objects had an info.yaml_representer method that returned a map (dict) of what is need to construct a copy (but see my response to your third point)?

  1. Slightly relatedly, to what extent could we combine the information on how to serialise the objects with what pickle does? In the end, it needs the same information. Similarly, any effort to serialise in JSON would need the same information as well. Could we rely more on self.dict?

One could rely more on self.__dict__, but I think that is a slippery path to re-writing pickle. Note that I believe yaml does use pickle for dumping arbitrary objects, but interestingly yaml fails to dump a Unit object:

>>> yaml.dump(u.m)
/Users/aldcroft/anaconda/envs/astropy3/lib/python3.3/copyreg.py in _reduce_ex(self, proto)
     63     else:
     64         if base is self.__class__:
---> 65             raise TypeError("can't pickle %s objects" % base.__name__)
     66         state = base(self)
     67     args = (self.__class__, base, state)

TypeError: can't pickle int objects

So the only way I could figure out how to accomplish YAML dump/load in a reasonable time was this hand-curated way. But it is true that the representer functions could easily be adapted to JSON and that was something I had in mind.

  1. There is something to be said for letting classes provide their own interface; this could perhaps be made part of info?

I thought about this as well (and even wrote code for that at one point). The issue I have is that not all classes that we want to serialize will have an info attribute, and some like np.ndarray are not even under our control. So at the end of the day it is probably just easier to have all the representers and constructors in one module?

@taldcroft
Copy link
Member Author

BTW, I just remembered that this functionality requires the latest release of YAML (3.12) which fixes a problem where the dumper code is trying to compare ndarray objects to None or () and this makes Py3 unhappy. (This is an odd story, because when running interactively that just generates a warning, but under pytest it turns into an exception. I don't understand that.)

@MSeifert04
Copy link
Contributor

This is an odd story, because when running interactively that just generates a warning, but under pytest it turns into an exception. I don't understand that.

What kind of warning? For example deprecation warnings are automatically turned into exceptions in astropy-conftest.

@pllim
Copy link
Member

pllim commented Nov 21, 2016

Travis failed with AssertionError (files not closed) and AttributeError.

@mhvk
Copy link
Contributor

mhvk commented Nov 21, 2016

@taldcroft - the problem with basic yaml not working on units is that UnitBase has a meta class InheritDocstrings. With our normal installation, even just

yaml.dump(u.UnitBase)

fails, while if I remove @six.add_metaclass(InheritDocstrings), it works.

(This may well be a bug in yaml.)

@mhvk
Copy link
Contributor

mhvk commented Nov 21, 2016

The problem is indeed the metaclass with yaml. E.g., in astropy 1.2.1, I can do

yaml.dump(Time)
# "!!python/name:astropy.time.core.Time ''\n"

but in current master it gives the same error as for u.UnitBase because Time now has the ShapedLikeNDArray metaclass.

@mhvk
Copy link
Contributor

mhvk commented Nov 21, 2016

I don't know how relevant it is, but here is a work-around for the inability to serialize metaclasses:

import yaml
import astropy.units as u
class AstropyDumper(yaml.Dumper):
    pass

AstropyDumper.add_multi_representer(type, AstropyDumper.yaml_representers[type])
yaml.dump(u.m, Dumper=AstropyDumper)
# "!!python/object/apply:astropy.units.core._recreate_irreducible_unit\nargs:\n- !!python/name:astropy.units.core.IrreducibleUnit ''\n- [m, meter]\n- true\nstate:\n  __doc__: 'meter: base unit of length in SI'\n  _format: {}\n  _long_names: [meter]\n  _names: [m, meter]\n  _short_names: [m]\n"

(It seems python-yaml requires one to sign-up to submit bug reports; sadly, this means they won't get one...)

@taldcroft
Copy link
Member Author

@mhvk - interesting and illuminating. So will you OK in the end with a "hand-curated" approach which assumes that we can put a selected set of attributes into a flat dict and then use this in __init__ (with some possible post-processing as in Time)?

As you can see the raw dump of u.m is a bit of a mess that I don't really want to dive into. In other words, I'm not sure how to keep this idea both simple and safe (no arbitrary code) without using the curated approach.

@pytest.mark.parametrize('c', [u.m, u.m / u.s, u.hPa, u.dimensionless_unscaled])
def test_unit(c):
cy = load(dump(c))
assert c == cy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For u.m, the units should be identical. So, do add assert c is cy for that case (maybe separate test).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this should be true for all the units that are defined, i.e., all but u.m / u.s.

@mhvk
Copy link
Contributor

mhvk commented Nov 21, 2016

@taldcroft - yes, good to get back to the higher-level question. I think it is fine to take a staged approach, i.e., start with what you have here, with stuff is hardcoded in the encoder/decoder, and try later to generalize it.

@eteq
Copy link
Member

eteq commented Nov 29, 2016

@taldcroft - this is very neat and a promising step!

Two large-scale questions:

  1. Why *_constructor/*_representer functions instead of <class>._yaml_repr_/<class>._yaml_init_ or similar methods? The latter strikes me as a good way to not have to manage the registry? (at least on the representer side, anyway)
  2. This is fairly YAML-specific and meant to stay that way, right? So the load/dump names should perhaps be load_yaml/dump_yaml? (a bit nitpicking for a WIP, but it might be less work down the road to make such changes sooner)

@taldcroft
Copy link
Member Author

Why _constructor/_representer functions instead of .yaml_repr/.yaml_init or similar methods? The latter strikes me as a good way to not have to manage the registry? (at least on the representer side, anyway)

There is at least one class (numpy) where we can't take that approach. (Yes, upstream patch is possible but not really useful/practical here).

But more to the point, having a registry gets a little tricky or unsavory when you think about details. First, I think it would involve requiring some mixin metaclass for every YAML-enabled class in order to do the registry. So I'm not super keen on that, but maybe people don't mind, e.g. putting this YAML metaclass on Unit?

Then it gets slightly tricky with the AstropyDumper/Loader classes. You see that right now all the supported classes get explicitly put into the class with class method calls when importing utils.yaml. If the classes register themselves, I'm not sure it is possible to do this without importing yaml (the pyyaml version), and we definitely don't want to require importing yaml in order to import Quantity or Time.

So all in all I think the outside-in approach I've done is just conceptually simpler to develop, and probably maintain. One can actually see all the constructors and representers in one place.

This is fairly YAML-specific and meant to stay that way, right? So the load/dump names should perhaps be load_yaml/dump_yaml? (a bit nitpicking for a WIP, but it might be less work down the road to make such changes sooner)

About naming, I'm just following existing naming convention for the real yaml, e.g.:

from astropy.utils import yaml
...
out = yaml.dump(astropy_obj)

So effectively this yaml is exactly like the real yaml but it is astropy-enabled.

@taldcroft
Copy link
Member Author

BTW much of the above was written thinking about everything happening automagically via a registry. But taking a hybrid approach of defining <class>._represent_as_dict() and <class>._construct_from_dict which are essentially the existing functions, and then registering them manually in utils/yaml.py would be straightforward and probably make more sense. (Except for numpy, of course).

That way those methods could in theory be used for JSON as well with just a little tweaking, and might also help out in making ECSV serialize mixin columns completely.

@mhvk
Copy link
Contributor

mhvk commented Nov 29, 2016

On where to put things: above, I suggested similar that a class could provide the interface since then one doesn't have to keep a list. But at the same time, I don't think we want to make classes too cluttered with things that do not speak to their basic functionality (i.e., I think we should not have Unit._to_yaml...). This in part since for Unit/Quantity at least, I feel they should eventually go upstream (to scipy if not numpy). So, if we go the class-route, I'd propose to make this part of info (this not that strange -- it already helps to define how items are formatted within tables; one can start to see it as a general property where information relevant to other parts of astropy is stored).

In any case, I don't think this discussion has to hold up the PR: one can always move to a on-the-class scheme from the current outside-in approach (and, as @taldcroft notes, there will be special cases for objects like numpy arrays anyway).

@taldcroft
Copy link
Member Author

@mhvk - made some progress, I hope you will like the changes.

@taldcroft
Copy link
Member Author

This is ready for review now. There is only one obvious issue now, namely in building the docs locally I get these warnings:

/Users/aldcroft/git/astropy/docs/api/astropy.io.misc.yaml.AstropyDumper.rst:7: WARNING: py:class reference target not found: yaml.dumper.SafeDumper
/Users/aldcroft/git/astropy/docs/api/astropy.io.misc.yaml.AstropyLoader.rst:7: WARNING: py:class reference target not found: yaml.loader.SafeLoader
/Users/aldcroft/git/astropy/docs/io/misc.rst:74: WARNING: py:class reference target not found: yaml.representer.SafeRepresenter
/Users/aldcroft/git/astropy/docs/io/misc.rst:74: WARNING: py:class reference target not found: yaml.composer.Composer
/Users/aldcroft/git/astropy/docs/io/misc.rst:74: WARNING: py:class reference target not found: yaml.constructor.SafeConstructor
/Users/aldcroft/git/astropy/docs/io/misc.rst:74: WARNING: py:class reference target not found: yaml.reader.Reader
/Users/aldcroft/git/astropy/docs/io/misc.rst:74: WARNING: py:class reference target not found: yaml.emitter.Emitter
/Users/aldcroft/git/astropy/docs/io/misc.rst:74: WARNING: py:class reference target not found: yaml.dumper.SafeDumper
/Users/aldcroft/git/astropy/docs/io/misc.rst:74: WARNING: py:class reference target not found: yaml.parser.Parser
/Users/aldcroft/git/astropy/docs/io/misc.rst:74: WARNING: py:class reference target not found: yaml.serializer.Serializer
/Users/aldcroft/git/astropy/docs/io/misc.rst:74: WARNING: py:class reference target not found: yaml.scanner.Scanner
/Users/aldcroft/git/astropy/docs/io/misc.rst:74: WARNING: py:class reference target not found: yaml.loader.SafeLoader
/Users/aldcroft/git/astropy/docs/io/misc.rst:74: WARNING: py:class reference target not found: yaml.representer.BaseRepresenter
/Users/aldcroft/git/astropy/docs/io/misc.rst:74: WARNING: py:class reference target not found: yaml.resolver.Resolver
/Users/aldcroft/git/astropy/docs/io/misc.rst:74: WARNING: py:class reference target not found: yaml.constructor.BaseConstructor
/Users/aldcroft/git/astropy/docs/io/misc.rst:74: WARNING: py:class reference target not found: yaml.resolver.BaseResolver

It looks like in generating the API docs it wants to find definitions for various PyYaml objects. These definitions don't exist anywhere AFAIK. So I have no idea how to fix this.

@taldcroft
Copy link
Member Author

The only other thing I can think of is a question about whether to inject info attributes into Unit and EarthLocation. That would make all of the serialized object classes use info for the core represent / construct operations. It makes no real difference for YAML stuff apart from introducing consistency. I'm slightly concerned about unexpected side effects, however, esp. for Unit.

@taldcroft
Copy link
Member Author

taldcroft commented Dec 4, 2016

Test failures:

  • Travis: unrelated timeouts and expected doc build failure mentioned above. Actual code tests (py2.7 and py3.5 with optional dependencies) are passing.
  • CircleCI: failure to build, presumably unrelated
  • AppVeyor: float format difference in docstring, need # doctest: +FLOAT_CMP

@taldcroft
Copy link
Member Author

@mhvk - I think I have addressed all comments. Cross fingers that tests and doc build pass.

@taldcroft
Copy link
Member Author

@mhvk - the test failure is related to the following funny that appears only in Python 3. This makes no sense to me and clearly I don't understand something. On Python 2 the byte order is changed in the way that I would expect.

In [13]: f = bytes(np.array([[0,1], [2,3]], order='F').data)

In [14]: f
Out[14]: b'\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00'

In [15]: c = bytes(np.array([[0,1], [2,3]], order='C').data)

In [16]: c
Out[16]: b'\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00'

In [17]: f == c
Out[17]: True

@taldcroft
Copy link
Member Author

@mhvk - I found a workaround though I still don't understand the original problem.

obj = np.ascontiguousarray(obj)
order = 'F' if np.isfortran(obj) else 'C'

data_b64 = base64.b64encode(bytes(obj.ravel(order=order).data))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this trick is good, but thought we might be able to use more of the numpy internals. So, I looked at the code for np.save (which gets one to https://github.com/numpy/numpy/blob/master/numpy/lib/format.py#L577 eventually), and what is does for fortran-order is array.T.tofile(...). This suggests we might just use array.T.tostring() without the call to bytes. So, one could rewrite the above as:

if np.isfortran(obj):
    obj = obj.T
    order = 'F'
else:
    order = 'C'

data_b64 = base64.b64encode(obj.tostring())

@taldcroft taldcroft merged commit 27f107a into astropy:master Dec 7, 2016
@taldcroft taldcroft deleted the yaml-serialize branch December 7, 2016 01:55
@mhvk
Copy link
Contributor

mhvk commented Dec 7, 2016

A really nice addition!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants