-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support serializing core astropy classes with YAML #5486
Conversation
FYI, here is a somewhat complicated example:
|
I like where this is going! But I'll need to think more to comment very sensibly. For now,
p.s. You'd need to add |
The basic strategy here was mostly looking at the class constructor API and then supplementing that with knowledge from methods like Maybe your point here is that it would be handy if these objects had an
One could rely more on
So the only way I could figure out how to accomplish YAML dump/load in a reasonable time was this hand-curated way. But it is true that the representer functions could easily be adapted to JSON and that was something I had in mind.
I thought about this as well (and even wrote code for that at one point). The issue I have is that not all classes that we want to serialize will have an |
BTW, I just remembered that this functionality requires the latest release of YAML (3.12) which fixes a problem where the dumper code is trying to compare ndarray objects to |
What kind of warning? For example deprecation warnings are automatically turned into exceptions in astropy-conftest. |
Travis failed with |
@taldcroft - the problem with basic yaml not working on units is that
fails, while if I remove (This may well be a bug in |
The problem is indeed the metaclass with yaml. E.g., in astropy 1.2.1, I can do
but in current master it gives the same error as for |
I don't know how relevant it is, but here is a work-around for the inability to serialize metaclasses:
(It seems python-yaml requires one to sign-up to submit bug reports; sadly, this means they won't get one...) |
@mhvk - interesting and illuminating. So will you OK in the end with a "hand-curated" approach which assumes that we can put a selected set of attributes into a flat dict and then use this in As you can see the raw dump of |
@pytest.mark.parametrize('c', [u.m, u.m / u.s, u.hPa, u.dimensionless_unscaled]) | ||
def test_unit(c): | ||
cy = load(dump(c)) | ||
assert c == cy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For u.m
, the units should be identical. So, do add assert c is cy
for that case (maybe separate test).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this should be true for all the units that are defined, i.e., all but u.m / u.s
.
@taldcroft - yes, good to get back to the higher-level question. I think it is fine to take a staged approach, i.e., start with what you have here, with stuff is hardcoded in the encoder/decoder, and try later to generalize it. |
@taldcroft - this is very neat and a promising step! Two large-scale questions:
|
There is at least one class (numpy) where we can't take that approach. (Yes, upstream patch is possible but not really useful/practical here). But more to the point, having a registry gets a little tricky or unsavory when you think about details. First, I think it would involve requiring some mixin metaclass for every YAML-enabled class in order to do the registry. So I'm not super keen on that, but maybe people don't mind, e.g. putting this YAML metaclass on Then it gets slightly tricky with the So all in all I think the outside-in approach I've done is just conceptually simpler to develop, and probably maintain. One can actually see all the constructors and representers in one place.
About naming, I'm just following existing naming convention for the real
So effectively this |
BTW much of the above was written thinking about everything happening automagically via a registry. But taking a hybrid approach of defining That way those methods could in theory be used for JSON as well with just a little tweaking, and might also help out in making ECSV serialize mixin columns completely. |
On where to put things: above, I suggested similar that a class could provide the interface since then one doesn't have to keep a list. But at the same time, I don't think we want to make classes too cluttered with things that do not speak to their basic functionality (i.e., I think we should not have In any case, I don't think this discussion has to hold up the PR: one can always move to a on-the-class scheme from the current outside-in approach (and, as @taldcroft notes, there will be special cases for objects like numpy arrays anyway). |
@mhvk - made some progress, I hope you will like the changes. |
This is ready for review now. There is only one obvious issue now, namely in building the docs locally I get these warnings:
It looks like in generating the API docs it wants to find definitions for various PyYaml objects. These definitions don't exist anywhere AFAIK. So I have no idea how to fix this. |
The only other thing I can think of is a question about whether to inject |
Test failures:
|
05cc7a6
to
16b982d
Compare
@mhvk - I think I have addressed all comments. Cross fingers that tests and doc build pass. |
@mhvk - the test failure is related to the following funny that appears only in Python 3. This makes no sense to me and clearly I don't understand something. On Python 2 the byte order is changed in the way that I would expect.
|
@mhvk - I found a workaround though I still don't understand the original problem. |
obj = np.ascontiguousarray(obj) | ||
order = 'F' if np.isfortran(obj) else 'C' | ||
|
||
data_b64 = base64.b64encode(bytes(obj.ravel(order=order).data)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this trick is good, but thought we might be able to use more of the numpy internals. So, I looked at the code for np.save
(which gets one to https://github.com/numpy/numpy/blob/master/numpy/lib/format.py#L577 eventually), and what is does for fortran-order is array.T.tofile(...)
. This suggests we might just use array.T.tostring()
without the call to bytes
. So, one could rewrite the above as:
if np.isfortran(obj):
obj = obj.T
order = 'F'
else:
order = 'C'
data_b64 = base64.b64encode(obj.tostring())
5e44f5b
to
c2bc8ea
Compare
A really nice addition!! |
This adds support for safe YAML serializing (dump and load) of:
The current implementation does not allow for arbitrary subclasses of these objects in order to ensure that the load process is "safe" and only uses trusted constructors.
This PR is on the path to allowing mixin columns in ECSV that fully round-trip. It also relates somewhat to #5471.
@mhvk @cdeil @astrofrog