Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the JsonableData data plugin #5017

Merged

Conversation

sphuber
Copy link
Contributor

@sphuber sphuber commented Jul 9, 2021

Fixes #4975

This data plugin is designed to make it easy to wrap instances of
arbitrary Python classes and store them as a node in the database. The
only requirement is that the class implements the as_dict method,
which should return a dictionary that represents the instance which is
JSON-serializable. The latter means that it can be serialized by the
JsonEncoder of the json built-in module.

If the class also implements a from_dict method, that can consume the
dictionary that is returned by as_dict, to reconstruct an instance,
the original instance that is wrapped by the node (loaded from the
database), can be recovered through the obj property.

The MSONable class, provided by the monty library, is designed to
make arbitrary classes easily JSON-serializable such that they can be
serialized and, for example, stored in a database. The class is used
extensively in pymatgen and many of their classes are MSONable. Since
the pymatgen classes are used often in the current AiiDA community, it
would be nice if these objects can be easily stored in the provenance
graph.

The new MsonableData data plugin, wraps any instance of a MSONable
class. When constructed, it calls as_dict which by definition returns
a JSON-serialized version of the object, which we can therefore store in
the nodes attributes and allow it to be stored. The obj property will
return the original MSONable instance as it deserializes it from the
serialized version stored in the attributes.

@sphuber sphuber requested review from zhubonan and mbercx July 9, 2021 07:16

"""

def __init__(self, obj, *args, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add type checking

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added type checking, but since we are very flexible, all we can say is typing.Any pretty much. Or is there some better typing we can have still?

Copy link
Member

@chrisjsewell chrisjsewell Aug 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well the best way is to define a Protocol: https://docs.python.org/3/library/typing.html#typing.Protocol, although you need the typing_extensions backport package for python < 3.8

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aiida/orm/nodes/data/msonable.py Outdated Show resolved Hide resolved
@sphuber
Copy link
Contributor Author

sphuber commented Jul 9, 2021

@zhubonan it would be great if you could give this a spin and see whether this is useful for your use case.

@zhubonan
Copy link
Contributor

zhubonan commented Jul 9, 2021

Thanks for this. I will have a go!

@zhubonan
Copy link
Contributor

zhubonan commented Jul 9, 2021

I just had a try and it works great for me. It can be very useful for storing the provenance of, say a phase diagram construction performed with pymatgen. At the moment I do this by manually converting AiiDA's records into ComputedEntry on-the-fly and make the phase diagram.

With this new capability, the pymatgen's entries (which has no provenance at all, so I have to manually add in the UUID....) can be stored and linked to AiiDA's workflows with the rovenance), and each phase diagram will also automatically have the full provenance record!

One thing that I realised is that fireworks actually uses a different protocol rather than the standard MSONable class, in which case the objects are loaded by FwSerializer. This is needed to allow code refactoring - so the class can be dynamically loaded:

https://github.com/materialsproject/fireworks/blob/a69911f9b99d025c322baa3dd003370b7a21dafb/fireworks/utilities/fw_serializers.py#L183

However, I think it is too code specific and not really a standard. I am happy to just having the MsonableData class in aiida-core.

@zhubonan
Copy link
Contributor

zhubonan commented Jul 9, 2021

I have found a problem with further testing. Not all MSONable objects can be serialised with standard JSON. For example, some pymatgen objects contain NumPy array and NaN values, the latter cannot be stored in PostgreSQL directly because JSON dost not support NaN.

I can see that they have implemented special decoders/encoders for these cases to get around it:

https://github.com/materialsvirtuallab/monty/blob/b6204db0ddebb51441b397cd8c79418022ea8830/monty/json.py#L248

Example content with nan:

{'@module': 'pymatgen.entries.computed_entries',
 '@class': 'ComputedEntry',
 'energy': -8.93381717,
 'composition': defaultdict(float, {'Li': 1.0, 'O': 1.0}),
 'energy_adjustments': [{'@module': 'pymatgen.entries.computed_entries',
   '@class': 'ConstantEnergyAdjustment',
   '@version': '2020.10.20',
   'value': -0.70229,
   'uncertainty': nan,
   'name': 'MP Anion Correction',
   'cls': {'@module': 'pymatgen.entries.compatibility',
    '@class': 'MaterialsProjectCompatibility',
    '@version': '2020.10.20'},
   'description': 'Constant energy adjustment'}],
 'parameters': {'run_type': 'GGA',
  'is_hubbard': False,
  'pseudo_potential': {'functional': 'PBE',
   'labels': ['Li_sv', 'O'],
   'pot_type': 'paw'},
  'hubbards': {},
  'potcar_symbols': ['PBE Li_sv', 'PBE O'],
  'oxide_type': 'oxide'},
 'data': {'oxide_type': 'oxide'},
 'entry_id': 'mp-1097030',
 'correction': -0.70229}

@zhubonan
Copy link
Contributor

zhubonan commented Jul 9, 2021

Oh actually it is the python's JSON encoder/decoder that does the work of sorting out NaN and infinity:

https://docs.python.org/3/library/json.html#json.JSONDecoder

@sphuber
Copy link
Contributor Author

sphuber commented Jul 9, 2021

I have found a problem with further testing. Not all MSONable objects can be serialised with standard JSON. For example, some pymatgen objects contain NumPy array and NaN values, the latter cannot be stored in PostgreSQL directly because JSON dost not support NaN.

Hmm, for numpy arrays, what we could think of doing is after calling as_dict on the object, before storing it on the attributes of the node, we first loop over the serialized content and remove all numpy arrays and store them in the repository of the node instead. We leave a placeholder in the attributes such that upon deserialization, we replace those placeholders with the numpy array read from the repository. This would obviously be less performant but should work I think. Do you have an example of a MSONable class that contains a numpy array, which is easy to mock, so I can add this in tests for the implementation?

Oh actually it is the python's JSON encoder/decoder that does the work of sorting out NaN and infinity:

Regarding the NaN values, we indeed don't support this because PostgreSQL doesn't support it in the JSONB fields: https://www.postgresql.org/docs/current/datatype-json.html#JSON-TYPE-MAPPING-TABLE
But then it is probably because NaN is not a valid JSON token. So it is not so much a problem of the Python encoder/decoders but merely that we can store it in the attributes. Only solution I see is to serialize this value (and others, such as infinity etc.) to some kind of string token that than also has to be deserialized. But not sure if this is going to work.

@zhubonan
Copy link
Contributor

Hmm, for numpy arrays, what we could think of doing is after calling as_dict on the object,
I think the monty package solves this by making the NumPy arrays part of the JSON representation:

https://guide.materialsvirtuallab.org/monty/_modules/monty/json.html#MontyEncoder

We should able to add this to a pair of encoding and decoding functions that process the dict before serialization and after deserialization. The ability to store infinite values can also be added that way - they just saves to strings like NaN and -Infinite, but perhaps it should add it to the parent class Dict. This is what python does anyway when encoding/decodnig JSON.

https://github.com/python/cpython/blob/3.9/Lib/json/encoder.py#L204

@zhubonan
Copy link
Contributor

Sorry, I think I have mixed some things up.

Just to clarify, the support of NumPy array is not something we need to do on our side. Individual classes should implement as_dict to be able to support it. In pymatgen, I have seen classes get around this by doing around trip using the MontyEncoder to get a string and then load it as plain python:

https://github.com/materialsproject/pymatgen/blob/5d600ca610d1a4d3e1532293e5f87062332b9a26/pymatgen/entries/computed_entries.py#L549

If we really want this for any class, we could add this functionality by doing the same round trip thing or calling the default method of the MontyEncoder. But I think it is not necessary in most cases.

So the only thing we need to add here is for processing the nan and inf values. I have raised a PR to the branch of this PR @sphuber.

@sphuber
Copy link
Contributor Author

sphuber commented Jul 14, 2021

Just to clarify, the support of NumPy array is not something we need to do on our side. Individual classes should implement as_dict to be able to support it.

So the pymatgen class that you failed to wrap in the MsonableData class, was failing because it did not serialize the numpy arrays it contains? I guess that if we specify that as_dict should return something that is JSON serializable, than that makes sense and technically we would be correct in raising for classes that don't respect this. However, if many pymatgen classes do not actually properly return JSON serializable dictionaries, then that defeats the purpose of our class a bit doesn't it? If the goal is to allow to wrap "any" MSONable class without any changes, then it would be a bit pointless if it fails for most. Do you have an idea what percentage of classes roughly don't work with the current implementation? Did you just happen to find one of few corner cases? Or are there quite a number?

@zhubonan
Copy link
Contributor

zhubonan commented Jul 14, 2021

No, what I meant is that in my case they just fail because of things like NaN or -inf being not JSON serializable. The numpy issue is a similar but different problem.

Like you said the as_dict method of the class should take care of converting NumPy objects. But very often it does not but the end result can still be serialized if their MontyEncoder when saving it to a json file.

For example:

class MsonableClass(MSONable):
    """Dummy class that implements the ``MSONable interface``."""

    def __init__(self, data):
        """Construct a new object."""
        self._data = data

    @property
    def data(self):
        """Return the data of this instance."""
        return self._data

    def as_dict(self):
        """Represent the object as a JSON-serializable dictionary."""
        return {
            '@module': self.__class__.__module__,
            '@class': self.__class__.__name__,
            'data': self._data,
        }

    @classmethod
    def from_dict(cls, d):
        """Reconstruct an instance from a serialized version."""
        return cls(d['data'])

obj = MsonableClass({'a': np.arange(10}))
  • calling as_dict does not necessary give a JSON serializable dictionary - the numpy/datetime objects are still there . The returned dictionary cannot be dump into a file using the standard encoder.
  • calling to_json would return the JSON string representation, encoded using MontyEncoder, where the numpy object is serialized, and the NaN, inf objects are converted to string.
  • calling from_dict would automatically go through the MontyDecoder to reconstruct the numpy, datetime, objects etc.

Because we don't store the content in the database as string, but rather passing the JSON directly to PostgreSQL, hence we don't go through the encoder/decoder stages, but they are needed for many of the classes to work.

As you can see, there is an asymmetry - as_dict does not always give a JSON serializable object (although hit should!) as it does not go through the encoder, but from_dict does process and recover things like numpy arrays and datetime.

To have the same level of support, we need to make sure when serializing we also go through the encoder. This can be done by processing objects using the default method that convert each object to into the JSON serializable format (I will update the code).

In addition, similar pre-procssing and post-processing steps are needed for the inf, NaN values. They are implicitly handled by the JSONEncoder in the stdlib when dumping to /loading from JSON strings, but as we don't actually dump the object into strings, we have to process them explicitly here.

@chrisjsewell chrisjsewell marked this pull request as draft July 21, 2021 09:42
@chrisjsewell
Copy link
Member

chrisjsewell commented Jul 21, 2021

Two notes that came to mind whilst discussing with @ramirezfranciscof:

  1. Personally, I don't think we should need pymatgen/monty as a dependency here, nor even probably call this MSONable. It can be more general than that: just wrapping any class with an as_dict (instance) method, that creates a dict (which includes the special @module and @class keys), and a from_dict (class) method.
  2. What happens if you try to load this with e.g. a Pymatgen version that has changed in a breaking manner, whereby the class is no longer in the same module, or the from_dict input dict structure has changed. I think this is going to be an issue for "long term" storage no? As in, you try to load this stuff from an archive, but it fails and you don't know why. I guess this is a similar issue to having data plugins, whereby there is no current functionality for migrations. There though, we do at least record the version of the plugin, and so here I think we should do the same thing (trying to capture the module's version).

@zhubonan
Copy link
Contributor

zhubonan commented Jul 21, 2021

What happens if you try to load this with e.g. a Pymatgen version that has changed in a breaking manner, whereby the class is no longer in the same module, or the from_dict input dict structure has changed.

I think that the MSONable does try to record the module version (not enforced) and when deserialising it is not checked.

Personally, I don't think we should need pymatgen/monty as a dependency here

I agree. There is no need to check if an object is actually a subclass of MSONable. As long as it is MSONable-like it should work. However, we would still want to pre-process the return of the as_dict so that it can be stored by PostgreSQL as JSON, e.g. workaround for the NaN, -inf values.

@sphuber
Copy link
Contributor Author

sphuber commented Jul 21, 2021

Note that the whole idea of this plugin was to make pymatgen classes that implement the MSONable interface easy to wrap and store in AiiDA. If we remove the requirement for accepted objects to be instances of MSONable and any class that implements as_dict and from_dict, I think we have to discuss the interface in more detail. A big problem in the current implementation was that what we really need is a method that returns a serialized version of the object that can be stored as JSON. The as_dict method of the MSONable class does not do this and so we still need to do a lot of work, including multiple roundtrip JSON (de)serialization. So technically we need as_json and from_json methods. This would make it very general and well define what objects can successfully be wrapped, but this means none of the pymatgen objects can be automatically wrapped, which was the original motivation for this.

@zhubonan
Copy link
Contributor

The as_dict method of the MSONable class does not do this and so we still need to do a lot of work, including multiple roundtrip JSON (de)serialization.

I probably got you confused... from the specification of MSONable class (link) the returned value is supposed to be JSON serialisable (using the Python's encoder for the NaN and inf). However, the problem is that it is that a MSONable object may have attributes being dictionaries/lists of MSONable objects or numpy arrays, in this case returning a simple, unprocessed dictionary for calling the constructor can break the JSON serialisation requirement. If that is the case it is down to the subclasses to implement the as_dict correctly. And one solution in some pymatgen classes is to override the MSONable.as_dict with additional steps of make a JSON string using the MontyEncoder, and reload it using the plain JSONDecoder. Doing such round trips makes what as_dict returns being JSON serialisable (using JSONDecoder). However, if one's class is simple enough, the as_dict method does not need to implement this.

What I meant is that we do not have to make monty a dependency and there is no need to check if an object is a sub-class of MSONable. We just implement the interface for serialisation and deserialisation for storing the data (store as_dict) in the database and a convenient method for recreating the object from the stored data (identify the class, and call from_dict).

We can remove the step using the MontyEncoder when serialisation - this just meant that the classes not implementing the as_dict correctly would not work, which is fine. For deserialisation, the default from_dict method of MSONable does call the MontyDecoder, so we can also remove the additional step to call the MontyDecoder here. Again, this only makes the classes not implementing the original MSONable interface correctly not work, which I think is fine.

@sphuber
Copy link
Contributor Author

sphuber commented Jul 27, 2021

I probably got you confused...

I am afraid I am still confused 😅

from the specification of MSONable class (link) the returned value is supposed to be JSON serialisable (using the Python's encoder for the NaN and inf).

Exactly, but the problem is that some (if not many or all) implementations do not respect this requirement at all.

However, the problem is that it is that a MSONable object may have attributes being dictionaries/lists of MSONable objects or numpy arrays, in this case returning a simple, unprocessed dictionary for calling the constructor can break the JSON serialisation requirement. If that is the case it is down to the subclasses to implement the as_dict correctly. And one solution in some pymatgen classes is to override the MSONable.as_dict with additional steps of make a JSON string using the MontyEncoder, and reload it using the plain JSONDecoder. Doing such round trips makes what as_dict returns being JSON serialisable (using JSONDecoder). However, if one's class is simple enough, the as_dict method does not need to implement this.

I agree, but apparently some classes don't do this. The whole reason we went in this direction is because you reported that you found that certain pymatgen classes don't respect this requirement properly.

I am fine with adding a more generic class, that is not necessarily linked to MSONAble but accepts anything that respects the spec of as_dict and from_dict. However, that would mean that many pymatgen classes cannot be wrapped as-is as their implementations are incorrect.

@zhubonan
Copy link
Contributor

Exactly, but the problem is that some (if not many or all) implementations do not respect this requirement at all.

I agree, but apparently some classes don't do this. The whole reason we went in this direction is because you reported that you found that certain pymatgen classes don't respect this requirement properly.

I was wrong about this - most of the pymatgen classes do have the overrides for as_dict and provision for potential problems. Initially, I looked at the MSONable.as_dict and thought surely that would not work for all MSONable subclasses. The fact is that most of the latter had overrides. So the get around we need here is a just to deal with the NaN, inf and -inf.

@zhubonan
Copy link
Contributor

@sphuber I have made some suggestions on the changes here: sphuber#13.
I think that we just have to assume that the target class have implemented as_dict and from_dict properly. It should be the case for most (intended) use cases.

Here is a ad-hoc test script I use:

MSONable = DataFactory('msonable')
from pymatgen.ext.matproj import MPRester
from pymatgen.analysis.phase_diagram import PhaseDiagram
rester = MPRester()
data = rester.get_entries_in_chemsys('Li-O')
pd = PhaseDiagram(data)
pentry = MSONable(pd)
pentry.store()
node = load_node(pentry.uuid)
output = node.obj

which is to pull some data entry from the materials project, create a phase diagram and store it into AiiDA. The PhaseDiagram class's as_dict assemble the dictionary by calling the as_dict of each objects (see here). The same as_dict can be achieved by doing loads(dumps(pd, cls=MontyEncoder)) though, but that the former is the proper implementation.

@sphuber sphuber force-pushed the feature/4975/msonable-data-plugin branch from 385b4a4 to d8f88f2 Compare August 1, 2021 08:20
@codecov
Copy link

codecov bot commented Aug 1, 2021

Codecov Report

Merging #5017 (d8f88f2) into develop (3ad0712) will decrease coverage by 0.14%.
The diff coverage is 96.30%.

❗ Current head d8f88f2 differs from pull request most recent head 9537ef4. Consider uploading reports for the commit 9537ef4 to get more accurate results
Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #5017      +/-   ##
===========================================
- Coverage    80.39%   80.25%   -0.13%     
===========================================
  Files          529      516      -13     
  Lines        36882    36814      -68     
===========================================
- Hits         29649    29543     -106     
- Misses        7233     7271      +38     
Flag Coverage Δ
django 74.74% <96.30%> (-0.15%) ⬇️
sqlalchemy 73.65% <96.30%> (-0.14%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
aiida/orm/nodes/data/jsonable.py 96.30% <96.30%> (ø)
aiida/engine/exceptions.py 0.00% <0.00%> (-100.00%) ⬇️
aiida/manage/tests/unittest_classes.py 0.00% <0.00%> (-63.63%) ⬇️
aiida/manage/database/integrity/__init__.py 50.00% <0.00%> (-50.00%) ⬇️
aiida/cmdline/utils/ascii_vis.py 0.00% <0.00%> (-16.36%) ⬇️
aiida/manage/tests/__init__.py 88.54% <0.00%> (-11.46%) ⬇️
aiida/tools/data/array/kpoints/__init__.py 88.68% <0.00%> (-11.32%) ⬇️
aiida/tools/importexport/dbimport/__init__.py 94.12% <0.00%> (-5.88%) ⬇️
aiida/cmdline/params/options/contextualdefault.py 50.00% <0.00%> (-4.54%) ⬇️
aiida/cmdline/params/types/workflow.py 71.43% <0.00%> (-3.57%) ⬇️
... and 102 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3ad0712...9537ef4. Read the comment docs.

zhubonan
zhubonan previously approved these changes Aug 1, 2021
Copy link
Contributor

@zhubonan zhubonan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Decoupling from the MSONable class is sensible as the to_dict, from_dict interface is a general one - MSONable simply includes convenient default implementations. The issue with common but non-standard JSON objects such as -inf, inf and nan are also addressed.

@sphuber sphuber force-pushed the feature/4975/msonable-data-plugin branch from d8f88f2 to 8104943 Compare August 1, 2021 13:32
@sphuber sphuber marked this pull request as ready for review August 1, 2021 13:37
@sphuber
Copy link
Contributor Author

sphuber commented Aug 1, 2021

I think we have converged on the best design. I have updated the PR to reflect this and have renamed all classes to properly reflect the fact that we support anything that is JSON-serializable. In that sense it is no longer directly tied to monty's MSONable, which also makes it less of a problem that certain pymatgen classes won't be compatible, since their as_dict method's do not return a JSON-serializable dictionary.

@chrisjsewell I also addressed the remarks from your initial PR so this PR is now ready for final review.

@sphuber sphuber changed the title Add the MsonableData data plugin Add the JsonableData data plugin Aug 2, 2021
@sphuber sphuber force-pushed the feature/4975/msonable-data-plugin branch 3 times, most recently from b5a38cb to 6513a79 Compare August 2, 2021 07:32
@sphuber
Copy link
Contributor Author

sphuber commented Aug 2, 2021

@chrisjsewell typing is updated. Ready for final review

@sphuber
Copy link
Contributor Author

sphuber commented Aug 12, 2021

@chrisjsewell I will be merging this tomorrow. Let me know if you still want to review it.

Copy link
Member

@chrisjsewell chrisjsewell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cheers, just a minor nitpick

aiida/orm/nodes/data/jsonable.py Show resolved Hide resolved

"""
try:
return self._obj
Copy link
Member

@chrisjsewell chrisjsewell Aug 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, wait when would this raise an AttributeError? do we not want to always reload the object, in case the attributes have changed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeh the attribute error would be for if its loaded I guess, but yeh what if you did:

data = JsonableData(obj)
data.set_attrbiute('x', 1)
data.obj

would the user expect the obj to be updated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. This is only really a problem though for unstored nodes and I doubt that users will be manually changing the attributes of the serialized object. Since the whole point of this class is to allow simple wrapping of serializable instances, there seems little point to be manually changing the attributes after construction but before storing. If we do want to support this as a use case, then we should indeed always fully reconstruct the obj from the database. This will introduce a significant performance hit since it needs to hit the db and deserialize the entire object. I would argue that this is not worth it, since these nodes will used most often to retrieve the obj from a stored node, rather than mutating unstored versions of them.

TL;DR: I would keep the current implementation and at most just add a warning in the obj property docstring that the thing is cached and so messing with attributes manually can break things

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeh that sounds fine to me 👍

This data plugin is designed to make it easy to wrap instances of
arbitrary Python classes and store them as a node in the database. The
only requirement is that the class implements the `as_dict` method,
which should return a dictionary that represents the instance which is
JSON-serializable. The latter means that it can be serialized by the
`JsonEncoder` of the `json` built-in module.

If the class also implements a `from_dict` method, that can consume the
dictionary that is returned by `as_dict`, to reconstruct an instance,
the original instance that is wrapped by the node (loaded from the
database), can be recovered through the `obj` property.
@sphuber sphuber force-pushed the feature/4975/msonable-data-plugin branch from 6513a79 to 9537ef4 Compare August 12, 2021 14:24
@sphuber sphuber merged commit 79d2d27 into aiidateam:develop Aug 12, 2021
@sphuber sphuber deleted the feature/4975/msonable-data-plugin branch August 12, 2021 16:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

✨ NEW: MSONableNode data type
3 participants