Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the EnumData data plugin to easily store Enum instances #5225

Merged
merged 1 commit into from
Nov 24, 2021

Conversation

sphuber
Copy link
Contributor

@sphuber sphuber commented Nov 9, 2021

Fixes #5224

The Enum instance is represented by two attributes in the database:

  • value: the enum value
  • identifier: the string representation of the enum's class

The original enum instance can be obtained from the node through the
value property. This will except if the enum class can no longer be
imported or the stored value is no longer a valid enum value. This can
be the case if the implementation of the class changed since the
original node was stored.

The plugin is named EnumData, and not Enum, because the latter would
overlap with the Python built-in class Enum from the enum module.
The downside is that this differs from the naming of other Python base
type analogs, such as Str and Float, which simply use a capitalized
version of the type for the data plugin class.

@sphuber
Copy link
Contributor Author

sphuber commented Nov 9, 2021

@giovannipizzi here example implementation of what we discussed.

@codecov
Copy link

codecov bot commented Nov 9, 2021

Codecov Report

Merging #5225 (34cab28) into develop (832ddba) will increase coverage by 6.07%.
The diff coverage is 100.00%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #5225      +/-   ##
===========================================
+ Coverage    75.20%   81.27%   +6.07%     
===========================================
  Files          533      534       +1     
  Lines        37377    37423      +46     
===========================================
+ Hits         28107    30410    +2303     
+ Misses        9270     7013    -2257     
Flag Coverage Δ
django 76.15% <100.00%> (?)
sqlalchemy 75.23% <100.00%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
aiida/orm/__init__.py 100.00% <ø> (ø)
aiida/orm/nodes/__init__.py 100.00% <ø> (ø)
aiida/orm/nodes/data/__init__.py 100.00% <100.00%> (ø)
aiida/orm/nodes/data/enum.py 100.00% <100.00%> (ø)
aiida/manage/tests/main.py 88.48% <0.00%> (+0.93%) ⬆️
aiida/manage/manager.py 80.45% <0.00%> (+1.12%) ⬆️
aiida/backends/general/migrations/utils.py 86.02% <0.00%> (+2.08%) ⬆️
aiida/backends/testbase.py 93.23% <0.00%> (+2.55%) ⬆️
aiida/backends/__init__.py 90.91% <0.00%> (+18.19%) ⬆️
aiida/tools/importexport/dbimport/main.py 94.12% <0.00%> (+23.53%) ⬆️
... and 70 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 832ddba...34cab28. Read the comment docs.

Copy link
Member

@giovannipizzi giovannipizzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!
Do I understand correctly that to get the actual value, one would need to do node.value.value? It's a bit convoluted (but I guess that, if we want to use this implantation, there's not much to improve for this?)

One thing I don't like too much is that .value is a property that raises if the python definition is not there (very common: people just import Data nodes and might not have the python implementation coming with it). I think in most cases, just having the underlying enum value is enough.

Suggestion:

  • have a raw_value or enum_value property on the class that just returns self.get_attribute(self.KEY_VALUE) and never raises
  • if we believe that .value should return the actual Enum, replace .value with an actual method .get_enum(). I know this goes in contradiction with the base classes, but e.g. this is not JSON-serializable (the value might be, though). So it's probably OK not to consider this a base class, and to have an explicit method if you really want to reconstruct a class that might not be there (like StructureData.get_ase()).

Actually, thinking better to it (and to my first point), maybe I would just have .value return the value of the Enum, i.e. this replaces what I was suggesting to call .raw_value above?

So you can do node.value and this would work like Enum.value that is quite intuitive and also avoids to have to do node.value.value. And if you really care about the Enum, you do node.get_enum()

@@ -0,0 +1,64 @@
# -*- coding: utf-8 -*-
"""Data plugin that allows to easily wrap an :class:`enum.Enum` instance."""
import enum
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any remote risk that this tries to import itself? In this case, should we rename the module?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, but instead of renaming the module, I have changed the import to from enum import Enum. That should be fine, no?

@sphuber
Copy link
Contributor Author

sphuber commented Nov 10, 2021

Thanks for the feedback @giovannipizzi . I agree that .value.value is probably going to be ugly. I have changed the interface quite a bit (see commit message for explanation) to make it respect the interface of normal enum's. There are now the get_enum and get_member methods to retrieve the enum and the enum member from the node, respectively.

@ramirezfranciscof
Copy link
Member

Do you mind me asking what is the use case for this? Initially I would have thought that the natural thing is to store a whole enum class instead of a single member, since typically these groupings of constants make sense together. Now I'm thinking maybe you want it to also track a specific element of the enum that was selected, for example, as an input for a calculation. Still it seems pretty fragile, in the sense that many untracked modifications in the environment can break get_enum no? Wouldn't it be better to store the whole class there instead of just the identifier?

@sphuber
Copy link
Contributor Author

sphuber commented Nov 23, 2021

Do you mind me asking what is the use case for this?

As any other use case for an Enum. It is to communicate the value for a variable that can take a number of values in a concise way without passing around magic strings or numbers. Just like we use enums in normal Python code, it may be useful to have them as an input to a Process. But since they are currently not storable, you have to transform them into a Str node probably, and then in the process deserialize it yourself back to the enum member. With this plugin, this is made more transparent.

Initially I would have thought that the natural thing is to store a whole enum class instead of a single member, since typically these groupings of constants make sense together.

This wouldn't make sense to me honestly. Sure, like this you can always load the node and deserialize the actual member, even if the original class was modified or not even available anymore. But the whole purpose of this plugin is to allow actually using it inside a process and in that case the enum has to be available in the environment or the code wouldn't work anyway. So serializing the entire class and storing that on the node won't help you there.

Copy link
Member

@giovannipizzi giovannipizzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Just a minor thing to fix in the doctsring - I'm pre-approving, but great if you could fix it before merging


The class ``Color`` is an enumeration (or enum). The attributes ``Color.RED`` and ``Color.GREEN`` are enumeration
members (or enum members) and are functionally constants. The enum members have names and values: the name of
``Color.RED`` is ``RED`` and the value of ``Color.BLUE`` is ``3``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Colo.BLUE is not in the example above

An `Enum` member is represented by three attributes in the database:

 * `name`: the member's name
 * `value`: the member's value
 * `identifier`: the string representation of the enum's identifier

The original enum member can be obtained from the node through the
`get_member` method. This will except if the enum class can no longer be
imported or the stored value is no longer a valid enum value. This can
be the case if the implementation of the class changed since the
original node was stored.

The plugin is named `EnumData`, and not `Enum`, because the latter would
overlap with the Python built-in class `Enum` from the `enum` module.
The downside is that this differs from the naming of other Python base
type analogs, such as `Str` and `Float`, which simply use a capitalized
version of the type for the data plugin class.
@sphuber sphuber force-pushed the feature/5224/enum-data-plugin branch from 680f065 to 34cab28 Compare November 24, 2021 16:14
@sphuber sphuber merged commit defc4f0 into aiidateam:develop Nov 24, 2021
@chrisjsewell
Copy link
Member

Eurgh, @sphuber I very much agree with @ramirezfranciscof, it is fragile; I don't how he feels about it, but I don't think you adequately answered he's questions.
Take this example,

In [1]: from plumpy.loaders import get_object_loader

In [2]: import enum

In [3]: class DummyEnum(enum.Enum):
   ...:     """Dummy enum for testing."""
   ...: 
   ...:     OPTION_A = 'a'
   ...:     OPTION_B = 'b'
   ...: 

In [4]: get_object_loader().identify_object(DummyEnum)
Out[4]: '__main__:DummyEnum'

This is what will be stored under identifier in the database, and so basically very fragile if you want to call get_enum later, it will be very easy to "lose" the original definition (especially for archived databases).
Previously, we haven't used get_object_loader for anything outside of checkpoints, which I would say is a special case because they are more "transient" (indeed they are stripped on exporting/importing archives)

@chrisjsewell
Copy link
Member

chrisjsewell commented Nov 24, 2021

A few more notes:

  1. although we use Enum quite a bit in aiida, a new possible alternative is https://docs.python.org/3/library/typing.html#typing.Literal

  2. In general, IMO we should strive to not have anything (outside of checkpoints) stored in the database/repository that is Python specific.

A possible more general approach for a data type that stored "validated" fields, would be to store them in some way alongside a jsonschema (I already kind of do this in some of my plugins), and enum would be a subset of such a data type (https://json-schema.org/understanding-json-schema/reference/generic.html?highlight=enum#enumerated-values)

{
  "enum": ["red", "amber", "green"]
}

Naturally, you would want to deduplicate these schemas: I guess the easiest way currently would be to save them in the new repository.
But obviously, given such a jsonschema and its value, it would be easy to construct an enum/literal, without the need for the original class object

@sphuber
Copy link
Contributor Author

sphuber commented Nov 25, 2021

This is what will be stored under identifier in the database, and so basically very fragile if you want to call get_enum later, it will be very easy to "lose" the original definition (especially for archived databases).

I clearly recognized this problem and the class deals with this "properly", as in the best way possible, by saying it cannot import the class. Note that this does not necessarily stop you from still using the node, the original name and value are also stored and can be retrieved from the name and value properties and can be used directly.

But my main point is, even if you were to store the entire representation, how would you then actually use it in code? The way you would use this is something like the following:

from aiida.engine import Process
from aiida.orm import EnumData
from some.module import SomeEnum

class SomeProcess(Process):

    @classmethod
    def define(cls, spec):
        super().define(spec)
        spec.input('some_enum', valid_type=EnumData)

    def run(self):
        if self.inputs.some_enum == SomeEnum.VALUE_ONE:
            # do something
        elif self.inputs.some_enum == SomeEnum.VALUE_TWO:
            # do something else

So in this example, if you presume that the EnumData cannot load the enum class and so get_member will fail, that means that from some.module import SomeEnum is not importable. But that means this code is broken anyway. If you would implement it as you suggest and store the entire representation, how would that help you? You can then load the instance, but you cannot use it, because the code cannot import it and so you cannot compare to it, making it useless anyway.

My point is that the example you provide of declaring an enum in a shell and then storing a member with the EnumData of course that will break, but that is not the use case for this plugin. But at the same time, when used as described in the example it works perfectly fine and serves an actual purpose.

@sphuber sphuber deleted the feature/5224/enum-data-plugin branch November 25, 2021 08:39
@chrisjsewell
Copy link
Member

when used as described in the example it works perfectly fine and serves an actual purpose.

yes exactly, as I've already said, it prioritises transient use over long term storage, and is fragile to changes in module code.
I would not recommend anyone use this in their plugin

@ramirezfranciscof
Copy link
Member

@sphuber perhaps we could maybe make a meeting to discuss this better live? So then you can explain to us a bit more the details of the use case you had in mind when you discussed with @giovannipizzi , and then we might understand this better and/or perhaps brainstorm some ideas that could make the information stored here a bit more robust.

We could carve a couple of minutes during the coding week so as to not take more time from your regular schedule.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a data plugin to store Enum instances
4 participants