Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Presenting asdf-pydantic, create ASDF tags with pydantic models. #1507

Closed
4 of 5 tasks
ketozhang opened this issue Mar 28, 2023 · 10 comments
Closed
4 of 5 tasks

Presenting asdf-pydantic, create ASDF tags with pydantic models. #1507

ketozhang opened this issue Mar 28, 2023 · 10 comments

Comments

@ketozhang
Copy link

After pip install asdf-pydantic you can do something like this:

from asdf_pydantic import AsdfPydanticModel

class Rectangle(AsdfPydanticModel):
    _tag = "asdf://asdf-pydantic/examples/tags/rectangle-1.0.0"

    width: float
    height: float

# After creating extension and install ...

af = asdf.AsdfFile()
af["rect"] = Rectangle(width=1, height=1)
#ASDF 1.0.0
#ASDF_STANDARD 1.5.0
%YAML 1.1
%TAG ! tag:stsci.edu:asdf/
--- !core/asdf-1.1.0
asdf_library: !core/software-1.0.0 {
    author: The ASDF Developers,
    homepage: 'http://github.com/asdf-format/asdf',
    name: asdf,
    version: 2.14.3}
history:
  extensions:
  - !core/extension_metadata-1.0.0
    extension_class: asdf.extension.BuiltinExtension
    software: !core/software-1.0.0 {
        name: asdf,
        version: 2.14.3}
  - !core/extension_metadata-1.0.0 {
    extension_class: mypackage.shapes.ShapesExtension,
    extension_uri: 'asdf://asdf-pydantic/shapes/extensions/shapes-1.0.0'}
rect: !<asdf://asdf-pydantic/shapes/tags/rectangle-1.0.0> {
    height: 1.0,
    width: 1.0}
...

Features

  • Create ASDF tag from your pydantic models with batteries (converters) included.
  • Validates data models as you create them and not only when reading and writing ASDF files.
  • Preserve Python types when deserializing ASDF files.
  • All the cool things that comes with pydantic (e.g., JSON encoder, JSON schema, Pydantic types)
  • Genereates ASDF schemas for you. (TBD, talk to me if you have ideas)

Rationale

Defining and prototyping composite types with Python standard types: UserDict, Dataclasses, TypeDict are okay, but nothing beats the flexibilty of pydantic. With ASDF you're already in the mental space of working with either YAML and JSON hierarchical structure. This feels natural with pydantic.

There are big potentials beyond integrating with the Python ecosystem. For example, Tags created with asdf-pydantic is already readily JSON-serializable (other existing tags, like astropy, may require custom serialization) and the JSON schema comes for free. Initial adoption of JSON schema in another language (e.g., C++, C#, Java, javascript) could be much faster than with ASDF schema — and not to forget, this is the most common schema for HTTP APIs.


I am looking for commmunity interest and user feedback. Alongside contributors to incubate this project if ASDF is interested.

@braingram
Copy link
Contributor

Thanks for sharing!

I'm excited to try out your project. I'm curious, did you encounter any issues with ASDF (including any non-intuitive API) while developing the project (or are there features that if added to ASDF would make this project easier/cleaner)?

@WilliamJamieson
Copy link
Contributor

I agree that this is a good idea. I have been playing with the idea of fusing asdf to pydantic since I started working with ASDF.

Several thoughts:

  1. The generation of ASDF schemas is a necessity for interoperability with non-python based ASDF libraries. One of the major points of using ASDF is that data being written and read from an ASDF file will be validated. pydantic would ensure this if a file is written using python; however, without a schema to transmit the information any non-python library would not have a method to handle validation of data. Note that I realize there is no current roadmap for a non-python based ASDF library, but that is one of the lofty principles that ASDF attempts to follow.

    • To this end, pytdantic does provide an interface for generating json-schemas: https://docs.pydantic.dev/usage/schema/.
    • This is a purely json-schema esque python dictionary, so tag information will need to be populated.
    • Also correctly referencing existing ASDF schemas will be difficult.
  2. Versioning will need to be implemented. Since ASDF is a file format, anything of this type will need to support backwards compatiblity with opening older versions of the model objects. Thus if you tweak your AsdfPytdanticModel you have to have a system in place to to open previously written data using this model (ASDF is supposed to be an Archival format).

    • In theory, basic "rules" can be in place, wherein changes to AsdfPydanticModel based objects need to be done in a flexible enough way. However, there is no way to really ensure this. (I would be happy to see a way to ensure this).
    • Basically, schemas are a way to preserve previous iterations of object representations outside the code governing the Python objects the represent.
  3. I am not sure how this would be able to handle non-builtin and non-AsdfPydanticModel based objects. In your example, would you be able to change it so that the width and height are astropy.units.Quantity objects? The hand-off between pydantic and ASDF, especially for validation, might be difficult.

  4. How does subclassing work? That is if you subclass an existing model, one would have to explicitly set a new tag otherwise deserialization will become an issue (which object would it be deserialized to?). When generating schemas from the AsdfPydanticModel would the schemas be essentially duplicated or would they reference something in common?

  5. Should there be an interface to "wrap" existing Python objects? Part of the reason for all the duplication, is that users of ASDF do not necessarily control the objects they are wishing write to ASDF files. It would be great to just define something like AsdfPydanticModel but declare it represents some other Python object for ASDF, wherein one basically defines something akin to the to_yaml_tree or from_yaml_tree interfaces of ASDF converters. Then one can at least combine creation of schemas and converters into a single object.

There are more things to consider once you dive in.

@ketozhang
Copy link
Author

One part I decided to ignore was optimizing entry point installation as pointed out in various pages of the ASDF docs.

What resulted was asdf-pydantic implements a Converter that contains a dynamic mapping between tags and types to be referenced later when the user creates their Extension or extension manifest. I've yet tested if the performance loss is significant enough for me to change how users statistically specify URIs, types, etc.


Maybe a nice-to-have is the API or the docs does not encourage building generic Converter without adding your own fixings. I've no specific request here, but the limitations made me design *asdf-pydantic`'s converter as (1) a singleton and (2) stores a mapping between tags to types.

@ketozhang
Copy link
Author

@WilliamJamieson Let me first address the questions that I can answer quickly then put more thoughts on the others (especially the schema and version question):

  1. I am not sure how this would be able to handle non-builtin and non-AsdfPydanticModel based objects.

Oh I was so thrilled when this worked amazingly. See this passing test script that uses astropy.units.Quantity.
https://github.com/ketozhang/asdf-pydantic/blob/main/tests/patterns/astropy_types_test.py

Look through the other test scripts as well.

  1. How does subclassing work? That is if you subclass an existing model, one would have to explicitly set a new tag otherwise deserialization will become an issue (which object would it be deserialized to?). When generating schemas from the AsdfPydanticModel would the schemas be essentially duplicated or would they reference something in common?

The rule to follow is:

  1. Every defined AsdfPydanticModel should be a tag.
  2. If a field's type is an AsdfPydanticModel, it is referenced as a tag.
  3. If a field's type is not an AsdfPydanticModel but is compatible with standard ASDF schema or installed extensions. The proper tag and/or serialization is used.
  4. If a field's type ispydantic.BaseModel but not (2), then it's field are as (3).
  5. If a field is neither (2), (3), or (4), then serialization will fail.

With those, subclassing follows (1) and a new tag needs to be defined (not yet enforced in asdf-pydantic).
For schemas, I've limited exposure to YAML schema inheritance and this does not exist in JSON schema.

  1. Should there be an interface to "wrap" existing Python objects?

If we agree the minimum the user must do is to define the object's data model in aAsdfPydanticModel, then pydantic already provides the wrapper:

class MyForeignObjectAdapterModel(ASDFPydanticModel):
  ...

my_foreign_object = ...

af.AsdfFile({
  "obj": MyForeignObjectAdapterModel.parse_obj(my_foreign_object)
})

###
from pydantic import parse_obj_as
sequence_of_foreign_objects = ...

af.AsdfFile({
  "objs": parse_obj_as(list[MyForeignObjectAdapterModel], sequence_of_foreign_objects)
})

Now how complicated the adapter is depends on your object. All Python standard types adaptable for free. You may take advantage of the object's __dict__ if defined.

@WilliamJamieson
Copy link
Contributor

@ketozhang its great that yo have already considered many of the same things I have already. I'll take the time to browse through your code and play with it.

There are two more things that I think would be useful:

  1. It would be nice to be able to treat these models as proxies for the AsdfFile object. Namely, each AsdfPytdanticModel instance contains all the information needed for this (assume the object is rooted with keyword root or something more specific). So one should be able to access the AsdfFile which would be created by serializing the object instance. This would make non-serialization related features, such as search, info, etc. immediately accessible as you generate datamodels. That is in your example there would be no need for:
af = asdf.AsdfFile()
af["rect"] = Rectangle(width=1, height=1)

Instead, it would already exist (or be generated on the fly and cached) with access given via something like Rectangle(width=1, height=1).asdf_file. Then the access to ASDF features becomes pretty direct.

  1. The ability to store "object (not instance) specific metadata".

    • Right now for the Roman Space Telescope, we store descriptive metadata about all datamodel instances in the rad schemas. For example, we encode the MAST archive's database endpoints and metadata directly in the schemas, e.g. https://github.com/spacetelescope/rad/blob/2f909e39b46d2ebb6ddb50e033b8e11afcb40c7c/src/rad/resources/schemas/aperture-1.0.0.yaml#L23-L25. This information is never set by the datamodel instances as it remains constant across all instances. However, this information is still accessible through the schema_info command on the AsdfFile object.
    • The current implementation of using this information is awkward because it is easy to search out and find this data given a valid ASDF file, but it is rather difficult to interrogate the schemas independently of an ASDF file instance to back out the data in a way which can be related to a file instance (schemas are very flat in that they do not really define an inheritance structure). This "reverse interrogation" is important so that the MAST archive can be updated to accommodate new data.
    • What is nice about defining everything via these models is that Python defines a nice inheritance structure. This means when python imports the module containing a datamodel, the nested structure is essentially flushed out by Python before any instance is created. This can then be interrogated to figure out this metadata.

@WilliamJamieson
Copy link
Contributor

  1. I am not sure how this would be able to handle non-builtin and non-AsdfPydanticModel based objects.

Oh I was so thrilled when this worked amazingly. See this passing test script that uses astropy.units.Quantity. ketozhang/asdf-pydantic@main/tests/patterns/astropy_types_test.py

I just had a chance to look over the source code for asdf_pydantic and it is pretty slick.

However, currently as your example tests stand ASDF is not actually validating the data as it is being written to the ASDF file nor is it validating the data is it is reading from the ASDF file. Granted, pydantic will validate when the data gets placed into the object, so it will throw an error if the data inside it is not correct. But this will not occur during the "validation" portions of the serialize/deserialize, instead it happens during the object creation/conversion from yaml. And more importantly, non-asdf-pytdantic objects contained within asdf-pydantic objects will experience no validation.

I had to dig really deep and "watch" ASDF run its validation on the data as it is reading/writing. Basically, asdf-pydantic does not throw validation errors via an oversight!

(I'm going to point at the latest PyPi release tag to explain what is going on; however, the current dev version is rapidly evolving in this area, see #1490 for what might be changing)

During tree validation ASDF validates from the root node down through all the leaf nodes in a depth-first-search. When a "tag" is encountered during validation, as is the case for the tagged objects in your example, we hit the following block of code:

asdf/asdf/schema.py

Lines 306 to 312 in efafac8

if self.serialization_context.extension_manager.handles_tag_definition(tag):
tag_def = self.serialization_context.extension_manager.get_tag_definition(tag)
schema_uris = tag_def.schema_uris
else:
schema_uris = [self.ctx.tag_mapping(tag)]
if schema_uris[0] == tag:
schema_uris = []

In your particular case, you have a valid ASDF extension so it drops to:

asdf/asdf/schema.py

Lines 307 to 308 in efafac8

tag_def = self.serialization_context.extension_manager.get_tag_definition(tag)
schema_uris = tag_def.schema_uris

In your case schema_uris variable happens to be an empty list. When it hits the code that validates further down the tree:

asdf/asdf/schema.py

Lines 314 to 320 in efafac8

# Must validate against all schema_uris
for schema_uri in schema_uris:
try:
with self.resolver.resolving(schema_uri) as resolved:
yield from self.descend(instance, resolved)
except RefResolutionError:
warnings.warn(f"Unable to locate schema file for '{tag}': '{schema_uri}'", AsdfWarning)

It iterates over an empty list and so does not descend further. Moreover, it does not raise any kind of alarms. This means that ASDF cannot validate anything contained within the subtree result of your AsdfPydanticModel.

I don't think we quite recognized the consequences of this of allowing tags which don't have a schema to allow for ASDF to validate the contents. However, his makes it much simpler to just skip writing a schema for an object, but instead just defining a tag. @eslavich might be able to comment on why this was allowed in the first place.

In this case, pydantic will ensure some data integrity, though only for objects defined via asdf-pydantic. But, flawed data in non-asdf-pydantic defined objects, which may "deserialize with out issue" will remain unrecognized as invalid.

Thus once asdf-pydantic can correctly create asdf-schemas, it should pass them as part of the extension process to ASDF to avoid these edge cases.

@ketozhang
Copy link
Author

ketozhang commented Mar 30, 2023

@WilliamJamieson I encourage you to submit an issue to the repo for those suggestions that asdf-pydantic not yet support.

Thank you, there many you mention that asdf-pydantic can do and I need to write this down in the documentation.

Data model metadata

The ability to store "object (not instance) specific metadata".

This is already supported. A field defined with the type pydantic.ClassVar is automatically assigned to every instance and inherited by every subclass. By default, this field is not written to ASDF, it only exists in the Python object. We could choose to store all ClassVar fields in the ASDF schema as you've done, but I am not aware of if this is a standard practice in ASDF.

Versioning and backward compatibility

The constraint is you can only have one Python package version installed. If you want backward-compatibility, the general rule is your latest version of your Python package must be a superset of the extension, tag, and schema versions.

My take is because your Python package must claim to support an extension at some version, to be backward compatible, it must also register an older version of the extension in its manifest. You could have various strategies to build AsdfPydanticModel to be flexible, but it could much more stable way to archive the source code itself in your Python package (e.g., put it in model/archive/shapes_1_0_0.py).

The fallback in asdf is to parse it as Python native types. That works pretty well for a lot of archival purposes.
Another fallback is dynamic generation of models from an ASDF schema which is possible with pydantic but out of scope for asdf-pydantic right now. However, I am not sure how this is more helpful at asdf fallback. Again you're not going to get much use out of your AsdfPydanticModel if you don't have the old version of the class existing elsewhere.

I'm open to ideas. How has the community implemented backward compatibility?

Are AsdfPydanticModel fields being validated in ASDF?

@WilliamJamieson, replying to the comment just before this. ASDF is not doing any validation on the AsdfPydanticModel object iself is because the tests I have, have not added a TagDefinition to the Extension. However, rom my tests, ASDF is validating the fields of AsdfPydanticModel.

Try this example where I maliciously changed astropy.Quantity.value to a string in the YAML file.

Click for example
import asdf
import astropy.units as u
from asdf.extension import Extension
from astropy.units import Quantity

from asdf_pydantic import AsdfPydanticConverter, AsdfPydanticModel


class DataPoint(AsdfPydanticModel):
    _tag = "asdf://asdf-pydantic/examples/tags/datapoint-1.0.0"

    distance: Quantity[u.m]


AsdfPydanticConverter.add_models(DataPoint)


class TestExtension(Extension):
    extension_uri = "asdf://asdf-pydantic/examples/extensions/test-1.0.0"

    converters = [AsdfPydanticConverter()]  # type: ignore
    tags = [*AsdfPydanticConverter().tags]  # type: ignore


asdf.get_config().add_extension(TestExtension())

with open("test.asdf", "rb") as fp:
    asdf.open(fp)
#ASDF 1.0.0
#ASDF_STANDARD 1.5.0
%YAML 1.1
%TAG ! tag:stsci.edu:asdf/
--- !core/asdf-1.1.0
asdf_library: !core/software-1.0.0 {author: The ASDF Developers, homepage: 'http://github.com/asdf-format/asdf',
  name: asdf, version: 2.14.4}
history:
  extensions:
  - !core/extension_metadata-1.0.0
    extension_class: asdf.extension.BuiltinExtension
    software: !core/software-1.0.0 {name: asdf, version: 2.14.4}
  - !core/extension_metadata-1.0.0 {extension_class: astropy_types_test.setup_module.<locals>.TestExtension,
    extension_uri: 'asdf://asdf-pydantic/examples/extensions/test-1.0.0'}
  - !core/extension_metadata-1.0.0
    extension_class: asdf_astropy._manifest.CompoundManifestExtension
    extension_uri: asdf://asdf-format.org/core/extensions/core-1.5.0
    software: !core/software-1.0.0 {name: asdf-astropy, version: 0.4.0}
positions:
- !<asdf://asdf-pydantic/examples/tags/datapoint-1.0.0>
  distance: !unit/quantity-1.1.0 {datatype: float64, unit: !unit/unit-1.0.0 m. value: "foobar"}
  time: !time/time-1.1.0 2023-01-01T00:00:00.000
- !<asdf://asdf-pydantic/examples/tags/datapoint-1.0.0>
  distance: !unit/quantity-1.1.0 {datatype: float64, unit: !unit/unit-1.0.0 m, value: 1.0}
  time: !time/time-1.1.0 2023-01-01T01:00:00.000
...

As you pointed out and I've read in the docs, schema-less tags are possible and not discouraged. At least, with asdf-pydantic you get some validation done if you decide to go schema-less.

@ketozhang
Copy link
Author

I welcome any further questions and feedbacks here until the thread is closed.

Do consider making an issue for the usual bug reports and suggestions in at https://github.com/ketozhang/asdf-pydantic/issues.

@WilliamJamieson
Copy link
Contributor

Are AsdfPydanticModel fields being validated in ASDF?

@WilliamJamieson, replying to the comment just before this. ASDF is not doing any validation on the AsdfPydanticModel object iself is because the tests I have, have not added a TagDefinition to the Extension. However, rom my tests, ASDF is validating the fields of AsdfPydanticModel.

Try this example where I maliciously changed astropy.Quantity.value to a string in the YAML file.

Yes you are correct that non AsdfPydanticModel objects supported by ASDF contained within a AsdfPydanticModel do in fact get validated. I missed the fact that the validation process keeps track of all the nodes in the tree which have been visited for validation. If there was a schema for the tag then it would continue down validating the nodes with respect to that schema and marking them off. In the case that no schema is provided, the nodes under the tagged node are simply not touched, so validation will pick them up at some later point in the validation because they have not been marked as validated. At this point, those objects get validated.

I guess I should not attempt figure these things out while very tired. In any case, asdf-pydantic objects will get fully validated so long as ASDF is willing to trust pydantic to validate the data when the AsdfPydanticModel is initialized.

I wonder if it is possible to modify the AsdfPydanticModel to raise the same ValidationError as ASDF does? That way libraries reading asdf files will on need to worry about catching one error type.

@WilliamJamieson
Copy link
Contributor

WilliamJamieson commented Mar 30, 2023

Versioning and backward compatibility

The constraint is you can only have one Python package version installed. If you want backward-compatibility, the general rule is your latest version of your Python package must be a superset of the extension, tag, and schema versions.

My take is because your Python package must claim to support an extension at some version, to be backward compatible, it must also register an older version of the extension in its manifest. You could have various strategies to build AsdfPydanticModel to be flexible, but it could much more stable way to archive the source code itself in your Python package (e.g., put it in model/archive/shapes_1_0_0.py).

The fallback in asdf is to parse it as Python native types. That works pretty well for a lot of archival purposes. Another fallback is dynamic generation of models from an ASDF schema which is possible with pydantic but out of scope for asdf-pydantic right now. However, I am not sure how this is more helpful at asdf fallback. Again you're not going to get much use out of your AsdfPydanticModel if you don't have the old version of the class existing elsewhere.

I'm open to ideas. How has the community implemented backward compatibility?

I'll open an issue directly in asdf-pytdantic for this, where further discussion can occur, see ketozhang/asdf-pydantic#3

Honestly, I think the simplest for versioning AsdfPydanticModels is to simply require the version number of the model to match the version number of the package. This will properly reflect the possible differences in an object, and then one could possibly issue a warning if the version of a tag in an existing asdf file does not match the current version of the model, but the data is compatible with the model. If pytdantic's validation fails then the user will be able to quickly determine the version of the extension needed, and then manually extract and forward port the data (if possible). In practice, I doubt the data contained in any model will drastically change over time.

@asdf-format asdf-format locked and limited conversation to collaborators Dec 14, 2023
@braingram braingram converted this issue into discussion #1706 Dec 14, 2023

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants