Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Pydantic model for Channels #386

Closed
tcompa opened this issue Jun 5, 2023 · 24 comments · Fixed by #410
Closed

Add Pydantic model for Channels #386

tcompa opened this issue Jun 5, 2023 · 24 comments · Fixed by #410
Labels
High Priority Current Priorities & Blocking Issues JSON Schemas

Comments

@tcompa
Copy link
Collaborator

tcompa commented Jun 5, 2023

Branching from #377.

For the moment I could only obtain something interesting via pydantic, while I got nowhere while trying with dataclasses or plain classes.
A basic example would be this script

from typing import Optional
import json
from pydantic import BaseModel
from pydantic.decorator import ValidatedFunction
from lib_args_schemas import _remove_args_kwargs_properties
from lib_args_schemas import _remove_pydantic_internals

class Channel(BaseModel):
    wavelength_id: str
    label: str
    start: Optional[int]
    end: Optional[int]

def function(channels: list[Channel]):
    pass

schema = ValidatedFunction(function, config=None).model.schema()
schema = _remove_args_kwargs_properties(schema)
schema = _remove_pydantic_internals(schema)
print(json.dumps(schema, indent=2, sort_keys=True))

that prints this schema

{
  "additionalProperties": false,
  "definitions": {
    "Channel": {
      "properties": {
        "end": {
          "title": "End",
          "type": "integer"
        },
        "label": {
          "title": "Label",
          "type": "string"
        },
        "start": {
          "title": "Start",
          "type": "integer"
        },
        "wavelength_id": {
          "title": "Wavelength Id",
          "type": "string"
        }
      },
      "required": [
        "wavelength_id",
        "label"
      ],
      "title": "Channel",
      "type": "object"
    }
  },
  "properties": {
    "channels": {
      "items": {
        "$ref": "#/definitions/Channel"
      },
      "title": "Channels",
      "type": "array"
    }
  },
  "required": [
    "channels"
  ],
  "title": "Function",
  "type": "object"
}

Notes:

  1. We should make sure that this schema can be supported in fractal-web.
  2. This approach has side effects:
    • Pydantic becomes a primary dependency for the tasks (adding to the discussion in JSON Schema version and pydantic v2 #375);
    • We will address attributes via channel.label, rather than channel["label"] (not a big deal, but propagating this into the fractal_tasks_core.lib subpackage could be annoying, because suddenly any other package using our lib will have to depend on pydantic at runtime).
  3. We won't be able to fully exploit the flexibility of JSON Schemas (e.g. adding mutually exclusive class attributes, or other more complex definitions), or at least not easily. This does not actually depend on the Pydantic option discussed here, but it is related to our automatic creation of schemas. There may be a point in the future where we can move (or at least copy) some logical checks from the task Python function to the JSON Schemas of its arguments, but we'll have to decide where/how this operation is made (is it done by hand? is the relevant info stored somewhere?)
@tcompa tcompa added the High Priority Current Priorities & Blocking Issues label Jun 5, 2023
@tcompa
Copy link
Collaborator Author

tcompa commented Jun 6, 2023

For the record, and assuming we proceed as described above:

If we add a default value to one of the internal variables (e.g. `wavelength_id: str = "something"), then this is correctly written into the JSON Schema, meaning that it could be available in fractal-web.

Note, however, that this will not be used by fractal-server when populating WorkflowTask.args with the defaults in the JSON Schemas, since only the "first-level" defaults will be read and assigned to args (ref fractal-analytics-platform/fractal-server#639). This looks a very intuitive behavior to me, but let's make sure we all agree.

@jluethi
Copy link
Collaborator

jluethi commented Jun 6, 2023

This approach has side effects:
Pydantic becomes a primary dependency for the tasks (adding to the discussion in #375);

I guess that's a price that should be fine to pay, especially if we go the pydantic.v1 route once reasonable?

We will address attributes via channel.label, rather than channel["label"] (not a big deal, but propagating this into the fractal_tasks_core.lib subpackage could be annoying, because suddenly any other package using our lib will have to depend on pydantic at runtime).

I'd even call this a potential win, channel.label seems like the better way to reference it.

We won't be able to fully exploit the flexibility of JSON Schemas (e.g. adding mutually exclusive class attributes, or other more complex definitions), or at least not easily. This does not actually depend on the Pydantic option discussed here, but it is related to our automatic creation of schemas.

Pity, that would be neat.

There may be a point in the future where we can move (or at least copy) some logical checks from the task Python function to the JSON Schemas of its arguments, but we'll have to decide where/how this operation is made (is it done by hand? is the relevant info stored somewhere?)

That would be great to make the tasks a bit simpler, i.e. having less input validation at the start.

Note, however, that this will not be used by fractal-server when populating WorkflowTask.args with the defaults in the JSON Schemas, since only the "first-level" defaults will be read and assigned to args (ref fractal-analytics-platform/fractal-server#639). This looks a very intuitive behavior to me, but let's make sure we all agree.

Can we briefly discuss this tomorrow? Not sure I fully understand this and I'd be a big fan of the workflow that's downloaded actually containing all the args that were run (though if there's no way to change the defaults on the server-side anymore, the risks here do decrease a bit)

@tcompa
Copy link
Collaborator Author

tcompa commented Jun 7, 2023

I'm generally OK with moving in this direction, but this also forces us to take a more strict choice out of the two following options:

  1. Whoever will want to use the task as a Python function (outside Fractal) will have to create objects of the Channel type, e.g. as in
from fractal_tasks_core.channels import Channel
from fractal_tasks_core.tasks.create_ome_zarr import create_ome_zarr
my_channels = [Channel(...something...)]
create_ome_zarr(..., my_channels, ...)
  1. Since we are now saying that pydantic may become an actual dependency, we can switch from an optional coerce_and_validate=True/False (which is currently part of the fractal_tasks_core.tasks._utils.run_fractal_task function) to always coercing&validating arguments of tasks.

Up to now we were keeping pydantic on the run_fractal_task side only, but if it enters the main function definition then I don't see compelling reasons for this any more.

@tcompa
Copy link
Collaborator Author

tcompa commented Jun 7, 2023

Can we briefly discuss this tomorrow? Not sure I fully understand this and I'd be a big fan of the workflow that's downloaded actually containing all the args that were run (though if there's no way to change the defaults on the server-side anymore, the risks here do decrease a bit)

Sure, let's rediscuss it.

What I'm saying, briefly, is that if there is a default for a specific Channel attribute (e.g. start=0) this is irrelevant for fractal-server, since it is not a default value for a function parameter, but it's a default value for an attribute of the type hint of a function parameter. This makes sense to me: fractal-server will never have to introduce an additional channel, but the user will have to do it (e.g. in fractal-web).

@tcompa
Copy link
Collaborator Author

tcompa commented Jun 7, 2023

As of the call this morning, we'll proceed with this approach for Channel. Some notes:

  1. The definition will be part of https://github.com/fractal-analytics-platform/fractal-tasks-core/blob/main/fractal_tasks_core/lib_channels.py.
  2. The model will perform some validation (either directly or through a Pydantic validator), and in this way we'll be able to remove some of the preliminary checks from the task functions. For instance validate_allowed_channel_input won't have to verify that wavelength_id is defined, since it will be a required attribute of the class.
  3. Should the model name simply be Channel?
  4. Description of attributes of internal models (e.g. the description of Channel.label) won't propagate to the JSON Schema, for the moment. We can open another issue for that task.

@jluethi feel free to mention other cases where this approach may be relevant, apart from Channel.

@tcompa tcompa changed the title JSON Schemas for complex arguments Introduce Pydantic model for Channels Jun 7, 2023
@jluethi
Copy link
Collaborator

jluethi commented Jun 7, 2023

Should the model name simply be Channel?

Given that the channel contains metadata that we put in the omero metadata and should be quite general, I'm happy with going with Channel here.

Description of attributes of internal models (e.g. the description of Channel.label) won't propagate to the JSON Schema, for the moment. We can open another issue for that task.

Ok, those would be useful to have as well eventually, let's track that separately.

@jluethi feel free to mention other cases where this approach may be relevant, apart from Channel.

I will keep an eye on this when reviewing the core tasks default values. One example that comes to mind: Could we make model_type in the cellpose task now of models.MODEL_NAMES now => a defined list of valid models [maybe even something where we can describe the valid options for a potential drop-down selection, but that's a new feature wish]
Using models.MODEL_NAMES was not possible with the pydantic approach we were using before, to be tested if it would work now.
I'll see if I find other examples.

@tcompa
Copy link
Collaborator Author

tcompa commented Jun 8, 2023

For the record, here is the schema for schema['properties']['omero']['properties']['channels']['items'] from ome-ngff 0.4:

{
"type": "object",
"properties": {
  "window": {
    "type": "object",
    "properties": {
      "end": { "type": "number" },            
      "max": { "type": "number" },            
      "min": { "type": "number" },            
      "start": { "type": "number" }
    },
    "required": ["start", "min", "end", "max" ]
  },
  "label": { "type": "string" },
  "family": { "type": "string" },
  "color": { "type": "string" },
  "active": { "type": "boolean" }
},
"required": [ "window", "color" ]
}

to which we are for sure adding some optional first-level properties like:

  • wavelength_id: str
  • coefficient: int = 1
  • inverted: bool = False

There is a (minor) decision to take, concerning the window attribute.
The correct way to go is that we have a nested model, as in

class ChannelWindow(BaseModel):
    min: str
    max: str
    start: str
    end: str

class Channel(BaseModel):
    wavelength_id: str
    window: ChannelWindow
    label: Optional[str]
    ...

which brings us closer to the ome-ngff schema for omero channels.

This choice, however, will break our current examples, because we introduced some additional first-level attributes like start and end that we later assigned to window["start"] and window["end"].
That is, what we currently have as

{
...
"start": 0,
...
}

will need to become something like

{
"window": {"start": 0, ...}
}

I'm in favor of this change, to move our model closer to the specs, but it's not strictly needed - thus I can avoid introducing it if we are annoyed by the breaking change (@jluethi).

@tcompa
Copy link
Collaborator Author

tcompa commented Jun 8, 2023

Some additional context:

What we are currently doing goes in the same direction of https://github.com/JaneliaSciComp/pydantic-ome-ngff (see for an example), which in my view is very relevant for long-term integration with ome-ngff through formal validation against known schemas (or Pydantic models).

In the current (very preliminary!) status of pydantic-ome-ngff:

omero and bioformats2raw are not currently supported, but contributions adding support for those models would be welcome.

That said, I think it'll be worth mentioning this example on that repo, at some point, to show our interest and possibly provide additional motivation. To be clear, I don't think we should depend on that package any time soon, but if it is transformed into an "official" encoding of the specs (we are not there yet), then I'd be quite interested in using it.

@tcompa
Copy link
Collaborator Author

tcompa commented Jun 8, 2023

Some more additional context

With no need for comments, see this code block from https://github.com/czbiohub/iohub/blob/main/iohub/ngff_meta.py#L231

class WindowDict(TypedDict):
    """Dictionary of contrast limit settings"""

    start: float
    end: float
    min: float
    max: float


class ChannelMeta(MetaBase):
    """Channel display settings without clear documentation from the NGFF spec.
    https://docs.openmicroscopy.org/omero/5.6.1/developers/Web/WebGateway.html#imgdata
    """

    active: bool = False
    coefficient: float = 1.0
    color: ColorType = "FFFFFF"
    family: str = "linear"
    inverted: bool = False
    label: str = None
    window: WindowDict = None

@jluethi
Copy link
Collaborator

jluethi commented Jun 8, 2023

wavelength_id: str
coefficient: int = 1
inverted: bool = False

I'm aware of wavelength_id. We don't really do anything with coefficient (don't even know what this is) and inverted, right?

I'm in favor of this change, to move our model closer to the specs, but it's not strictly needed - thus I can avoid introducing it if we are annoyed by the breaking change (@jluethi).

Now's the time for such breaking changes :) I'm also in favor of moving close to their model.

We can't fully adopt their spec for our input though: We will always calculate the correct min & max, never ask the user (they are based on whether the image is an uint8 or uint16. Thus, we will have a slightly different input spec than their definition for what we request as inputs. I'm also not foreseeing asking the user for coefficient (whatever that actually does)

Other than that, big fan of moving closer to them and, in the far future, even depending on a package like that!

@tcompa
Copy link
Collaborator Author

tcompa commented Jun 8, 2023

I'm aware of wavelength_id. We don't really do anything with coefficient (don't even know what this is) and inverted, right?

We have always been using these two default values (coefficient=1 and inverted=False), since the major channel refactor. They were also present in the first examples we discussed, and they are part of the transitional OME-NGFF omero metadata.

I can take them away, if there is any reason, or leave them as they are.

https://docs.openmicroscopy.org/omero/5.6.1/developers/Web/WebGateway.html#imgdata

@tcompa
Copy link
Collaborator Author

tcompa commented Jun 8, 2023

We can't fully adopt their spec for our input though: We will always calculate the correct min & max, never ask the user

Yes, indeed. We will have to set them as optional, and then assign them inside our own logic. But that's a reasonable trade-off, IMO.

@tcompa
Copy link
Collaborator Author

tcompa commented Jun 8, 2023

We can't fully adopt their spec for our input though: We will always calculate the correct min & max, never ask the user

Yes, indeed. We will have to set them as optional, and then assign them inside our own logic. But that's a reasonable trade-off, IMO.

This will also be the case for colormap, which we are assigning as part of our logic (and not requiring from the user).

@jluethi
Copy link
Collaborator

jluethi commented Jun 8, 2023

I can take them away, if there is any reason, or leave them as they are.

I'd leave them hard-coded until we figure out if we want to do something with them

Yes, indeed. We will have to set them as optional, and then assign them inside our own logic. But that's a reasonable trade-off, IMO.

Agreed, that's a good trade-off for these things! Same for colormap!

@tcompa
Copy link
Collaborator Author

tcompa commented Jun 9, 2023

Worth mentioning:

As of 153fb06, we have a test for define_omero_channels (the function that goes from a list of our-own custom Channel objects to a standard JSON object to be printed in the zarr metatata), which directly validates the output against OME-NGFF JSON Schema.

This makes it clear, once again, that introducing a deeper integration of the OME-NGFF schemas will be a very interesting idea for the future.

@jluethi
Copy link
Collaborator

jluethi commented Jun 9, 2023

Very nice!

@tcompa
Copy link
Collaborator Author

tcompa commented Jun 9, 2023

Quick question: the channel.window attribute is a required one, in ome-ngff.

Should we fail when we try processing an existing zarr file that does not comply (i.e. that doesn't have the window attribute)?
Or should we also make window an optional attribute?

(I guess the latter is true, or we will be a bit more strict in what we can process..)

@jluethi
Copy link
Collaborator

jluethi commented Jun 9, 2023

As far as input to a fractal tasks are concerned, the window should be something users can optionally specify. They may specify a start & end (=> affects display of the image).
They should likely never change the min & max (based on uint8 & uint16 etc.)

I thought the omero metadata supports not having the start & end specified. If they are not specified, the napari viewer just handles that as far as I know.

Thus, we only optionally ask start & end from user. We always provide a window but so far, this window always included min & max, while start & end were optional

@jluethi
Copy link
Collaborator

jluethi commented Jun 9, 2023

Also, the omero metadata is labeled as transitional, see https://ngff.openmicroscopy.org/latest/index.html#omero-md

Thus, I would not expect that all OME-Zarrs we ever process will have an omero metadata. I wouldn't be too strict here thus.

@tcompa
Copy link
Collaborator Author

tcompa commented Jun 9, 2023

As far as input to a fractal tasks are concerned, the window should be something users can optionally specify. They may specify a start & end (=> affects display of the image).

What about an existing zarr? Does it have to have windows specified?

EDIT: answer was in the second comment..

@tcompa
Copy link
Collaborator Author

tcompa commented Jun 9, 2023

Thus, we only optionally ask start & end from user. We always provide a window but so far, this window always included min & max, while start & end were optional

Agreed. This is the current behavior (in the PR) for when we create a zarr ourselves.

@tcompa
Copy link
Collaborator Author

tcompa commented Jun 12, 2023

Also, the omero metadata is labeled as transitional, see https://ngff.openmicroscopy.org/latest/index.html#omero-md
Thus, I would not expect that all OME-Zarrs we ever process will have an omero metadata. I wouldn't be too strict here thus.

I'm now merging #410, but another caveat is due (and we can continue in a different issue if it requires a discussion) about how to handle existing OME-Zarrs.

Right now, in #410, I did use the new Channel model in multiple places:

  1. Obviously for the arguments of create_ome_zarr and create_ome_zarr_multiplex
  2. In many other tasks (yokogawa-to-zarr, illumination correction, napari workflows, cellpose) which need to read channels from an existing OME-Zarr.

For case 1, it's OK to be as strict as possible (that is, a user cannot include arbitrary metadata in a newly created OME-Zarr), but this could be an issue for case 2. What should we do if we receive an OME-Zarr created outside Fractal, that includes some non-standard metadata?

With the current version of #410, all these metadata would be discarded upon reading from disk. This is not a severe issue at the moment, because we only read the channels metadata (in case 2) and never write them back to disk. It would become an issue if we have a task that also writes the channel metadata back to disk, because in that case we would be loosing some of the original non-standard metadata.

@jluethi
Copy link
Collaborator

jluethi commented Jun 12, 2023

Good points.

There are 2 areas where this may become relevant:

  1. In the context of taking arbitrary external OME-Zarr files as input => how do we parse their metadata correctly? Do our assumptions work in this broader context? Extract attributes from ome-zarr rather than from metadata (whenever possible) #351
  2. In the context of using the MD converter task => does that still work? I think it should. See https://github.com/jluethi/faim-hcs/tree/fractal-roi-tables

@tcompa tcompa changed the title Introduce Pydantic model for Channels Add Pydantic model for Channels Jun 13, 2023
@tcompa
Copy link
Collaborator Author

tcompa commented Jun 13, 2023

For the record, what was described in #386 (comment) is now tested as part of fractal-demos, with create-ome-zarr arguments that look like

{
    "allowed_channels": [
        {
            "color": "00FFFF",                 // renamed from `colormap` to `color`, as per specs
            "wavelength_id": "A01_C01",
            "label": "DAPI",
            "window":{                         // this nesting of `window` is new
                "start": 110,
                "end": 800
            }
        },
        {
            "color": "FF00FF",
            "wavelength_id": "A01_C02",
            "label": "nanog",
            "window": {
                "start": 110,
                "end": 290
            }
        },
        {
            "color": "FFFF00",
            "wavelength_id": "A02_C03",
            "label": "Lamin B1",
            "window": {
                "start": 110,
                "end": 1600
            }
        }
        ]
    ,
    "coarsening_xy": 2,
    "num_levels": 5,
    "image_extension": "png"
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
High Priority Current Priorities & Blocking Issues JSON Schemas
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants