Skip to content

Conversation

@jjnesbitt
Copy link
Member

This allows support for setting Config dynamically (vendorization, for example), without needing to rely on module import order.

@yarikoptic I believe this removes the need for the clear_dandischema_modules_and_set_env_vars conftest function, as config no longer relies on module import, but I'm still looking into that, which is why this is draft.

@codecov
Copy link

codecov bot commented Jul 31, 2025

Codecov Report

❌ Patch coverage is 96.00000% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 87.18%. Comparing base (b78da9e) to head (fbf75b1).

Files with missing lines Patch % Lines
dandischema/models.py 93.75% 4 Missing ⚠️
dandischema/conf.py 96.42% 1 Missing ⚠️
dandischema/tests/conftest.py 95.65% 1 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (b78da9e) and HEAD (fbf75b1). Click for more details.

HEAD has 6 uploads less than BASE
Flag BASE (b78da9e) HEAD (fbf75b1)
unittests 54 48
Additional details and impacted files
@@               Coverage Diff                @@
##           devendorize     #316       +/-   ##
================================================
- Coverage        97.86%   87.18%   -10.69%     
================================================
  Files               18       18               
  Lines             2249     2263       +14     
================================================
- Hits              2201     1973      -228     
- Misses              48      290      +242     
Flag Coverage Δ
unittests 87.18% <96.00%> (-10.69%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Member

@candleindark candleindark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point, I haven't check all the details for correctness, but I can see that with this approach, the calling order restriction of dandischema.conf.set_instance_config() in relation to the import of dandischema.models has indeed been lifted. However, to achieve that, the validations affected by the value of the instance config in dandischema.conf have been moved into custom Pydantic validators, i.e., the field_validators. The validation behaviors of those validators exist only in Python runtime and are not encoded in the corresponding JSON Schema schemas of the models.

I think, in general, we want to move validation out of custom Pydantic validators into validation that can be encoded into JSON Schema schemas. @yarikoptic, your input?

doi: str = Field(
title="DOI",
json_schema_extra={"readOnly": True, "nskey": DANDI_NSKEY},
default="",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We require doi value when DANDI DOI pattern is not available. Defaulting to """ will fail the check_id method below, but the feedback the user getting will be different.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what should default be set to here? Prior to this, it seemed that either pattern was set to DANDI_DOI_PATTERN, or it was set to the default pattern and default was set to "". Should default be set dynamically to match that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am actually talking about the default value of the doi field. Without this proposed change, the doi field doesn't have a default when DANDI_DOI_PATTERN is None which means that the doi field is required in that situation. With this proposed change, the doi field always has a default which means the doi field is never required.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, that's an oversight on my part.

@jjnesbitt
Copy link
Member Author

jjnesbitt commented Jul 31, 2025

The validation behaviors of those validators exist only in Python runtime and are not encoded in the corresponding JSON Schema schemas of the models.

Good point. This could be addressed using field_serializer. I can take a stab at that.

I think, in general, we want to move validation out of custom Pydantic validators into validation that can be encoded into JSON

I don't think they're necessarily opposed to one another. The custom validators I added are simply using regex, which can be re-encoded into strings once the model is serialized.

@candleindark
Copy link
Member

candleindark commented Jul 31, 2025

The validation behaviors of those validators exist only in Python runtime and are not encoded in the corresponding JSON Schema schemas of the models.

Good point. This could be addressed using field_serializer. I can take a stab at that.

You are right. We can do that but that will make the eventual transitioning to LinkML more complex, and from now on we have to manage the serialization of those involved elements instead of being done by Pydantic.

I think, in general, we want to move validation out of custom Pydantic validators into validation that can be encoded into JSON

I don't think they're necessarily opposed to one another. The custom validators I added are simply using regex, which can be re-encoded into strings once the model is serialized.

I like the idea of separation of model from code, so I am hesitant in moving more of the model specs to the code by using those custom validators and respective serializers.

I want to know why we want these customization in the first place. I see that these changes allow dandischema.conf.set_instance_config() to be called after as well as before the import of dandischema.models. Do we need this ability though? Without these changes, dandischema.models can be initialized correctly as long as the env vars corresponding to the fields of dandischema.conf.Config are set when dandi-archive is lanuched, and dandischema.models don't need to change for the duration of a dandi-archive run, i.e., dandischema.conf.set_instance_config() doesn't need to be called in dandi-archive at all. As for dandi-cli, per recent discussion with @yarikoptic, we no longer need an instance specific dandischema.models. Thus, at this point, dandischema.conf.set_instance_config() doesn't need to be called within dandi-cli either.

@yarikoptic
Copy link
Member

re

I think, in general, we want to move validation out of custom Pydantic validators into validation that can be encoded into JSON Schema schemas. @yarikoptic, your input?

and related

that will make the eventual transitioning to LinkML more complex, and from now on we have to manage the serialization of those involved elements instead of being done by Pydantic.
...
I like the idea of separation of model from code, so I am hesitant in moving more of the model specs to the code by using those custom validators and respective serializers.

Although I am with you on the ultimate desires/design, an immediate target is to provide support for multiple instances with current existing setup of pydantic + jsonschema, with minimal amount of "user visible changes" (i.e. not changing much if anything in "DANDI vendorized schema"). So I would say -- we can go and move into a few more python validations and serializers for now. It would also help to identify such points better for when we re-approach expressing it in linkml again.

@yarikoptic
Copy link
Member

re

As for dandi-cli, per recent discussion with @yarikoptic, we no longer need an instance specific dandischema.models

we should not need it , but did we check after relaxing all the regexes that client would work as fine?

@candleindark candleindark force-pushed the devendorize branch 2 times, most recently from 261f0f4 to c5f6327 Compare August 4, 2025 03:37
@yarikoptic
Copy link
Member

We merged other developments into devendorize (which also should be passing tests now) and now conflicts came up. @jjnesbitt - would you prefer to update this branch yourself or would be ok with @candleindark to attempt that? if ok - would you prefer rebase or merge?

@jjnesbitt
Copy link
Member Author

We merged other developments into devendorize (which also should be passing tests now) and now conflicts came up. @jjnesbitt - would you prefer to update this branch yourself or would be ok with @candleindark to attempt that? if ok - would you prefer rebase or merge?

I'm okay handling conflicts, but are we going forward with this branch? It seemed that @candleindark had major objections.

@candleindark
Copy link
Member

candleindark commented Aug 8, 2025

We merged other developments into devendorize (which also should be passing tests now) and now conflicts came up. @jjnesbitt - would you prefer to update this branch yourself or would be ok with @candleindark to attempt that? if ok - would you prefer rebase or merge?

I'm okay handling conflicts, but are we going forward with this branch? It seemed that @candleindark had major objections.

Yes. We are going forward with this. Thanks for bring the idea and the PR. After some considerations, we think it's best to avoid the "brittle" setup that depends on the import order.

You can rebase this PR and make all the tests pass, or I can send a PR to this PR, whichever works better for you. Let me know which way you prefer.

Comment on lines +1628 to +1643
@staticmethod
def get_id_pattern():
conf = get_instance_config()
sub_pattern = conf.id_pattern + "|" + conf.id_pattern.lower()
pattern = rf"^({sub_pattern}):\d{{6}}(/(draft|\d+\.\d+\.\d+))$"

return pattern

@field_validator("id")
@classmethod
def check_id(cls, value: str) -> str:
pattern = cls.get_id_pattern()
if re.match(pattern, value) is None:
raise ValueError(f"ID does not match pattern {pattern}")

return value
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I really wanted to do was the following:

    @field_pattern("id")
    @staticmethod
    def get_id_pattern():
        conf = get_instance_config()
        sub_pattern = conf.id_pattern + "|" + conf.id_pattern.lower()
        pattern = rf"^({sub_pattern}):\d{{6}}(/(draft|\d+\.\d+\.\d+))$"

        return pattern

Where field_pattern would be a new decorator that does two things:

  1. Creates a field_validator for the specified fields ("id"), that just runs re.match with the pattern returned from the decorated function get_id_pattern on the supplied fields ("id").
  2. Tags the function/class in a way that allows for pattern to be injected into the rendered schema automatically

I tried to implement this but couldn't find a way to do so. Perhaps in the future this can be updated.

@jjnesbitt
Copy link
Member Author

@candleindark I unfortunately don't have time to get this PR to 100%. However, it is almost there (90%). As far as I can tell, the big remaining item is this comment you left, regarding the doi default value. I believe a default could be supplied in the check_doi method, but the issue is that other parts of the testing code make use of a statically defined instance config. In order to properly fix how DOI is handled, I think those constants (and anywhere they're made use of) need to be updated to use a dynamic instance config.

Do you think you can take this on?

@candleindark
Copy link
Member

Do you think you can take this on?

Sure. I will take it from this point on. Thanks for helping out.

@yarikoptic
Copy link
Member

ok, so it seems just that vendorized CI runs (where we pretend to be running on a specific vendorized instance) are failing , e.g. in the tests

FAILED dandischema/tests/test_metadata.py::test_requirements[obj1-PublishedDandiset-missingfields1]
359
FAILED dandischema/tests/test_metadata.py::test_requirements[obj2-PublishedDandiset-missingfields2]
360
FAILED dandischema/tests/test_metadata.py::test_requirements[obj3-PublishedDandiset-missingfields3]
361
FAILED dandischema/tests/test_models.py::test_dandimeta_1 - assert 5 == 6
362
FAILED dandischema/tests/test_models.py::test_vendorization[config_dict0-[A-Z][-A-Z]*-10\\.\\d{4,}-valid_vendored_fields0-invalid_vendored_fields0]
363
FAILED dandischema/tests/test_models.py::test_vendorization[config_dict2-DANDI-10\\.\\d{4,}-valid_vendored_fields2-invalid_vendored_fields2]
364
FAILED dandischema/tests/test_models.py::test_vendorization[config_dict3-[A-Z][-A-Z]*-10\\.\\d{4,}-valid_vendored_fields3-invalid_vendored_fields3]

@jjnesbitt you didn' try to run tests while having specified an instance "outside" of the environment right?

The support of the use of the `|` operator to
express optional type was introduced in
Python 3.10. The lowest supported Python in
this project is currently 3.9. Let's
delay the use of `|` to express
optional types after dropping of Python 3.9
So that it behaves the same as `dandischema
.models.DANDI_INSTANCE_URL_PATTERN` that
it is replacing. Incidentally, this commit
also renames the local variable `pattern`
to `instance_url` to reflect the nature of
the assigned value
Rename property `published_version_pattern`
to `published_version_url_pattern`. The
new name is more consistent with the
`PUBLISHED_VERSION_URL_PATTERN` constant
that existed in `dandischema.models`
which the property is replacing
Realign the definition of
`Config.dandi_doi_pattern` to
`dandischema.models.DANDI_DOI_PATTERN` which
it is replacing
Remove special handling of importing of
`dandischema.models` before
`set_instance_config()` is called.
The whole point of the containing PR is to
remove the reliance on import order, so
such a handling will no longer be needed.
[(license_.name, license_.value) for license_ in _INSTANCE_CONFIG.licenses],
[
(license_.name, license_.value)
for license_ in get_instance_config().licenses
Copy link
Member

@candleindark candleindark Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line prevents dandischema to become truly dynamic, the purpose of this PR, and there is no way around it as long as LicenseType is defined as an Enum at the module level. Once this definition of LicenseType is executed, changes in the value returned by get_instance_config() do not alter the value of LicenseType.

A way to make LicenseType dynamic, as suggested by ChatGPT, is to define it as custom type with hooks for Pydantic to validate and generate JSON schema, such as the following.

# types.py (or nearby)
from pydantic import GetJsonSchemaHandler
from pydantic.json_schema import JsonSchemaValue
from pydantic_core import core_schema, PydanticCustomError
from dandischema.conf import get_instance_config

def _current_license_values() -> list[str]:
    # your config objects already have .value equal to "scheme:id"
    return [lic.value for lic in get_instance_config().licenses]

class DynamicLicense(str):
    @classmethod
    def __get_pydantic_core_schema__(cls, _source, _handler):
        def validate(v):
            s = str(v)
            allowed = _current_license_values()
            if s not in allowed:
                raise PydanticCustomError(
                    'license_value',
                    'Invalid license: {val}. Allowed: {allowed}',
                    {'val': s, 'allowed': allowed},
                )
            return s
        return core_schema.no_info_after_validator_function(
            validate, core_schema.str_schema()
        )

    @classmethod
    def __get_pydantic_json_schema__(cls, core_schema, handler: GetJsonSchemaHandler) -> JsonSchemaValue:
        schema = handler(core_schema)
        schema['enum'] = _current_license_values()
        schema['title'] = 'License'
        return schema

Though I have yet to test this approach fully, it looks viable to me but very messy and opaque. However, the more crucial question for me is if we should redefine LicenseType as a custom type in order achieve the goal of this PR, or should we just accept the current state of #294, which has restriction on import order? If we take this approach, and there are other enum classes need to be made dynamic in the further, they will have to be redefined the same way as well. Should we take this approach, @yarikoptic?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't this alternative work as for immediate needs/use-cases:

  • instead of directly assigning a list here, we come up with a function like def assign_enums(instance_config) which we call here after getting the instance_config.
  • inside that function we set all desired enums like this one to config based value
  • in set_instance_config we add a flag assign_enums: bool = False and if assign_enums: import dandischema.models and call assign_enums with that new config

This way we

  • would not have circular import
  • would be able to adjust all those enums (if more than just LicenseType would need to be set).

WDYT @jjnesbitt about this situation and how to solve it?

Overall, me and @candleindark feel that added complexity over-weights the benefit we might get with "dynamic configuration" at the moment, and would prefer to go with a much simpler original solution of doing instance setup at import time once (that is what in devendorize branch) for now to avoid all possible gotchas due to added complexity to config life cycle. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other point I want to bring up, after realizing it in handling this LicenseType issue, is that in making models.py fully dynamic, we have rendered the use of set_instance_config() a potential pitfall. Now, one has to be extremely careful with the use of set_instance_config(). Using set_instance_config() to define the models, one has to ensure that it is always called in a function that will be executed each time that a model entity is evaluated, and any failure in doing so will lead to a models.py that is only partially dynamic and holds definitions that are inconsistent with the instant config. Case in point, the error in the current definition of LicenseType was overlook.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way that I can think to solve this problem in the framework of this PR is to change the type of license to a str, and define the following field validator for license:

    @field_validator("license")
    @staticmethod
    def check_license(value: str) -> str:
        license_values = [x.value for x in get_instance_config().licenses]
        if value not in license_values:
            raise ValueError(f"License {value} not valid")

        return value

and then add this in DandiBaseModel.__get_pydantic_json_schema__:

            if prop == "License":
                value["items"]["enum"] = [
                    x.value for x in get_instance_config().licenses
                ]

This solution is messy and has its issues, so if you'd rather go with a static import approach, you're free to do so. However, I'll just point out that there's already loads of arbitrary logic in the __get_pydantic_json_schema__ method, as well as many other places.

The larger issue is that you're trying to take a 2000 line file filled with pydantic type definitions, and make it configurable with minimal changes. As far as I can tell, this is not something that's really done in pydantic. You'd be better off creating a function that returns the properly configured classes at runtime, but that has it's own issues.

🤷

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it also effect meditor so we loose drop down and require users to enter that string?

Indeed the task we are trying to do has big "footprint" but IMHO it is quite easy to achieve with import time customization. After we achieve desired effects of allowing multiple instances and making backend and frontend configured with just a few new settings, we will look into overhauling this setup, likely with a switch to linkml as the source of the model. Potentially, again, just keeping it a singular version customized at import time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that solution wouldn't effect the meditor since the JSON schema for the field is customized at DandiBaseModel.__get_pydantic_json_schema__. However, this is less desirable to the automated JSON schema generation if we were to define LicenseType as an Enum.

@yarikoptic
Copy link
Member

ok, per my comment above we will postpone an attempt to make it work "properly" via dynamic configuration and stick to import time configuration. I will close it for now but it should be kept in mind whenever we reapproach this.

@yarikoptic yarikoptic closed this Aug 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants