Skip to content

fix: Fix how supervisor validation works for cascading reindexing templates in compaction supervisors#19223

Merged
capistrant merged 5 commits intoapache:masterfrom
capistrant:fix-reindexing-bug
Apr 16, 2026
Merged

fix: Fix how supervisor validation works for cascading reindexing templates in compaction supervisors#19223
capistrant merged 5 commits intoapache:masterfrom
capistrant:fix-reindexing-bug

Conversation

@capistrant
Copy link
Copy Markdown
Contributor

@capistrant capistrant commented Mar 27, 2026

Description

Spec validation during spec POST

Added validateSpec to SupervisorSpec. It has a noop default implementation, which follows a similar validation pattern that exists with validateSpecUpdateTo in SupervisorSpec. We call this when the spec is posted. This gives us the early feedback to client if they try to submit an invalid spec.

regular spec validation

In earlier iterations of this validation refactoring, compaction supervisor specs were validated on deserialization (constructor validation). This status could become stale since the spec validity is tied to cluster dynamic config in addition to the spec for compaction. Now we are constantly validating the status of the spec, so if a change to cluster dynamic config change makes a spec invalid, it will be quickly and clearly reflected in the status of the supervisor that is now invalid. in the past the spec would need to be deserialized again for it to reflect a now invalid status.

Cascading reindexing validation changes

The cascading reindexing template doesn't fit well into the existing supervisor validation that exists for the other compaction supervisor types. Ended up refactoring validation of a spec into the compaction config interface itself. the older config types delegate to the same validation that has always been called for them. the cascading config leverages some of the same underlying code but only does a subset of validations due to the more dynamic nature of config creation.

Release note

fixing an issue in a never before released feature so no fix note needed.

We could mention a more correct validation of compaction supervisor specs who are utilizing the default compaction engine configured for the cluster. If the default engine changes making a spec invalid, we will now catch it before creating tasks with the invalid config.


Key changed/added classes in this PR
  • DataSourceCompactionConfig
  • CatalogDataSourceCompactionConfig
  • InlineSchemaDataSourceCompactionConfig
  • CascadingReindexingTemplate
  • SupervisorResource
  • SupervisorSpec

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@capistrant capistrant added the Bug label Mar 27, 2026
@capistrant capistrant added this to the 37.0.0 milestone Mar 27, 2026
{
if (compactionConfig == null) {
return CompactionConfigValidationResult.failure("Cannot be null");
} else if (compactionConfig instanceof CascadingReindexingTemplate) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be better to avoid the instanceof by putting the config in control of its own validation, such as by adding an interface method on DataSourceCompactionConfig like:

CompactionConfigValidationResult validate(ClusterCompactionConfig clusterConfig)

Btw, I know this code was pre-existing, but the existing path uses getLatestClusterConfig() to validate. Isn't that strange? It means that whether a CompactionSupervisorSpec is readable or not will depend on the current cluster configuration. If the cluster config changes then a currently-running supervisor could stop having a valid spec. Also, validation happens only on construction of the spec when it's deserialized, even if the cluster configuration later changes in a way that makes a previously-valid spec now-invalid.

Wouldn't it be better to validate when we run the spec in resetCompactionJobQueue, rather than on spec deserialization?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be better to validate when we run the spec in resetCompactionJobQueue, rather than on spec deserialization?

hmm this is a good point. I guess one downside is delayed feedback after spec submission 🤔 . And for folks who specify an engine in spec they actually don't depend on the cluster config so kind of sad to wait to get feedback. but probably even worse is to have a spec that was good quietly become invalid if the cluster config changes. so I suppose I've talked myself into moving it

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or maybe we just do it in both spots 😅

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It worries me to have validation dependent on dynamic configuration in the constructor of CompactionSupervisorSpec. What happens if dynamic configuration changes in such a way that a previously-valid spec is now invalid? Will it now fail to show up in the supervisor list API? It will be tough for the user to fix it if they can't see it.

I believe the ideal approach would be to validate on submission (in SupervisorResource#specPost) and at runtime (in resetCompactionJobQueue) but not on construction.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did some local testing to get clarity on behavior -

It will not fail to show up. if a spec goes from valid to invalid because of the dynamic config, the next time it is deserialized it will flip to INVALID_SPEC state but still shows up. I'm still taking this suggestion under advisement though. even if it isn't awful UX currently, I think you're on to something in suggesting we rethink the location of validtion

Copy link
Copy Markdown
Contributor Author

@capistrant capistrant Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gianm ok I tinkered with things to get the validation out of the constructor and into POST time, plus other status/state checks for the supervisor. net it gets it out of only checking during deserialization and now is much more responsive in surfacing config changes that make a spec invalid/valid. before you would need a failover, service restart, etc. to get the status to reflect the change in the supervisor data. now it is pretty rapidly reflective of the actual state

let me know if this feels more in the right direction, thanks!

@capistrant capistrant requested a review from gianm April 2, 2026 13:43
@cecemei
Copy link
Copy Markdown
Contributor

cecemei commented Apr 7, 2026

@capistrant @gianm whats the status for this PR? are we ready to merge?

@capistrant
Copy link
Copy Markdown
Contributor Author

@capistrant @gianm whats the status for this PR? are we ready to merge?

@cecemei I am working on addressing Gian's latest comments. I was on vacation last week, sorry for the delay.

Copy link
Copy Markdown
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, latest changes LGTM.

@capistrant capistrant merged commit 0e91c6f into apache:master Apr 16, 2026
63 of 64 checks passed
@github-actions github-actions Bot modified the milestones: 37.0.0, 38.0.0 Apr 16, 2026
capistrant added a commit to capistrant/incubator-druid that referenced this pull request Apr 16, 2026
…plates in compaction supervisors (apache#19223)

* Fix how supervisor validation works for cascading reindexing templates in compaction supervisors

* some refactoring of compaction spec validation

* Add a validation check again before creating jobs

* Make compaction spec validation more dynamic and fresh compared to previous approach

* Fixup a place validation wasn't beging called after removing it from compaction config constructor

(cherry picked from commit 0e91c6f)
@capistrant capistrant modified the milestones: 38.0.0, 37.0.0 Apr 16, 2026
cecemei pushed a commit that referenced this pull request Apr 17, 2026
…plates in compaction supervisors (#19223) (#19334)

* Fix how supervisor validation works for cascading reindexing templates in compaction supervisors

* some refactoring of compaction spec validation

* Add a validation check again before creating jobs

* Make compaction spec validation more dynamic and fresh compared to previous approach

* Fixup a place validation wasn't beging called after removing it from compaction config constructor

(cherry picked from commit 0e91c6f)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants