Fix registry incremental extract dropping modules from catalog#65975
Merged
potiuk merged 1 commit intoapache:mainfrom Apr 27, 2026
Merged
Fix registry incremental extract dropping modules from catalog#65975potiuk merged 1 commit intoapache:mainfrom
potiuk merged 1 commit intoapache:mainfrom
Conversation
Two related bugs in `dev/registry/extract_parameters.py` and `dev/registry/merge_registry_data.py` that combined to corrupt the global module catalog on every single-provider incremental run: 1. `extract_parameters.py --provider X` skipped writing `modules.json` entirely on the assumption that "the output would be incomplete and would clobber the full modules.json from a previous build". 2. `merge_registry_data.py` then drove module drops off the IDs in `new_providers`, so for every provider listed in `providers.json` that wasn't replaced in `new_modules` (because it was empty), the merge dropped that provider's modules from the catalog and replaced them with nothing. Net effect: every successful incremental auto-trigger silently removed the targeted provider's modules from the global module catalog (e.g. all of amazon's 377 entries). The `--provider` argument also accepted only a single ID string, not the space-separated input that `extract_metadata.py` and `extract_connections.py` already support and `registry-build.yml` advertises in its workflow_dispatch input. Changes: `extract_parameters.py`: - Add `_parse_requested_providers(arg)` helper mirroring the sibling scripts' parser. - Rename `only_provider: str | None` -> `requested_providers: set[str] | None`. - Always write `modules.json`; in `--provider` mode it covers only the requested providers (partial). The merge step is responsible for combining with the existing catalog without dropping non-targeted providers' entries. - `runtime_modules.json` (debug stats) is still skipped in `--provider` mode since stats would be partial -- preserving the prior behaviour. `merge_registry_data.py`: - Drive module drops off the provider IDs present in `new_modules`, not `new_providers`. Empty/missing `new_modules` therefore preserves all existing modules instead of silently wiping the targeted provider's entries. Tests: - New `TestParseRequestedProviders` covering None, "", whitespace-only, single, multi, extra-whitespace, and duplicate cases. - Updated `test_missing_new_modules_file` to assert preservation (previous assertion codified the bug). - New `test_empty_new_modules_preserves_existing` (empty list payload). - New `test_partial_new_modules_replaces_only_listed_providers` (mixed scenario: providers.json updates two providers but only one has fresh module data).
1 task
Contributor
shahar1
approved these changes
Apr 27, 2026
potiuk
approved these changes
Apr 27, 2026
This was referenced Apr 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two related bugs in
dev/registry/extract_parameters.pyanddev/registry/merge_registry_data.pythat combined to silently corrupt the global module catalog on every single-provider incremental run, plus a multi-provider input contract that's currently a lie.Bug 1 —
--providerdoesn't accept space-separated inputextract_metadata.py(line 377) andextract_connections.py(line 218) both accept space-separated provider IDs.extract_parameters.pytreats the whole string as a single ID and errors out. Theregistry-build.ymlworkflow_dispatch input documentsprovideras "Provider ID(s) for incremental build (space-separated, empty = full build)" — a multi-provider dispatch fails with"provider 'amazon google ...' not found in provider.yaml files".Bug 2 —
modules.jsonwipe on incrementalIn
--provider Xmode,extract_parameters.pyskipped writingmodules.jsonentirely on the assumption that "the output would be incomplete and would clobber the full modules.json from a previous build".merge_registry_data.pythen drove module drops offnew_providers's IDs:Net effect: for every provider listed in
new_providers, the merge removed that provider's module entries from the global catalog and replaced them with nothing. Every successful single-provider auto-trigger frompublish-docs-to-s3.yml'supdate-registryjob silently removed the targeted provider's modules (e.g. all of amazon's 377 entries) from the catalog used by/explore/, the per-provider operator listings, and PageFind search. A subsequent full build restored everything; in between, the registry was degraded.Design rationale
Why always write
modules.json(even partial in--providermode)? The merge step is the right place to enforce "preserve non-targeted providers." Making the extractor responsible for "do I clobber?" coupled extraction state to merge state. The fix decouples them: extractor always writes what it discovered (full or partial); merge replaces only the providers actually present in the new payload.Why drive merge drops off
new_modulesinstead ofnew_providers? It's the conservative invariant: if you have replacement modules, replace; if you don't, leave existing alone. The previous logic assumednew_providersandnew_modulestrack each other perfectly, which is wrong when extraction skips writes (or, post-fix, when a provider has no module data to write).Trade-off: a provider that genuinely went from N modules to 0 between releases (refactored away, deprecated to config-only) leaves stale entries in the catalog until a full build clears them. That's preferable to the prior behaviour where every single-provider incremental update silently wiped the targeted provider's modules. Comment added in
merge_registry_data.pydocumenting the trade-off.Sibling-extractor parity:
extract_metadata.pyandextract_connections.pyalready split--provideron whitespace. The new_parse_requested_providershelper mirrors that pattern.Gotchas
runtime_modules.json(debug stats only) is still skipped in--providermode — partial stats would be misleading. No production code reads this file.--provider " "(whitespace-only) case parses to an empty set, which is falsy, so downstreamif requested_providersskips the filter — same as the sibling scripts. Probably an accidental full rebuild if a CI variable is unset, but matches existing behaviour.Known follow-ups
registry-build.yml'sworkflow_dispatchis still not exercised end-to-end in CI. That's a separate orchestration question, not in scope here.print(f"Wrote ... modules ({scope_label}) ...")log was hoisted out of the loop body sincescope_labelis loop-invariant.