Skip to content

Fix registry incremental extract dropping modules from catalog#65975

Merged
potiuk merged 1 commit intoapache:mainfrom
astronomer:registry-extract-multi-provider
Apr 27, 2026
Merged

Fix registry incremental extract dropping modules from catalog#65975
potiuk merged 1 commit intoapache:mainfrom
astronomer:registry-extract-multi-provider

Conversation

@kaxil
Copy link
Copy Markdown
Member

@kaxil kaxil commented Apr 27, 2026

Summary

Two related bugs in dev/registry/extract_parameters.py and dev/registry/merge_registry_data.py that combined to silently corrupt the global module catalog on every single-provider incremental run, plus a multi-provider input contract that's currently a lie.

Bug 1 — --provider doesn't accept space-separated input

extract_metadata.py (line 377) and extract_connections.py (line 218) both accept space-separated provider IDs. extract_parameters.py treats the whole string as a single ID and errors out. The registry-build.yml workflow_dispatch input documents provider as "Provider ID(s) for incremental build (space-separated, empty = full build)" — a multi-provider dispatch fails with "provider 'amazon google ...' not found in provider.yaml files".

Bug 2 — modules.json wipe on incremental

In --provider X mode, extract_parameters.py skipped writing modules.json entirely on the assumption that "the output would be incomplete and would clobber the full modules.json from a previous build". merge_registry_data.py then drove module drops off new_providers's IDs:

new_ids = {p["id"] for p in new_providers}
merged_modules = [m for m in existing_modules if m["provider_id"] not in new_ids]
merged_modules.extend(new_modules)  # often empty

Net effect: for every provider listed in new_providers, the merge removed that provider's module entries from the global catalog and replaced them with nothing. Every successful single-provider auto-trigger from publish-docs-to-s3.yml's update-registry job silently removed the targeted provider's modules (e.g. all of amazon's 377 entries) from the catalog used by /explore/, the per-provider operator listings, and PageFind search. A subsequent full build restored everything; in between, the registry was degraded.

Design rationale

Why always write modules.json (even partial in --provider mode)? The merge step is the right place to enforce "preserve non-targeted providers." Making the extractor responsible for "do I clobber?" coupled extraction state to merge state. The fix decouples them: extractor always writes what it discovered (full or partial); merge replaces only the providers actually present in the new payload.

Why drive merge drops off new_modules instead of new_providers? It's the conservative invariant: if you have replacement modules, replace; if you don't, leave existing alone. The previous logic assumed new_providers and new_modules track each other perfectly, which is wrong when extraction skips writes (or, post-fix, when a provider has no module data to write).

Trade-off: a provider that genuinely went from N modules to 0 between releases (refactored away, deprecated to config-only) leaves stale entries in the catalog until a full build clears them. That's preferable to the prior behaviour where every single-provider incremental update silently wiped the targeted provider's modules. Comment added in merge_registry_data.py documenting the trade-off.

Sibling-extractor parity: extract_metadata.py and extract_connections.py already split --provider on whitespace. The new _parse_requested_providers helper mirrors that pattern.

Gotchas

  • runtime_modules.json (debug stats only) is still skipped in --provider mode — partial stats would be misleading. No production code reads this file.
  • The new --provider " " (whitespace-only) case parses to an empty set, which is falsy, so downstream if requested_providers skips the filter — same as the sibling scripts. Probably an accidental full rebuild if a CI variable is unset, but matches existing behaviour.

Known follow-ups

  • Multi-provider dispatch via registry-build.yml's workflow_dispatch is still not exercised end-to-end in CI. That's a separate orchestration question, not in scope here.
  • The single-line print(f"Wrote ... modules ({scope_label}) ...") log was hoisted out of the loop body since scope_label is loop-invariant.

Two related bugs in `dev/registry/extract_parameters.py` and
`dev/registry/merge_registry_data.py` that combined to corrupt the global
module catalog on every single-provider incremental run:

1. `extract_parameters.py --provider X` skipped writing `modules.json`
   entirely on the assumption that "the output would be incomplete and
   would clobber the full modules.json from a previous build".

2. `merge_registry_data.py` then drove module drops off the IDs in
   `new_providers`, so for every provider listed in `providers.json`
   that wasn't replaced in `new_modules` (because it was empty), the
   merge dropped that provider's modules from the catalog and replaced
   them with nothing. Net effect: every successful incremental
   auto-trigger silently removed the targeted provider's modules from
   the global module catalog (e.g. all of amazon's 377 entries).

The `--provider` argument also accepted only a single ID string, not the
space-separated input that `extract_metadata.py` and
`extract_connections.py` already support and `registry-build.yml`
advertises in its workflow_dispatch input.

Changes:

`extract_parameters.py`:
- Add `_parse_requested_providers(arg)` helper mirroring the sibling
  scripts' parser.
- Rename `only_provider: str | None` -> `requested_providers: set[str] | None`.
- Always write `modules.json`; in `--provider` mode it covers only the
  requested providers (partial). The merge step is responsible for
  combining with the existing catalog without dropping non-targeted
  providers' entries.
- `runtime_modules.json` (debug stats) is still skipped in `--provider`
  mode since stats would be partial -- preserving the prior behaviour.

`merge_registry_data.py`:
- Drive module drops off the provider IDs present in `new_modules`,
  not `new_providers`. Empty/missing `new_modules` therefore preserves
  all existing modules instead of silently wiping the targeted
  provider's entries.

Tests:
- New `TestParseRequestedProviders` covering None, "", whitespace-only,
  single, multi, extra-whitespace, and duplicate cases.
- Updated `test_missing_new_modules_file` to assert preservation
  (previous assertion codified the bug).
- New `test_empty_new_modules_preserves_existing` (empty list payload).
- New `test_partial_new_modules_replaces_only_listed_providers`
  (mixed scenario: providers.json updates two providers but only one
  has fresh module data).
@shahar1
Copy link
Copy Markdown
Contributor

shahar1 commented Apr 27, 2026

Encountered this issue here, you preceeded me with a fix :) (#65976)

@potiuk potiuk merged commit 83b4c09 into apache:main Apr 27, 2026
141 checks passed
@github-project-automation github-project-automation Bot moved this from Backlog to Done in Airflow Registry Apr 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:dev-tools area:registry backport-to-v3-2-test Mark PR with this label to backport to v3-2-test branch

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants