Skip to content

Fix: dataset deletion cleanup gaps#21

Merged
AlexianMasson merged 6 commits into
mainfrom
fix/dataset-deletion-cleanup-gaps
Jun 3, 2026
Merged

Fix: dataset deletion cleanup gaps#21
AlexianMasson merged 6 commits into
mainfrom
fix/dataset-deletion-cleanup-gaps

Conversation

@AlexianMasson
Copy link
Copy Markdown
Collaborator

Audit of what each action creates vs. what deletion cleaned up — the gaps found on the deletion path, fixed:

  • GeoServer ACL rules: deletion removed the feature type but left its .r/.w rules. Since final_table_name is reused by get_available_table_name, old rules could silently apply to a future dataset. delete_layer_acl is now called alongside delete_layer (best-effort, 404-tolerant).
  • In-flight runs: no runs were cancelled before deletion; a run finishing afterwards would recreate tables with nothing tracking them. cancel_ingestion_dag now force-fails the scheduled ingestion DAG, process_dag and staging_dag runs first.
  • Stale ingestion DAG: the ingestion_<id> DAG was only deleted when a schedule was still set at deletion time. Deletion now always calls delete_dag (404-tolerant).
  • Run history: staging_dag/process_dag run records (dag runs, task instances, XComs carrying table names) were never removed. purge_dataset_dag_runs deletes them via run_id_pattern LIKE before the IntegrityLink row is deleted.
  • Org-shared resources: the GeoServer workspace {org}, datastore {org}_ds and the org PostgreSQL schema outlived all datasets forever. After the last dataset of an org is deleted, the service attempts (best-effort) non-recursive REST deletes (GeoServer refuses when non-empty → harmless) and DROP SCHEMA … RESTRICT (skipped for shared staging/data schemas, guard co-located with the schema config).

Known limitation: failed-run log files on the Airflow volume are unreachable from the backend and remain.

Dataset deletion removed the feature type but left the layer's
.r/.w ACL rules in GeoServer. Since the final table name becomes
reusable once the row is deleted, a stale rule could silently apply
to a future dataset published under the same name.

Add GeoServerService.delete_layer_acl (best-effort, 404-tolerant)
and call it from DatasetDeletionService next to delete_layer.
DAG deletion was guarded by the schedule still being set. When the
schedule had been cleared before the dataset was deleted, the stale
ingestion_<id> DAG (metadata + run history) stayed in Airflow forever.

Call delete_dag unconditionally — it already treats 404 as success.
A staging_dag or process_dag run still in flight when the dataset was
deleted would recreate the staging/final table after cleanup (its
callbacks 404, leaving an orphaned table with no GS/GN artifacts).

Force-fail all runs of the dataset first (best-effort): the scheduled
ingestion DAG, process_dag runs (prefix '<id>_') and staging_dag runs
(prefix '<id>' — the first staging run id is exactly the uuid).
staging_dag/process_dag run records (with task instances and XComs
holding staging/final table names) outlived the dataset forever.

Add purge_dataset_dag_runs, matching run ids by SQL LIKE prefix
('<id>%' for staging_dag, '<id>_%' for process_dag), called best-effort
before the IntegrityLink row is deleted.

Known limitation: failed-run log files under /opt/airflow/logs live on
the Airflow volume and cannot be reached from the backend; success-run
logs are already removed by the DAG success callback.
The per-org GeoServer workspace ({org}) and datastore ({org}_ds) and
the org PostgreSQL schema survived forever, even after the org's last
dataset was deleted.

When no IntegrityLink remains for the organization, delete the
datastore and workspace via raw non-recursive REST calls (GeoServer
refuses to delete non-empty resources, so anything still in use
survives; geoservercloud's own delete helpers hardcode recurse=true and
are deliberately avoided) and DROP SCHEMA ... RESTRICT the org schema.
The shared 'staging' and default 'data' schemas are never touched.
The never-drop guard hardcoded the 'staging' literal inside the
deletion service; when get_staging_schema becomes configurable the
guard would silently stop covering it. is_shared_schema now derives
from get_staging_schema/DEFAULT_DATA_SCHEMA in core.config.
Copy link
Copy Markdown
Contributor

@tonio tonio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I wouldn't have guess there was so many things to cleanup !

@AlexianMasson
Copy link
Copy Markdown
Collaborator Author

Nice, I wouldn't have guess there was so many things to cleanup !

I was the first to be surprised !

@AlexianMasson AlexianMasson merged commit c1fbf6e into main Jun 3, 2026
4 checks passed
@AlexianMasson AlexianMasson deleted the fix/dataset-deletion-cleanup-gaps branch June 3, 2026 13:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants