Description:
The resource_delete_total and resource_delete_duration_seconds metrics are being incremented every reconcile cycle for resources that don't exist and were never created.
When using a Deployment-based EnvoyProxy, it seems like the controller calls deleteDaemonSet(), deleteHPA(), and deletePDB() on every reconcile because the createOrUpdate* seems to treat nil renders as a signal to delete.
After running for a few days, I'm seeing ~25k deletes for DaemonSet, HPA, and PDB - while the Deployment (which actually exists) shows zero deletes as expected.
This makes the metrics noisy and hard to track actual delete operations. It is also misleading for the dashboards and does not seem to match with the description of the metric:
Basically, if a resource is not actually deleted, then I wouldn't expect it to be counted as such.
Repro steps:
- Deploy Envoy Gateway with default config (Deployment mode, no HPA/PDB)
- Create a Gateway and let it reconcile for a while
- Query metrics:
resource_delete_total
resource_delete_duration_seconds_sum
- Observe DaemonSet/HPA/PDB counters climbing while Deployment stays at 0
Environment:
- Envoy Gateway v1.7.0
- Kubernetes 1.34.2
- Default EnvoyProxy config (Deployment with 2 replicas, no autoscaling or PDB)
Logs:
N/A - no errors, just metric behavior
Possible fix would be to check whether DeleteAllOf actually deleted anything before recording metrics, or skip the delete call entirely if a Get shows the resource doesn't exist.
Description:
The
resource_delete_totalandresource_delete_duration_secondsmetrics are being incremented every reconcile cycle for resources that don't exist and were never created.When using a Deployment-based EnvoyProxy, it seems like the controller calls
deleteDaemonSet(),deleteHPA(), anddeletePDB()on every reconcile because thecreateOrUpdate*seems to treatnilrenders as a signal to delete.After running for a few days, I'm seeing ~25k deletes for DaemonSet, HPA, and PDB - while the Deployment (which actually exists) shows zero deletes as expected.
This makes the metrics noisy and hard to track actual delete operations. It is also misleading for the dashboards and does not seem to match with the description of the metric:
Basically, if a resource is not actually deleted, then I wouldn't expect it to be counted as such.
Repro steps:
Environment:
Logs:
N/A - no errors, just metric behavior
Possible fix would be to check whether
DeleteAllOfactually deleted anything before recording metrics, or skip the delete call entirely if a Get shows the resource doesn't exist.