ENG-3593: Eliminate redundant System fetches in /system/upsert#8080
ENG-3593: Eliminate redundant System fetches in /system/upsert#8080
Conversation
Each upsert call previously issued four `Fetching resource` (`get_resource`) round-trips per row on the UPDATE path: 1. upsert_system: existence check 2. update_system: re-fetch with relationships 3. update_resource: pre-UPDATE sanity (inside crud.update_resource) 4. update_resource: post-UPDATE return (inside crud.update_resource) Three of those are removable without changing behavior: - Thread the System loaded by upsert_system's existence check through to update_system as an optional `existing_system` parameter (eliminates #2). The direct `PUT /system` endpoint caller continues to work; the new parameter is optional and falls back to the original fetch when absent. - Inline the UPDATE in update_system rather than calling crud.update_resource, using the ORM object we already hold and a single db.refresh() to pick up any DB-side coercions (eliminates #3 and #4). Result: 4 -> 1 fetch per updated row. db.refresh() does not flow through crud.get_resource, so it does not emit `Fetching resource` log lines. Also adds per-axis change-detection logging in _audit_system_changes (general / privacy_declarations / data_flow). DeepDiff calls are now captured once into named variables and reused for both the existing SystemHistory write gate and a new `System change detection` debug log line. No behavior change in the audit path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub. 2 Skipped Deployments
|
Codecov Report✅ All modified and coverable lines are covered by tests. ❌ Your project status has failed because the head coverage (84.99%) is below the target coverage (85.00%). You can increase the head coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #8080 +/- ##
=======================================
Coverage 84.98% 84.99%
=======================================
Files 633 633
Lines 41737 41753 +16
Branches 4886 4887 +1
=======================================
+ Hits 35471 35487 +16
- Misses 5157 5158 +1
+ Partials 1109 1108 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
- Fix mypy narrowing of `existing_system: Optional[System]` in
`update_system` by introducing a typed local `system: System` populated
from either the passed-in argument or `get_resource`. Avoids `assert`
in app code and keeps the function body unchanged in shape.
- Rename changelog fragment to `8080-...yaml` (filename must start with
PR number per `noxfiles/changelog_nox.py:121`) and add the required
`pr: 8080` field.
- Add three regression tests in
`tests/ctl/core/test_system_history.py::TestUpsertSystemFetchOptimization`:
- `test_upsert_passes_existing_system_to_update` — mock-based,
verifies `upsert_system` threads the loaded System through.
- `test_update_system_with_existing_system_persists_changes` —
functional, verifies the new code path actually persists + audits.
- `test_upsert_system_emits_one_fetch_per_updated_system` — uses
`loguru_caplog` to assert exactly one `Fetching resource` log per
system on the UPDATE path. Guards against accidental re-introduction
of a redundant `get_resource` call.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The test re-used the same SystemSchema instances across the prime and measure passes, but `update_system` mutates inputs by stripping `privacy_declarations` after upserting them. The second pass then hit `AttributeError: 'System' object has no attribute 'privacy_declarations'` inside `validate_privacy_declarations`. Build a fresh payload for each pass while keeping the fides_keys stable so both passes hit the same DB rows. Verified by running the test suite in the local fides container — all 9 tests in tests/ctl/core/test_system_history.py pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/code-review |
There was a problem hiding this comment.
Code Review: Eliminate redundant System fetches in /system/upsert
The optimization goal is clear and well-motivated. The existing_system pass-through in upsert_system is correct, and pre-computing the three DeepDiff results in _audit_system_changes (avoiding triple re-computation) is a nice cleanup. The test coverage is a good addition.
A few items to address before merge:
Bug — test fixture teardown is a no-op (both fixtures)
delete(System).where(...) in both system fixtures (line 37 pre-existing, line 164 new) constructs but never executes the delete statement — rows accumulate in the test DB across runs. See inline comment on line 164.
Atomicity gap between UPDATE and audit write
The system data UPDATE and the SystemHistory audit entries commit in separate transactions. A failure between them leaves the system updated with no history record. This was structurally the same before, but since the UPDATE is now inlined it would be straightforward to merge both operations into one async with db.begin() block. See inline comment on lines 239–251.
Missing SQLAlchemyError guard on inline UPDATE
The removed update_resource caught SQLAlchemyError and raised a domain-level QueryError. The inlined replacement lets raw SQLAlchemy errors propagate. See inline comment on lines 239–248.
Log-string coupling in regression test
test_upsert_system_emits_one_fetch_per_updated_system infers fetch counts by scanning for "Fetching resource" in logs. If the log message changes this gives a false pass. Patching get_resource and counting calls is a more reliable approach. See inline comment on line 247.
Minor: return type annotation improvement
update_system now correctly returns Tuple[System, bool] (was Tuple[Dict, bool]). The route handler at src/fides/api/v1/endpoints/system.py:277 still annotates -> Dict but returns the System ORM object — that inconsistency predates this PR, but it's worth a follow-up cleanup.
🔬 Codegraph: unavailable
💡 Write /code-review in a comment to re-run this review.
| async with db.begin(): | ||
| log.debug( | ||
| "Updating resource", | ||
| sql_model="System", | ||
| fides_key=resource.fides_key, | ||
| ) | ||
| await db.execute( | ||
| sql_update(System.__table__) | ||
| .where(System.fides_key == resource.fides_key) | ||
| .values(resource_dict) |
There was a problem hiding this comment.
src/fides/api/db/system.py:239-248
The old update_resource wrapped the execute() in a try/except SQLAlchemyError and re-raised a domain-level QueryError with a structured log message. The inlined version lets raw SQLAlchemy exceptions propagate to callers unchecked. Worth adding the same guard:
from sqlalchemy.exc import SQLAlchemyError
from fides.api.util import errors
try:
await db.execute(
sql_update(System.__table__)
.where(System.fides_key == resource.fides_key)
.values(resource_dict)
)
except SQLAlchemyError as e:
log.exception(f"Failed to update System with error: '{e}'")
raise errors.QueryError()| fides_key=fk, | ||
| organization_fides_key="default_organization", | ||
| name=f"Fetch count test {i}", | ||
| system_type="test", |
There was a problem hiding this comment.
tests/ctl/core/test_system_history.py:247
test_upsert_system_emits_one_fetch_per_updated_system relies on the literal string "Fetching resource" appearing in log output to count DB fetches. This is a fragile coupling to crud.get_resource's internal log message — if that string ever changes or the log level is raised, the assertion len(matching) == 1 will always pass even if the regression is re-introduced.
A more robust approach would be to patch get_resource itself and count invocations:
with patch("fides.api.db.system.get_resource", wraps=get_resource) as mock_get:
await upsert_system(resources=build_payload(), db=async_session, ...)
# Each system should only be fetched once (existence check in upsert_system).
assert mock_get.call_count == len(fides_keys)This directly measures what the optimization is meant to guarantee rather than inferring it through log messages.
- Add SQLAlchemyError -> errors.QueryError guard around the inline UPDATE in update_system, mirroring the guard the prior crud.update_resource call had. Without this the inlined UPDATE lets raw SQLAlchemy errors propagate, a small behavior regression. - Refactor test_upsert_system_emits_one_fetch_per_updated_system to patch crud.get_resource at the call site (with `wraps=` so behavior is unchanged) and assert mock.call_count, instead of scanning loguru output for the literal 'Fetching resource' string. Directly measures the metric the optimization is meant to guarantee and isn't coupled to log message wording. Skipped on purpose to keep scope tight: - Fixture teardown delete() that's never executed -- the same no-op pattern exists in the pre-existing fixture at line 37; out of scope. - Atomicity gap between the inline UPDATE and the SystemHistory write in separate `async with db.begin()` blocks -- structurally pre-existing (was the same shape via crud.update_resource); tracked for follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| except SQLAlchemyError as exc: | ||
| # Mirrors the guard the prior `crud.update_resource` call had. | ||
| log.exception(f"Failed to update System with error: '{exc}'") | ||
| raise errors.QueryError() |
There was a problem hiding this comment.
these lines are uncovered by tests, can we add a test for the exception case ?
There was a problem hiding this comment.
[Claude]: added in f447f65941 — test_update_system_raises_query_error_on_sqlalchemy_error patches async_session.execute to raise SQLAlchemyError only on the System UPDATE and asserts errors.QueryError propagates.
Addresses erosselli's review on PR #8080: the inlined UPDATE's `except SQLAlchemyError` branch was uncovered. New test patches `async_session.execute` to raise on the System UPDATE only and asserts `errors.QueryError` propagates, matching the guard the prior `crud.update_resource` call provided. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ticket ENG-3593
Description Of Changes
Reduces per-row
get_resourcecalls in/system/upsertfrom 4 → 1 on the UPDATE path, and adds per-axis change-detection logging in_audit_system_changesso we can quantify the steady-state no-op rate before considering a "skip-unchanged" optimization.Each updated system on the UPDATE path was previously issuing four
Fetching resourceround-trips:upsert_system— existence checkupdate_system— re-fetch with relationshipsupdate_resource— pre-UPDATE sanity (insidecrud.update_resource)update_resource— post-UPDATE return (insidecrud.update_resource)Three are pure dedup: none of them inspect the row for anything except "does it exist" or "what is its current state," both of which the caller already has by step (1).
This was identified as a contributor to the system memory baseline during the recent
system/upsertperf investigation; reducing fetches lowers both wall time and peak memory on this endpoint.Code Changes
upsert_systemkeeps theSystemreturned by its existence check and threads it toupdate_systemvia a new optionalexisting_system=parameter. The directPUT /systemendpoint caller (no pre-loaded System) continues to work — the parameter defaults toNoneand falls back to the original fetch.update_systemreplaces its call tocrud.update_resourcewith an inlinesql_update(System.__table__).where(...).values(resource_dict)plusdb.refresh(existing_system). The refresh re-reads the row so DB-side coercions (timestamps, JSONB normalization, etc.) are still picked up; the_audit_system_changescomparison stays equivalent to the prior behavior._audit_system_changescaptures eachDeepDiffonce into named variables (general_diff,privacy_diff,data_flow_diff) and emits a structuredSystem change detectiondebug log per system withgeneral_changed,privacy_declarations_changed,data_flow_changed,any_changed, plus affected-paths counts per axis. No behavior change — sameif-gates, sameSystemHistorywrites; this is purely observability.db.refresh()does not flow throughcrud.get_resource, so it does not emit aFetching resourcelog line. TheFetching resourcecount on the same payload drops from ~4N to ~N (the remaining one isupsert_system's existence check, which a follow-up could batch).Performance verification (local benchmark)
reproduce_upsert_perf.py
Methodology: build a payload of N Systems × M declarations, prime once (INSERT path), then upsert the same payload three times to exercise the UPDATE path on every row with zero data drift. Run on
mainand this branch in turn against the same local backend / DB. Backend at DEBUG log level soFetching resource/Updating resourcelines emit. Reproduction script will be attached as a comment on this PR.Payload: 180 systems × 3 declarations each.
mainFetching resourceper callUpdating resourceper callSystem change detectionper callThe fetch reduction matches the call-graph trace exactly. The wall-time reduction (~46%) is less than the fetch-count reduction (~75%), meaning roughly half of the original UPDATE-path wall time is non-fetch work (DeepDiff, model_dump, deepcopy of existing snapshot) — which is exactly the headroom available for the higher-risk "skip-unchanged" follow-up.
Steps to Confirm
tests/ctl/core/test_system_history.py— coversupdate_system+_audit_system_changesdirectly. The critical case istest_no_changes(idempotent upsert ⇒ zeroSystemHistoryrows). All six cases should pass:test_system_information_changes,test_privacy_declaration_changes,test_ingress_egress_changes,test_multiple_changes,test_no_changes,test_automatic_system_update.tests/ctl/core/test_system.pyandtests/ops/api/v1/endpoints/test_system.pyfor broader coverage and to exercise thePUT /systempath (which uses theexisting_system=Nonefallback).Fetching resourcelogs per measure call (down from ~720), wall time roughly halved on local Postgres, and oneSystem change detectionline per row withany_changed=Falseon idempotent payloads.Pre-Merge Checklist
CHANGELOG.mdupdatedmaindowngrade()migration is correct and worksFollowups
upsert_system's per-row existence check with a single batchedSELECT ... WHERE fides_key IN (...)plusselectinload(System.privacy_declarations). Collapses the remaining N fetches to ~1.updated_atsemantics) — theSystem change detectionlogs added here will quantify the opportunity first. The benchmark above shows ~1.65s/call of non-fetch wall time still on the table.test_system_history.py), and (2) DB-call-count assertions to prevent silent regression of the fetch-count reduction.