test(database): regression test for sqla engine creation (#27897)#40237
test(database): regression test for sqla engine creation (#27897)#40237rusackas wants to merge 4 commits into
Conversation
Code Review Agent Run #958a56Actionable Suggestions - 0Review Details
Bito Usage GuideCommands Type the following command in the pull request comment and save the comment.
Refer to the documentation for additional commands. Configuration This repository uses Documentation & Help |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #40237 +/- ##
==========================================
- Coverage 64.20% 64.01% -0.19%
==========================================
Files 2592 2592
Lines 139232 138949 -283
Branches 32327 32214 -113
==========================================
- Hits 89389 88952 -437
- Misses 48308 48459 +151
- Partials 1535 1538 +3
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
… once per URL Closes #27897 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
74c0769 to
125624e
Compare
Wires up _ENGINE_CACHE — a module-level dict keyed by (database_id, str(sqlalchemy_url), repr(sorted(engine_kwargs.items()))) — so that successive _get_sqla_engine(nullpool=False) calls reuse the same Engine instance instead of building a fresh one each invocation. Per SQLAlchemy docs the engine is meant to live for the process lifetime; recreating defeats every pool an operator configures via DB_CONNECTION_MUTATOR (the original bug report's duckdb queue-size-1 seeing multiple simultaneous connections). nullpool=True engines are skipped — those are intentionally poolless and there's nothing to reuse. The regression test added in the prior commit clears _ENGINE_CACHE in its setup so test ordering can't smuggle a stale entry past the assertion. Closes #27897
aminghadersohi
left a comment
There was a problem hiding this comment.
Thanks for this fix — the regression test is excellent and the cache comment block is unusually thorough. A few observations:
HIGH — thread safety on first access
_ENGINE_CACHE is a module-level dict with no locking (core.py:101). Under multi-threaded gunicorn workers the check-then-set at lines 587–589 is not atomic: two concurrent first-access calls for the same URL can both see a cache miss, both call create_engine, and both write — leaving a briefly-duplicated connection pool. Python's GIL keeps the dict from corrupting, but the "one engine per process per URL" guarantee becomes probabilistic rather than strict.
Suggested fix — protect lazy-init with a lock:
import threading
_ENGINE_CACHE: dict[tuple[int | None, str, str], Engine] = {}
_ENGINE_CACHE_LOCK = threading.Lock()
# in _get_sqla_engine, replace the cache block with:
with _ENGINE_CACHE_LOCK:
if cached := _ENGINE_CACHE.get(cache_key):
return cached
engine = create_engine(sqlalchemy_url, **engine_kwargs)
_ENGINE_CACHE[cache_key] = engine
return engineMEDIUM — self.id is None for unsaved instances
The cache key uses self.id as its first element (core.py:583). Database.id is None before the object is saved to the database. Two distinct unsaved Database instances with the same URI and engine kwargs would silently share a cache entry. Production code always saves before connecting, so this is unlikely, but worth documenting or guarding (e.g. skip caching when self.id is None).
NITs
dict[tuple[Any, ...], Engine]→ the key is always(int | None, str, str), so a more specific annotation is available.cached = ...; if cached is not None: return cached→ walrus:if cached := _ENGINE_CACHE.get(cache_key): return cached- Inline import at
core_test.py:555follows the existing file convention but worth cleaning up file-wide in a follow-up.
Praise
The regression test is exactly right: clears the global cache before running, mocks create_engine at the correct import path, calls twice, asserts call_count == 1 with a descriptive failure message. The docstring cites the SQLAlchemy docs and the originating issue. The nullpool guard is also correct — ephemeral engines should not be cached and you've handled both the lookup skip and the write skip.
Code Review Agent Run #904d8aActionable Suggestions - 0Review Details
Bito Usage GuideCommands Type the following command in the pull request comment and save the comment.
Refer to the documentation for additional commands. Configuration This repository uses Documentation & Help |
| # size-1 queue). Skip the cache when ``nullpool`` is True — those | ||
| # engines are intentionally poolless and there's nothing to reuse. | ||
| cache_key: tuple[Any, ...] | None = None | ||
| if not nullpool: |
There was a problem hiding this comment.
I think the cache is no longer engaged now. The new code is gated by if not nullpool: but every callsite of get_sqla_engine/_get_sqla_engine in superset/ accepts the default nullpool=True. The nullpool=True branch then forces poolclass=NullPool at :521, which is the exact behavior the issue says defeats DB_CONNECTION_MUTATOR's pool.
There was a problem hiding this comment.
Spot on — confirmed: every in-tree caller uses the default nullpool=True, so the cache as written was dormant exactly where it needs to engage. Fixed in 0555735 by removing the if not nullpool: gate so caching happens regardless. Even a NullPool engine has nontrivial construction cost (URL parsing, dialect resolution, connect_args setup, re-running DB_CONNECTION_MUTATOR), and the operator pool config #27897 is about can only persist if the engine object itself is reused.
Also updated the regression test to call with the default nullpool=True so it actually exercises the production path — if it had done that originally, you wouldn't have had to catch this for me. Thanks for the careful read!
|
This question isn’t related to the pull request. I can only help with questions about the PR’s code or comments. |
| # (database_id, str(sqlalchemy_url), repr(sorted(engine_kwargs.items()))). | ||
| # Populated only when ``nullpool=False`` — pooled engines are the only ones | ||
| # that benefit from process-wide reuse. | ||
| _ENGINE_CACHE: dict[tuple[Any, ...], Engine] = {} |
There was a problem hiding this comment.
will editing a Database (password rotation, host change, SSH tunnel reconfig) or deleting it ever pop the cache here?
There was a problem hiding this comment.
Good question. The cache key is (database_id, str(sqlalchemy_url), repr(sorted(engine_kwargs.items()))):
- Password rotation / host change / SSH tunnel reconfig:
sqlalchemy_urlis the decrypted URL (built from the latest Database fields each call), andengine_kwargsincludes whateverDB_CONNECTION_MUTATORproduces from the latest Database state. So a rotated password or changed host means a different key on the next call → cache miss → fresh engine. Existing in-flight requests on the old engine continue against the old credentials until they finish, which matches SQLAlchemy's normal behavior. - Deletion: stale entries linger until process restart — that's a memory footprint concern (a few hundred bytes per dead entry) rather than a correctness one. I'd rather not couple this PR to a SQLAlchemy event hook in
Database.__delete__since the right invalidation surface is bigger than just delete (would also want it on rename/clone), and worth scoping as a follow-up.
Updated the cache header comment to make the URL/kwargs key contract explicit.
@sadpandajoe (blocker): the previous cache was gated by `if not nullpool:` but every in-tree callsite of `get_sqla_engine` / `_get_sqla_engine` uses the default `nullpool=True`, which forces `poolclass=NullPool` at L521 — i.e. the cache was dormant exactly where it needs to engage to fix #27897. Remove the gate; cache regardless of `nullpool`. Even a NullPool engine has nontrivial construction cost (URL parsing, dialect resolution, connect_args setup, re-running DB_CONNECTION_MUTATOR), and the operator pool config that #27897 is about can only persist if the engine object itself is reused. @aminghadersohi (HIGH): thread-safety on first access. Two concurrent calls under multi-threaded gunicorn could both miss the cache, both call `create_engine`, and both write. Wrap lookup + write in a `threading.Lock` so the "one engine per process per (id, URL, kwargs)" guarantee is strict, not probabilistic. @aminghadersohi (MEDIUM): `self.id is None` on unsaved instances. Two distinct in-memory `Database` objects with the same URI would have collided on a shared cache entry. Skip caching when `self.id is None`. @aminghadersohi (NITs): - Tightened `dict[tuple[Any, ...], Engine]` → `dict[tuple[int, str, str], Engine]`. - Replaced `cached = ...; if cached is not None:` with walrus. Cache invalidation (sadpandajoe's L101 question): password rotation, host change, or any mutation that touches the decrypted URL or DB_CONNECTION_MUTATOR-produced kwargs naturally falls through to a fresh engine because the cache key includes both. Stale entries remain until process restart — that's a memory footprint concern but not a correctness one, and worth a follow-up rather than blocking this PR. Updated regression test to exercise the production default path (`nullpool=True`) instead of `nullpool=False`, which would have masked sadpandajoe's finding. Added a second test asserting unsaved instances don't cache. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Thanks for the thorough review @aminghadersohi! All four addressed in 0555735:
Also addressed @sadpandajoe's blocker (the cache was gated by Pre-commit (ruff/mypy/pylint) all pass locally. |
✅ Deploy Preview for superset-docs-preview ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
aminghadersohi
left a comment
There was a problem hiding this comment.
All three concerns from the previous round are addressed: guards both the cache read and write (HIGH); the gate prevents unsaved instances from sharing cache entries, with a dedicated regression test (MEDIUM); and the type annotation is now with the walrus operator in the lookup (NITs). LGTM.
aminghadersohi
left a comment
There was a problem hiding this comment.
All three concerns from the previous round are addressed: threading.Lock() guards both the cache read and write (HIGH); the self.id is not None gate prevents unsaved instances from sharing cache entries, with a dedicated regression test (MEDIUM); and the type annotation is now dict[tuple[int, str, str], Engine] with the walrus operator in the lookup (NITs). LGTM.
Code Review Agent Run #daa67fActionable Suggestions - 0Review Details
Bito Usage GuideCommands Type the following command in the pull request comment and save the comment.
Refer to the documentation for additional commands. Configuration This repository uses Documentation & Help |
SUMMARY
This is a test-only PR opened as a TDD-style validation of issue #27897.
#27897 (filed 2024-04) reports that
Database._get_sqla_enginecallscreate_engineon every invocation instead of caching the engine per URL. Per SQLAlchemy docs an engine is meant to be created once per process per URL so its connection pool can do its job; the current behavior defeats the pool every timeDB_CONNECTION_MUTATORconfigures one.This PR adds one regression test on
Database._get_sqla_engine:test_get_sqla_engine_caches_engine_per_url— patchescreate_engine, calls_get_sqla_engine(nullpool=False)twice for the same URL, and assertscreate_enginewas called exactly once. Will fail until the engine is cached.How to interpret CI
Database._get_sqla_engine(or hoist into a module-level dict keyed by sqlalchemy_url + connect_args).TESTING INSTRUCTIONS
ADDITIONAL INFORMATION
🤖 Generated with Claude Code