fix: Performance and reliability improvements for database and caching#565
fix: Performance and reliability improvements for database and caching#565filthyrake merged 3 commits intodevfrom
Conversation
#544, #545, #546, #549, #553) - Add 30s timeout to all database retry operations to prevent indefinite hangs (#549) - Replace VOD list COUNT query with window function for O(1) pagination (#544) - Replace chat pagination COUNT query with limit+1 pattern (#546) - Add sharded locks (16 shards) to bitrate cache to reduce contention (#545) - Add max size limit (1000 entries) to bitrate cache to prevent memory leaks (#553) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Changes based on specialized agent reviews: db_retry.py (Issue #549 review feedback): - Add total_deadline parameter (default 60s) to bound cumulative retry time - Document that write timeouts are NOT retried (write may have committed) - Check deadline before each retry attempt - Ensure sleep doesn't exceed remaining deadline live_metrics.py (Issue #545/#553 review feedback): - Replace all-locks eviction with single eviction lock - Eviction now acquires only per-shard locks when deleting - Prevents concurrent evictions without blocking normal cache operations - Normal cache reads/writes can proceed during eviction live_schemas.py + studio_chat.py: - Document that ChatMessageListResponse.total is count of returned messages - Clarify this is intentional for cursor-based pagination (use has_more instead) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR improves reliability and performance across the studio/live stack by adding database operation timeouts, optimizing pagination queries, and reducing contention/unbounded growth in the bitrate cache.
Changes:
- Add per-operation DB query timeouts and a total retry deadline in
db_retry. - Optimize VOD list pagination by using a
COUNT(*) OVER()window function. - Optimize chat message pagination by switching from COUNT queries to a
limit + 1approach. - Reduce bitrate cache lock contention via sharded locks and cap cache growth with a max-size eviction strategy.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| api/studio_vod.py | Replaces separate COUNT query with COUNT(*) OVER() and derives total from returned rows. |
| api/studio_chat.py | Removes COUNT query for chat pagination, uses limit+1 to compute has_more, and changes total semantics. |
| api/live_schemas.py | Updates chat list schema docs to reflect cursor pagination and new total meaning. |
| api/live_metrics.py | Introduces sharded locks and max-size eviction for bitrate cache. |
| api/db_retry.py | Adds per-query timeouts and an overall retry deadline to prevent indefinite hangs. |
Comments suppressed due to low confidence (1)
api/studio_chat.py:234
- The new pagination query relies on
WHERE stream_id = ? AND deleted_at IS NULLplusid < before_idandORDER BY id DESC LIMIT …. There isn’t an index matching this pattern (existing migration createsix_chat_messages_stream_createdon(stream_id, created_at)), so this can still degrade into inefficient scans on large chats. Consider adding a migration for a composite/partial index such as(stream_id, id DESC)withWHERE deleted_at IS NULL(or includingdeleted_atin the index) to support the new access path.
query = (
sa.select(
chat_messages,
users.c.username.label("username"),
)
.select_from(
chat_messages.outerjoin(users, chat_messages.c.user_id == users.c.id)
)
.where(chat_messages.c.stream_id == stream["id"])
.where(chat_messages.c.deleted_at.is_(None)) # Exclude deleted messages
.order_by(chat_messages.c.id.desc())
.limit(effective_limit + 1) # Fetch one extra to detect has_more
)
if before_id:
query = query.where(chat_messages.c.id < before_id)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
api/studio_chat.py
Outdated
| # Note: 'total' is count of returned messages, NOT total in database. | ||
| # This is intentional for cursor-based pagination - use 'has_more' for pagination. | ||
| # (Issue #546 - removed expensive COUNT query for performance) | ||
| total=len(messages), | ||
| has_more=has_more, |
There was a problem hiding this comment.
This changes the meaning of the total field from “total messages matching the query” to “messages returned in this page”. That’s a backwards-incompatible API change and the field name is likely to be misinterpreted by clients. Consider either (a) renaming the field (e.g., returned/count) and keeping total semantics, or (b) making total optional/omitted and introducing a new field for per-page count, to avoid breaking existing consumers.
| remaining_after_expired = len(_bitrate_cache) - len(keys_to_evict) | ||
| if remaining_after_expired > BITRATE_CACHE_MAX_SIZE: | ||
| # Sort non-expired entries by expiry time (oldest first) | ||
| non_expired = [ | ||
| (k, v) for k, v in _bitrate_cache.items() | ||
| if k not in keys_to_evict | ||
| ] | ||
| non_expired.sort(key=lambda x: x[1][1]) | ||
| evict_count = remaining_after_expired - BITRATE_CACHE_MAX_SIZE | ||
| keys_to_evict.extend(k for k, _ in non_expired[:evict_count]) |
There was a problem hiding this comment.
In _enforce_cache_size_limit, non_expired filtering does if k not in keys_to_evict where keys_to_evict is a list. That makes eviction selection potentially O(n^2) as cache size grows past the limit. Consider using a set(keys_to_evict) for membership checks (and/or building non_expired from a snapshot like list(_bitrate_cache.items())) to keep eviction overhead predictable.
| for attempt in range(max_retries + 1): | ||
| # Check total deadline before each attempt (Issue #549 review feedback) | ||
| if total_deadline is not None: | ||
| elapsed = time.monotonic() - start_time | ||
| if elapsed >= total_deadline: | ||
| logger.error(f"Database operation exceeded total deadline of {total_deadline}s after {attempt} attempts") | ||
| raise DatabaseTimeoutError(f"Operation exceeded total deadline of {total_deadline}s") | ||
|
|
There was a problem hiding this comment.
New timeout behavior is introduced (per-attempt asyncio.wait_for(...) raising DatabaseTimeoutError, plus the new total_deadline enforcement in execute_with_retry), but the existing db-retry test suite doesn’t cover these paths. Add unit tests in tests/test_db_retry.py to validate: (1) query timeouts raise DatabaseTimeoutError, and (2) total_deadline stops retries once exceeded.
| rows = await fetch_all_with_retry(query) | ||
|
|
||
| # Extract total from first row (all rows have same total_count) | ||
| total = rows[0]["total_count"] if rows else 0 | ||
| vods = [vod_to_response(dict(row)) for row in rows] |
There was a problem hiding this comment.
total is derived from the first returned row’s total_count. When the requested page is empty (e.g., page is past the last page), rows is empty and this returns total = 0, which is incorrect and can break pagination UI logic. Consider falling back to a separate COUNT query when rows is empty (or restructure the query so total can be returned even for empty pages).
| # Add window function for total count (Issue #544) | ||
| # This computes total in a single query, avoiding a separate COUNT query | ||
| query = query.add_columns( | ||
| sa.func.count().over().label("total_count") | ||
| ) |
There was a problem hiding this comment.
This endpoint filters by videos.deleted_at IS NULL, optional videos.status, and (for non-admins) videos.owner_id OR live_streams.owner_id, then orders by videos.created_at DESC. There is no composite index covering these access patterns, so the window COUNT(*) OVER() can still require large scans/sorts on bigger datasets. Consider adding a migration for an index that matches the common filters/order (e.g., a composite/partial index involving owner_id, deleted_at, status, and created_at).
Changes based on Copilot PR review: 1. studio_chat.py + live_schemas.py: - Make `total` field Optional[int], return None instead of count - Documents that field is deprecated, use `has_more` for pagination - Maintains backwards API compatibility (field still exists) 2. live_metrics.py: - Convert keys_to_evict list to set for O(1) membership checks - Fixes potential O(n²) eviction performance issue 3. studio_vod.py: - Add fallback COUNT query when page is empty - Prevents incorrect total=0 for pages past last page - Add comment noting index requirement for optimal performance 4. tests/test_db_retry.py: - Add tests for DatabaseTimeoutError behavior - Add tests for total_deadline parameter - Verify timeouts are not retried - Verify deadline stops retries when exceeded Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Summary
COUNT(*) OVER()window function for efficient O(1) paginationTest plan
Closes #544, #545, #546, #549, #553
🤖 Generated with Claude Code