Skip to content

fix: Performance and reliability improvements for database and caching#565

Merged
filthyrake merged 3 commits intodevfrom
fix/performance-reliability-improvements
Feb 2, 2026
Merged

fix: Performance and reliability improvements for database and caching#565
filthyrake merged 3 commits intodevfrom
fix/performance-reliability-improvements

Conversation

@filthyrake
Copy link
Copy Markdown
Owner

Summary

Test plan

  • All db_retry tests pass (31/31)
  • All modified modules import cleanly
  • Verify VOD list pagination still works correctly
  • Verify chat message pagination still works correctly
  • Verify bitrate estimation under concurrent load

Closes #544, #545, #546, #549, #553

🤖 Generated with Claude Code

filthyrake and others added 2 commits February 1, 2026 14:48
#544, #545, #546, #549, #553)

- Add 30s timeout to all database retry operations to prevent indefinite hangs (#549)
- Replace VOD list COUNT query with window function for O(1) pagination (#544)
- Replace chat pagination COUNT query with limit+1 pattern (#546)
- Add sharded locks (16 shards) to bitrate cache to reduce contention (#545)
- Add max size limit (1000 entries) to bitrate cache to prevent memory leaks (#553)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Changes based on specialized agent reviews:

db_retry.py (Issue #549 review feedback):
- Add total_deadline parameter (default 60s) to bound cumulative retry time
- Document that write timeouts are NOT retried (write may have committed)
- Check deadline before each retry attempt
- Ensure sleep doesn't exceed remaining deadline

live_metrics.py (Issue #545/#553 review feedback):
- Replace all-locks eviction with single eviction lock
- Eviction now acquires only per-shard locks when deleting
- Prevents concurrent evictions without blocking normal cache operations
- Normal cache reads/writes can proceed during eviction

live_schemas.py + studio_chat.py:
- Document that ChatMessageListResponse.total is count of returned messages
- Clarify this is intentional for cursor-based pagination (use has_more instead)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves reliability and performance across the studio/live stack by adding database operation timeouts, optimizing pagination queries, and reducing contention/unbounded growth in the bitrate cache.

Changes:

  • Add per-operation DB query timeouts and a total retry deadline in db_retry.
  • Optimize VOD list pagination by using a COUNT(*) OVER() window function.
  • Optimize chat message pagination by switching from COUNT queries to a limit + 1 approach.
  • Reduce bitrate cache lock contention via sharded locks and cap cache growth with a max-size eviction strategy.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
api/studio_vod.py Replaces separate COUNT query with COUNT(*) OVER() and derives total from returned rows.
api/studio_chat.py Removes COUNT query for chat pagination, uses limit+1 to compute has_more, and changes total semantics.
api/live_schemas.py Updates chat list schema docs to reflect cursor pagination and new total meaning.
api/live_metrics.py Introduces sharded locks and max-size eviction for bitrate cache.
api/db_retry.py Adds per-query timeouts and an overall retry deadline to prevent indefinite hangs.
Comments suppressed due to low confidence (1)

api/studio_chat.py:234

  • The new pagination query relies on WHERE stream_id = ? AND deleted_at IS NULL plus id < before_id and ORDER BY id DESC LIMIT …. There isn’t an index matching this pattern (existing migration creates ix_chat_messages_stream_created on (stream_id, created_at)), so this can still degrade into inefficient scans on large chats. Consider adding a migration for a composite/partial index such as (stream_id, id DESC) with WHERE deleted_at IS NULL (or including deleted_at in the index) to support the new access path.
    query = (
        sa.select(
            chat_messages,
            users.c.username.label("username"),
        )
        .select_from(
            chat_messages.outerjoin(users, chat_messages.c.user_id == users.c.id)
        )
        .where(chat_messages.c.stream_id == stream["id"])
        .where(chat_messages.c.deleted_at.is_(None))  # Exclude deleted messages
        .order_by(chat_messages.c.id.desc())
        .limit(effective_limit + 1)  # Fetch one extra to detect has_more
    )

    if before_id:
        query = query.where(chat_messages.c.id < before_id)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +256 to +260
# Note: 'total' is count of returned messages, NOT total in database.
# This is intentional for cursor-based pagination - use 'has_more' for pagination.
# (Issue #546 - removed expensive COUNT query for performance)
total=len(messages),
has_more=has_more,
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the meaning of the total field from “total messages matching the query” to “messages returned in this page”. That’s a backwards-incompatible API change and the field name is likely to be misinterpreted by clients. Consider either (a) renaming the field (e.g., returned/count) and keeping total semantics, or (b) making total optional/omitted and introducing a new field for per-page count, to avoid breaking existing consumers.

Copilot uses AI. Check for mistakes.
Comment on lines +176 to +185
remaining_after_expired = len(_bitrate_cache) - len(keys_to_evict)
if remaining_after_expired > BITRATE_CACHE_MAX_SIZE:
# Sort non-expired entries by expiry time (oldest first)
non_expired = [
(k, v) for k, v in _bitrate_cache.items()
if k not in keys_to_evict
]
non_expired.sort(key=lambda x: x[1][1])
evict_count = remaining_after_expired - BITRATE_CACHE_MAX_SIZE
keys_to_evict.extend(k for k, _ in non_expired[:evict_count])
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In _enforce_cache_size_limit, non_expired filtering does if k not in keys_to_evict where keys_to_evict is a list. That makes eviction selection potentially O(n^2) as cache size grows past the limit. Consider using a set(keys_to_evict) for membership checks (and/or building non_expired from a snapshot like list(_bitrate_cache.items())) to keep eviction overhead predictable.

Copilot uses AI. Check for mistakes.
Comment on lines 176 to +201
for attempt in range(max_retries + 1):
# Check total deadline before each attempt (Issue #549 review feedback)
if total_deadline is not None:
elapsed = time.monotonic() - start_time
if elapsed >= total_deadline:
logger.error(f"Database operation exceeded total deadline of {total_deadline}s after {attempt} attempts")
raise DatabaseTimeoutError(f"Operation exceeded total deadline of {total_deadline}s")

Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New timeout behavior is introduced (per-attempt asyncio.wait_for(...) raising DatabaseTimeoutError, plus the new total_deadline enforcement in execute_with_retry), but the existing db-retry test suite doesn’t cover these paths. Add unit tests in tests/test_db_retry.py to validate: (1) query timeouts raise DatabaseTimeoutError, and (2) total_deadline stops retries once exceeded.

Copilot uses AI. Check for mistakes.
Comment on lines 238 to 242
rows = await fetch_all_with_retry(query)

# Extract total from first row (all rows have same total_count)
total = rows[0]["total_count"] if rows else 0
vods = [vod_to_response(dict(row)) for row in rows]
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

total is derived from the first returned row’s total_count. When the requested page is empty (e.g., page is past the last page), rows is empty and this returns total = 0, which is incorrect and can break pagination UI logic. Consider falling back to a separate COUNT query when rows is empty (or restructure the query so total can be returned even for empty pages).

Copilot uses AI. Check for mistakes.
Comment on lines +229 to 233
# Add window function for total count (Issue #544)
# This computes total in a single query, avoiding a separate COUNT query
query = query.add_columns(
sa.func.count().over().label("total_count")
)
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This endpoint filters by videos.deleted_at IS NULL, optional videos.status, and (for non-admins) videos.owner_id OR live_streams.owner_id, then orders by videos.created_at DESC. There is no composite index covering these access patterns, so the window COUNT(*) OVER() can still require large scans/sorts on bigger datasets. Consider adding a migration for an index that matches the common filters/order (e.g., a composite/partial index involving owner_id, deleted_at, status, and created_at).

Copilot uses AI. Check for mistakes.
Changes based on Copilot PR review:

1. studio_chat.py + live_schemas.py:
   - Make `total` field Optional[int], return None instead of count
   - Documents that field is deprecated, use `has_more` for pagination
   - Maintains backwards API compatibility (field still exists)

2. live_metrics.py:
   - Convert keys_to_evict list to set for O(1) membership checks
   - Fixes potential O(n²) eviction performance issue

3. studio_vod.py:
   - Add fallback COUNT query when page is empty
   - Prevents incorrect total=0 for pages past last page
   - Add comment noting index requirement for optimal performance

4. tests/test_db_retry.py:
   - Add tests for DatabaseTimeoutError behavior
   - Add tests for total_deadline parameter
   - Verify timeouts are not retried
   - Verify deadline stops retries when exceeded

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@filthyrake filthyrake merged commit f022fe9 into dev Feb 2, 2026
2 of 3 checks passed
@filthyrake filthyrake deleted the fix/performance-reliability-improvements branch February 2, 2026 00:33
@filthyrake filthyrake review requested due to automatic review settings March 23, 2026 23:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants