fix: Performance and reliability improvements for database and caching by filthyrake · Pull Request #565 · filthyrake/vlog

filthyrake · 2026-02-02T00:03:48Z

Summary

[Reliability] Missing Database Operation Timeouts #549: Add 30s timeout to all database retry operations to prevent queries from hanging indefinitely under database contention
[Performance] Unbounded VOD List Count Query Causes Table Scans #544: Replace VOD list COUNT query with COUNT(*) OVER() window function for efficient O(1) pagination
[Performance] Chat Message Pagination Count Query Is Inefficient #546: Replace chat pagination COUNT query with limit+1 pattern, eliminating O(n) table scans
[Performance] Bitrate Cache Lock Contention Limits Throughput #545: Add sharded locks (16 shards) to bitrate cache to reduce lock contention under high load
[Reliability] Bitrate Cache Has No Size Limit #553: Add max size limit (1000 entries) to bitrate cache to prevent unbounded memory growth

Test plan

All db_retry tests pass (31/31)
All modified modules import cleanly
Verify VOD list pagination still works correctly
Verify chat message pagination still works correctly
Verify bitrate estimation under concurrent load

🤖 Generated with Claude Code

#544, #545, #546, #549, #553) - Add 30s timeout to all database retry operations to prevent indefinite hangs (#549) - Replace VOD list COUNT query with window function for O(1) pagination (#544) - Replace chat pagination COUNT query with limit+1 pattern (#546) - Add sharded locks (16 shards) to bitrate cache to reduce contention (#545) - Add max size limit (1000 entries) to bitrate cache to prevent memory leaks (#553) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Changes based on specialized agent reviews: db_retry.py (Issue #549 review feedback): - Add total_deadline parameter (default 60s) to bound cumulative retry time - Document that write timeouts are NOT retried (write may have committed) - Check deadline before each retry attempt - Ensure sleep doesn't exceed remaining deadline live_metrics.py (Issue #545/#553 review feedback): - Replace all-locks eviction with single eviction lock - Eviction now acquires only per-shard locks when deleting - Prevents concurrent evictions without blocking normal cache operations - Normal cache reads/writes can proceed during eviction live_schemas.py + studio_chat.py: - Document that ChatMessageListResponse.total is count of returned messages - Clarify this is intentional for cursor-based pagination (use has_more instead) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Copilot

Pull request overview

This PR improves reliability and performance across the studio/live stack by adding database operation timeouts, optimizing pagination queries, and reducing contention/unbounded growth in the bitrate cache.

Changes:

Add per-operation DB query timeouts and a total retry deadline in db_retry.
Optimize VOD list pagination by using a COUNT(*) OVER() window function.
Optimize chat message pagination by switching from COUNT queries to a limit + 1 approach.
Reduce bitrate cache lock contention via sharded locks and cap cache growth with a max-size eviction strategy.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
api/studio_vod.py	Replaces separate COUNT query with `COUNT(*) OVER()` and derives total from returned rows.
api/studio_chat.py	Removes COUNT query for chat pagination, uses `limit+1` to compute `has_more`, and changes `total` semantics.
api/live_schemas.py	Updates chat list schema docs to reflect cursor pagination and new `total` meaning.
api/live_metrics.py	Introduces sharded locks and max-size eviction for bitrate cache.
api/db_retry.py	Adds per-query timeouts and an overall retry deadline to prevent indefinite hangs.

Comments suppressed due to low confidence (1)

api/studio_chat.py:234

The new pagination query relies on WHERE stream_id = ? AND deleted_at IS NULL plus id < before_id and ORDER BY id DESC LIMIT …. There isn’t an index matching this pattern (existing migration creates ix_chat_messages_stream_created on (stream_id, created_at)), so this can still degrade into inefficient scans on large chats. Consider adding a migration for a composite/partial index such as (stream_id, id DESC) with WHERE deleted_at IS NULL (or including deleted_at in the index) to support the new access path.

    query = (
        sa.select(
            chat_messages,
            users.c.username.label("username"),
        )
        .select_from(
            chat_messages.outerjoin(users, chat_messages.c.user_id == users.c.id)
        )
        .where(chat_messages.c.stream_id == stream["id"])
        .where(chat_messages.c.deleted_at.is_(None))  # Exclude deleted messages
        .order_by(chat_messages.c.id.desc())
        .limit(effective_limit + 1)  # Fetch one extra to detect has_more
    )

    if before_id:
        query = query.where(chat_messages.c.id < before_id)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-02T00:22:39Z

api/studio_chat.py

+        # Note: 'total' is count of returned messages, NOT total in database.
+        # This is intentional for cursor-based pagination - use 'has_more' for pagination.
+        # (Issue #546 - removed expensive COUNT query for performance)
+        total=len(messages),
+        has_more=has_more,


This changes the meaning of the total field from “total messages matching the query” to “messages returned in this page”. That’s a backwards-incompatible API change and the field name is likely to be misinterpreted by clients. Consider either (a) renaming the field (e.g., returned/count) and keeping total semantics, or (b) making total optional/omitted and introducing a new field for per-page count, to avoid breaking existing consumers.

Copilot · 2026-02-02T00:22:39Z

api/live_metrics.py

+            remaining_after_expired = len(_bitrate_cache) - len(keys_to_evict)
+            if remaining_after_expired > BITRATE_CACHE_MAX_SIZE:
+                # Sort non-expired entries by expiry time (oldest first)
+                non_expired = [
+                    (k, v) for k, v in _bitrate_cache.items()
+                    if k not in keys_to_evict
+                ]
+                non_expired.sort(key=lambda x: x[1][1])
+                evict_count = remaining_after_expired - BITRATE_CACHE_MAX_SIZE
+                keys_to_evict.extend(k for k, _ in non_expired[:evict_count])


In _enforce_cache_size_limit, non_expired filtering does if k not in keys_to_evict where keys_to_evict is a list. That makes eviction selection potentially O(n^2) as cache size grows past the limit. Consider using a set(keys_to_evict) for membership checks (and/or building non_expired from a snapshot like list(_bitrate_cache.items())) to keep eviction overhead predictable.

Copilot · 2026-02-02T00:22:39Z

api/db_retry.py

    for attempt in range(max_retries + 1):
+        # Check total deadline before each attempt (Issue #549 review feedback)
+        if total_deadline is not None:
+            elapsed = time.monotonic() - start_time
+            if elapsed >= total_deadline:
+                logger.error(f"Database operation exceeded total deadline of {total_deadline}s after {attempt} attempts")
+                raise DatabaseTimeoutError(f"Operation exceeded total deadline of {total_deadline}s")
+


New timeout behavior is introduced (per-attempt asyncio.wait_for(...) raising DatabaseTimeoutError, plus the new total_deadline enforcement in execute_with_retry), but the existing db-retry test suite doesn’t cover these paths. Add unit tests in tests/test_db_retry.py to validate: (1) query timeouts raise DatabaseTimeoutError, and (2) total_deadline stops retries once exceeded.

Copilot · 2026-02-02T00:22:40Z

api/studio_vod.py

    rows = await fetch_all_with_retry(query)

+    # Extract total from first row (all rows have same total_count)
+    total = rows[0]["total_count"] if rows else 0
    vods = [vod_to_response(dict(row)) for row in rows]


total is derived from the first returned row’s total_count. When the requested page is empty (e.g., page is past the last page), rows is empty and this returns total = 0, which is incorrect and can break pagination UI logic. Consider falling back to a separate COUNT query when rows is empty (or restructure the query so total can be returned even for empty pages).

Copilot · 2026-02-02T00:22:40Z

api/studio_vod.py

+    # Add window function for total count (Issue #544)
+    # This computes total in a single query, avoiding a separate COUNT query
+    query = query.add_columns(
+        sa.func.count().over().label("total_count")
    )


This endpoint filters by videos.deleted_at IS NULL, optional videos.status, and (for non-admins) videos.owner_id OR live_streams.owner_id, then orders by videos.created_at DESC. There is no composite index covering these access patterns, so the window COUNT(*) OVER() can still require large scans/sorts on bigger datasets. Consider adding a migration for an index that matches the common filters/order (e.g., a composite/partial index involving owner_id, deleted_at, status, and created_at).

Changes based on Copilot PR review: 1. studio_chat.py + live_schemas.py: - Make `total` field Optional[int], return None instead of count - Documents that field is deprecated, use `has_more` for pagination - Maintains backwards API compatibility (field still exists) 2. live_metrics.py: - Convert keys_to_evict list to set for O(1) membership checks - Fixes potential O(n²) eviction performance issue 3. studio_vod.py: - Add fallback COUNT query when page is empty - Prevents incorrect total=0 for pages past last page - Add comment noting index requirement for optimal performance 4. tests/test_db_retry.py: - Add tests for DatabaseTimeoutError behavior - Add tests for total_deadline parameter - Verify timeouts are not retried - Verify deadline stops retries when exceeded Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

filthyrake and others added 2 commits February 1, 2026 14:48

filthyrake requested a review from Copilot February 2, 2026 00:14

Copilot started reviewing on behalf of filthyrake February 2, 2026 00:15 View session

Copilot AI reviewed Feb 2, 2026

View reviewed changes

filthyrake merged commit f022fe9 into dev Feb 2, 2026
2 of 3 checks passed

filthyrake deleted the fix/performance-reliability-improvements branch February 2, 2026 00:33

filthyrake review requested due to automatic review settings March 23, 2026 23:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Performance and reliability improvements for database and caching#565

fix: Performance and reliability improvements for database and caching#565
filthyrake merged 3 commits intodevfrom
fix/performance-reliability-improvements

filthyrake commented Feb 2, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 2, 2026

Uh oh!

Copilot AI Feb 2, 2026

Uh oh!

Copilot AI Feb 2, 2026

Uh oh!

Copilot AI Feb 2, 2026

Uh oh!

Copilot AI Feb 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

filthyrake commented Feb 2, 2026

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants