fix: non-blocking background prefetch for thumbnail fallback (#67)#70
Conversation
…ailures Add 20 tests across two files that identify and reproduce the root cause of thumbnail loading failures on large catalogs (500+ artworks). Root cause: client-server timeout race condition. The server-side individual fallback (PR #69) takes 53s+ for a batch of 20 IDs when the TV is struggling, but the frontend's AbortController timeout is 30s. The client always aborts before receiving the response, never sees the fallback:true flag, so the progress bar never appears. Three retries create a ~96s loop ending in 'Tap to retry' on all thumbnails. test_issue67_diagnosis.py (14 tests): - Thumbnail key matching with samsungtvws format - Individual fallback cascade timing (20 TV calls per failed batch) - TV operation timeout and D2D socket orphaning - Lock starvation during concurrent fallback operations - Large catalog scenario (63 TV hits per batch of 20) - Client timeout race: proves 1 failing ID = 58s worst-case (>30s) test_reproduce_nitrowolf.py (6 tests): - Server response exceeds client timeout (53s vs 30s) — matches logs - Click-to-retry reads from disk cache (2.6ms) — matches user report - Progress bar never shown (client aborts before response) - Three-attempt retry loop (96s wall time → 'Tap to retry') - Concurrent batches compounding the timeout problem - Settings endpoint blocked by TV lock during fallback Uses 50x time scaling for fast test execution while preserving the timing relationships that cause the failures. Refs: #67, #68, #69
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 24a94a32d2
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| data = resp.json() | ||
|
|
||
| # 1. Server response took longer than the client timeout | ||
| assert server_time > SCALED_CLIENT_TIMEOUT, ( |
There was a problem hiding this comment.
Avoid locking in the thumbnail timeout regression
Because this reproduction test is part of the normal pytest suite, this assertion makes the current timeout race the expected behavior: any production fix that brings /api/thumbnails responses back under the frontend timeout will fail CI here before it can ship. If the goal is diagnostic coverage for #67, mark these reproductions as expected failures or assert the desired bounded-response behavior instead of requiring the server to stay slower than the client.
Useful? React with 👍 / 👎.
| def hanging_fn(art): | ||
| import threading | ||
| # Simulate the D2D socket blocking on recv for a long time | ||
| time.sleep(10) |
There was a problem hiding this comment.
Avoid leaving a sleeping executor thread in the suite
In this timeout test, _tv_op cancels the await on asyncio.to_thread, but the worker thread running hanging_fn keeps sleeping for seconds after the assertion. When this test is run in the full suite, that non-daemon executor thread can delay event-loop cleanup or process exit by the full 10s sleep; use a much shorter blocking interval or a controllable synchronization primitive instead so the diagnostic test does not add a fixed 10s stall to every run.
Useful? React with 👍 / 👎.
- Mark reproduction test classes with @pytest.mark.xfail(strict=False) so a production fix for #67 won't break CI. Tests currently show as XPASS (bug still exists); once fixed they'll become XFAIL (expected). - Replace time.sleep(10) in hanging_fn with threading.Event.wait(2) and cancel.set() after assertions, so the executor thread exits promptly instead of stalling the suite for 10s.
Replace synchronous individual thumbnail fallback (which blocked 53s+ for large catalogs) with asyncio.create_task() background prefetch. Server changes: - Add _prefetch_thumbnails() background task that fetches thumbnails individually via _tv_op() and caches to disk - Add _thumb_prefetch_in_progress dedup set to avoid redundant fetches - get_thumbnails_batch() now returns immediately with fallback=True and all uncached IDs as missing, spawning background prefetch Client changes: - In fallback mode, retry up to 8 times at 3s intervals (vs 3 retries with escalating backoff) to pick up newly cached thumbnails Test updates: - conftest: reset _thumb_prefetch_in_progress and _tv_lock per test - test_api_endpoints: verify immediate response + background cache - test_issue67_diagnosis: update 5 diagnostic tests for new async behavior (fast response, background individual calls, disk cache)
There was a problem hiding this comment.
Pull request overview
This PR changes the thumbnail fallback flow to avoid blocking /api/thumbnails responses when batch fetching fails: the server now returns immediately (with fallback: true and missing IDs) and prefetches missing thumbnails asynchronously into the disk cache. The frontend retry loop is adjusted to take advantage of this by allowing more attempts in fallback mode to pick up cached results. The PR also adds two new large test modules to diagnose and reproduce the previously reported issue #67 behavior and updates existing endpoint tests/fixtures to match the new behavior.
Changes:
- Backend: make thumbnail fallback non-blocking by scheduling background per-ID prefetch and tracking in-progress IDs to avoid duplicate work.
- Frontend: increase retry attempts in fallback mode and adjust retry delay logic so the client can pick up newly cached thumbnails.
- Tests: add diagnostic + reproduction suites for #67 and update existing thumbnail fallback tests/fixtures for background prefetch semantics.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
server.py |
Implements background thumbnail prefetch on batch failure and tracks in-progress IDs. |
index.html |
Updates thumbnail batch retry strategy when fallback mode is detected. |
tests/conftest.py |
Resets new global prefetch-tracking state and lock per test. |
tests/test_api_endpoints.py |
Updates thumbnail fallback tests to expect immediate response + background caching. |
tests/test_issue67_diagnosis.py |
Adds diagnostic tests that explain timing/lock/timeout dynamics for #67. |
tests/test_reproduce_nitrowolf.py |
Adds xfail reproduction tests matching Nitrowolf’s reported symptoms and timing. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -490,16 +509,16 @@ async def get_thumbnails_batch(body: dict): | |||
| except Exception as e: | |||
| log.warning("Batch thumbnail fetch failed, falling back to individual: %s", e) | |||
| # Background thumbnail prefetch — tracks IDs already being fetched so | ||
| # concurrent/retry requests don't spawn duplicate work. | ||
| _thumb_prefetch_in_progress: set[str] = set() | ||
|
|
||
|
|
||
| async def _prefetch_thumbnails(content_ids: list[str]) -> None: | ||
| """Fetch thumbnails individually in the background and cache to disk.""" | ||
| for cid in content_ids: |
| // When the server is prefetching thumbnails in the background | ||
| // (fallback mode), allow more retries with shorter delays so the | ||
| // client picks up cached results as they become available. | ||
| const MAX_ATTEMPTS = _thumbFallbackSeen ? 8 : 3; | ||
| const retryDelay = (a) => _thumbFallbackSeen ? 3000 : 2000 * (a + 1); |
| # Give the background task time to run | ||
| import asyncio | ||
| await asyncio.sleep(0.2) |
| # Give the background task time to run | ||
| import asyncio | ||
| await asyncio.sleep(0.2) |
| async def test_526_artworks_connection_count(self, client, mock_tv, tmp_data_dir, monkeypatch): | ||
| """With 524 artworks, the frontend sends 26 batches of 20 + 1 batch of 4. |
Samsung Frame WebSocket connections go stale after ~30-60s of inactivity, causing BrokenPipeError when the user interacts with the TV after browsing Atmosphere results or other idle periods. - Track _tv_last_used timestamp, updated after each successful _tv_op - _ensure_tv_connection checks idle time against TV_CONN_MAX_IDLE (30s) - Stale connections are proactively closed and reopened instead of waiting for BrokenPipeError on the first attempt - Reset _tv_last_used in test fixtures for proper isolation
- Fix misleading log: 'falling back to individual' → 'scheduling background prefetch' (Copilot on server.py:524) - Fix misleading comment: 'shorter delays' → 'fixed 3s interval' with note about normal escalating backoff (Copilot on index.html:2879) - Replace fixed sleep(0.2) with bounded polling loop in fallback tests to avoid CI flakiness (Copilot on test_api_endpoints.py:395,424) - Rename test_526 → test_524 to match docstring catalog size (Copilot on test_issue67_diagnosis.py:281)
|
The previously cached ones that weren't showing up until click now show up. However, most of them are still in Tap to retry status. The progress bar at the bottom did appear and count down but no new thumbnails showed up. What is the process of grabbing thumbnails? Are you just grabbing the image and then generating the thumbnail from that image or is there an actual thumbnail stored on the Frame device you're grabbing? |
Fix: Thumbnail loading timeout race (issue #67)
Problem
When the Samsung TV's batch thumbnail endpoint (
get_thumbnail_list) fails, the server fell back to fetching each thumbnail individually — blocking the HTTP response for 53+ seconds while the frontend times out at 30s.Root cause
The server's individual-fallback path makes up to
N × 3sequential_tv_opcalls (one per artwork, 3 retry attempts each), all while holding the response open. For 20 artworks that's ~53 seconds of blocking.Fix (production code changes)
server.py
get_thumbnails_batch()now returns immediately with{"fallback": true, "missing": [...]}when the batch call fails_prefetch_thumbnails) fetches individual thumbnails asynchronously and caches them to disk_thumb_prefetch_in_progress) prevents concurrent background fetches for the same content IDs_tv_last_usedand proactively reconnects when idle >TV_CONN_MAX_IDLE(30s)index.html
fallback: trueis received, the frontend retries up to 8 times at fixed 3s intervals (vs normal 3 attempts with escalating backoff)Tests
xfail(strict=False)since the fix changes the behavior)Review feedback addressed
sleep(0.2)in tests with bounded polling loopsthreading.Eventwith bounded waitCloses #67