fix: non-blocking background prefetch for thumbnail fallback (#67) by danmunz · Pull Request #70 · danmunz/docent

danmunz · 2026-06-02T15:04:02Z

Fix: Thumbnail loading timeout race (issue #67)

Problem

When the Samsung TV's batch thumbnail endpoint (get_thumbnail_list) fails, the server fell back to fetching each thumbnail individually — blocking the HTTP response for 53+ seconds while the frontend times out at 30s.

Root cause

The server's individual-fallback path makes up to N × 3 sequential _tv_op calls (one per artwork, 3 retry attempts each), all while holding the response open. For 20 artworks that's ~53 seconds of blocking.

Fix (production code changes)

server.py

get_thumbnails_batch() now returns immediately with {"fallback": true, "missing": [...]} when the batch call fails
A background task (_prefetch_thumbnails) fetches individual thumbnails asynchronously and caches them to disk
Deduplication set (_thumb_prefetch_in_progress) prevents concurrent background fetches for the same content IDs
WebSocket staleness detection: tracks _tv_last_used and proactively reconnects when idle > TV_CONN_MAX_IDLE (30s)

index.html

When fallback: true is received, the frontend retries up to 8 times at fixed 3s intervals (vs normal 3 attempts with escalating backoff)
This gives the background prefetch time to populate the cache

Tests

test_reproduce_nitrowolf.py: Reproduces the exact failure scenario from issue bug: TV operations fail with ConnectionFailure / socket closed on large catalogs #67 (marked xfail(strict=False) since the fix changes the behavior)
test_issue67_diagnosis.py: Tests for timeout behavior, lock contention, WebSocket staleness detection, and large-catalog scenarios
test_api_endpoints.py: Updated fallback tests to verify background prefetch behavior with polling-based assertions

Review feedback addressed

Fixed misleading log message ("falling back to individual" → "scheduling background prefetch")
Fixed misleading code comment about delay durations
Replaced fixed sleep(0.2) in tests with bounded polling loops
Fixed test name/docstring mismatch (526 → 524)
Hanging thread cleanup already uses threading.Event with bounded wait

Closes #67

…ailures Add 20 tests across two files that identify and reproduce the root cause of thumbnail loading failures on large catalogs (500+ artworks). Root cause: client-server timeout race condition. The server-side individual fallback (PR #69) takes 53s+ for a batch of 20 IDs when the TV is struggling, but the frontend's AbortController timeout is 30s. The client always aborts before receiving the response, never sees the fallback:true flag, so the progress bar never appears. Three retries create a ~96s loop ending in 'Tap to retry' on all thumbnails. test_issue67_diagnosis.py (14 tests): - Thumbnail key matching with samsungtvws format - Individual fallback cascade timing (20 TV calls per failed batch) - TV operation timeout and D2D socket orphaning - Lock starvation during concurrent fallback operations - Large catalog scenario (63 TV hits per batch of 20) - Client timeout race: proves 1 failing ID = 58s worst-case (>30s) test_reproduce_nitrowolf.py (6 tests): - Server response exceeds client timeout (53s vs 30s) — matches logs - Click-to-retry reads from disk cache (2.6ms) — matches user report - Progress bar never shown (client aborts before response) - Three-attempt retry loop (96s wall time → 'Tap to retry') - Concurrent batches compounding the timeout problem - Settings endpoint blocked by TV lock during fallback Uses 50x time scaling for fast test execution while preserving the timing relationships that cause the failures. Refs: #67, #68, #69

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 24a94a32d2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-02T15:08:07Z

+        data = resp.json()
+
+        # 1. Server response took longer than the client timeout
+        assert server_time > SCALED_CLIENT_TIMEOUT, (


Avoid locking in the thumbnail timeout regression

Because this reproduction test is part of the normal pytest suite, this assertion makes the current timeout race the expected behavior: any production fix that brings /api/thumbnails responses back under the frontend timeout will fail CI here before it can ship. If the goal is diagnostic coverage for #67, mark these reproductions as expected failures or assert the desired bounded-response behavior instead of requiring the server to stay slower than the client.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-02T15:08:07Z

+        def hanging_fn(art):
+            import threading
+            # Simulate the D2D socket blocking on recv for a long time
+            time.sleep(10)


Avoid leaving a sleeping executor thread in the suite

In this timeout test, _tv_op cancels the await on asyncio.to_thread, but the worker thread running hanging_fn keeps sleeping for seconds after the assertion. When this test is run in the full suite, that non-daemon executor thread can delay event-loop cleanup or process exit by the full 10s sleep; use a much shorter blocking interval or a controllable synchronization primitive instead so the diagnostic test does not add a fixed 10s stall to every run.

Useful? React with 👍 / 👎.

- Mark reproduction test classes with @pytest.mark.xfail(strict=False) so a production fix for #67 won't break CI. Tests currently show as XPASS (bug still exists); once fixed they'll become XFAIL (expected). - Replace time.sleep(10) in hanging_fn with threading.Event.wait(2) and cancel.set() after assertions, so the executor thread exits promptly instead of stalling the suite for 10s.

Replace synchronous individual thumbnail fallback (which blocked 53s+ for large catalogs) with asyncio.create_task() background prefetch. Server changes: - Add _prefetch_thumbnails() background task that fetches thumbnails individually via _tv_op() and caches to disk - Add _thumb_prefetch_in_progress dedup set to avoid redundant fetches - get_thumbnails_batch() now returns immediately with fallback=True and all uncached IDs as missing, spawning background prefetch Client changes: - In fallback mode, retry up to 8 times at 3s intervals (vs 3 retries with escalating backoff) to pick up newly cached thumbnails Test updates: - conftest: reset _thumb_prefetch_in_progress and _tv_lock per test - test_api_endpoints: verify immediate response + background cache - test_issue67_diagnosis: update 5 diagnostic tests for new async behavior (fast response, background individual calls, disk cache)

Copilot

Pull request overview

This PR changes the thumbnail fallback flow to avoid blocking /api/thumbnails responses when batch fetching fails: the server now returns immediately (with fallback: true and missing IDs) and prefetches missing thumbnails asynchronously into the disk cache. The frontend retry loop is adjusted to take advantage of this by allowing more attempts in fallback mode to pick up cached results. The PR also adds two new large test modules to diagnose and reproduce the previously reported issue #67 behavior and updates existing endpoint tests/fixtures to match the new behavior.

Changes:

Backend: make thumbnail fallback non-blocking by scheduling background per-ID prefetch and tracking in-progress IDs to avoid duplicate work.
Frontend: increase retry attempts in fallback mode and adjust retry delay logic so the client can pick up newly cached thumbnails.
Tests: add diagnostic + reproduction suites for #67 and update existing thumbnail fallback tests/fixtures for background prefetch semantics.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
`server.py`	Implements background thumbnail prefetch on batch failure and tracks in-progress IDs.
`index.html`	Updates thumbnail batch retry strategy when fallback mode is detected.
`tests/conftest.py`	Resets new global prefetch-tracking state and lock per test.
`tests/test_api_endpoints.py`	Updates thumbnail fallback tests to expect immediate response + background caching.
`tests/test_issue67_diagnosis.py`	Adds diagnostic tests that explain timing/lock/timeout dynamics for #67.
`tests/test_reproduce_nitrowolf.py`	Adds xfail reproduction tests matching Nitrowolf’s reported symptoms and timing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@@ -490,16 +509,16 @@ async def get_thumbnails_batch(body: dict):
        except Exception as e:
            log.warning("Batch thumbnail fetch failed, falling back to individual: %s", e)


+# Background thumbnail prefetch — tracks IDs already being fetched so
+# concurrent/retry requests don't spawn duplicate work.
+_thumb_prefetch_in_progress: set[str] = set()
+
+
+async def _prefetch_thumbnails(content_ids: list[str]) -> None:
+    """Fetch thumbnails individually in the background and cache to disk."""
+    for cid in content_ids:


+      // When the server is prefetching thumbnails in the background
+      // (fallback mode), allow more retries with shorter delays so the
+      // client picks up cached results as they become available.
+      const MAX_ATTEMPTS = _thumbFallbackSeen ? 8 : 3;
+      const retryDelay = (a) => _thumbFallbackSeen ? 3000 : 2000 * (a + 1);


+        # Give the background task time to run
+        import asyncio
+        await asyncio.sleep(0.2)


+        # Give the background task time to run
+        import asyncio
+        await asyncio.sleep(0.2)


+    async def test_526_artworks_connection_count(self, client, mock_tv, tmp_data_dir, monkeypatch):
+        """With 524 artworks, the frontend sends 26 batches of 20 + 1 batch of 4.


Samsung Frame WebSocket connections go stale after ~30-60s of inactivity, causing BrokenPipeError when the user interacts with the TV after browsing Atmosphere results or other idle periods. - Track _tv_last_used timestamp, updated after each successful _tv_op - _ensure_tv_connection checks idle time against TV_CONN_MAX_IDLE (30s) - Stale connections are proactively closed and reopened instead of waiting for BrokenPipeError on the first attempt - Reset _tv_last_used in test fixtures for proper isolation

- Fix misleading log: 'falling back to individual' → 'scheduling background prefetch' (Copilot on server.py:524) - Fix misleading comment: 'shorter delays' → 'fixed 3s interval' with note about normal escalating backoff (Copilot on index.html:2879) - Replace fixed sleep(0.2) with bounded polling loop in fallback tests to avoid CI flakiness (Copilot on test_api_endpoints.py:395,424) - Rename test_526 → test_524 to match docstring catalog size (Copilot on test_issue67_diagnosis.py:281)

Nitrowolf · 2026-06-02T19:32:37Z

The previously cached ones that weren't showing up until click now show up. However, most of them are still in Tap to retry status. The progress bar at the bottom did appear and count down but no new thumbnails showed up.

What is the process of grabbing thumbnails? Are you just grabbing the image and then generating the thumbnail from that image or is there an actual thumbnail stored on the Frame device you're grabbing?

danmunz mentioned this pull request Jun 2, 2026

fix: consistent API shape & Set-based progress tracking for thumbnail fallback #69

Merged

4 tasks

chatgpt-codex-connector Bot reviewed Jun 2, 2026

View reviewed changes

danmunz added 2 commits June 2, 2026 11:36

danmunz changed the title ~~test: diagnostic and reproduction tests for issue #67 thumbnail failures~~ fix: non-blocking background prefetch for thumbnail fallback (#67) Jun 2, 2026

danmunz requested a review from Copilot June 2, 2026 16:01

Copilot started reviewing on behalf of danmunz June 2, 2026 16:02 View session

Copilot AI reviewed Jun 2, 2026

View reviewed changes

danmunz added 2 commits June 2, 2026 12:14

danmunz merged commit 08607d2 into main Jun 2, 2026

danmunz mentioned this pull request Jun 2, 2026

fix: throttle background prefetch and add circuit breaker (#67) #72

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: non-blocking background prefetch for thumbnail fallback (#67)#70

fix: non-blocking background prefetch for thumbnail fallback (#67)#70
danmunz merged 5 commits into
mainfrom
test/issue-67-diagnosis

danmunz commented Jun 2, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 2, 2026

Uh oh!

chatgpt-codex-connector Bot Jun 2, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Nitrowolf commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		@@ -490,16 +509,16 @@ async def get_thumbnails_batch(body: dict):
		except Exception as e:
		log.warning("Batch thumbnail fetch failed, falling back to individual: %s", e)

		async def test_526_artworks_connection_count(self, client, mock_tv, tmp_data_dir, monkeypatch):
		"""With 524 artworks, the frontend sends 26 batches of 20 + 1 batch of 4.

Conversation

danmunz commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix: Thumbnail loading timeout race (issue #67)

Problem

Root cause

Fix (production code changes)

Tests

Review feedback addressed

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Nitrowolf commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

danmunz commented Jun 2, 2026 •

edited

Loading