Skip to content

fix: non-blocking background prefetch for thumbnail fallback (#67)#70

Merged
danmunz merged 5 commits into
mainfrom
test/issue-67-diagnosis
Jun 2, 2026
Merged

fix: non-blocking background prefetch for thumbnail fallback (#67)#70
danmunz merged 5 commits into
mainfrom
test/issue-67-diagnosis

Conversation

@danmunz
Copy link
Copy Markdown
Owner

@danmunz danmunz commented Jun 2, 2026

Fix: Thumbnail loading timeout race (issue #67)

Problem

When the Samsung TV's batch thumbnail endpoint (get_thumbnail_list) fails, the server fell back to fetching each thumbnail individually — blocking the HTTP response for 53+ seconds while the frontend times out at 30s.

Root cause

The server's individual-fallback path makes up to N × 3 sequential _tv_op calls (one per artwork, 3 retry attempts each), all while holding the response open. For 20 artworks that's ~53 seconds of blocking.

Fix (production code changes)

server.py

  • get_thumbnails_batch() now returns immediately with {"fallback": true, "missing": [...]} when the batch call fails
  • A background task (_prefetch_thumbnails) fetches individual thumbnails asynchronously and caches them to disk
  • Deduplication set (_thumb_prefetch_in_progress) prevents concurrent background fetches for the same content IDs
  • WebSocket staleness detection: tracks _tv_last_used and proactively reconnects when idle > TV_CONN_MAX_IDLE (30s)

index.html

  • When fallback: true is received, the frontend retries up to 8 times at fixed 3s intervals (vs normal 3 attempts with escalating backoff)
  • This gives the background prefetch time to populate the cache

Tests

  • test_reproduce_nitrowolf.py: Reproduces the exact failure scenario from issue bug: TV operations fail with ConnectionFailure / socket closed on large catalogs #67 (marked xfail(strict=False) since the fix changes the behavior)
  • test_issue67_diagnosis.py: Tests for timeout behavior, lock contention, WebSocket staleness detection, and large-catalog scenarios
  • test_api_endpoints.py: Updated fallback tests to verify background prefetch behavior with polling-based assertions

Review feedback addressed

  • Fixed misleading log message ("falling back to individual" → "scheduling background prefetch")
  • Fixed misleading code comment about delay durations
  • Replaced fixed sleep(0.2) in tests with bounded polling loops
  • Fixed test name/docstring mismatch (526 → 524)
  • Hanging thread cleanup already uses threading.Event with bounded wait

Closes #67

…ailures

Add 20 tests across two files that identify and reproduce the root cause
of thumbnail loading failures on large catalogs (500+ artworks).

Root cause: client-server timeout race condition. The server-side
individual fallback (PR #69) takes 53s+ for a batch of 20 IDs when the
TV is struggling, but the frontend's AbortController timeout is 30s.
The client always aborts before receiving the response, never sees the
fallback:true flag, so the progress bar never appears. Three retries
create a ~96s loop ending in 'Tap to retry' on all thumbnails.

test_issue67_diagnosis.py (14 tests):
- Thumbnail key matching with samsungtvws format
- Individual fallback cascade timing (20 TV calls per failed batch)
- TV operation timeout and D2D socket orphaning
- Lock starvation during concurrent fallback operations
- Large catalog scenario (63 TV hits per batch of 20)
- Client timeout race: proves 1 failing ID = 58s worst-case (>30s)

test_reproduce_nitrowolf.py (6 tests):
- Server response exceeds client timeout (53s vs 30s) — matches logs
- Click-to-retry reads from disk cache (2.6ms) — matches user report
- Progress bar never shown (client aborts before response)
- Three-attempt retry loop (96s wall time → 'Tap to retry')
- Concurrent batches compounding the timeout problem
- Settings endpoint blocked by TV lock during fallback

Uses 50x time scaling for fast test execution while preserving the
timing relationships that cause the failures.

Refs: #67, #68, #69
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 24a94a32d2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

data = resp.json()

# 1. Server response took longer than the client timeout
assert server_time > SCALED_CLIENT_TIMEOUT, (
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid locking in the thumbnail timeout regression

Because this reproduction test is part of the normal pytest suite, this assertion makes the current timeout race the expected behavior: any production fix that brings /api/thumbnails responses back under the frontend timeout will fail CI here before it can ship. If the goal is diagnostic coverage for #67, mark these reproductions as expected failures or assert the desired bounded-response behavior instead of requiring the server to stay slower than the client.

Useful? React with 👍 / 👎.

Comment thread tests/test_issue67_diagnosis.py Outdated
def hanging_fn(art):
import threading
# Simulate the D2D socket blocking on recv for a long time
time.sleep(10)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid leaving a sleeping executor thread in the suite

In this timeout test, _tv_op cancels the await on asyncio.to_thread, but the worker thread running hanging_fn keeps sleeping for seconds after the assertion. When this test is run in the full suite, that non-daemon executor thread can delay event-loop cleanup or process exit by the full 10s sleep; use a much shorter blocking interval or a controllable synchronization primitive instead so the diagnostic test does not add a fixed 10s stall to every run.

Useful? React with 👍 / 👎.

danmunz added 2 commits June 2, 2026 11:36
- Mark reproduction test classes with @pytest.mark.xfail(strict=False)
  so a production fix for #67 won't break CI. Tests currently show as
  XPASS (bug still exists); once fixed they'll become XFAIL (expected).

- Replace time.sleep(10) in hanging_fn with threading.Event.wait(2)
  and cancel.set() after assertions, so the executor thread exits
  promptly instead of stalling the suite for 10s.
Replace synchronous individual thumbnail fallback (which blocked 53s+
for large catalogs) with asyncio.create_task() background prefetch.

Server changes:
- Add _prefetch_thumbnails() background task that fetches thumbnails
  individually via _tv_op() and caches to disk
- Add _thumb_prefetch_in_progress dedup set to avoid redundant fetches
- get_thumbnails_batch() now returns immediately with fallback=True
  and all uncached IDs as missing, spawning background prefetch

Client changes:
- In fallback mode, retry up to 8 times at 3s intervals (vs 3 retries
  with escalating backoff) to pick up newly cached thumbnails

Test updates:
- conftest: reset _thumb_prefetch_in_progress and _tv_lock per test
- test_api_endpoints: verify immediate response + background cache
- test_issue67_diagnosis: update 5 diagnostic tests for new async
  behavior (fast response, background individual calls, disk cache)
@danmunz danmunz changed the title test: diagnostic and reproduction tests for issue #67 thumbnail failures fix: non-blocking background prefetch for thumbnail fallback (#67) Jun 2, 2026
@danmunz danmunz requested a review from Copilot June 2, 2026 16:01
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR changes the thumbnail fallback flow to avoid blocking /api/thumbnails responses when batch fetching fails: the server now returns immediately (with fallback: true and missing IDs) and prefetches missing thumbnails asynchronously into the disk cache. The frontend retry loop is adjusted to take advantage of this by allowing more attempts in fallback mode to pick up cached results. The PR also adds two new large test modules to diagnose and reproduce the previously reported issue #67 behavior and updates existing endpoint tests/fixtures to match the new behavior.

Changes:

  • Backend: make thumbnail fallback non-blocking by scheduling background per-ID prefetch and tracking in-progress IDs to avoid duplicate work.
  • Frontend: increase retry attempts in fallback mode and adjust retry delay logic so the client can pick up newly cached thumbnails.
  • Tests: add diagnostic + reproduction suites for #67 and update existing thumbnail fallback tests/fixtures for background prefetch semantics.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
server.py Implements background thumbnail prefetch on batch failure and tracks in-progress IDs.
index.html Updates thumbnail batch retry strategy when fallback mode is detected.
tests/conftest.py Resets new global prefetch-tracking state and lock per test.
tests/test_api_endpoints.py Updates thumbnail fallback tests to expect immediate response + background caching.
tests/test_issue67_diagnosis.py Adds diagnostic tests that explain timing/lock/timeout dynamics for #67.
tests/test_reproduce_nitrowolf.py Adds xfail reproduction tests matching Nitrowolf’s reported symptoms and timing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread server.py Outdated
@@ -490,16 +509,16 @@ async def get_thumbnails_batch(body: dict):
except Exception as e:
log.warning("Batch thumbnail fetch failed, falling back to individual: %s", e)
Comment thread server.py
Comment on lines +464 to +471
# Background thumbnail prefetch — tracks IDs already being fetched so
# concurrent/retry requests don't spawn duplicate work.
_thumb_prefetch_in_progress: set[str] = set()


async def _prefetch_thumbnails(content_ids: list[str]) -> None:
"""Fetch thumbnails individually in the background and cache to disk."""
for cid in content_ids:
Comment thread index.html
Comment on lines +2875 to +2879
// When the server is prefetching thumbnails in the background
// (fallback mode), allow more retries with shorter delays so the
// client picks up cached results as they become available.
const MAX_ATTEMPTS = _thumbFallbackSeen ? 8 : 3;
const retryDelay = (a) => _thumbFallbackSeen ? 3000 : 2000 * (a + 1);
Comment thread tests/test_api_endpoints.py Outdated
Comment on lines +393 to +395
# Give the background task time to run
import asyncio
await asyncio.sleep(0.2)
Comment thread tests/test_api_endpoints.py Outdated
Comment on lines +422 to +424
# Give the background task time to run
import asyncio
await asyncio.sleep(0.2)
Comment thread tests/test_issue67_diagnosis.py Outdated
Comment on lines +280 to +281
async def test_526_artworks_connection_count(self, client, mock_tv, tmp_data_dir, monkeypatch):
"""With 524 artworks, the frontend sends 26 batches of 20 + 1 batch of 4.
danmunz added 2 commits June 2, 2026 12:14
Samsung Frame WebSocket connections go stale after ~30-60s of
inactivity, causing BrokenPipeError when the user interacts with
the TV after browsing Atmosphere results or other idle periods.

- Track _tv_last_used timestamp, updated after each successful _tv_op
- _ensure_tv_connection checks idle time against TV_CONN_MAX_IDLE (30s)
- Stale connections are proactively closed and reopened instead of
  waiting for BrokenPipeError on the first attempt
- Reset _tv_last_used in test fixtures for proper isolation
- Fix misleading log: 'falling back to individual' → 'scheduling
  background prefetch' (Copilot on server.py:524)
- Fix misleading comment: 'shorter delays' → 'fixed 3s interval'
  with note about normal escalating backoff (Copilot on index.html:2879)
- Replace fixed sleep(0.2) with bounded polling loop in fallback
  tests to avoid CI flakiness (Copilot on test_api_endpoints.py:395,424)
- Rename test_526 → test_524 to match docstring catalog size
  (Copilot on test_issue67_diagnosis.py:281)
@Nitrowolf
Copy link
Copy Markdown

The previously cached ones that weren't showing up until click now show up. However, most of them are still in Tap to retry status. The progress bar at the bottom did appear and count down but no new thumbnails showed up.

What is the process of grabbing thumbnails? Are you just grabbing the image and then generating the thumbnail from that image or is there an actual thumbnail stored on the Frame device you're grabbing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: TV operations fail with ConnectionFailure / socket closed on large catalogs

3 participants