fix: persistent TV connection + eliminate concurrent socket races#68
Merged
Conversation
Backend: - Reuse a single persistent WebSocket connection instead of opening/closing one per TV operation (was 11+ connect/disconnect cycles per page load) - Remove _fetch_thumbnails_sync — the unguarded background pre-fetch that opened a concurrent WebSocket without _tv_lock, crashing the TV's socket server on large catalogs (524+ artworks) - Route /api/art and /api/art/refresh through _tv_op so all TV access is serialized through the single persistent connection - Release _tv_lock between retry attempts so failed operations don't starve user actions for up to 58 seconds - On failure, close and reopen the connection (auto-reconnect) Frontend: - Increase thumbnail batch size from 10 to 50 (reduces 53 serial requests to ~11 for a 524-artwork catalog) - Increase thumbnail request timeout from 12s to 30s for larger batches - Add updateTvStatus() — TV status indicator now updates on 502 errors instead of being set once at init and never updated - Centralize TV status display logic (updateArtworkCount uses updateTvStatus) Also includes: /health endpoint, Docker HEALTHCHECK, safe JSON loading with corruption recovery, atomic write backups, and related tests.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8681b6ce09
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…table Addresses Codex review on PR #68: curl -f in Docker HEALTHCHECK only fails on non-2xx, but /health returned 200 even with status 'error'. Now returns 503 so Docker correctly marks the container unhealthy.
4 tasks
danmunz
added a commit
that referenced
this pull request
Jun 2, 2026
…#67) * test: add diagnostic and reproduction tests for issue #67 thumbnail failures Add 20 tests across two files that identify and reproduce the root cause of thumbnail loading failures on large catalogs (500+ artworks). Root cause: client-server timeout race condition. The server-side individual fallback (PR #69) takes 53s+ for a batch of 20 IDs when the TV is struggling, but the frontend's AbortController timeout is 30s. The client always aborts before receiving the response, never sees the fallback:true flag, so the progress bar never appears. Three retries create a ~96s loop ending in 'Tap to retry' on all thumbnails. test_issue67_diagnosis.py (14 tests): - Thumbnail key matching with samsungtvws format - Individual fallback cascade timing (20 TV calls per failed batch) - TV operation timeout and D2D socket orphaning - Lock starvation during concurrent fallback operations - Large catalog scenario (63 TV hits per batch of 20) - Client timeout race: proves 1 failing ID = 58s worst-case (>30s) test_reproduce_nitrowolf.py (6 tests): - Server response exceeds client timeout (53s vs 30s) — matches logs - Click-to-retry reads from disk cache (2.6ms) — matches user report - Progress bar never shown (client aborts before response) - Three-attempt retry loop (96s wall time → 'Tap to retry') - Concurrent batches compounding the timeout problem - Settings endpoint blocked by TV lock during fallback Uses 50x time scaling for fast test execution while preserving the timing relationships that cause the failures. Refs: #67, #68, #69 * fix: address Codex review feedback on PR #70 - Mark reproduction test classes with @pytest.mark.xfail(strict=False) so a production fix for #67 won't break CI. Tests currently show as XPASS (bug still exists); once fixed they'll become XFAIL (expected). - Replace time.sleep(10) in hanging_fn with threading.Event.wait(2) and cancel.set() after assertions, so the executor thread exits promptly instead of stalling the suite for 10s. * fix: non-blocking background prefetch for thumbnail fallback (#67) Replace synchronous individual thumbnail fallback (which blocked 53s+ for large catalogs) with asyncio.create_task() background prefetch. Server changes: - Add _prefetch_thumbnails() background task that fetches thumbnails individually via _tv_op() and caches to disk - Add _thumb_prefetch_in_progress dedup set to avoid redundant fetches - get_thumbnails_batch() now returns immediately with fallback=True and all uncached IDs as missing, spawning background prefetch Client changes: - In fallback mode, retry up to 8 times at 3s intervals (vs 3 retries with escalating backoff) to pick up newly cached thumbnails Test updates: - conftest: reset _thumb_prefetch_in_progress and _tv_lock per test - test_api_endpoints: verify immediate response + background cache - test_issue67_diagnosis: update 5 diagnostic tests for new async behavior (fast response, background individual calls, disk cache) * fix: proactively reconnect stale TV WebSocket connections Samsung Frame WebSocket connections go stale after ~30-60s of inactivity, causing BrokenPipeError when the user interacts with the TV after browsing Atmosphere results or other idle periods. - Track _tv_last_used timestamp, updated after each successful _tv_op - _ensure_tv_connection checks idle time against TV_CONN_MAX_IDLE (30s) - Stale connections are proactively closed and reopened instead of waiting for BrokenPipeError on the first attempt - Reset _tv_last_used in test fixtures for proper isolation * refactor: address PR review comments from Copilot and Codex - Fix misleading log: 'falling back to individual' → 'scheduling background prefetch' (Copilot on server.py:524) - Fix misleading comment: 'shorter delays' → 'fixed 3s interval' with note about normal escalating backoff (Copilot on index.html:2879) - Replace fixed sleep(0.2) with bounded polling loop in fallback tests to avoid CI flakiness (Copilot on test_api_endpoints.py:395,424) - Rename test_526 → test_524 to match docstring catalog size (Copilot on test_issue67_diagnosis.py:281)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes the critical TV connection failures reported by u/Feastweasel (524 artworks, Docker) where every TV operation failed with
ConnectionFailure/socket closed, and the UI froze for 10–45 seconds during thumbnail loading. Also ships several related improvements accumulated during the investigation.Root Cause
Six interconnected bugs in the TV communication layer, detailed in #67. The two critical ones:
_tv_lock— opened an unguarded concurrent WebSocket that crashed the TV's socket server_tv_opopened/closed a new WebSocket — 56+ connect/disconnect cycles per page load with 524 artworksHow Each Issue Is Resolved
Closes #67 — TV operations fail with ConnectionFailure / socket closed on large catalogs
The primary bug. Six root causes, all addressed:
_fetch_thumbnails_syncbypasses_tv_lock→ concurrent WebSocket crashes TV/api/thumbnailsendpoint._tv_opopens/closes a new WebSocket (56+ cycles)_ensure_tv_connection()reuses a single WebSocket across all operations. Auto-reconnects on failure via_close_tv_connection(). Reduced from 11 connections to 2 in testing._tv_lockserializes ALL ops → lock starvation_tv_lockduring the delay, so other operations can proceed instead of starving for 58+ seconds.updateTvStatus()function flips the indicator to "TV went to sleep" on any 502 error, and back to online when operations succeed.Closes #19 — Add /health endpoint for liveness checks
Added
GET /health— a lightweight endpoint that checks data file integrity, directory writability, and uptime without attempting a TV connection. Returns structured JSON:{"status": "ok", "version": "1.0.1", "tv_ip": "192.168.1.24", "data_dir": "/data", "data_files": {"collections": "ok", ...}, "uptime_seconds": 3600}Also added a Docker
HEALTHCHECKdirective that polls/healthevery 30s (pluscurlin the Dockerfile to support it).Closes #57 — Corrupted JSON data files crash the server with no recovery
All five JSON data files (
collections.json,artwork_meta.json,ai_config.json,api_usage.json,drive_sync.json) now use_safe_load_json():JSONDecodeError/ValueError, logs the error, backs up the corrupt file asfilename.corrupt.TIMESTAMP, and returns sensible defaults so the server keeps running.lifespan()so corruption is detected immediately, not on first user request._atomic_write_json()now copies the existing file to.bakbefore overwriting, creating a one-deep recovery point.Closes #63 — Jump-to-current-artwork button
Added a floating action button (FAB) in the bottom-right corner:
_updateCurrentCardObserver()— visible whencurrentIdis set, hidden when no artwork is displayed.Closes #47 — AI settings UX: unclear relationship between Vision, Analysis, and API keys
Added inline
.settings-hinthelp text throughout the AI Settings modal:What Changed
Backend (
server.py)_ensure_tv_connection()/_close_tv_connection()reuse a single connection across all TV operations_fetch_thumbnails_sync— eliminated the unguarded background pre-fetch/api/artand/api/art/refreshthrough_tv_op— all TV access through the persistent connection_tv_lockbetween retries — no more 58-second lock starvation/healthendpoint — lightweight status check without TV connection_safe_load_json()— corruption recovery with automatic backup_atomic_write_json()— backs up existing file before overwritinglifespan()Frontend (
index.html)updateTvStatus()— dynamic TV status indicator on 502 errorsscrollToCurrentCard(),_updateCurrentCardObserver(), compact-on-scroll.settings-hintinline help text with provider linksInfrastructure
HEALTHCHECK— polls/healthevery 30scurlfor healthcheckTests
conftest.py— reset_tv_conn/_tv_artbetween testsTest Results
Reproduced on a local setup with 61 artworks. Same test (3 concurrent page-load requests + 6 concurrent thumbnail batches + 1 user select):