fix: persistent TV connection + eliminate concurrent socket races by danmunz · Pull Request #68 · danmunz/docent

danmunz · 2026-06-02T00:20:44Z

Summary

Fixes the critical TV connection failures reported by u/Feastweasel (524 artworks, Docker) where every TV operation failed with ConnectionFailure / socket closed, and the UI froze for 10–45 seconds during thumbnail loading. Also ships several related improvements accumulated during the investigation.

Root Cause

Six interconnected bugs in the TV communication layer, detailed in #67. The two critical ones:

Background thumbnail pre-fetch bypassed _tv_lock — opened an unguarded concurrent WebSocket that crashed the TV's socket server
Every _tv_op opened/closed a new WebSocket — 56+ connect/disconnect cycles per page load with 524 artworks

How Each Issue Is Resolved

Closes #67 — TV operations fail with ConnectionFailure / socket closed on large catalogs

The primary bug. Six root causes, all addressed:

Root Cause	Fix
`_fetch_thumbnails_sync` bypasses `_tv_lock` → concurrent WebSocket crashes TV	Removed entirely. No more background pre-fetch. Thumbnails are fetched on demand through the lock-protected `/api/thumbnails` endpoint.
Every `_tv_op` opens/closes a new WebSocket (56+ cycles)	Persistent connection. `_ensure_tv_connection()` reuses a single WebSocket across all operations. Auto-reconnects on failure via `_close_tv_connection()`. Reduced from 11 connections to 2 in testing.
Single `_tv_lock` serializes ALL ops → lock starvation	Lock released between retries. The retry loop now releases `_tv_lock` during the delay, so other operations can proceed instead of starving for 58+ seconds.
"TV is listening" indicator set once, never updated	Dynamic status updates. New `updateTvStatus()` function flips the indicator to "TV went to sleep" on any 502 error, and back to online when operations succeed.
Failed retries hold lock for up to 58 seconds	Same fix as lock starvation — lock released between attempts.
Frontend fires ~53 serial thumbnail requests for 524 artworks	Batch size 10 → 50. Reduces to ~11 requests. Timeout raised 12s → 30s to accommodate.

Closes #19 — Add /health endpoint for liveness checks

Added GET /health — a lightweight endpoint that checks data file integrity, directory writability, and uptime without attempting a TV connection. Returns structured JSON:

{"status": "ok", "version": "1.0.1", "tv_ip": "192.168.1.24", "data_dir": "/data", "data_files": {"collections": "ok", ...}, "uptime_seconds": 3600}

Also added a Docker HEALTHCHECK directive that polls /health every 30s (plus curl in the Dockerfile to support it).

Closes #57 — Corrupted JSON data files crash the server with no recovery

All five JSON data files (collections.json, artwork_meta.json, ai_config.json, api_usage.json, drive_sync.json) now use _safe_load_json():

Corruption recovery: catches JSONDecodeError / ValueError, logs the error, backs up the corrupt file as filename.corrupt.TIMESTAMP, and returns sensible defaults so the server keeps running.
Startup validation: all JSON files are loaded at startup in lifespan() so corruption is detected immediately, not on first user request.
Atomic write backups: _atomic_write_json() now copies the existing file to .bak before overwriting, creating a one-deep recovery point.

Closes #63 — Jump-to-current-artwork button

Added a floating action button (FAB) in the bottom-right corner:

Scrolls to the currently displayed artwork with smooth scroll, accounting for header and group nav height.
Auto-shows/hides via _updateCurrentCardObserver() — visible when currentId is set, hidden when no artwork is displayed.
Compact on scroll — shrinks when scrolled past 200px to stay unobtrusive.
Filter-aware — if the current artwork is filtered out, shows a toast: "Current artwork is not visible — try clearing filters."

Closes #47 — AI settings UX: unclear relationship between Vision, Analysis, and API keys

Added inline .settings-hint help text throughout the AI Settings modal:

Under Identification: explains it's optional, what Vision does (reverse-image search for hints), and that analysis works without it.
Under Analysis: explains it's required for AI features, what it produces (title, artist, year, medium, etc.), and that Vision results are passed as context when enabled.
Under each API key field: direct "Get a key →" links to the provider's console.
Under model selectors: brief guidance ("Sonnet is best for most users. Haiku is cheapest." / "Pick whichever you already have a key for.").

What Changed

Backend (`server.py`)

Persistent WebSocket connection — _ensure_tv_connection() / _close_tv_connection() reuse a single connection across all TV operations
Removed _fetch_thumbnails_sync — eliminated the unguarded background pre-fetch
Routed /api/art and /api/art/refresh through _tv_op — all TV access through the persistent connection
Released _tv_lock between retries — no more 58-second lock starvation
Added /health endpoint — lightweight status check without TV connection
_safe_load_json() — corruption recovery with automatic backup
_atomic_write_json() — backs up existing file before overwriting
Startup validation — all JSON files loaded and checked in lifespan()

Frontend (`index.html`)

Thumbnail batch size 10 → 50, timeout 12s → 30s
updateTvStatus() — dynamic TV status indicator on 502 errors
Jump-to-current FAB — scrollToCurrentCard(), _updateCurrentCardObserver(), compact-on-scroll
AI settings hints — .settings-hint inline help text with provider links

Infrastructure

Docker HEALTHCHECK — polls /health every 30s
Dockerfile — added curl for healthcheck

Tests

conftest.py — reset _tv_conn / _tv_art between tests
New tests for health endpoint and file persistence (90 total, all passing)

Test Results

Reproduced on a local setup with 61 artworks. Same test (3 concurrent page-load requests + 6 concurrent thumbnail batches + 1 user select):

Metric	Before	After
WebSocket connections	11	2
Thumbnail batch failures	6/6	6/6 (TV API issue, not connection)
Background pre-fetch race	Present	Eliminated
Lock held during retries	Up to 58s	Released between attempts

Backend: - Reuse a single persistent WebSocket connection instead of opening/closing one per TV operation (was 11+ connect/disconnect cycles per page load) - Remove _fetch_thumbnails_sync — the unguarded background pre-fetch that opened a concurrent WebSocket without _tv_lock, crashing the TV's socket server on large catalogs (524+ artworks) - Route /api/art and /api/art/refresh through _tv_op so all TV access is serialized through the single persistent connection - Release _tv_lock between retry attempts so failed operations don't starve user actions for up to 58 seconds - On failure, close and reopen the connection (auto-reconnect) Frontend: - Increase thumbnail batch size from 10 to 50 (reduces 53 serial requests to ~11 for a 524-artwork catalog) - Increase thumbnail request timeout from 12s to 30s for larger batches - Add updateTvStatus() — TV status indicator now updates on 502 errors instead of being set once at init and never updated - Centralize TV status display logic (updateArtworkCount uses updateTvStatus) Also includes: /health endpoint, Docker HEALTHCHECK, safe JSON loading with corruption recovery, atomic write backups, and related tests.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8681b6ce09

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

…table Addresses Codex review on PR #68: curl -f in Docker HEALTHCHECK only fails on non-2xx, but /health returned 200 even with status 'error'. Now returns 503 so Docker correctly marks the container unhealthy.

…#67) * test: add diagnostic and reproduction tests for issue #67 thumbnail failures Add 20 tests across two files that identify and reproduce the root cause of thumbnail loading failures on large catalogs (500+ artworks). Root cause: client-server timeout race condition. The server-side individual fallback (PR #69) takes 53s+ for a batch of 20 IDs when the TV is struggling, but the frontend's AbortController timeout is 30s. The client always aborts before receiving the response, never sees the fallback:true flag, so the progress bar never appears. Three retries create a ~96s loop ending in 'Tap to retry' on all thumbnails. test_issue67_diagnosis.py (14 tests): - Thumbnail key matching with samsungtvws format - Individual fallback cascade timing (20 TV calls per failed batch) - TV operation timeout and D2D socket orphaning - Lock starvation during concurrent fallback operations - Large catalog scenario (63 TV hits per batch of 20) - Client timeout race: proves 1 failing ID = 58s worst-case (>30s) test_reproduce_nitrowolf.py (6 tests): - Server response exceeds client timeout (53s vs 30s) — matches logs - Click-to-retry reads from disk cache (2.6ms) — matches user report - Progress bar never shown (client aborts before response) - Three-attempt retry loop (96s wall time → 'Tap to retry') - Concurrent batches compounding the timeout problem - Settings endpoint blocked by TV lock during fallback Uses 50x time scaling for fast test execution while preserving the timing relationships that cause the failures. Refs: #67, #68, #69 * fix: address Codex review feedback on PR #70 - Mark reproduction test classes with @pytest.mark.xfail(strict=False) so a production fix for #67 won't break CI. Tests currently show as XPASS (bug still exists); once fixed they'll become XFAIL (expected). - Replace time.sleep(10) in hanging_fn with threading.Event.wait(2) and cancel.set() after assertions, so the executor thread exits promptly instead of stalling the suite for 10s. * fix: non-blocking background prefetch for thumbnail fallback (#67) Replace synchronous individual thumbnail fallback (which blocked 53s+ for large catalogs) with asyncio.create_task() background prefetch. Server changes: - Add _prefetch_thumbnails() background task that fetches thumbnails individually via _tv_op() and caches to disk - Add _thumb_prefetch_in_progress dedup set to avoid redundant fetches - get_thumbnails_batch() now returns immediately with fallback=True and all uncached IDs as missing, spawning background prefetch Client changes: - In fallback mode, retry up to 8 times at 3s intervals (vs 3 retries with escalating backoff) to pick up newly cached thumbnails Test updates: - conftest: reset _thumb_prefetch_in_progress and _tv_lock per test - test_api_endpoints: verify immediate response + background cache - test_issue67_diagnosis: update 5 diagnostic tests for new async behavior (fast response, background individual calls, disk cache) * fix: proactively reconnect stale TV WebSocket connections Samsung Frame WebSocket connections go stale after ~30-60s of inactivity, causing BrokenPipeError when the user interacts with the TV after browsing Atmosphere results or other idle periods. - Track _tv_last_used timestamp, updated after each successful _tv_op - _ensure_tv_connection checks idle time against TV_CONN_MAX_IDLE (30s) - Stale connections are proactively closed and reopened instead of waiting for BrokenPipeError on the first attempt - Reset _tv_last_used in test fixtures for proper isolation * refactor: address PR review comments from Copilot and Codex - Fix misleading log: 'falling back to individual' → 'scheduling background prefetch' (Copilot on server.py:524) - Fix misleading comment: 'shorter delays' → 'fixed 3s interval' with note about normal escalating backoff (Copilot on index.html:2879) - Replace fixed sleep(0.2) with bounded polling loop in fallback tests to avoid CI flakiness (Copilot on test_api_endpoints.py:395,424) - Rename test_526 → test_524 to match docstring catalog size (Copilot on test_issue67_diagnosis.py:281)

chatgpt-codex-connector Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread Dockerfile

danmunz added 2 commits June 1, 2026 20:25

fix: return 503 from /health when data files are corrupt or dir unwri…

5f1dc84

…table Addresses Codex review on PR #68: curl -f in Docker HEALTHCHECK only fails on non-2xx, but /health returned 200 even with status 'error'. Now returns 503 so Docker correctly marks the container unhealthy.

docs: update README for /health endpoint, Docker HEALTHCHECK, test count

7b8dc54

danmunz merged commit 910821b into main Jun 2, 2026

danmunz deleted the fix/tv-connection-stability branch June 2, 2026 00:28

Copilot AI mentioned this pull request Jun 2, 2026

fix: consistent API shape & Set-based progress tracking for thumbnail fallback #69

Merged

4 tasks

danmunz mentioned this pull request Jun 2, 2026

fix: non-blocking background prefetch for thumbnail fallback (#67) #70

Merged

danmunz mentioned this pull request Jun 4, 2026

fix: resolve prefetch timeout and retry path failures (#67) #75

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: persistent TV connection + eliminate concurrent socket races#68

fix: persistent TV connection + eliminate concurrent socket races#68
danmunz merged 3 commits into
mainfrom
fix/tv-connection-stability

danmunz commented Jun 2, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

danmunz commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

How Each Issue Is Resolved

Closes #67 — TV operations fail with ConnectionFailure / socket closed on large catalogs

Closes #19 — Add /health endpoint for liveness checks

Closes #57 — Corrupted JSON data files crash the server with no recovery

Closes #63 — Jump-to-current-artwork button

Closes #47 — AI settings UX: unclear relationship between Vision, Analysis, and API keys

What Changed

Backend (server.py)

Frontend (index.html)

Infrastructure

Tests

Test Results

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

danmunz commented Jun 2, 2026 •

edited

Loading

Backend (`server.py`)

Frontend (`index.html`)