Skip to content

fix: persistent TV connection + eliminate concurrent socket races#68

Merged
danmunz merged 3 commits into
mainfrom
fix/tv-connection-stability
Jun 2, 2026
Merged

fix: persistent TV connection + eliminate concurrent socket races#68
danmunz merged 3 commits into
mainfrom
fix/tv-connection-stability

Conversation

@danmunz
Copy link
Copy Markdown
Owner

@danmunz danmunz commented Jun 2, 2026

Summary

Fixes the critical TV connection failures reported by u/Feastweasel (524 artworks, Docker) where every TV operation failed with ConnectionFailure / socket closed, and the UI froze for 10–45 seconds during thumbnail loading. Also ships several related improvements accumulated during the investigation.

Root Cause

Six interconnected bugs in the TV communication layer, detailed in #67. The two critical ones:

  1. Background thumbnail pre-fetch bypassed _tv_lock — opened an unguarded concurrent WebSocket that crashed the TV's socket server
  2. Every _tv_op opened/closed a new WebSocket — 56+ connect/disconnect cycles per page load with 524 artworks

How Each Issue Is Resolved

Closes #67 — TV operations fail with ConnectionFailure / socket closed on large catalogs

The primary bug. Six root causes, all addressed:

Root Cause Fix
_fetch_thumbnails_sync bypasses _tv_lock → concurrent WebSocket crashes TV Removed entirely. No more background pre-fetch. Thumbnails are fetched on demand through the lock-protected /api/thumbnails endpoint.
Every _tv_op opens/closes a new WebSocket (56+ cycles) Persistent connection. _ensure_tv_connection() reuses a single WebSocket across all operations. Auto-reconnects on failure via _close_tv_connection(). Reduced from 11 connections to 2 in testing.
Single _tv_lock serializes ALL ops → lock starvation Lock released between retries. The retry loop now releases _tv_lock during the delay, so other operations can proceed instead of starving for 58+ seconds.
"TV is listening" indicator set once, never updated Dynamic status updates. New updateTvStatus() function flips the indicator to "TV went to sleep" on any 502 error, and back to online when operations succeed.
Failed retries hold lock for up to 58 seconds Same fix as lock starvation — lock released between attempts.
Frontend fires ~53 serial thumbnail requests for 524 artworks Batch size 10 → 50. Reduces to ~11 requests. Timeout raised 12s → 30s to accommodate.

Closes #19 — Add /health endpoint for liveness checks

Added GET /health — a lightweight endpoint that checks data file integrity, directory writability, and uptime without attempting a TV connection. Returns structured JSON:

{"status": "ok", "version": "1.0.1", "tv_ip": "192.168.1.24", "data_dir": "/data", "data_files": {"collections": "ok", ...}, "uptime_seconds": 3600}

Also added a Docker HEALTHCHECK directive that polls /health every 30s (plus curl in the Dockerfile to support it).

Closes #57 — Corrupted JSON data files crash the server with no recovery

All five JSON data files (collections.json, artwork_meta.json, ai_config.json, api_usage.json, drive_sync.json) now use _safe_load_json():

  • Corruption recovery: catches JSONDecodeError / ValueError, logs the error, backs up the corrupt file as filename.corrupt.TIMESTAMP, and returns sensible defaults so the server keeps running.
  • Startup validation: all JSON files are loaded at startup in lifespan() so corruption is detected immediately, not on first user request.
  • Atomic write backups: _atomic_write_json() now copies the existing file to .bak before overwriting, creating a one-deep recovery point.

Closes #63 — Jump-to-current-artwork button

Added a floating action button (FAB) in the bottom-right corner:

  • Scrolls to the currently displayed artwork with smooth scroll, accounting for header and group nav height.
  • Auto-shows/hides via _updateCurrentCardObserver() — visible when currentId is set, hidden when no artwork is displayed.
  • Compact on scroll — shrinks when scrolled past 200px to stay unobtrusive.
  • Filter-aware — if the current artwork is filtered out, shows a toast: "Current artwork is not visible — try clearing filters."

Closes #47 — AI settings UX: unclear relationship between Vision, Analysis, and API keys

Added inline .settings-hint help text throughout the AI Settings modal:

  • Under Identification: explains it's optional, what Vision does (reverse-image search for hints), and that analysis works without it.
  • Under Analysis: explains it's required for AI features, what it produces (title, artist, year, medium, etc.), and that Vision results are passed as context when enabled.
  • Under each API key field: direct "Get a key →" links to the provider's console.
  • Under model selectors: brief guidance ("Sonnet is best for most users. Haiku is cheapest." / "Pick whichever you already have a key for.").

What Changed

Backend (server.py)

  • Persistent WebSocket connection_ensure_tv_connection() / _close_tv_connection() reuse a single connection across all TV operations
  • Removed _fetch_thumbnails_sync — eliminated the unguarded background pre-fetch
  • Routed /api/art and /api/art/refresh through _tv_op — all TV access through the persistent connection
  • Released _tv_lock between retries — no more 58-second lock starvation
  • Added /health endpoint — lightweight status check without TV connection
  • _safe_load_json() — corruption recovery with automatic backup
  • _atomic_write_json() — backs up existing file before overwriting
  • Startup validation — all JSON files loaded and checked in lifespan()

Frontend (index.html)

  • Thumbnail batch size 10 → 50, timeout 12s → 30s
  • updateTvStatus() — dynamic TV status indicator on 502 errors
  • Jump-to-current FABscrollToCurrentCard(), _updateCurrentCardObserver(), compact-on-scroll
  • AI settings hints.settings-hint inline help text with provider links

Infrastructure

  • Docker HEALTHCHECK — polls /health every 30s
  • Dockerfile — added curl for healthcheck

Tests

  • conftest.py — reset _tv_conn / _tv_art between tests
  • New tests for health endpoint and file persistence (90 total, all passing)

Test Results

Reproduced on a local setup with 61 artworks. Same test (3 concurrent page-load requests + 6 concurrent thumbnail batches + 1 user select):

Metric Before After
WebSocket connections 11 2
Thumbnail batch failures 6/6 6/6 (TV API issue, not connection)
Background pre-fetch race Present Eliminated
Lock held during retries Up to 58s Released between attempts

Backend:
- Reuse a single persistent WebSocket connection instead of opening/closing
  one per TV operation (was 11+ connect/disconnect cycles per page load)
- Remove _fetch_thumbnails_sync — the unguarded background pre-fetch that
  opened a concurrent WebSocket without _tv_lock, crashing the TV's socket
  server on large catalogs (524+ artworks)
- Route /api/art and /api/art/refresh through _tv_op so all TV access is
  serialized through the single persistent connection
- Release _tv_lock between retry attempts so failed operations don't starve
  user actions for up to 58 seconds
- On failure, close and reopen the connection (auto-reconnect)

Frontend:
- Increase thumbnail batch size from 10 to 50 (reduces 53 serial requests
  to ~11 for a 524-artwork catalog)
- Increase thumbnail request timeout from 12s to 30s for larger batches
- Add updateTvStatus() — TV status indicator now updates on 502 errors
  instead of being set once at init and never updated
- Centralize TV status display logic (updateArtworkCount uses updateTvStatus)

Also includes: /health endpoint, Docker HEALTHCHECK, safe JSON loading with
corruption recovery, atomic write backups, and related tests.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8681b6ce09

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread Dockerfile
danmunz added 2 commits June 1, 2026 20:25
…table

Addresses Codex review on PR #68: curl -f in Docker HEALTHCHECK
only fails on non-2xx, but /health returned 200 even with status
'error'. Now returns 503 so Docker correctly marks the container
unhealthy.
@danmunz danmunz merged commit 910821b into main Jun 2, 2026
@danmunz danmunz deleted the fix/tv-connection-stability branch June 2, 2026 00:28
danmunz added a commit that referenced this pull request Jun 2, 2026
…#67)

* test: add diagnostic and reproduction tests for issue #67 thumbnail failures

Add 20 tests across two files that identify and reproduce the root cause
of thumbnail loading failures on large catalogs (500+ artworks).

Root cause: client-server timeout race condition. The server-side
individual fallback (PR #69) takes 53s+ for a batch of 20 IDs when the
TV is struggling, but the frontend's AbortController timeout is 30s.
The client always aborts before receiving the response, never sees the
fallback:true flag, so the progress bar never appears. Three retries
create a ~96s loop ending in 'Tap to retry' on all thumbnails.

test_issue67_diagnosis.py (14 tests):
- Thumbnail key matching with samsungtvws format
- Individual fallback cascade timing (20 TV calls per failed batch)
- TV operation timeout and D2D socket orphaning
- Lock starvation during concurrent fallback operations
- Large catalog scenario (63 TV hits per batch of 20)
- Client timeout race: proves 1 failing ID = 58s worst-case (>30s)

test_reproduce_nitrowolf.py (6 tests):
- Server response exceeds client timeout (53s vs 30s) — matches logs
- Click-to-retry reads from disk cache (2.6ms) — matches user report
- Progress bar never shown (client aborts before response)
- Three-attempt retry loop (96s wall time → 'Tap to retry')
- Concurrent batches compounding the timeout problem
- Settings endpoint blocked by TV lock during fallback

Uses 50x time scaling for fast test execution while preserving the
timing relationships that cause the failures.

Refs: #67, #68, #69

* fix: address Codex review feedback on PR #70

- Mark reproduction test classes with @pytest.mark.xfail(strict=False)
  so a production fix for #67 won't break CI. Tests currently show as
  XPASS (bug still exists); once fixed they'll become XFAIL (expected).

- Replace time.sleep(10) in hanging_fn with threading.Event.wait(2)
  and cancel.set() after assertions, so the executor thread exits
  promptly instead of stalling the suite for 10s.

* fix: non-blocking background prefetch for thumbnail fallback (#67)

Replace synchronous individual thumbnail fallback (which blocked 53s+
for large catalogs) with asyncio.create_task() background prefetch.

Server changes:
- Add _prefetch_thumbnails() background task that fetches thumbnails
  individually via _tv_op() and caches to disk
- Add _thumb_prefetch_in_progress dedup set to avoid redundant fetches
- get_thumbnails_batch() now returns immediately with fallback=True
  and all uncached IDs as missing, spawning background prefetch

Client changes:
- In fallback mode, retry up to 8 times at 3s intervals (vs 3 retries
  with escalating backoff) to pick up newly cached thumbnails

Test updates:
- conftest: reset _thumb_prefetch_in_progress and _tv_lock per test
- test_api_endpoints: verify immediate response + background cache
- test_issue67_diagnosis: update 5 diagnostic tests for new async
  behavior (fast response, background individual calls, disk cache)

* fix: proactively reconnect stale TV WebSocket connections

Samsung Frame WebSocket connections go stale after ~30-60s of
inactivity, causing BrokenPipeError when the user interacts with
the TV after browsing Atmosphere results or other idle periods.

- Track _tv_last_used timestamp, updated after each successful _tv_op
- _ensure_tv_connection checks idle time against TV_CONN_MAX_IDLE (30s)
- Stale connections are proactively closed and reopened instead of
  waiting for BrokenPipeError on the first attempt
- Reset _tv_last_used in test fixtures for proper isolation

* refactor: address PR review comments from Copilot and Codex

- Fix misleading log: 'falling back to individual' → 'scheduling
  background prefetch' (Copilot on server.py:524)
- Fix misleading comment: 'shorter delays' → 'fixed 3s interval'
  with note about normal escalating backoff (Copilot on index.html:2879)
- Replace fixed sleep(0.2) with bounded polling loop in fallback
  tests to avoid CI flakiness (Copilot on test_api_endpoints.py:395,424)
- Rename test_526 → test_524 to match docstring catalog size
  (Copilot on test_issue67_diagnosis.py:281)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant