Skip to content

bug: TV operations fail with ConnectionFailure / socket closed on large catalogs #67

@danmunz

Description

@danmunz

Report

From u/Feastweasel (v1.0.1, 524 artworks, Docker):

Now it doesn't work at all. It reads the cached thumbs, but that's it.

Every TV operation (select, matte, thumbnails) fails with ConnectionFailure and {'reason': 'socket closed'}. The "TV is listening" indicator shows green, the TV is awake, and AI analysis (which doesn't touch the TV) works fine. Docker stop also works now.

Additionally, the UI freezes for 10-45 seconds when clicking buttons (e.g., Settings) while thumbnail fetches are in progress. Operations execute out of order — a stalled Settings dialog pops up over a later-opened modal.

Root Cause Analysis

There are 6 interconnected bugs, with #1 as the likely primary trigger.

Bug 1 (Critical): Background thumbnail pre-fetch bypasses _tv_lock

_fetch_thumbnails_sync() (~line 239) runs in a thread pool executor (~line 451) and opens its own WebSocket connection without acquiring _tv_lock. This creates a concurrent WebSocket to the TV while _tv_lock-protected operations are also connecting.

The Samsung Frame's WebSocket server does not handle concurrent connections well — it drops one or both, causing {'reason': 'socket closed'}. Every subsequent _tv_op then fails because the TV's WebSocket server is in a confused/recovering state. The retry logic makes this worse by hammering the TV with more connection attempts.

This is the smoking gun. The background pre-fetch races with the frontend's /api/thumbnails batch requests, and the TV kills all sockets.

Bug 2 (Critical): Every _tv_op opens and closes a new WebSocket

_tv_op() (~line 166) creates a brand new SamsungTVWS instance, opens a WebSocket (art.open()), runs the operation, and calls tv.close()every single time.

With 524 artworks on first load:

  • /api/info → 1 WebSocket open/close
  • /api/art → 1 WebSocket open/close
  • /api/mattes → 1 WebSocket open/close
  • Background _fetch_thumbnails_sync → 1 long-lived WebSocket (no lock!)
  • ~53 /api/thumbnails batches → 53 WebSocket open/close cycles
  • = ~56+ WebSocket connect/disconnect cycles in rapid succession

The TV's WebSocket server likely rate-limits or crashes under this churn.

Bug 3 (High): Global _tv_lock serializes ALL TV ops — lock starvation

_tv_lock is a single asyncio.Lock() (~line 88). Every TV operation competes for it: thumbnails, mattes, select, info, art list, favorites, slideshow, filters — everything.

With 53 thumbnail batches queued, a user's "change matte" request sits at position 54 in the queue. Each batch takes 2-4 seconds (connection + fetch + close), so the user waits 60+ seconds for their click to execute.

This explains the 10-45 second stall on the Settings button and the out-of-order execution: the settings fetch was queued behind thumbnail batches, and by the time it completed, the user had opened another modal.

Bug 4 (Medium): "TV is listening" indicator never updates

The frontend calls /api/info once during init() (~line 2405 in index.html). If it succeeds, the green dot is set permanently. There is no heartbeat, no periodic recheck, no update on failure. The user sees "TV is listening" while every subsequent operation fails.

Bug 5 (High): Failed retries hold the lock for up to ~58 seconds

When a _tv_op fails and retries (3 attempts × TV_TIMEOUT + WoL delay), it holds _tv_lock the entire time. Worst case: 3 × (timeout + 8s) + 2 × 2s = ~58 seconds of lock hold per single failed operation. With multiple failed operations queued, the total lockout can exceed several minutes.

Bug 6 (High): Frontend fires ~53 serial thumbnail requests for 524 artworks

The IntersectionObserver + loadThumbnailBatch() (~line 2773 in index.html) slices visible thumbnails into batches of 10 and fires them sequentially. Each batch → one _tv_op() → one WebSocket round-trip.

Timeline of Failure (524 artworks, fresh start)

0.0s  /api/info         → lock acquired, WebSocket #1 open/close → GREEN DOT ✅
0.5s  /api/art          → lock acquired, WebSocket #2 open/close → 524 items
1.0s  Background prefetch → WebSocket #3 opened WITHOUT LOCK ⚠️
1.0s  /api/mattes       → waiting for lock...
1.2s  /api/thumbnails batch 1 → waiting for lock...
1.4s  /api/thumbnails batch 2 → waiting for lock...
      ...
      /api/mattes finally gets lock → WebSocket #4 vs concurrent #3 → SOCKET CLOSED 💥
      All subsequent _tv_ops fail — TV WebSocket is confused
      Each failure triggers 3 retries × 2s delay = lock held 10-20s per failure
      User clicks Settings → queued at position 50+ → waits 60+ seconds
      User clicks Matte → queued behind Settings → opens after Settings, out of order

Proposed Fixes

Immediate (P0)

  1. Make _fetch_thumbnails_sync acquire _tv_lock — or better, remove it entirely and let the existing /api/thumbnails endpoint handle all fetches. No concurrent unguarded connections.

  2. Reuse a persistent WebSocket connection instead of open/close per operation. Create a connection pool (size 1) or a long-lived SamsungTVWS instance that reconnects on failure. This eliminates the 56+ connect/disconnect churn.

Short-term (P1)

  1. Separate thumbnail lock from user-action lock — or use a priority queue so user-initiated actions (select, matte, settings) jump ahead of background thumbnail fetches. Alternatively, cancel in-progress thumbnail batches when a user action arrives.

  2. Larger thumbnail batches — fetch all missing thumbnails in a single _tv_op call instead of batches of 10. One WebSocket round-trip instead of 53.

  3. Update "TV is listening" on failure — if any _tv_op fails after exhausting retries, flip the indicator to red/yellow. Add a periodic heartbeat (every 30s).

Medium-term (P2)

  1. Non-blocking retry — release _tv_lock between retry attempts so other operations can proceed while waiting for the TV to wake up.

  2. Request deduplication — if the user clicks a button while thumbnails are loading, cancel or deprioritize the thumbnail queue.

Affected code

Location Issue
server.py ~L239-268 _fetch_thumbnails_sync — unguarded concurrent WebSocket
server.py ~L166-173 _tv_op — new WebSocket per call
server.py ~L88 _tv_lock — single global lock
server.py ~L175-190 Retry logic holds lock during delays
server.py ~L451-452 Background pre-fetch launch (no lock)
index.html ~L2773-2807 Thumbnail batching (groups of 10)
index.html ~L2405-2416 TV status indicator (set once, never updated)

Environment

  • Docent v1.0.1 (Docker)
  • 524 artworks on Samsung Frame
  • TV awake, on local network
  • AI analysis (OpenAI) works fine (no TV connection needed)

Local Reproduction (61 artworks)

Reproduced on a local setup with only 61 artworks (vs Feastweasel's 524). Results confirm all root causes.

Test: Simulated page load + 6 concurrent thumbnail batches + user action

=== PAGE LOAD (3 concurrent requests) ===
19:59:48 → 19:59:54  (6 seconds for 3 requests that should take ~1s each)
Requests serialized through _tv_lock — each waited for the previous one.

=== THUMBNAIL STORM (6 concurrent batches of 10) ===
batch1 = 3.12s    ← queued behind batches that arrived first
batch2 = 2.08s
batch3 = 1.04s    ← first to acquire lock
batch4 = 4.16s
batch5 = 5.18s
batch6 = 6.22s    ← last in queue, waited for all 5 before it
Total wall time: 6 seconds (perfectly serialized staircase)

=== USER SELECT (queued behind all batches) ===
select = 1.05s    ← waited until all 6 batches drained
Started at 20:00:00, right after storm ended at 20:00:00

Key findings

Metric 61 artworks (local) 524 artworks (projected)
WebSocket connections opened 11 in 12 seconds ~56+ in rapid succession
Thumbnail batch requests 6 ~53
Lock starvation (user action delayed) ~7 seconds ~53 seconds
Background pre-fetch race Not triggered (0 new IDs) Fires on every fresh start

All 6 thumbnail batch fetches failed

WARNING docent: Batch thumbnail fetch failed: `get_thumbnail_list` request failed with error number -1
WARNING docent: Batch thumbnail fetch failed: `get_thumbnail_list` request failed with error number -1
WARNING docent: Batch thumbnail fetch failed: `get_thumbnail_list` request failed with error number -1
WARNING docent: Batch thumbnail fetch failed: `get_thumbnail_list` request failed with error number -1
WARNING docent: Batch thumbnail fetch failed: `get_thumbnail_list` request failed with error number -1
WARNING docent: Batch thumbnail fetch failed: `get_thumbnail_list` request failed with error number -1

Every batch hit get_thumbnail_list error -1. Thumbnails only appear to work because they're served from disk cache. The live TV thumbnail fetch path is broken even on a small catalog.

Conclusion

The bug is fully reproducible at 61 artworks. At 524 artworks, the lock starvation scales linearly (~53s), and the background pre-fetch race (Bug #1) would also trigger on fresh starts, compounding into the total failure Feastweasel reported.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions