bug: TV operations fail with ConnectionFailure / socket closed on large catalogs

## Report

From u/Feastweasel (v1.0.1, 524 artworks, Docker):

> Now it doesn't work at all. It reads the cached thumbs, but that's it.

Every TV operation (`select`, `matte`, `thumbnails`) fails with `ConnectionFailure` and `{'reason': 'socket closed'}`. The "TV is listening" indicator shows green, the TV is awake, and AI analysis (which doesn't touch the TV) works fine. Docker stop also works now.

Additionally, the UI freezes for 10-45 seconds when clicking buttons (e.g., Settings) while thumbnail fetches are in progress. Operations execute out of order — a stalled Settings dialog pops up over a later-opened modal.

## Root Cause Analysis

There are **6 interconnected bugs**, with #1 as the likely primary trigger.

### Bug 1 (Critical): Background thumbnail pre-fetch bypasses `_tv_lock`

`_fetch_thumbnails_sync()` (~line 239) runs in a thread pool executor (~line 451) and opens its **own WebSocket connection without acquiring `_tv_lock`**. This creates a concurrent WebSocket to the TV while `_tv_lock`-protected operations are also connecting.

The Samsung Frame's WebSocket server does not handle concurrent connections well — it drops one or both, causing `{'reason': 'socket closed'}`. Every subsequent `_tv_op` then fails because the TV's WebSocket server is in a confused/recovering state. The retry logic makes this worse by hammering the TV with more connection attempts.

**This is the smoking gun.** The background pre-fetch races with the frontend's `/api/thumbnails` batch requests, and the TV kills all sockets.

### Bug 2 (Critical): Every `_tv_op` opens and closes a new WebSocket

`_tv_op()` (~line 166) creates a brand new `SamsungTVWS` instance, opens a WebSocket (`art.open()`), runs the operation, and calls `tv.close()` — **every single time**.

With 524 artworks on first load:
- `/api/info` → 1 WebSocket open/close
- `/api/art` → 1 WebSocket open/close
- `/api/mattes` → 1 WebSocket open/close
- Background `_fetch_thumbnails_sync` → 1 long-lived WebSocket (no lock!)
- ~53 `/api/thumbnails` batches → 53 WebSocket open/close cycles
- = **~56+ WebSocket connect/disconnect cycles in rapid succession**

The TV's WebSocket server likely rate-limits or crashes under this churn.

### Bug 3 (High): Global `_tv_lock` serializes ALL TV ops — lock starvation

`_tv_lock` is a single `asyncio.Lock()` (~line 88). Every TV operation competes for it: thumbnails, mattes, select, info, art list, favorites, slideshow, filters — everything.

With 53 thumbnail batches queued, a user's "change matte" request sits at position 54 in the queue. Each batch takes 2-4 seconds (connection + fetch + close), so the user waits **60+ seconds** for their click to execute.

This explains the 10-45 second stall on the Settings button and the out-of-order execution: the settings fetch was queued behind thumbnail batches, and by the time it completed, the user had opened another modal.

### Bug 4 (Medium): "TV is listening" indicator never updates

The frontend calls `/api/info` once during `init()` (~line 2405 in index.html). If it succeeds, the green dot is set permanently. There is no heartbeat, no periodic recheck, no update on failure. The user sees "TV is listening" while every subsequent operation fails.

### Bug 5 (High): Failed retries hold the lock for up to ~58 seconds

When a `_tv_op` fails and retries (3 attempts × TV_TIMEOUT + WoL delay), it holds `_tv_lock` the entire time. Worst case: `3 × (timeout + 8s) + 2 × 2s = ~58 seconds` of lock hold per single failed operation. With multiple failed operations queued, the total lockout can exceed several minutes.

### Bug 6 (High): Frontend fires ~53 serial thumbnail requests for 524 artworks

The `IntersectionObserver` + `loadThumbnailBatch()` (~line 2773 in index.html) slices visible thumbnails into batches of 10 and fires them **sequentially**. Each batch → one `_tv_op()` → one WebSocket round-trip.

## Timeline of Failure (524 artworks, fresh start)

```
0.0s  /api/info         → lock acquired, WebSocket #1 open/close → GREEN DOT ✅
0.5s  /api/art          → lock acquired, WebSocket #2 open/close → 524 items
1.0s  Background prefetch → WebSocket #3 opened WITHOUT LOCK ⚠️
1.0s  /api/mattes       → waiting for lock...
1.2s  /api/thumbnails batch 1 → waiting for lock...
1.4s  /api/thumbnails batch 2 → waiting for lock...
      ...
      /api/mattes finally gets lock → WebSocket #4 vs concurrent #3 → SOCKET CLOSED 💥
      All subsequent _tv_ops fail — TV WebSocket is confused
      Each failure triggers 3 retries × 2s delay = lock held 10-20s per failure
      User clicks Settings → queued at position 50+ → waits 60+ seconds
      User clicks Matte → queued behind Settings → opens after Settings, out of order
```

## Proposed Fixes

### Immediate (P0)

1. **Make `_fetch_thumbnails_sync` acquire `_tv_lock`** — or better, remove it entirely and let the existing `/api/thumbnails` endpoint handle all fetches. No concurrent unguarded connections.

2. **Reuse a persistent WebSocket connection** instead of open/close per operation. Create a connection pool (size 1) or a long-lived `SamsungTVWS` instance that reconnects on failure. This eliminates the 56+ connect/disconnect churn.

### Short-term (P1)

3. **Separate thumbnail lock from user-action lock** — or use a priority queue so user-initiated actions (select, matte, settings) jump ahead of background thumbnail fetches. Alternatively, cancel in-progress thumbnail batches when a user action arrives.

4. **Larger thumbnail batches** — fetch all missing thumbnails in a single `_tv_op` call instead of batches of 10. One WebSocket round-trip instead of 53.

5. **Update "TV is listening" on failure** — if any `_tv_op` fails after exhausting retries, flip the indicator to red/yellow. Add a periodic heartbeat (every 30s).

### Medium-term (P2)

6. **Non-blocking retry** — release `_tv_lock` between retry attempts so other operations can proceed while waiting for the TV to wake up.

7. **Request deduplication** — if the user clicks a button while thumbnails are loading, cancel or deprioritize the thumbnail queue.

## Affected code

| Location | Issue |
|----------|-------|
| `server.py` ~L239-268 | `_fetch_thumbnails_sync` — unguarded concurrent WebSocket |
| `server.py` ~L166-173 | `_tv_op` — new WebSocket per call |
| `server.py` ~L88 | `_tv_lock` — single global lock |
| `server.py` ~L175-190 | Retry logic holds lock during delays |
| `server.py` ~L451-452 | Background pre-fetch launch (no lock) |
| `index.html` ~L2773-2807 | Thumbnail batching (groups of 10) |
| `index.html` ~L2405-2416 | TV status indicator (set once, never updated) |

## Environment

- Docent v1.0.1 (Docker)
- 524 artworks on Samsung Frame
- TV awake, on local network
- AI analysis (OpenAI) works fine (no TV connection needed)

## Local Reproduction (61 artworks)

Reproduced on a local setup with only 61 artworks (vs Feastweasel's 524). Results confirm all root causes.

### Test: Simulated page load + 6 concurrent thumbnail batches + user action

```
=== PAGE LOAD (3 concurrent requests) ===
19:59:48 → 19:59:54  (6 seconds for 3 requests that should take ~1s each)
Requests serialized through _tv_lock — each waited for the previous one.

=== THUMBNAIL STORM (6 concurrent batches of 10) ===
batch1 = 3.12s    ← queued behind batches that arrived first
batch2 = 2.08s
batch3 = 1.04s    ← first to acquire lock
batch4 = 4.16s
batch5 = 5.18s
batch6 = 6.22s    ← last in queue, waited for all 5 before it
Total wall time: 6 seconds (perfectly serialized staircase)

=== USER SELECT (queued behind all batches) ===
select = 1.05s    ← waited until all 6 batches drained
Started at 20:00:00, right after storm ended at 20:00:00
```

### Key findings

| Metric | 61 artworks (local) | 524 artworks (projected) |
|---|---|---|
| WebSocket connections opened | 11 in 12 seconds | ~56+ in rapid succession |
| Thumbnail batch requests | 6 | ~53 |
| Lock starvation (user action delayed) | ~7 seconds | ~53 seconds |
| Background pre-fetch race | Not triggered (0 new IDs) | Fires on every fresh start |

### All 6 thumbnail batch fetches failed

```
WARNING docent: Batch thumbnail fetch failed: `get_thumbnail_list` request failed with error number -1
WARNING docent: Batch thumbnail fetch failed: `get_thumbnail_list` request failed with error number -1
WARNING docent: Batch thumbnail fetch failed: `get_thumbnail_list` request failed with error number -1
WARNING docent: Batch thumbnail fetch failed: `get_thumbnail_list` request failed with error number -1
WARNING docent: Batch thumbnail fetch failed: `get_thumbnail_list` request failed with error number -1
WARNING docent: Batch thumbnail fetch failed: `get_thumbnail_list` request failed with error number -1
```

Every batch hit `get_thumbnail_list error -1`. Thumbnails only appear to work because they're served from disk cache. The live TV thumbnail fetch path is broken even on a small catalog.

### Conclusion

The bug is **fully reproducible** at 61 artworks. At 524 artworks, the lock starvation scales linearly (~53s), and the background pre-fetch race (Bug #1) would also trigger on fresh starts, compounding into the total failure Feastweasel reported.


Location	Issue
`server.py` ~L239-268	`_fetch_thumbnails_sync` — unguarded concurrent WebSocket
`server.py` ~L166-173	`_tv_op` — new WebSocket per call
`server.py` ~L88	`_tv_lock` — single global lock
`server.py` ~L175-190	Retry logic holds lock during delays
`server.py` ~L451-452	Background pre-fetch launch (no lock)
`index.html` ~L2773-2807	Thumbnail batching (groups of 10)
`index.html` ~L2405-2416	TV status indicator (set once, never updated)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: TV operations fail with ConnectionFailure / socket closed on large catalogs #67

Report

Root Cause Analysis

Bug 1 (Critical): Background thumbnail pre-fetch bypasses `_tv_lock`

Bug 2 (Critical): Every `_tv_op` opens and closes a new WebSocket

Bug 3 (High): Global `_tv_lock` serializes ALL TV ops — lock starvation

Bug 4 (Medium): "TV is listening" indicator never updates

Bug 5 (High): Failed retries hold the lock for up to ~58 seconds

Bug 6 (High): Frontend fires ~53 serial thumbnail requests for 524 artworks

Timeline of Failure (524 artworks, fresh start)

Proposed Fixes

Immediate (P0)

Short-term (P1)

Medium-term (P2)

Affected code

Environment

Local Reproduction (61 artworks)

Test: Simulated page load + 6 concurrent thumbnail batches + user action

Key findings

All 6 thumbnail batch fetches failed

Conclusion

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Metric	61 artworks (local)	524 artworks (projected)
WebSocket connections opened	11 in 12 seconds	~56+ in rapid succession
Thumbnail batch requests	6	~53
Lock starvation (user action delayed)	~7 seconds	~53 seconds
Background pre-fetch race	Not triggered (0 new IDs)	Fires on every fresh start

bug: TV operations fail with ConnectionFailure / socket closed on large catalogs #67

Description

Report

Root Cause Analysis

Bug 1 (Critical): Background thumbnail pre-fetch bypasses _tv_lock

Bug 2 (Critical): Every _tv_op opens and closes a new WebSocket

Bug 3 (High): Global _tv_lock serializes ALL TV ops — lock starvation

Bug 4 (Medium): "TV is listening" indicator never updates

Bug 5 (High): Failed retries hold the lock for up to ~58 seconds

Bug 6 (High): Frontend fires ~53 serial thumbnail requests for 524 artworks

Timeline of Failure (524 artworks, fresh start)

Proposed Fixes

Immediate (P0)

Short-term (P1)

Medium-term (P2)

Affected code

Environment

Local Reproduction (61 artworks)

Test: Simulated page load + 6 concurrent thumbnail batches + user action

Key findings

All 6 thumbnail batch fetches failed

Conclusion

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Bug 1 (Critical): Background thumbnail pre-fetch bypasses `_tv_lock`

Bug 2 (Critical): Every `_tv_op` opens and closes a new WebSocket

Bug 3 (High): Global `_tv_lock` serializes ALL TV ops — lock starvation