Skip to content

Probe concurrency guardrails and operations#43

Merged
wpak-ai merged 4 commits into
developfrom
w3/issue-07-09-probe-concurrency-ops
May 19, 2026
Merged

Probe concurrency guardrails and operations#43
wpak-ai merged 4 commits into
developfrom
w3/issue-07-09-probe-concurrency-ops

Conversation

@henry0816191
Copy link
Copy Markdown
Collaborator

@henry0816191 henry0816191 commented May 19, 2026

Summary

Closes week_3 issues 07, 08, and 09 in one PR: harden and document the asyncio/thread concurrency model, fix Last-Modified edge cases in ISO probes, and add operator-facing probe documentation plus machine-readable per-cycle logging.

Changes

Issue 37 — Implicit single-thread contract (ISOProber._stats)

  • Document the _stats event-loop-only invariant on the attribute; extend ISOProber class docstring.
  • Guard _stats with threading.Lock via _bump_stat, _reset_stats, and public snapshot_stats().
  • Note that WG21Index.papers is replaced wholesale on every refresh() (do not mutate in place).
  • Add test_run_cycle_stats_integrity_under_concurrency to verify stat buckets sum to URL count under concurrent asyncio.gather.

Issue 08 — asyncio.to_thread guardrails

  • Add run_blocking_io() in concurrency.py (wraps asyncio.to_thread with an explicit safety contract).
  • Use it in monitor.py for matches_for_users, with a call-site comment explaining why it is safe (DB pool only, no shared source state).
  • Add docs/architecture.md and a Concurrency subsection in CONTRIBUTING.md (rules for event loop vs threads, new data sources, open-std extension point).

Issue 09 — Probe operations + Last-Modified

  • Add docs/probe-operations.md: normal envelope (~1,600–2,000 HEADs/cycle), hot/cold behavior, tuning (HTTP_CONCURRENCY, POLL_INTERVAL_MINUTES, POLL_OVERRUN_COOLDOWN_SECONDS), degradation signals, and troubleshooting (overrun, >5% errors, 429s).
  • Emit PROBE-CYCLE-SUMMARY JSON per probe cycle (cycle_requests, cycle_duration_s, hot_probes, cold_probes, errors, hit_total, etc.).
  • Fix _probe_one Last-Modified handling: naive datetimes → UTC; bad/unparseable LM → hit_no_lm (treated as recent, no silent drops).
  • Update ProbeHit.is_recent docstring; invert bad-LM test; add naive-LM test.
  • Link README probing section → docs/probe-operations.md.

Test plan

  • uv run pytest tests/ -q --cov=paperscout --cov-fail-under=90
  • test_probe_one_bad_last_modified_header
  • test_probe_one_naive_last_modified_recent
  • test_run_cycle_stats_integrity_under_concurrency
  • uv run pre-commit run --all-files (before merge)

Related Issues

close #37
close #38
close #39

Summary by CodeRabbit

  • Documentation

    • Added detailed architecture and probe-operations guides; README updated to reference operational thresholds, logs, and concurrency rules.
  • Bug Fixes

    • Improved handling of missing or unparsable Last-Modified headers so recent discoveries are counted, alerted, and fetched correctly.
  • Refactor

    • Centralized blocking I/O handling and introduced safe snapshotting and single-line cycle summary logging for robust concurrent operation.
  • Tests

    • Expanded tests for header parsing and concurrent probe stats integrity.

Review Change Stack

@henry0816191 henry0816191 self-assigned this May 19, 2026
@henry0816191 henry0816191 requested a review from wpak-ai as a code owner May 19, 2026 15:53
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 19, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c06dfa25-09c6-491e-bf04-7bfb0a669803

📥 Commits

Reviewing files that changed from the base of the PR and between 78cf1f3 and f4d9faa.

📒 Files selected for processing (9)
  • CONTRIBUTING.md
  • README.md
  • docs/architecture.md
  • docs/probe-operations.md
  • src/paperscout/concurrency.py
  • src/paperscout/models.py
  • src/paperscout/monitor.py
  • src/paperscout/sources.py
  • tests/test_sources.py

📝 Walkthrough

Walkthrough

Adds concurrency guidance and a run_blocking_io helper; makes ISOProber stats lock-protected with snapshot reporting and JSON cycle summaries; normalizes and merges unusable Last-Modified handling into “no-LM”/recent behavior; updates monitor to use snapshot_stats and run_blocking_io; and adds operator docs and tests.

Changes

Concurrency Safety & Probe Operations

Layer / File(s) Summary
Concurrency Infrastructure & Safety Documentation
src/paperscout/concurrency.py, docs/architecture.md, CONTRIBUTING.md, README.md, docs/probe-operations.md
Introduces run_blocking_io() async helper for offloading blocking I/O to worker threads. Documents the concurrency model (event loop for probing, threads only for pure blocking I/O), explicit rules against accessing internal _stats or _papers from threads, and guidance on safe extension patterns.
Monitor Integration with Public APIs
src/paperscout/monitor.py
Replaces direct self.prober._stats access with calls to snapshot_stats() in both seeding and poll bookkeeping. Centralizes blocking I/O dispatch by importing and using run_blocking_io() instead of raw asyncio.to_thread() for database operations, with updated documentation on safety.
Thread-Safe Stats Collection & Logging
src/paperscout/sources.py, tests/test_sources.py
Adds threading.Lock-protected _stats dict with _bump_stat() and _reset_stats() helpers and a public snapshot_stats() API. Routes all counter increments through _bump_stat(). Replaces old completion logs with a single structured JSON PROBE-CYCLE-SUMMARY.
Last-Modified Parsing & Hit Classification
src/paperscout/sources.py, src/paperscout/models.py, tests/test_sources.py
Parses naive Last-Modified datetimes as UTC; treats parse failures as last_modified=None and classifies the hit as recent (hit_no_lm). Updates ProbeHit.is_recent docstring and adjusts/extends tests accordingly, including a concurrency-focused run_cycle stats integrity test.

Sequence Diagram

sequenceDiagram
  participant EventLoop as asyncio Event Loop
  participant Monitor as Monitor.scheduler
  participant RunBlocking as run_blocking_io
  participant Worker as WorkerThread
  participant DB as PostgreSQL

  Monitor->>RunBlocking: await run_blocking_io(matches_for_users)
  RunBlocking->>Worker: asyncio.to_thread(callable)
  Worker->>DB: execute blocking query (own connection)
  DB-->>Worker: query result
  Worker-->>RunBlocking: return result
  RunBlocking-->>Monitor: resume with result
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

bug, documentation

Suggested reviewers

  • wpak-ai

Poem

🐰 I hopped through docs and threading rules,

I taught the loop to trust its tools.
Locks keep counters neat and right,
LM wears UTC at night.
A JSON log records each probe's flight.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 56.25% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Probe concurrency guardrails and operations' accurately reflects the main changes: adding concurrency guardrails (locks, documentation, run_blocking_io), operator documentation, and probe operations guidance.
Description check ✅ Passed The PR description comprehensively covers all changes across three linked issues (#37, #38, #39), provides a detailed test plan, and follows the required template structure with Summary and Test plan sections.
Linked Issues check ✅ Passed All code changes align with linked issue requirements: Issue #37 (guard _stats with Lock, add snapshot_stats, test concurrency) is implemented [lines in sources.py +62/-13, test_sources.py +65/-6]; Issue #38 (add run_blocking_io, use in monitor.py, document in CONTRIBUTING.md/architecture.md) is implemented [concurrency.py +22/-0, monitor.py +7/-5, CONTRIBUTING.md +10/-0, architecture.md +36/-0]; Issue #39 (probe-operations.md, PROBE-CYCLE-SUMMARY logging, Last-Modified fixes, tests) is implemented [probe-operations.md +70/-0, sources.py fixes, test updates].
Out of Scope Changes check ✅ Passed All changes are directly scoped to the three linked issues: concurrency model hardening, guardrails, documentation, Last-Modified handling fixes, and supporting tests—no unrelated changes detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch w3/issue-07-09-probe-concurrency-ops

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/paperscout/sources.py`:
- Around line 583-593: The broad except Exception around parsedate_to_datetime
should be narrowed to only the exceptions that parsing can raise; replace
"except Exception:" with "except (TypeError, ValueError):" so only bad/invalid
Last-Modified inputs are handled (keeping the existing fallback: last_modified =
None and is_recent = True), and allow other unexpected exceptions to propagate;
look for the parsedate_to_datetime call and the variables last_modified and
is_recent in this block to apply the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 631af90f-8ee8-4e40-adae-a80fa913e7c8

📥 Commits

Reviewing files that changed from the base of the PR and between 78cf1f3 and b9d6ed5.

📒 Files selected for processing (9)
  • CONTRIBUTING.md
  • README.md
  • docs/architecture.md
  • docs/probe-operations.md
  • src/paperscout/concurrency.py
  • src/paperscout/models.py
  • src/paperscout/monitor.py
  • src/paperscout/sources.py
  • tests/test_sources.py

Comment thread src/paperscout/sources.py
@henry0816191
Copy link
Copy Markdown
Collaborator Author

@coderabbitai full review again

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 19, 2026

@henry0816191 Sure! I'll kick off a full review of PR #43 right away.

✅ Actions performed

Full review triggered.

@wpak-ai
Copy link
Copy Markdown
Contributor

wpak-ai commented May 19, 2026

@coderabbitai resume

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 19, 2026

✅ Actions performed

Reviews resumed.

@wpak-ai
Copy link
Copy Markdown
Contributor

wpak-ai commented May 19, 2026

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 19, 2026

✅ Actions performed

Full review triggered.

@wpak-ai wpak-ai merged commit fc400af into develop May 19, 2026
9 checks passed
@wpak-ai wpak-ai deleted the w3/issue-07-09-probe-concurrency-ops branch May 19, 2026 17:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants