Skip to content

fix: record errors on missing deps + safe JSON serialization (#472, #473)#474

Merged
axisrow merged 6 commits intomainfrom
fix/472-473-missing-dependencies-json-serialization
Apr 19, 2026
Merged

fix: record errors on missing deps + safe JSON serialization (#472, #473)#474
axisrow merged 6 commits intomainfrom
fix/472-473-missing-dependencies-json-serialization

Conversation

@axisrow
Copy link
Copy Markdown
Owner

@axisrow axisrow commented Apr 19, 2026

Summary

Changes

Issue #472 — Missing dependency handling

src/services/pipeline_nodes/handlers.py:

  • RetrieveContextHandler, ImageGenerateHandler, NotifyHandler, SearchQueryTriggerHandler now call context.record_error(code="missing_dependency", ...) instead of silent logger.info/warning + return

src/services/unified_dispatcher.py:

  • _handle_photo_due and _handle_photo_auto now return CollectionTaskStatus.FAILED instead of COMPLETED when services are not configured

Issue #473 — JSON serialization safety

New file src/utils/json.py:

  • safe_json_dumps() function with fallback for:
    • datetime, date.isoformat()
    • bytes.hex()
    • Pydantic v2 models → .model_dump()
    • Unknown types → repr()

Replaced json.dumps with safe_json_dumps in:

  • src/database/repositories/collection_tasks.py
  • src/database/repositories/telegram_commands.py
  • src/web/routes/pipelines.py
  • src/services/agent_provider_service.py
  • src/services/image_provider_service.py
  • src/agent/manager.py

Closes

Test plan

  • ruff check passes
  • pytest tests/ passes (parallel + serial)
  • Manual: pipeline with missing search_engine now shows error in context
  • Manual: channel collect with service messages containing datetime/bytes no longer crashes

🤖 Generated with Claude Code

Hapi and others added 2 commits April 19, 2026 11:42
…rnal data

- #472: Replace silent skips with explicit error recording in pipeline handlers
  - RetrieveContextHandler, ImageGenerateHandler, NotifyHandler, SearchQueryTriggerHandler now use context.record_error with code="missing_dependency"
  - Unified dispatcher photo handlers now return FAILED instead of COMPLETED when services are missing

- #473: Add safe_json_dumps utility and replace unsafe json.dumps calls
  - New src/utils/json.py with safe_json_dumps handling datetime, bytes, Pydantic models
  - Replace 28 json.dumps calls across collection_tasks, telegram_commands, pipelines, agent_provider_service, image_provider_service, agent/manager

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ect FAILED

Aligns tests with the behaviour change in #472: dispatcher now marks tasks as
FAILED (not COMPLETED) when PhotoTaskService / PhotoAutoUploadService are absent.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@axisrow
Copy link
Copy Markdown
Owner Author

axisrow commented Apr 19, 2026

Что ещё осталось доработать из #472

PR закрывает основную часть issue, но не все HIGH-риски, описанные в #472. Ниже — список того, что нужно сделать в этом PR или в follow-up.

1. PublishServicepool.get_client_by_phone() без None guard

src/services/publish_service.py:~87client_pool передаётся в конструктор как object (без опциональности), но внутри используется без проверки is None. Если кто-то передаст None (как это делал CLI до #470), получим AttributeError вместо понятной ошибки.

Fix: либо типизировать параметр как обязательный и добавить runtime-проверку в __init__, либо добавить явный if self._client_pool is None: raise ConfigurationError(...) в publish_run.

2. UnifiedDispatcher._handle_content_publish — unguarded self._client_pool

src/services/unified_dispatcher.py:~683pool = self._client_pool используется без проверки на None. Задача помечается COMPLETED / крашится, а не FAILED.

Fix: добавить проверку в начале handler'а, аналогично _handle_photo_due из этого PR:

if self._client_pool is None:
    await self._tasks.update_collection_task(
        task.id,
        CollectionTaskStatus.FAILED,
        error="client_pool not configured",
    )
    return

3. ReactHandler / ForwardHandler / DeleteMessageHandler

У них context.record_error(code="no_client_pool", ...) уже был до этого PR, так что формально «ошибка записана». Но есть нюанс:

Идея: если handler не может выполнить ни одной операции из-за missing dependency — имеет смысл пробрасывать exception, чтобы весь run помечался status=failed, а не completed. Текущее поведение маскирует misconfiguration как «успешное выполнение 0 операций».

4. Medium-risk места из #473

В issue #473 помимо HIGH были ещё места, которые нужно было ревьюнуть:

  • src/services/agent_provider_service.py:~168 — заменено ✅
  • src/services/image_provider_service.py:~122 — заменено ✅
  • src/agent/manager.py:271, 290, 301, 742 — заменено ✅

Всё, что было в аудите, покрыто этим PR. Но имеет смысл добавить ruff-правило или pre-commit хук, который будет флагать новые json.dumps(...) без default= в чувствительных модулях (src/database/repositories/, src/services/, src/web/routes/), чтобы не откатиться.

5. Проверка тестами нового поведения

Текущие тесты обновлены под FAILED вместо COMPLETED — это хорошо. Но стоит добавить позитивные тесты для нового контракта:

  • ReactHandler без client_poolcontext.get_errors() содержит entry с code="no_client_pool" (уже есть?)
  • RetrieveContextHandler без search_enginecontext.get_errors() содержит code="missing_dependency" (новое поведение этого PR)
  • Аналогично для ImageGenerate, Notify, SearchQueryTrigger

Резюме: PR закрывает ~80% issue #472 и 100% issue #473. Оставшиеся 20% (PublishService, _handle_content_publish, поведение react/forward/delete при client_pool=None) можно доделать здесь же, либо вынести в follow-up — на ваше усмотрение.

- PublishService._publish_to_target: early return when client_pool is None
- UnifiedDispatcher._handle_content_publish: fail task when client_pool missing
- React/Forward/Delete handlers: raise RuntimeError after record_error
- Tests updated: pytest.raises for RuntimeError, assert missing_dependency errors

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@axisrow
Copy link
Copy Markdown
Owner Author

axisrow commented Apr 19, 2026

@claude review. Focus on critical issues: bugs, security vulnerabilities, logical errors, data loss risks, performance problems. Do NOT nitpick style, naming conventions, minor formatting, or subjective preferences — only flag issues that could break functionality or cause real harm in production.

@claude
Copy link
Copy Markdown

claude bot commented Apr 19, 2026

Claude encountered an error —— View job


I'll analyze this and get back to you.

@axisrow
Copy link
Copy Markdown
Owner Author

axisrow commented Apr 19, 2026

@claude review. Focus on critical issues: bugs, security vulnerabilities, logical errors, data loss risks, performance problems. Do NOT nitpick style, naming conventions, minor formatting, or subjective preferences — only flag issues that could break functionality or cause real harm in production.

@claude
Copy link
Copy Markdown

claude bot commented Apr 19, 2026

Claude encountered an error —— View job


I'll analyze this and get back to you.

@axisrow
Copy link
Copy Markdown
Owner Author

axisrow commented Apr 19, 2026

@claude please review this PR. Focus only on critical issues.

@claude
Copy link
Copy Markdown

claude bot commented Apr 19, 2026

Claude finished @axisrow's task in 3m 43s —— View job


PR Review

  • Read changed files
  • Review safe_json_dumps utility
  • Review missing dependency handling in handlers
  • Review dispatcher changes
  • Check for critical bugs / data loss risks

Critical Issues

1. context.record_error() before raise RuntimeError() is wasted — node_errors never persisted

src/services/pipeline_nodes/handlers.py — ReactHandler, ForwardHandler, DeleteMessageHandler

context.record_error(node_id=node_id, code="no_client_pool", ...)
logger.warning("ReactHandler[%s]: no client_pool, skipping", node_id)
raise RuntimeError("ReactHandler: client_pool not available")   # ← NEW in this PR

When RuntimeError is raised, execution unwinds through pipeline_executor.py (line 123: re-raises after logging), then through content_generation_service.py (line 183: marks run failed, re-raises). The node_errors = context.get_errors() at pipeline_executor.py:135 is never reached, so the structured error recorded by context.record_error() is never written to generation_runs.metadata.

The net result: the run has status=failed and error="ReactHandler: client_pool not available" in the task, but the metadata.node_errors that the UI reads is absent — worse than the previous behavior where node_errors was present. The context.record_error() call before the raise is a no-op and should either be removed or the executor needs to persist context.get_errors() in the exception path.

Fix this →


2. repr() fallback silently corrupts DB records for unknown types

src/utils/json.py:27

return repr(o)   # last-resort fallback

When an unexpected non-serializable type reaches collection_tasks.py or telegram_commands.py, the row is written with a Python repr() string like <TLMessage object at 0x7f...>. The original json.dumps would raise TypeError here (catching the bug early); the new code silently stores a value that cannot be deserialized, corrupting the record and potentially breaking task replay or recovery.

The fallback is safe for SSE/streaming paths (agent manager) where the string is just displayed. But for DB write paths it changes fail-fast into silent data corruption. Consider either raising in those call sites or adding a sentinel so corrupt payloads are detectable:

raise TypeError(f"Object of type {type(o).__name__} is not JSON serializable")

versus keeping repr() only for display paths (SSE, logging).

Fix this →


Non-Critical Observations

  • safe_json_dumps in hash computation (agent_provider_service.py:463,474): In practice secret_payload and payload only contain strings/simple types so repr() will never trigger — no functional risk. However, if it ever did trigger, the hash would be non-deterministic across Python versions, causing spurious cache invalidation.

  • Behavioral asymmetry: RetrieveContextHandler, ImageGenerateHandler, NotifyHandler, SearchQueryTriggerHandler — record error and return (pipeline continues). ReactHandler, ForwardHandler, DeleteMessageHandler — record error and raise (pipeline aborts). This is intentional per the PR but worth documenting: React/Forward/Delete on a misconfigured worker will now fail the entire run, not just those steps.

  • PublishService None guard (publish_service.py:84): Correct fix, PublishResult(success=False, error=...) is the right return for that method signature.

  • _handle_content_publish None guard (unified_dispatcher.py:677): Correct placement, properly before pool = self._client_pool is dereferenced.


Summary

Two issues warrant attention before merge:

  1. The context.record_error() / raise RuntimeError combination in React/Forward/Delete handlers produces worse observability than intended — the structured error is silently dropped.
  2. The repr() fallback in safe_json_dumps is appropriate for streaming display but risky for DB write paths where it could silently corrupt records.

- pipeline_executor: wrap node loop in try/finally so context.get_errors()
  is always captured, even when a handler raises RuntimeError
- safe_json_dumps: replace repr() fallback with TypeError to prevent
  silent DB corruption from non-serializable objects

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@axisrow
Copy link
Copy Markdown
Owner Author

axisrow commented Apr 19, 2026

@claude review. Focus on critical issues: bugs, security vulnerabilities, logical errors, data loss risks, performance problems. Do NOT nitpick style, naming conventions, minor formatting, or subjective preferences — only flag issues that could break functionality or cause real harm in production.

@claude
Copy link
Copy Markdown

claude bot commented Apr 19, 2026

Claude finished @axisrow's task in 2m 42s —— View job


PR Review

  • Read changed files
  • Review safe_json_dumps utility and its callers
  • Review missing dependency handling in handlers
  • Review pipeline executor exception path
  • Check dispatcher changes
  • Check for critical bugs / data loss risks

Critical Issue

node_errors is silently dropped when React/Forward/Delete handlers raise

src/services/pipeline_executor.py:133-134 + src/services/pipeline_nodes/handlers.py:313-320

When ReactHandler, ForwardHandler, or DeleteMessageHandler encounters client_pool is None, the flow is:

# handlers.py — record error, then raise
context.record_error(node_id=node_id, code="no_client_pool", ...)
raise RuntimeError("ReactHandler: client_pool not available")
# pipeline_executor.py — outer try/finally
try:
    for node in ordered:
        try:
            await handler.execute(...)   # ← raises here
        except Exception:
            logger.exception(...)
            raise
        finally:
            services["_current_node_id"] = prev   # ← restores node id
finally:
    node_errors = context.get_errors()   # ← runs, but only a local assignment!

# ← exception has already propagated; this return is never reached:
return {..., "node_errors": node_errors, ...}

The outer finally does capture the errors into the local node_errors variable, but since the exception propagates past the return statement, the errors are never included in any return value. Back in content_generation_service.generate(), the except Exception at line 183 only calls set_status(run_id, "failed") with no metadata — so generation_runs.metadata.node_errors is never persisted.

The unit tests in test_pipeline_nodes_handlers.py (test_react_records_error_when_no_client_pool, etc.) verify the handler-level behavior correctly, but don't catch the integration-level loss.

Net result: the run gets status=failed with a bare error string, but the structured node_errors (with code, detail) that the UI reads from metadata is silently lost. This is functionally worse than recording nothing, because it creates a false expectation that node_errors is populated.

Fix option A — persist errors in the exception path inside pipeline_executor.py:

except Exception as exc:
    logger.exception(...)
    raise RuntimeError(str(exc)) from exc  # carry node_errors separately

Or better: catch and re-raise with node_errors stored in a custom exception attribute.

Fix option B — remove context.record_error() before raise in these three handlers (the RuntimeError message already surfaces as the run's error string; there's no point recording it as structured node error if it can't be persisted).

Fix this →


Previously Flagged Issue — Now Resolved

The prior review raised concern about repr() silently corrupting DB records for unknown types. The current code at src/utils/json.py:27 correctly raises TypeError (fail-fast) instead:

raise TypeError(f"Object of type {type(o).__name__} is not JSON serializable")

This is the right behavior. ✅


Everything Else Looks Correct

  • PublishService._publish_to_target (publish_service.py:84-86): if pool is None: return PublishResult(success=False, ...) — correct placement and return type.
  • _handle_content_publish None guard (unified_dispatcher.py:677-683): mirrors the _handle_photo_due pattern from this PR. Correct.
  • _handle_photo_due / _handle_photo_auto returning FAILED instead of COMPLETED on missing services: correct.
  • RetrieveContextHandler, ImageGenerateHandler, NotifyHandler, SearchQueryTriggerHandler: record error + return (pipeline continues). This is intentional and tested.
  • safe_json_dumps in hash paths (agent_provider_service.py:463,474): secret_payload and payload only contain strings/dicts in practice; TypeError will never trigger here, so no functional risk.

Summary: One critical issue remains — the node_errors captured in pipeline_executor.py's finally block is discarded when a handler raises, so structured errors from React/Forward/Delete misconfiguration are never persisted to generation_runs.metadata. Everything in the #473 JSON serialization fix is correct as-is.

…wn types

- pipeline_executor: attach node_errors to exception via exc.node_errors
  so callers can persist structured errors even on RuntimeError
- content_generation_service: extract exc.node_errors and save to DB
  metadata when marking run as failed
- generation_runs repo: set_status now accepts optional metadata param
- safe_json_dumps: replace repr() fallback with TypeError (fail-fast)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@axisrow
Copy link
Copy Markdown
Owner Author

axisrow commented Apr 19, 2026

@claude review. Focus on critical issues: bugs, security vulnerabilities, logical errors, data loss risks, performance problems. Do NOT nitpick style, naming conventions, minor formatting, or subjective preferences — only flag issues that could break functionality or cause real harm in production.

@claude
Copy link
Copy Markdown

claude bot commented Apr 19, 2026

Claude finished @axisrow's task in 2m 15s —— View job


PR Review

  • Read changed files
  • Review safe_json_dumps utility and callers
  • Review pipeline executor exception path
  • Review handler missing-dep changes
  • Review dispatcher changes
  • Check for critical bugs / data loss risks

All Previously Flagged Critical Issues Are Now Fixed

node_errors propagation via exc.node_errors

Prior reviews flagged that context.record_error() before raise RuntimeError in ReactHandler/ForwardHandler/DeleteMessageHandler would lose the structured errors. This is now fully resolved:

pipeline_executor.py:124-129

except Exception as exc:
    logger.exception("Node %s (%s) failed during pipeline execution", ...)
    errors = context.get_errors()      # picks up the no_client_pool error
    if errors:
        exc.node_errors = errors       # attached to the exception
    raise

content_generation_service.py:189-191

node_errors = getattr(exc, "node_errors", None)
meta = {"node_errors": node_errors} if node_errors else None
await self._db.repos.generation_runs.set_status(run_id, "failed", metadata=meta)

The context.record_error() call in the handlers is not wasted — errors are captured before the stack unwinds, attached to the exception object, and persisted to generation_runs.metadata. ✅

repr() fallback replaced with TypeError

src/utils/json.py:27

raise TypeError(f"Object of type {type(o).__name__} is not JSON serializable")

Fail-fast for unknown types — no silent DB corruption. ✅

PublishService / _handle_content_publish None guards ✅

  • publish_service.py:84-86: if pool is None: return PublishResult(success=False, ...)
  • unified_dispatcher.py:677-683: guard before pool = self._client_pool dereference ✅

No New Critical Issues Found

The complete serialization chain for node_errors is safe: record_error() stores only str/int values → context.get_errors() returns list[dict] with those types → json.dumps(metadata) in generation_runs.set_status serializes cleanly with no safe_json_dumps needed.

_handle_photo_due / _handle_photo_auto returning FAILED instead of COMPLETED on missing services is correct behavior and properly tested.


Summary

All three critical issues identified in prior review cycles have been addressed. The PR is correct as-is and can be merged.

- FakeGenerationRunsRepo.set_status: accept optional metadata param
- content_publish tests: add client_pool mock to dispatcher fixtures

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@axisrow axisrow merged commit 42d2ae0 into main Apr 19, 2026
5 checks passed
@axisrow axisrow deleted the fix/472-473-missing-dependencies-json-serialization branch April 19, 2026 15:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants