fix(web_search): drop misnamed X-Return-Format header on Jina providers#1799
Conversation
Jina's ``s.jina.ai`` interprets ``X-Return-Format: json`` as a
markdown-format hint when an API key is present (huangheng
empirical curl matrix): the response body comes back as
``text/plain`` instead of JSON, ``await response.json()`` then
raises in ``JinaSearchProvider.search``, the provider returns ``[]``
silently, and ``_search_with_jina_fallback`` cascades to
DuckDuckGo. From networks where DuckDuckGo's anomaly check is
active (most non-residential / cloud IPs), the fallback also
returns ``[]`` — the agent surfaces "网络搜索 / 网页读取 多次失败" to
the end user even with a valid Jina key configured.
Drop the misnamed header. ``Accept: application/json`` (set in
``__init__``) is the canonical signal Jina honours; with it alone,
``s.jina.ai`` returns proper ``application/json`` responses and the
provider parses them into ``WebSearchResultItem`` entries as
expected.
Mirror the same cleanup on ``JinaReaderProvider`` so the two
providers stay consistent.
New regression test: ``tests/unit_test/web_access/test_jina_search_provider_headers.py``
pins three invariants —
* outbound headers contain ``Accept: application/json`` and
``Authorization``, but never ``X-Return-Format``;
* a valid JSON response is parsed into ranked results;
* ``X-Target-Domain`` is still set when ``source`` is provided.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
a48660c to
cddd4e7
Compare
CR by @huangheng — 🟢 LGTM ✅CI 全绿 (per own-up #10 explicit verify)``` 10/10 lanes green. Verification (per own-up #11 SOP)
Empirical confirmation (this session)User-reported web search failure root-caused via direct API curl:
`s.jina.ai` ignores `X-Return-Format` header entirely. Only `Accept: application/json` returns JSON. Pre-fix code's `response.json()` threw `JSONDecodeError` → silent caught → empty results → fallback DuckDuckGo (also unreliable from China). User saw "搜索为空" with `fallback_used:true` on every Jina-routed query. Post-fix verified locally via PR API:
simple-stable + 12-invariant
Wave 10/11 conflict check✅ Doesn't touch any of #1786 / #1789 / #1790 / #1792 / #1793 / #1796 / #1797 territory (Jina provider files are own surface). Verdict🟢 LGTM ✅ — clean diagnosis-driven hot-fix, regression-pinned via 3 unit tests citing empirical evidence, 10/10 CI green. Per earayu2 directive (msg=d3ece0eb "推进cr和合并") proceeding squash merge per own-up #10 SOP. |
|
Architect ratify ✅ — 10/10 CI green explicit verify (lint+smoke+provider+preflight). 12-invariant + simple-stable 4-guardrail all pass. Jina API header contract fix scoped 4 files +185/-7 with 3 regression tests. mini-pattern 13 diagnostic chain (caller/runtime input/reproduce) confirmed root cause is Jina 4xx response handling, not wave timing. Proceeding squash merge per own-up #10 SOP + earayu2 directive 「推进合并」. |
…_PROXY (#1802) * fix(web_access): aiohttp ClientSession needs trust_env=True for HTTPS_PROXY aiohttp's ClientSession does NOT honor HTTPS_PROXY / HTTP_PROXY env vars by default — that requires explicit ``trust_env=True``. libcurl (and ``curl(1)``) honor those env vars by default, hence the asymmetry users hit on CN deploys: ``curl https://s.jina.ai/?q=test`` returns HTTP 401 in 0.7s while the same host's aiohttp client gets ``ClientConnectorError: Cannot connect to host s.jina.ai:443 ssl:default [Connection reset by peer]``. The fix is a 1-field add to the 3 user-facing web_access providers: - aperag/domains/web_access/search/providers/jina_search_provider.py - aperag/domains/web_access/reader/providers/jina_read_provider.py - aperag/domains/web_access/reader/providers/trafilatura_read_provider.py The 2 internal weixin clients (``aperag/utils/weixin/client.py``) are intentionally left untouched — those run on internal corporate networks and don't need a proxy. Behaviour: - Production / public-internet deploys without HTTPS_PROXY set: 0 change (``trust_env=True`` is a no-op when no env var exists). - Local dev or private-cloud deploys behind a corporate / regional proxy: ``export HTTPS_PROXY=...`` is now sufficient — no code patch needed. Adds a regression test pinning ``trust_env=True`` on the search provider's ClientSession kwargs, mirroring PR #1799's header contract tests. Closes #7 (web search 总是失败 — root cause was 2 layers: PR #1799 fixed the misnamed ``X-Return-Format`` header; this PR fixes the aiohttp env-proxy default). Long-term CN provider work tracked in issue #1800 (SearxNG/Tavily/Bing multi-provider). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * chore(web_access): apply ruff format on trust_env hot-fix ruff format collapsed the multi-line ClientSession invocation in jina_search_provider.py + jina_read_provider.py back to single-line since the resulting line stays under the configured line-length budget. Pure formatting cleanup, no behaviour change. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Summary
Web search through the agent fails with empty results ("网络搜索:多次尝试均返回了空结果") even with a valid Jina API key configured — root cause empirically located by @huangheng: Jina's `s.jina.ai` interprets `X-Return-Format: json` (with a valid key) as a markdown hint, returns `text/plain`, the provider's `await response.json()` fails, the provider returns `[]` silently, and `_search_with_jina_fallback` cascades to DuckDuckGo (which is anomaly-blocked from most non-residential IPs).
Fix
Test plan
🤖 Generated with Claude Code