feat: search citations — 信源透传 + 结构化字段 + url_citation annotations#468
Merged
chenyme merged 3 commits intochenyme:mainfrom Apr 19, 2026
Merged
feat: search citations — 信源透传 + 结构化字段 + url_citation annotations#468chenyme merged 3 commits intochenyme:mainfrom
chenyme merged 3 commits intochenyme:mainfrom
Conversation
… Sources
Grok SSE 流中的 webSearchResults 和 xSearchResults 透传给下游消费者。
采集层(StreamAdapter):
- webSearchResults: 直接使用原始 url + title
- xSearchResults: postId+username 拼接 URL,text 前 50 字构造 title,
空白归一化,共享 set 跨类型去重
- references_suffix() 统一转义 Markdown 特殊字符后输出
多轮剥离(_extract_message):
- 标记行 [grok2api-sources]: # (CommonMark link ref def,渲染器不显示)
- 正则覆盖 string content + block list content,CRLF 兼容
- 仅匹配含标记行的段落,用户自写 ## Sources 不受影响
配置:
- features.show_search_sources(默认 false),管理面板可开关
- 管理面板 + 6 语言 i18n (zh/en/de/es/fr/ja)
…ype}]
搜索信源以 search_sources 结构化字段始终输出。
show_search_sources 布尔开关仅控制是否同时追加 ## Sources 正文。
采集层(StreamAdapter):
- feed() 采集时标记 type: "web" / "x_post"
- 新增 search_sources_list() 返回 [{url, title, type}] 或 None
注入层(3 API × 流式/非流式 + tool_calls 路径):
- Chat Completions: 响应根对象(避免 Vercel AI SDK strict schema)
- Responses API: message item 级别
- Anthropic Messages: 响应根对象 / message_delta.delta
- tool_calls/tool_use 路径同步覆盖
配置/前端:
- config.defaults.toml 注释更新
- config.html + 6 语言 i18n 描述更新
(cherry picked from commit ced0fe1)
搜索内联引用 [[N]](url) 同步输出 OpenAI 标准 url_citation annotations(URL、title、文本位置)。CherryStudio 走 Responses API 可渲染底部引用卡片。 采集层(StreamAdapter): - _clean_token() 改为返回 (cleaned_text, local_annotations) - 新增 _pending_citations / _annotations / _text_offset 三态 - _render_replace 生成引用时同步记录元数据,per-token 定位 - title 三级 fallback: card → webSearchResults → URL 本身 - FrameEvent 新增 annotation_data 字段,annotations_list() 扁平输出 注入层(三端点 × 流式/非流式): - Responses API: 流式 annotation.added 实时事件 + content_part.done / output_item.done 的 annotations 数组 - Chat Completions: 非流式 message.annotations;流式 final chunk 的 delta.annotations(嵌套 url_citation 格式) - Anthropic Messages: 非流式 TextBlock.annotations;流式 message_delta.delta.annotations(自定义扩展) 设计说明: - Anthropic 标准 citations 需 encrypted_index(专有加密索引,无法生成)+ cited_text(源网页原文,Grok 不提供),改用 OpenAI 扁平格式作自定义扩展 - Chat Completions 流式放 delta.annotations 而非 choice.annotations:Vercel AI SDK schema 对 delta.annotations 有精确定义,必须放标准位置 - per-token 定位(vs 流末全文扫):精确字符串匹配避免 URL 含 ) 截断、UTF-16 偏差影响可控、title 从 card 直取、annotation.added 事件自然可发 (cherry picked from commit 0fbb89c)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
Grok 搜索时 SSE 流返回 44~400 条信源(web + X 帖子),但此前 grok2api 全部丢弃,
下游消费者(如 GrokSearch MCP)的
sources_count为 0;模型正文内联引用
[[N]](url)也仅有裸 Markdown,缺少标准化的引用元数据。本 PR 建立完整的搜索引用链路:信源透传 → 结构化字段 → 引用标注,三层覆盖全部 3 个 API 端点。
Summary
## Sources段落(web + X 帖子 URL 列表)show_search_sources(默认关)search_sources: [{url, title, type}]url_citationannotations(URL、title、文本偏移)Changes
采集层 —
xai_chat.pyStreamAdapter信源采集:
webSearchResults: 原始 url + title,多帧累积去重xSearchResults:postId+username拼接 URL,text前 50 字构造 title,空白归一化_web_search_urls_seenset 跨类型去重references_suffix()统一转义 Markdown 特殊字符search_sources_list()返回[{url, title, type}](type:"web"/"x_post")引用标注:
_clean_token()改为返回(cleaned_text, local_annotations)_pending_citations/_annotations/_text_offset三态追踪_render_replace生成引用时同步记录元数据,per-token 精确定位FrameEvent新增annotation_data字段,annotations_list()扁平输出注入层 — 三端点 × 流式/非流式
Chat Completions (
chat.py+_format.py):message.annotations(url_citation 格式)delta.annotationssearch_sources置于响应根对象Responses API (
responses.py):annotation.added实时事件 +content_part.done/output_item.done的 annotations 数组search_sources置于 message item 级别Anthropic Messages (
messages.py):TextBlock.annotationsmessage_delta.delta.annotations(自定义扩展,OpenAI 扁平格式)search_sources置于响应根对象 /message_delta.delta多轮剥离 —
chat.py_extract_message[grok2api-sources]: #(CommonMark link reference definition,渲染器不显示)## Sources不受影响配置 & UI
config.defaults.toml新增show_search_sources = false(仅控制正文透传)Test plan
show_search_sources,发送搜索类问题 → 响应末尾出现## Sources段落## Sources正文,但search_sources字段仍存在search_sources字段结构:[{url, title, type}],type 为"web"或"x_post"url_citationannotations: 包含 url、title、start_index、end_indexhttps://x.com/{username}/status/{postId}