Skip to content

chnlqsray/local_rag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

PDF 知识库问答助手 · Private Knowledge Base Q&A Assistant

一套本地优先的私有文档问答系统,将 PDF / Markdown 文件向量化后持久化到本地知识库,支持跨会话的多轮问答与引用可追溯,彻底解决网页 AI 上下文溢出和语境丢失的问题。约 4,800 行代码,由本人主导需求与架构,通过与 AI 协作完成开发,未直接编写一行代码。

A local-first private knowledge base Q&A system that vectorises PDF / Markdown files into a persistent local knowledge base, enabling cross-session multi-turn Q&A with traceable citations — fundamentally solving the context-overflow and memory-loss problems of web-based AI tools. ~4,800 lines of code, designed and directed by me through structured LLM collaboration, without writing any code directly.

YouTube 演示视频 · Bilibili 演示视频


核心特性 · Key Features

文档处理 · Document Processing

PDF 文本提取实现四级 fallback,每页独立判断最优引擎:pdfplumber → PyMuPDF → pypdf → RapidOCR(扫描件)。同时支持 Markdown 文件导入,自动清洗 HTML 表格标签、LaTeX 公式格式命令、CDN 图片链接等噪音。商业 PDF 水印行(ISO Store 购买水印、"Licensed to" 行、版权声明等)在解析阶段自动剔除,不污染向量空间。

PDF text extraction uses a four-level fallback with per-page engine selection: pdfplumber → PyMuPDF → pypdf → RapidOCR (for scanned documents). Markdown import is also supported, with automatic cleaning of HTML table tags, LaTeX formula commands, CDN image links, and other noise. Commercial PDF watermark lines (ISO Store purchase watermarks, "Licensed to" lines, copyright footers) are automatically removed during parsing to avoid polluting the vector space.

持久化知识库 · Persistent Knowledge Base

向量索引按文件 MD5 哈希分目录存储在本地 FAISS 库,跨会话直接加载,无需重新处理文档。安全替换策略:版本更新时先在临时路径建索引,成功后才覆盖旧文件,失败时旧版本完整保留。重复内容检测:不同文件名但内容相同的文档不重复导入,索引自动复用。

Vector indexes are stored per-file in local FAISS directories keyed by MD5 hash, reloaded directly across sessions without reprocessing. Safe-replace strategy: on version update, the new index is built in a temp path first; only on success is the old file replaced — otherwise the old version is fully preserved. Duplicate-content detection: files with identical content but different names are not re-indexed; the existing index is reused.

多层检索增强 · Multi-Layer Retrieval Enhancement

检索流水线按顺序应用以下增强层:

The retrieval pipeline applies these enhancement layers in sequence:

  1. 全局 L2 排序 + 来源多样性控制 — 各向量库分别召回,全局升序排序,单文档最多贡献 N 条,避免一份文档独占参考集 / Global L2 ranking across all stores with per-source diversity capping
  2. Query-chunk 关键词对齐评分 — 提取问题关键词,按 chunk 覆盖度调整分数(高覆盖奖励 / 低覆盖降权)/ Keyword coverage scoring: rewards chunks with high question-keyword overlap
  3. 元数据噪声降权 — 利用建索引时标注的结构化元数据(噪声标记 / 章节角色),对图注页、版权残留、购买水印、URL 密集等内容降权 / Metadata-based noise downweighting using section_role and noise_flags annotated at index-build time
  4. 问题类型特判 — 定义/比较类触发证据类型优先排序(定义条款 > 要求条款);参数/工艺类触发数值密度评分(含具体数字+单位的 chunk 前移)+ 扩大候选池(k=30)+ 补捞 rescue pass / Question-type-specific reranking: definition questions prioritise definition clauses; parameter questions boost chunks with numerical density and run a rescue pass
  5. 低价值 chunk 质量过滤 — 过滤目录页、封面页、模板页、版权反馈页等无证据价值内容 / Low-value chunk filtering: removes TOC pages, cover pages, template pages, copyright/feedback pages
  6. 可选 bge-reranker 精排 — 配置 SiliconFlow API Key 后可用;定义/比较/参数类问题自动触发,其他问题可手动开启 / Optional SiliconFlow bge-reranker-v2-m3 second-pass reranking; auto-triggered for definition and parameter questions

引用可追溯 · Traceable Citations

每条回答附带参考段落列表(折叠展示),每条标注:来源文件名、页码、相关度分数、提取引擎标签(OCR/PyMuPDF 等)、内容类型标签([定义][要求][范围][说明][正文])、智能摘要。定义/比较类问题自动触发三段式结构化响应模板(①直接条款引用 / ②合理补充 / ③建议回查),区分直接证据与推断性内容。

Every answer includes a collapsible reference panel with per-chunk annotations: source filename, page number, relevance score, extraction engine tag (OCR/PyMuPDF, etc.), content type tag ([Definition] [Requirement] [Scope] [Note] [Body]), and an intelligent excerpt. Definition and comparison questions auto-trigger a three-section structured response template (① direct clause citation / ② reasoned supplement / ③ suggested follow-up), explicitly separating direct evidence from inferences.

LLM 层与可靠性工程 · LLM Layer & Reliability Engineering

统一 OpenAI 兼容接口,支持多个 Provider:

Unified OpenAI-compatible interface supporting multiple providers:

Provider 访问方式 说明
🔵 魔塔社区 大陆直连 GLM-5 / DeepSeek-V3.2,每日2000次,推荐
🟣 智谱 AI 大陆直连 glm-4.7-flash 免费
🌐 OpenRouter 需代理 qwen3.6-plus:free 等免费模型
⚡ Groq 需代理 速度极快,qwen3-32b
🦙 Ollama Cloud 需代理 DeepSeek V3.2 / Gemma4 云端版

可靠性工程措施:流式输出检测截断(finish_reason=length 或启发式),自动发起续写请求拼接完整回答;推理模型(reasoning_content)特殊处理;空回答分类诊断(stream_empty / reasoning_only / cleaned_empty);rerank 失败静默退化为 L2 排序;会话 JSONL 持久化,支持历史切换与命名;对话可导出为 Markdown(含可选的参考段落原文)。

Reliability engineering: streaming truncation detection (finish_reason=length or heuristic) triggers automatic continuation requests to complete cut-off answers; reasoning model output (reasoning_content) handled separately; empty response classified and diagnosed; rerank failure silently falls back to L2 ordering; session JSONL persistence with history switching and custom naming; conversations exportable as Markdown (optionally including reference chunk text).

Embedding 四路 fallback · Embedding Four-Level Fallback

硅基流动 API(大陆直连,BAAI/bge-m3)→ HuggingFace Inference API → 本地 Ollama(bge-m3) → HF 镜像本地下载(约570MB,仅首次),四路同模型(BAAI/bge-m3),向量空间完全兼容,可安全切换而无需重建索引。

SiliconFlow API (mainland-accessible, BAAI/bge-m3) → HuggingFace Inference API → local Ollama (bge-m3) → HF mirror local download (~570 MB, first time only). All four paths use the same model (BAAI/bge-m3), so the vector space is fully compatible — switching providers never requires rebuilding indexes.


快速开始 · Quick Start

# 1. 安装依赖 / Install dependencies
pip install -r requirements.txt

# 可选:扫描件 OCR 支持 / Optional: OCR support for scanned PDFs
pip install rapidocr-onnxruntime numpy

# 可选:备用 PDF 解析引擎 / Optional: fallback PDF parsers
pip install PyMuPDF pypdf

# 2. 配置 API Key / Configure API keys
# 创建 .streamlit/secrets.toml / Create .streamlit/secrets.toml:
# MODELSCOPE_API_KEY = "your_modelscope_key"    # 必填,大陆直连 / Required, mainland-accessible
# SILICONFLOW_API_KEY = "your_sf_key"           # 可选,用于 embedding + rerank / Optional, for embedding & rerank
# GROQ_API_KEY = "your_groq_key"                # 可选 / Optional

# 3. 导入 PDF / Import PDFs
# 将 PDF 放入 knowledge_base/source_pdfs/ 后重启应用
# Place PDFs in knowledge_base/source_pdfs/ then restart the app

# 4. 运行 / Run
streamlit run local_RAG.py

或使用界面上传 / Or upload via UI:左侧边栏「导入新文档」,支持 PDF 和 Markdown 文件。


知识库目录结构 · Knowledge Base Directory

knowledge_base/
├── source_pdfs/        原始文件(PDF / Markdown)
├── indexes/            FAISS 向量索引(按 MD5 哈希分目录)
├── manifest.json       文档注册表(哈希 / 路径 / 页数 / 块数等)
├── failed_imports.json 导入失败记录
├── sessions/           会话历史(每次启动独立 JSONL 文件)
└── cache/ocr/          OCR 识别结果缓存(避免重复识别)

关于本项目 · About This Project

本工具由本人主导需求定义、系统架构与功能设计,通过与 Claude、Gemini 等大模型持续协作完成开发,未直接编写任何代码。这是"工程师思维 + AI 工具放大产出"方法论在专业工具场景下的实践案例。

This tool was designed and directed by me — covering requirements, system architecture, and feature specification — and implemented entirely through structured collaboration with Claude and Gemini. No code was written by hand. It demonstrates the methodology of engineering thinking amplified by AI tooling, applied to a professional-use-case tool.


Independently designed and delivered · 2026

About

Local-first private knowledge base Q&A system

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages