增加 PDF_OCR_THRESHOLD 配置项，只对宽高超过页面一定比例（图片宽/页面宽，图片高/页面高）的图片进行 OCR。 #2525

liunux4odoo · 2024-01-02T08:26:02Z

这样可以避免 PDF 中一些小图片的干扰，提高非扫描版 PDF 处理速度。
以 test_files/langchain.pdf 为例，加载时间从 38s 减少到 5s 以内

这样可以避免 PDF 中一些小图片的干扰，提高非扫描版 PDF 处理速度

aoteman-z · 2024-01-18T10:03:44Z

你好，我按照Files chages改完，上传一份没有图片的pdf好像还是用了RapidOCRPDFLoader，请问是为什么？
log信息：RapidOCRPDFLoader context page index: 0: 100%|██████████| 1/1 [00:00<00:00, 23.60it/s]
We couldn't connect to 'https://huggingface.co' to load this file, couldn't find it in the cached files and it looks like None is not the path to a directory containing a file named config.json.
Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.
文档切分示例：page_content='6304p\n\nProtocol 118\n\n04/06/2016' metadata={'source': '//zoc_chatchat/knowledge_base/zjm/content/123.pdf'}
Batches: 100%|██████████| 1/1 [00:18<00:00, 18.03s/it]

liunux4odoo · 2024-01-19T07:48:47Z

loader 还是那个 loader，ocr 不是那个 ocr 了。

@liunux4odoo

新功能： - 优化 PDF 文件的 OCR，过滤无意义的小图片 by @liunux4odoo #2525 - 支持 Gemini 在线模型 by @yhfgyyf #2630 - 支持 GLM4 在线模型 by @zRzRzRzRzRzRzR - elasticsearch更新https连接 by @xldistance #2390 - 增强对PPT、DOC知识库文件的OCR识别 by @596192804 #2013 - 更新 Agent 对话功能 by @zRzRzRzRzRzRzR - 每次创建对象时从连接池获取连接，避免每次执行方法时都新建连接 by @Lijia0 #2480 - 实现 ChatOpenAI 判断token有没有超过模型的context上下文长度 by @glide-the - 更新运行数据库报错和项目里程碑 by @zRzRzRzRzRzRzR #2659 - 更新配置文件/文档/依赖 by @imClumsyPanda @zRzRzRzRzRzRzR - 添加日文版 readme by @eltociear #2787 修复： - langchain 更新后，PGVector 向量库连接错误 by @HALIndex #2591 - Minimax's model worker 错误 by @xyhshen - ES库无法向量检索.添加mappings创建向量索引 by MSZheng20 #2688

302658980 · 2024-03-19T03:40:04Z

loader 还是那个 loader，ocr 不是那个 ocr 了。

我也是遇到了这个问题
2024-03-19 11:07:34,183 - utils.py[line:286] - INFO: RapidOCRPDFLoader used for /root/autodl-tmp/Langchain-Chatchat/knowledge_base/doc-1701875989560164354/content/526062814771613696-百度推广签约合同(2).pdf
RapidOCRPDFLoader context page index: 9: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [05:21<00:00, 32.14s/it]
We couldn't connect to 'https://huggingface.co' to load this file, couldn't find it in the cached files and it looks like None is not the path to a directory containing a file named config.json.
Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.
文档切分示例：page_content='《百度推广服务协议》\n\n0447763\n\n合同编号：\n\n深圳大拿智能设备有限公司\n\n方：\n\n方：百度国际科技（深圳）有限公司\n\n法定代表人：\n\n法定代表人：崔珊珊\n\n联系地址：\n\n联系地址：深圳市南山区粤海街道滨海社区海天一\n\n路8号百度国际大厦西塔楼3层\n\n联系人：\n\n联系人：\n\n电话：\n\n电话：售前：4008060018、售后：4009200000\n\n电子邮件：\n\n电子邮箱：zxhelp@sh.baidu.com\n\n开户行：\n\n开户行：招商银行深圳分行创维大厦支行\n\n账号：' metadata={'source': '/root/autodl-tmp/Langchain-Chatchat/knowledge_base/doc-1701875989560164354/content/526062814771613696-百度推广签约合同(2).pdf'}

好像结果是成功了进行知识库问答也正常但是中间这个 We couldn't connect to 'https://huggingface.co' to load this file, couldn't find it in the cached files and it looks like None is not the path to a directory containing a file named config.json.
是提示我什么。有点慌

增加 PDF_OCR_THRESHOLD 配置项，只对宽高超过页面一定比例（图片宽/页面宽，图片高/页面高）的图片进行 OCR。

7dcd8b9

这样可以避免 PDF 中一些小图片的干扰，提高非扫描版 PDF 处理速度

dosubot bot added the size:S This PR changes 10-29 lines, ignoring generated files. label Jan 2, 2024

liunux4odoo merged commit aeb7a7e into chatchat-space:dev Jan 2, 2024

liunux4odoo mentioned this pull request Jan 2, 2024

[Loader] PDF loader 应该可选，或者优先提取PDF文本层信息 #2478

Closed

liunux4odoo mentioned this pull request Jan 25, 2024

publish 0.2.10 #2797

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

增加 PDF_OCR_THRESHOLD 配置项，只对宽高超过页面一定比例（图片宽/页面宽，图片高/页面高）的图片进行 OCR。 #2525

增加 PDF_OCR_THRESHOLD 配置项，只对宽高超过页面一定比例（图片宽/页面宽，图片高/页面高）的图片进行 OCR。 #2525

liunux4odoo commented Jan 2, 2024

aoteman-z commented Jan 18, 2024

liunux4odoo commented Jan 19, 2024

302658980 commented Mar 19, 2024

增加 PDF_OCR_THRESHOLD 配置项，只对宽高超过页面一定比例（图片宽/页面宽，图片高/页面高）的图片进行 OCR。 #2525

增加 PDF_OCR_THRESHOLD 配置项，只对宽高超过页面一定比例（图片宽/页面宽，图片高/页面高）的图片进行 OCR。 #2525

Conversation

liunux4odoo commented Jan 2, 2024

aoteman-z commented Jan 18, 2024

liunux4odoo commented Jan 19, 2024

302658980 commented Mar 19, 2024