Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

增加 PDF_OCR_THRESHOLD 配置项,只对宽高超过页面一定比例(图片宽/页面宽,图片高/页面高)的图片进行 OCR。 #2525

Merged
merged 1 commit into from Jan 2, 2024

Conversation

liunux4odoo
Copy link
Collaborator

这样可以避免 PDF 中一些小图片的干扰,提高非扫描版 PDF 处理速度。
以 test_files/langchain.pdf 为例,加载时间从 38s 减少到 5s 以内

@dosubot dosubot bot added the size:S This PR changes 10-29 lines, ignoring generated files. label Jan 2, 2024
@liunux4odoo liunux4odoo merged commit aeb7a7e into chatchat-space:dev Jan 2, 2024
@aoteman-z
Copy link

你好,我按照Files chages改完,上传一份没有图片的pdf好像还是用了RapidOCRPDFLoader,请问是为什么?
log信息:RapidOCRPDFLoader context page index: 0: 100%|██████████| 1/1 [00:00<00:00, 23.60it/s]
We couldn't connect to 'https://huggingface.co' to load this file, couldn't find it in the cached files and it looks like None is not the path to a directory containing a file named config.json.
Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.
文档切分示例:page_content='6304p\n\nProtocol 118\n\n04/06/2016' metadata={'source': '//zoc_chatchat/knowledge_base/zjm/content/123.pdf'}
Batches: 100%|██████████| 1/1 [00:18<00:00, 18.03s/it]

@liunux4odoo
Copy link
Collaborator Author

loader 还是那个 loader,ocr 不是那个 ocr 了。

@liunux4odoo liunux4odoo mentioned this pull request Jan 25, 2024
liunux4odoo added a commit that referenced this pull request Jan 25, 2024
新功能:
- 优化 PDF 文件的 OCR,过滤无意义的小图片 by @liunux4odoo #2525
- 支持 Gemini 在线模型 by @yhfgyyf #2630
- 支持 GLM4 在线模型 by @zRzRzRzRzRzRzR
- elasticsearch更新https连接 by @xldistance #2390
- 增强对PPT、DOC知识库文件的OCR识别 by @596192804 #2013
- 更新 Agent 对话功能 by @zRzRzRzRzRzRzR
- 每次创建对象时从连接池获取连接,避免每次执行方法时都新建连接 by @Lijia0 #2480
- 实现 ChatOpenAI 判断token有没有超过模型的context上下文长度 by @glide-the
- 更新运行数据库报错和项目里程碑 by @zRzRzRzRzRzRzR #2659
- 更新配置文件/文档/依赖 by @imClumsyPanda @zRzRzRzRzRzRzR
- 添加日文版 readme by @eltociear #2787

修复:
- langchain 更新后,PGVector 向量库连接错误 by @HALIndex #2591
- Minimax's model worker 错误 by @xyhshen 
- ES库无法向量检索.添加mappings创建向量索引 by MSZheng20 #2688
@302658980
Copy link

loader 还是那个 loader,ocr 不是那个 ocr 了。

我也是遇到了这个问题
2024-03-19 11:07:34,183 - utils.py[line:286] - INFO: RapidOCRPDFLoader used for /root/autodl-tmp/Langchain-Chatchat/knowledge_base/doc-1701875989560164354/content/526062814771613696-百度推广签约合同(2).pdf
RapidOCRPDFLoader context page index: 9: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [05:21<00:00, 32.14s/it]
We couldn't connect to 'https://huggingface.co' to load this file, couldn't find it in the cached files and it looks like None is not the path to a directory containing a file named config.json.
Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.
文档切分示例:page_content='《百度推广服务协议》\n\n0447763\n\n合同编号:\n\n深圳大拿智能设备有限公司\n\n方:\n\n方:百度国际科技(深圳)有限公司\n\n法定代表人:\n\n法定代表人:崔珊珊\n\n联系地址:\n\n联系地址:深圳市南山区粤海街道滨海社区海天一\n\n路8号百度国际大厦西塔楼3层\n\n联系人:\n\n联系人:\n\n电话:\n\n电话:售前:4008060018、售后:4009200000\n\n电子邮件:\n\n电子邮箱:zxhelp@sh.baidu.com\n\n开户行:\n\n开户行:招商银行深圳分行创维大厦支行\n\n账号:' metadata={'source': '/root/autodl-tmp/Langchain-Chatchat/knowledge_base/doc-1701875989560164354/content/526062814771613696-百度推广签约合同(2).pdf'}

好像结果是成功了 进行知识库问答也正常 但是中间这个 We couldn't connect to 'https://huggingface.co' to load this file, couldn't find it in the cached files and it looks like None is not the path to a directory containing a file named config.json.
是提示我什么。有点慌

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:S This PR changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants