Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] 使用 init_database.py 初始化文档库失败 #2086

Closed
zhengwanbo opened this issue Nov 16, 2023 · 3 comments
Closed

[BUG] 使用 init_database.py 初始化文档库失败 #2086

zhengwanbo opened this issue Nov 16, 2023 · 3 comments
Labels
bug Something isn't working

Comments

@zhengwanbo
Copy link

问题描述 / Problem Description
使用 init_database.py 初始化文档库失败,只有一个文件被加载成功,其他文件都失败。

(faiss) [opc@llm-test Langchain-Chatchat]$ python copy_config_example.py
(faiss) [opc@llm-test Langchain-Chatchat]$ python init_database.py --recreate-vs
recreating all vector stores
2023-11-16 09:27:00,152 - faiss_cache.py[line:80] - INFO: loading vector store in 'samples/vector_store/m3e-base' from disk.
2023-11-16 09:27:00,564 - SentenceTransformer.py[line:66] - INFO: Load pretrained SentenceTransformer: moka-ai/m3e-base
Batches: 100%|██████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 37.38it/s]
2023-11-16 09:27:01,893 - loader.py[line:54] - INFO: Loading faiss with AVX2 support.
2023-11-16 09:27:01,908 - loader.py[line:56] - INFO: Successfully loaded faiss with AVX2 support.
2023-11-16 09:27:01,922 - faiss_cache.py[line:80] - INFO: loading vector store in 'samples/vector_store/m3e-base' from disk.
Batches: 100%|██████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 41.18it/s]
2023-11-16 09:27:01,953 - migrate.py[line:77] - ERROR: ValueError: 暂未支持的文件格式 .jsonl,已跳过
2023-11-16 09:27:01,953 - migrate.py[line:77] - ERROR: ValueError: 暂未支持的文件格式 .xlsx,已跳过
2023-11-16 09:27:01,953 - migrate.py[line:77] - ERROR: ValueError: 暂未支持的文件格式 .jsonl,已跳过
2023-11-16 09:27:01,953 - migrate.py[line:77] - ERROR: ValueError: 暂未支持的文件格式 .xlsx,已跳过
2023-11-16 09:27:01,954 - utils.py[line:292] - INFO: CSVLoader used for /home/opc/LLM/Langchain-Chatchat/knowledge_base/samples/content/test_files/langchain-ChatGLM_closed.csv
2023-11-16 09:27:01,954 - utils.py[line:292] - INFO: CSVLoader used for /home/opc/LLM/Langchain-Chatchat/knowledge_base/samples/content/test_files/langchain-ChatGLM_open.csv
2023-11-16 09:27:01,954 - utils.py[line:292] - INFO: UnstructuredFileLoader used for /home/opc/LLM/Langchain-Chatchat/knowledge_base/samples/content/test_files/test.txt
文档切分示例:page_content=': 0\ntitle: 效果如何优化\nfile: 2023-04-04.00\nurl: https://github.com/imClumsyPanda/langchain-ChatGLM/issues/14\ndetail: 如图所示,将该项目的README.md和该项目结合后,回答效果并不理想,请问可以从哪些方面进行优化\nid: 0' metadata={'source': '/home/opc/LLM/Langchain-Chatchat/knowledge_base/samples/content/test_files/langchain-ChatGLM_open.csv', 'row': 0}
正在将 samples/test_files/langchain-ChatGLM_open.csv 添加到向量库,共包含323条文档
2023-11-16 09:27:02,255 - utils.py[line:373] - ERROR: RuntimeError: 从文件 samples/test_files/langchain-ChatGLM_closed.csv 加载文档时出错:Error loading /home/opc/LLM/Langchain-Chatchat/knowledge_base/samples/content/test_files/langchain-ChatGLM_closed.csv
Batches: 0%| | 0/11 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
2023-11-16 09:27:03,455 - utils.py[line:160] - INFO: NumExpr defaulting to 8 threads.
2023-11-16 09:27:03,735 - utils.py[line:373] - ERROR: ImportError: 从文件 samples/test_files/test.txt 加载文档时出错:libGL.so.1: cannot open shared object file: No such file or directory

@zhengwanbo zhengwanbo added the bug Something isn't working label Nov 16, 2023
@liunux4odoo
Copy link
Collaborator

Duplicate of #1783

@liunux4odoo liunux4odoo marked this as a duplicate of #1783 Nov 16, 2023
@plancktree
Copy link

同样的问题,请问题主解决了吗

@irisrain
Copy link

帮忙回答下吧。
wiki查到
Q:linux下向量化PDF文件时出错:ImportError: 从文件 *.pdf 加载文档时出错:libGL.so.1: cannot open shared object file: No such file or directory
A: 这是系统缺少必要的动态库,可以手动安装:libgl1-mesa-glx 和 libglib2.0-0
自行安装搞定
yum update -y
yum install mesa-libGL glib2 -y

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants