[BUG] 使用 init_database.py 初始化文档库失败 #2086

zhengwanbo · 2023-11-16T09:33:42Z

问题描述 / Problem Description
使用 init_database.py 初始化文档库失败，只有一个文件被加载成功，其他文件都失败。

(faiss) [opc@llm-test Langchain-Chatchat]$ python copy_config_example.py
(faiss) [opc@llm-test Langchain-Chatchat]$ python init_database.py --recreate-vs
recreating all vector stores
2023-11-16 09:27:00,152 - faiss_cache.py[line:80] - INFO: loading vector store in 'samples/vector_store/m3e-base' from disk.
2023-11-16 09:27:00,564 - SentenceTransformer.py[line:66] - INFO: Load pretrained SentenceTransformer: moka-ai/m3e-base
Batches: 100%|██████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 37.38it/s]
2023-11-16 09:27:01,893 - loader.py[line:54] - INFO: Loading faiss with AVX2 support.
2023-11-16 09:27:01,908 - loader.py[line:56] - INFO: Successfully loaded faiss with AVX2 support.
2023-11-16 09:27:01,922 - faiss_cache.py[line:80] - INFO: loading vector store in 'samples/vector_store/m3e-base' from disk.
Batches: 100%|██████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 41.18it/s]
2023-11-16 09:27:01,953 - migrate.py[line:77] - ERROR: ValueError: 暂未支持的文件格式 .jsonl，已跳过
2023-11-16 09:27:01,953 - migrate.py[line:77] - ERROR: ValueError: 暂未支持的文件格式 .xlsx，已跳过
2023-11-16 09:27:01,953 - migrate.py[line:77] - ERROR: ValueError: 暂未支持的文件格式 .jsonl，已跳过
2023-11-16 09:27:01,953 - migrate.py[line:77] - ERROR: ValueError: 暂未支持的文件格式 .xlsx，已跳过
2023-11-16 09:27:01,954 - utils.py[line:292] - INFO: CSVLoader used for /home/opc/LLM/Langchain-Chatchat/knowledge_base/samples/content/test_files/langchain-ChatGLM_closed.csv
2023-11-16 09:27:01,954 - utils.py[line:292] - INFO: CSVLoader used for /home/opc/LLM/Langchain-Chatchat/knowledge_base/samples/content/test_files/langchain-ChatGLM_open.csv
2023-11-16 09:27:01,954 - utils.py[line:292] - INFO: UnstructuredFileLoader used for /home/opc/LLM/Langchain-Chatchat/knowledge_base/samples/content/test_files/test.txt
文档切分示例：page_content=': 0\ntitle: 效果如何优化\nfile: 2023-04-04.00\nurl: https://github.com/imClumsyPanda/langchain-ChatGLM/issues/14\ndetail: 如图所示，将该项目的README.md和该项目结合后，回答效果并不理想，请问可以从哪些方面进行优化\nid: 0' metadata={'source': '/home/opc/LLM/Langchain-Chatchat/knowledge_base/samples/content/test_files/langchain-ChatGLM_open.csv', 'row': 0}
正在将 samples/test_files/langchain-ChatGLM_open.csv 添加到向量库，共包含323条文档
2023-11-16 09:27:02,255 - utils.py[line:373] - ERROR: RuntimeError: 从文件 samples/test_files/langchain-ChatGLM_closed.csv 加载文档时出错：Error loading /home/opc/LLM/Langchain-Chatchat/knowledge_base/samples/content/test_files/langchain-ChatGLM_closed.csv
Batches: 0%| | 0/11 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
2023-11-16 09:27:03,455 - utils.py[line:160] - INFO: NumExpr defaulting to 8 threads.
2023-11-16 09:27:03,735 - utils.py[line:373] - ERROR: ImportError: 从文件 samples/test_files/test.txt 加载文档时出错：libGL.so.1: cannot open shared object file: No such file or directory

The text was updated successfully, but these errors were encountered:

liunux4odoo · 2023-11-16T09:55:07Z

Duplicate of #1783

plancktree · 2023-11-26T17:05:52Z

同样的问题，请问题主解决了吗

irisrain · 2023-12-28T07:53:53Z

帮忙回答下吧。
wiki查到
Q：linux下向量化PDF文件时出错：ImportError: 从文件 *.pdf 加载文档时出错：libGL.so.1: cannot open shared object file: No such file or directory
A：这是系统缺少必要的动态库，可以手动安装：libgl1-mesa-glx 和 libglib2.0-0
自行安装搞定
yum update -y
yum install mesa-libGL glib2 -y

zhengwanbo added the bug Something isn't working label Nov 16, 2023

liunux4odoo marked this as a duplicate of #1783 Nov 16, 2023

liunux4odoo closed this as completed Nov 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] 使用 init_database.py 初始化文档库失败 #2086

[BUG] 使用 init_database.py 初始化文档库失败 #2086

zhengwanbo commented Nov 16, 2023

liunux4odoo commented Nov 16, 2023

plancktree commented Nov 26, 2023

irisrain commented Dec 28, 2023

[BUG] 使用 init_database.py 初始化文档库失败 #2086

[BUG] 使用 init_database.py 初始化文档库失败 #2086

Comments

zhengwanbo commented Nov 16, 2023

liunux4odoo commented Nov 16, 2023

plancktree commented Nov 26, 2023

irisrain commented Dec 28, 2023