Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[复现问题] 构造 prompt 时从知识库中提取的文字乱码 #5

Closed
jerrylususu opened this issue Apr 1, 2023 · 4 comments
Closed

Comments

@jerrylususu
Copy link

hi,我在尝试复现 README 中的效果,也使用了 ChatGLM-6B 的 README 作为输入文本,但发现从知识库中提取的文字是乱码,导致构造的 prompt 不可用。想了解如何解决这个问题。

System: 基于以下内容,简洁和专业的来回答用户的问题。
    如果无法从中得到答案,请说 "不知道" 或 "没有足够的相关信息",不要试图编造答案。答案请使用中文。
    ----------------
    # ChatGLM-6B

[GLM-130B@ICLR 23]

[GLM@ACL 22]

Blog ¢ ð

[GitHub]

[GitHub] ¢ ð

HF Repo ¢ ð

Twitter ¢ ð
    ----------------
@imClumsyPanda
Copy link
Collaborator

请问是把readme下载至本地后,使用UnstructuredFileLoader加载的吗?

@jerrylususu
Copy link
Author

请问是把readme下载至本地后,使用UnstructuredFileLoader加载的吗?

是的。因本地 GPU 显存不足,我在 AutoDL 平台上的云虚拟机中进行操作。已确认下载后文件为 UTF8 编码。相关库版本信息:

langchain 0.0.128
transformers 4.26.1
unstructued 0.5.8

@imClumsyPanda
Copy link
Collaborator

imClumsyPanda commented Apr 1, 2023 via email

@jerrylususu
Copy link
Author

jerrylususu commented Apr 1, 2023

感谢指路!尝试加载其他文本正常。排查后发现可能是因为 ChatGLM 的 README 顶部的 HTML 链接导致的,删除以下文本即可:

<p align="center">
   🌐 <a href="https://chatglm.cn/blog" target="_blank">Blog</a> • 🤗 <a href="https://huggingface.co/THUDM/chatglm-6b" target="_blank">HF Repo</a> • 🐦 <a href="https://twitter.com/thukeg" target="_blank">Twitter</a> • 📃 <a href="https://arxiv.org/abs/2103.10360" target="_blank">[GLM@ACL 22]</a> <a href="https://github.com/THUDM/GLM" target="_blank">[GitHub]</a> • 📃 <a href="https://arxiv.org/abs/2210.02414" target="_blank">[GLM-130B@ICLR 23]</a> <a href="https://github.com/THUDM/GLM-130B" target="_blank">[GitHub]</a> <br>
</p>

此外直接以 rb 模式打开也会有问题(依然会乱码),但如果先指定 encoding 得到文件 IO 对象再打开似乎就没事了:

from langchain.document_loaders import UnstructuredFileIOLoader

with open(filepath, "r", encoding="utf8") as f:
    loader = UnstructuredFileIOLoader(file=f, mode="elements")
    docs = loader.load()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants