-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[复现问题] 构造 prompt 时从知识库中提取的文字乱码 #5
Comments
请问是把readme下载至本地后,使用UnstructuredFileLoader加载的吗? |
是的。因本地 GPU 显存不足,我在 AutoDL 平台上的云虚拟机中进行操作。已确认下载后文件为 UTF8 编码。相关库版本信息:
|
请问是否测试过其他格式类型文件,或者.md格式的其他文件?建议先检查一下loader输出的结果看看有没有问题,有问题的话具体要看一下Unstructured里面的函数。
Neko Null ***@***.***>于2023年4月2日 周日00:14写道:
… 请问是把readme下载至本地后,使用UnstructuredFileLoader加载的吗?
是的。因本地 GPU 显存不足,我在 AutoDL 平台上的云虚拟机中进行操作。已确认下载后文件为 UTF8 编码。相关库版本信息:
langchain 0.0.128
transformers 4.26.1
unstructued 0.5.8
—
Reply to this email directly, view it on GitHub
<#5 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABLH5EWOY37AWWSHIUPXHJLW7BH7TANCNFSM6AAAAAAWPW4WTA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
感谢指路!尝试加载其他文本正常。排查后发现可能是因为 ChatGLM 的 README 顶部的 HTML 链接导致的,删除以下文本即可: <p align="center">
🌐 <a href="https://chatglm.cn/blog" target="_blank">Blog</a> • 🤗 <a href="https://huggingface.co/THUDM/chatglm-6b" target="_blank">HF Repo</a> • 🐦 <a href="https://twitter.com/thukeg" target="_blank">Twitter</a> • 📃 <a href="https://arxiv.org/abs/2103.10360" target="_blank">[GLM@ACL 22]</a> <a href="https://github.com/THUDM/GLM" target="_blank">[GitHub]</a> • 📃 <a href="https://arxiv.org/abs/2210.02414" target="_blank">[GLM-130B@ICLR 23]</a> <a href="https://github.com/THUDM/GLM-130B" target="_blank">[GitHub]</a> <br>
</p> 此外直接以 from langchain.document_loaders import UnstructuredFileIOLoader
with open(filepath, "r", encoding="utf8") as f:
loader = UnstructuredFileIOLoader(file=f, mode="elements")
docs = loader.load() |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
hi,我在尝试复现 README 中的效果,也使用了 ChatGLM-6B 的 README 作为输入文本,但发现从知识库中提取的文字是乱码,导致构造的 prompt 不可用。想了解如何解决这个问题。
The text was updated successfully, but these errors were encountered: