Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] 启用上下文关联,每次embedding搜索到的内容都会比前一次多一段 #613

Closed
guangyuanyu opened this issue Jun 13, 2023 · 3 comments
Labels
bug Something isn't working

Comments

@guangyuanyu
Copy link

问题描述 / Problem Description
启用上下文关联 chunk_conent,每次embedding搜索到的内容都会比前一次多一段
image

复现问题的步骤 / Steps to Reproduce

  1. 知识库里有这样一段文字:“问:账户最多创建几个? 答:账户最多创建5个”
  2. 用的GanymedeNil_text2vec-large-chinese embedding模型
  3. 获取知识库条数设置的2
  4. 每段最大长度设置的100
  5. 勾选启用上下文关联

预期的结果 / Expected Result
每次搜索到的内容是一致的

实际结果 / Actual Result
每次搜索到的内容都比之前要多一段,如上图

环境信息 / Environment Information

  • langchain-ChatGLM 版本/commit 号:fef22e3133d8de8f06382149f4303c66afd637cb
  • 是否使用 Docker 部署(是/否):否,macbook cpu运行
  • 使用的模型(ChatGLM-6B / ClueAI/ChatYuan-large-v2 等):ChatGLM-6B
  • 使用的 Embedding 模型(GanymedeNil/text2vec-large-chinese 等):GanymedeNil/text2vec-large-chinese
  • 操作系统及版本 / Operating system and version: macos 13.3.1
  • Python 版本 / Python version: 3.10.9
@guangyuanyu guangyuanyu added the bug Something isn't working label Jun 13, 2023
@jkmchinese
Copy link

该问题的主要原因是MyFAISS.py文件再搜索上下文关联文档后,修改了缓存的doc文档,导致的。

简单修改的话,就只需要做下deepcopy即可:doc = copy.deepcopy(self.docstore.search(_id))

        for id_seq in id_lists:
            for id in id_seq:
                if id == id_seq[0]:
                    _id = self.index_to_docstore_id[id]
                    doc = copy.deepcopy(self.docstore.search(_id))
                else:
                    _id0 = self.index_to_docstore_id[id]
                    doc0 = self.docstore.search(_id0)
                    doc.page_content += " " + doc0.page_content
            if not isinstance(doc, Document):
                raise ValueError(f"Could not find document for id {_id}, got {doc}")
            doc_score = min([scores[0][id] for id in [indices[0].tolist().index(i) for i in id_seq if i in indices[0]]])
            doc.metadata["score"] = int(doc_score)
            docs.append(doc)
        return docs

@imClumsyPanda FYI

@guangyuanyu
Copy link
Author

谢谢,已解决

@imClumsyPanda
Copy link
Collaborator

该问题的主要原因是MyFAISS.py文件再搜索上下文关联文档后,修改了缓存的doc文档,导致的。

简单修改的话,就只需要做下deepcopy即可:doc = copy.deepcopy(self.docstore.search(_id))

        for id_seq in id_lists:
            for id in id_seq:
                if id == id_seq[0]:
                    _id = self.index_to_docstore_id[id]
                    doc = copy.deepcopy(self.docstore.search(_id))
                else:
                    _id0 = self.index_to_docstore_id[id]
                    doc0 = self.docstore.search(_id0)
                    doc.page_content += " " + doc0.page_content
            if not isinstance(doc, Document):
                raise ValueError(f"Could not find document for id {_id}, got {doc}")
            doc_score = min([scores[0][id] for id in [indices[0].tolist().index(i) for i in id_seq if i in indices[0]]])
            doc.metadata["score"] = int(doc_score)
            docs.append(doc)
        return docs

@imClumsyPanda FYI

已在master分支中按照评论中方法进行修复,感谢反馈。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants