Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] 是否能更换Faiss使用的向量匹配引擎和距离度量方式 #3115

Closed
liuchuan01 opened this issue Feb 29, 2024 · 18 comments
Closed
Assignees
Labels
enhancement New feature or request

Comments

@liuchuan01
Copy link

在配置文件中,我已经看到了kbs_config = {
"faiss": {
},
我也观察了日志,去寻找了
2024-02-29 14:34:03,340 - loader.py[line:54] - INFO: Loading faiss with AVX2 support.
2024-02-29 14:34:03,358 - loader.py[line:56] - INFO: Successfully loaded faiss with AVX2 support.
的出处,但依然找不到初始化faiss时的参数,希望解惑

@liuchuan01 liuchuan01 added the enhancement New feature or request label Feb 29, 2024
Copy link

dosubot bot commented Feb 29, 2024

🤖

你好,@liuchuan01!很高兴见到你。我是Dosu,一个可以帮助你解决问题、回答问题、并成为贡献者的机器人。我在这里帮助你,同时等待人类维护者的回复。我会立刻看看你的问题,并尽快给你完整的答复。

@liuchuan01
Copy link
Author

现在已经去挖掘了源码,能看到faiss.py初始化时使用的是flat引擎和内积距离,请问我如何改变?

Copy link

dosubot bot commented Feb 29, 2024

🤖

Based on your question, it seems you're looking to modify the initialization parameters for Faiss in the Langchain-Chatchat source code. These parameters can be found and modified in the load_vector_store method of the FaissKBService class. This method is responsible for loading the Faiss vector store with specific parameters such as kb_name, vector_name, and embed_model.

Here's a snippet of the load_vector_store method:

def load_vector_store(self) -> ThreadSafeFaiss:
    return kb_faiss_pool.load_vector_store(kb_name=self.kb_name,
                                           vector_name=self.vector_name,
                                           embed_model=self.embed_model)

To modify the initialization parameters for Faiss, you would adjust the values of self.kb_name, self.vector_name, and self.embed_model before the load_vector_store method is called. These attributes are set in the do_init method and potentially elsewhere in the class or by the class's consumer.

Here's a snippet of the do_init method:

def do_init(self):
    self.vector_name = self.vector_name or self.embed_model
    self.kb_path = self.get_kb_path()
    self.vs_path = self.get_vs_path()

To modify the initialization parameters, you could directly set these attributes (self.kb_name, self.vector_name, self.embed_model) to the desired values before the Faiss index is loaded or created. This could be done in the class's constructor (__init__ method) or before calling methods that load or manipulate the vector store, such as do_create_kb, do_add_doc, do_search, etc.

You can find more details in the source code at server/knowledge_base/kb_service/faiss_kb_service.py.

Regarding the "Loading faiss with AVX2 support" and "Successfully loaded faiss with AVX2 support" messages, these are likely coming from the underlying Faiss library itself when it is being imported or initialized, and not directly from the Langchain-Chatchat codebase. The exact source of these messages would depend on the version and configuration of the Faiss library you are using.

I hope this helps! If you have any more questions or need further clarification, feel free to ask.

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@liuchuan01
Copy link
Author

@dosu-bot after your mention .i find this vector_store = FAISS.load_local(vs_path, embeddings, normalize_L2=True,distance_strategy="METRIC_INNER_PRODUCT") in faiss_cache.py
now i can change distance_strategy
but still i want to change this type of index like : flat,IVF,PQ...
can you help me?

Copy link

dosubot bot commented Feb 29, 2024

🤖

Based on the information you provided, it seems you are trying to modify the type of index used by Faiss in the Langchain-Chatchat source code. The load_vector_store method in the FaissKBService class is indeed where the Faiss index is initialized. However, the actual creation and configuration of the Faiss index are not shown in the provided context.

To change the type of Faiss index, you would likely need to modify the implementation of the kb_faiss_pool.load_vector_store method or wherever the Faiss index is initialized within the kb_faiss_pool object. This could involve specifying a different index type when creating the Faiss index.

Here's a general approach on how you might modify the index type:

  1. Locate the Index Creation: Find where the Faiss index is created in the kb_faiss_pool or related utility. This is where you'll specify the type of index Faiss should use.

  2. Modify the Index Type: Change the index type to the desired one. For example, if you're switching to an IVF index for faster searches, you might change the index creation code to something like:

import faiss

# Example of creating an IVF index
dimension = 128  # Example dimension, adjust based on your actual data
nlist = 100  # Number of clusters
quantizer = faiss.IndexFlatL2(dimension)  # The quantizer for assigning vectors to clusters
index = faiss.IndexIVFFlat(quantizer, dimension, nlist, faiss.METRIC_L2)
  1. Adjust Index Parameters: Depending on the index type, you may need to adjust additional parameters or perform extra steps, such as training the index with a sample of vectors before adding all vectors if you're using an IVF index.

  2. Ensure Compatibility: Make sure the rest of your code that interacts with the Faiss index is compatible with the new index type. Some index types may have specific requirements or limitations.

Without access to the implementation details of kb_faiss_pool.load_vector_store or how the Faiss index is initialized and managed within your codebase, this is a general guide. You would need to adapt these steps based on the specific details of your implementation.

In addition, I found a similar solved issue in the Langchain-Chatchat repository: [FEATURE] langchain-chatchat Faiss向量库的索引配置在哪看呢. The suggested solution indicates that the Faiss vector library's index configuration can be found in the kbs_config dictionary in the configs/kb_config.py.example file. The default index type for Faiss is not IndexFlatIP, but IndexFlatL2 based on Euclidean distance. If you want to use cosine similarity, you need to normalize your vectors before adding them to the index and before querying the index.

I hope this helps! If you have any other questions, feel free to ask.

Sources:

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@zRzRzRzRzRzRzR zRzRzRzRzRzRzR self-assigned this Mar 2, 2024
@liuchuan01
Copy link
Author

@zRzRzRzRzRzRzR 维护者你好,我深入阅读源码后了解了加载kb的机制,梳理出这张流图(在知识库问答函数最外层加载时有所不同,因为看的版本被团队其他人做过修改)
image-20240301124635271
我注意到在faiss_cache.py中new_vector_store函数加载时指定了距离度量方式,源码:

vector_store = FAISS.from_documents([doc], embeddings, normalize_L2=True,distance_strategy="METRIC_INNER_PRODUCT")

对应的是新建知识库的场景
可是当我继续挖掘源码时我发现,最终会调用langchain源码的faiss.py中的__from函数
在该函数中,逻辑为

if distance_strategy == DistanceStrategy.MAX_INNER_PRODUCT:
        index = faiss.IndexFlatIP(len(embeddings[0]))
else:
        # Default to L2, currently other metric types not initialized.
        index = faiss.IndexFlatL2(len(embeddings[0]))

其中的常量定义为:

MAX_INNER_PRODUCT = "MAX_INNER_PRODUCT"

这意味着new_vector_store函数加载时传递的参数并不是正确的。最终向量距离度量方式还是L2距离。请问我的理解有误吗?
我是否可以提交一个PR,将传递的参数改为EUCLIDEAN_DISTANCE,这样可以在不造成实际影响的状态下减少误解。

目前我所在的团队正在持续研究本项目向量数据库(faiss)的使用优化中,也将开始测试内积、余弦相似度匹配的召回精度,希望后续可以为本项目作出更大贡献,感谢!

Copy link

dosubot bot commented Mar 2, 2024

This is a tough nut to crack, could really use your help on this, @imClumsyPanda.

@liangpn
Copy link

liangpn commented Mar 8, 2024

@zRzRzRzRzRzRzR 维护者你好,我深入阅读源码后了解了加载kb的机制,梳理出这张流图(在知识库问答函数最外层加载时有所不同,因为看的版本被团队其他人做过修改) image-20240301124635271 我注意到在faiss_cache.py中new_vector_store函数加载时指定了距离度量方式,源码:

vector_store = FAISS.from_documents([doc], embeddings, normalize_L2=True,distance_strategy="METRIC_INNER_PRODUCT")

对应的是新建知识库的场景 可是当我继续挖掘源码时我发现,最终会调用langchain源码的faiss.py中的__from函数 在该函数中,逻辑为

if distance_strategy == DistanceStrategy.MAX_INNER_PRODUCT:
        index = faiss.IndexFlatIP(len(embeddings[0]))
else:
        # Default to L2, currently other metric types not initialized.
        index = faiss.IndexFlatL2(len(embeddings[0]))

其中的常量定义为:

MAX_INNER_PRODUCT = "MAX_INNER_PRODUCT"

这意味着new_vector_store函数加载时传递的参数并不是正确的。最终向量距离度量方式还是L2距离。请问我的理解有误吗? 我是否可以提交一个PR,将传递的参数改为EUCLIDEAN_DISTANCE,这样可以在不造成实际影响的状态下减少误解。

目前我所在的团队正在持续研究本项目向量数据库(faiss)的使用优化中,也将开始测试内积、余弦相似度匹配的召回精度,希望后续可以为本项目作出更大贡献,感谢!

请问最后的结果怎么样?这一段代码我这边跑也是有所疑惑,解析器提示
2024-03-08 15:49:35,289 - loader.py[line:64] - INFO: Loading faiss.
2024-03-08 15:49:35,320 - loader.py[line:66] - INFO: Successfully loaded faiss.
/home/hvuser/.local/lib/python3.10/site-packages/langchain_community/vectorstores/faiss.py:121: UserWarning: Normalizing L2 is not applicable for metric type: METRIC_INNER_PRODUCT
warnings.warn(
不知道是否会有影响

@liuchuan01
Copy link
Author

出现这个报错的原因是本地加载知识库时,会指定距离度量方式为 METRIC_INNER_PRODUCT ,但该加载不经过__from,故度量类型不会被纠正,在faiss.py的init方法调用时会有代码如下

 if (
            self.distance_strategy != DistanceStrategy.EUCLIDEAN_DISTANCE
            and self._normalize_L2
        ):
            warnings.warn(
                "Normalizing L2 is not applicable for metric type: {strategy}".format(
                    strategy=self.distance_strategy
                )
            )

当发现距离度量方式不为L2且使用L2归一化就会出现这个报错。
在dev的最新版上面已经修复了该方法,修复的方法是在faiss_cahce.py中

vector_store = FAISS.load_local(vs_path, embeddings, normalize_L2=True)

去除了手动指定的距离度量方式,最终加载时也就是默认的L2距离。

最后,创建知识库的流程也进行了纠错,可见PullRequest#3206

@liangpn
Copy link

liangpn commented Mar 15, 2024

出现这个报错的原因是本地加载知识库时,会指定距离度量方式为 METRIC_INNER_PRODUCT ,但该加载不经过__from,故度量类型不会被纠正,在faiss.py的init方法调用时会有代码如下

 if (
            self.distance_strategy != DistanceStrategy.EUCLIDEAN_DISTANCE
            and self._normalize_L2
        ):
            warnings.warn(
                "Normalizing L2 is not applicable for metric type: {strategy}".format(
                    strategy=self.distance_strategy
                )
            )

当发现距离度量方式不为L2且使用L2归一化就会出现这个报错。 在dev的最新版上面已经修复了该方法,修复的方法是在faiss_cahce.py中

vector_store = FAISS.load_local(vs_path, embeddings, normalize_L2=True)

去除了手动指定的距离度量方式,最终加载时也就是默认的L2距离。

最后,创建知识库的流程也进行了纠错,可见PullRequest#3206

faiss_cache.py 文件_FaissPool方法大概61行,不是 也有这个方法?这一段不需要改吗?
vector_store = FAISS.from_documents([doc], embeddings, normalize_L2=True,distance_strategy="METRIC_INNER_PRODUCT")

@liuchuan01
Copy link
Author

出现这个报错的原因是本地加载知识库时,会指定距离度量方式为 METRIC_INNER_PRODUCT ,但该加载不经过__from,故度量类型不会被纠正,在faiss.py的init方法调用时会有代码如下

 if (
            self.distance_strategy != DistanceStrategy.EUCLIDEAN_DISTANCE
            and self._normalize_L2
        ):
            warnings.warn(
                "Normalizing L2 is not applicable for metric type: {strategy}".format(
                    strategy=self.distance_strategy
                )
            )

当发现距离度量方式不为L2且使用L2归一化就会出现这个报错。 在dev的最新版上面已经修复了该方法,修复的方法是在faiss_cahce.py中

vector_store = FAISS.load_local(vs_path, embeddings, normalize_L2=True)

去除了手动指定的距离度量方式,最终加载时也就是默认的L2距离。
最后,创建知识库的流程也进行了纠错,可见PullRequest#3206

faiss_cache.py 文件_FaissPool方法大概61行,不是 也有这个方法?这一段不需要改吗? vector_store = FAISS.from_documents([doc], embeddings, normalize_L2=True,distance_strategy="METRIC_INNER_PRODUCT")

61行的也要改的,在我开启的PR里改的就是61行。61行是新建知识库的情况,111行左右load_local是本地加载的情况

@liangpn
Copy link

liangpn commented Mar 15, 2024

出现这个报错的原因是本地加载知识库时,会指定距离度量方式为 METRIC_INNER_PRODUCT ,但该加载不经过__from,故度量类型不会被纠正,在faiss.py的init方法调用时会有代码如下

 if (
            self.distance_strategy != DistanceStrategy.EUCLIDEAN_DISTANCE
            and self._normalize_L2
        ):
            warnings.warn(
                "Normalizing L2 is not applicable for metric type: {strategy}".format(
                    strategy=self.distance_strategy
                )
            )

当发现距离度量方式不为L2且使用L2归一化就会出现这个报错。 在dev的最新版上面已经修复了该方法,修复的方法是在faiss_cahce.py中

vector_store = FAISS.load_local(vs_path, embeddings, normalize_L2=True)

去除了手动指定的距离度量方式,最终加载时也就是默认的L2距离。
最后,创建知识库的流程也进行了纠错,可见PullRequest#3206

faiss_cache.py 文件_FaissPool方法大概61行,不是 也有这个方法?这一段不需要改吗? vector_store = FAISS.from_documents([doc], embeddings, normalize_L2=True,distance_strategy="METRIC_INNER_PRODUCT")

61行的也要改的,在我开启的PR里改的就是61行。61行是新建知识库的情况,111行左右load_local是本地加载的情况

好的

@zixiaotan21
Copy link

楼主大佬你好,同样研究faiss向量数据库的内积、余弦相似度的优化, 谢谢楼主的细致分析!小弟还想请教一下:

  1. 有点没搞明白 MAX_INNER_PRODUCT 和 Metric Inner product 两种距离度量方式有何不同?
  2. Faiss 向量库默认的距离度量方式是L2 度量吗 ,也就是后面设置的score_threshold, 值越低,相似性越高?

@liuchuan01
Copy link
Author

楼主大佬你好,同样研究faiss向量数据库的内积、余弦相似度的优化, 谢谢楼主的细致分析!小弟还想请教一下:

  1. 有点没搞明白 MAX_INNER_PRODUCT 和 Metric Inner product 两种距离度量方式有何不同?
  2. Faiss 向量库默认的距离度量方式是L2 度量吗 ,也就是后面设置的score_threshold, 值越低,相似性越高?
  1. MAX_INNER_PRODUCT是langchain原码中规定的一个常量,用来初始化faiss, Metric Inner product 可能是LangchainChatchat开发者的笔误,暂时没有意义
  2. 是L2,由于 Metric Inner product 并不是langchian原码中列举出来的,所以实际在初始化时还是会变回欧式距离也就是L2,值越低,相似性越高
    后续你如果使用内积得到了和欧式距离的对比结果,可以再交流交流哦

@zixiaotan21
Copy link

谢谢楼主,我明白了,Metric Inner product 估计确实是笔误,所以就是按你上面所说的,把61行里最后的 distance_strategy="METRIC_INNER_PRODUCT,删掉,这样就没有歧义了,对吧?

@zixiaotan21
Copy link

我大概想明白了:

  1. 如果是vector_store = FAISS.load_local(vs_path, embeddings, normalize_L2=True) 最终加载的就是默认的L2距离。值越低,相似性越高。
  2. 如果修改为vector_store = FAISS.from_documents([doc], embeddings, normalize_L2=True, distance_strategy="MAX_INNER_PRODUCT" ) 加载的应该是最大内积距离,也就是余弦相似度。值越高,相似性越高。

在langchain-chatchat的使用中,感觉使用法1, 知识库问答匹配知识的时候,总会匹配出很多不太相关的知识,想问问楼主有没有试过用2 的方法 采用内积的方式 去检索相似向量呀,我现在准备试试 对比一下效果~

@zixiaotan21
Copy link

尝试2 之后发现,若指定“[距离度量方式]”为"MAX_INNER_PRODUCT",(内积/余弦相似度度量),会出现楼上一样的 UserWarning: Normalizing L2 is not applicable for metric type: METRIC_INNER_PRODUCT 报错,想问问这种情况楼主有遇到不?

@1hfq
Copy link

1hfq commented Apr 21, 2024

指定度量方式改为"MAX_INNER_PRODUCT"后,知识库问答就检索不到文档了,变成大语言模型自身解答了,不知道为什么。大佬,有解决的办法吗?求

@zRzRzRzRzRzRzR zRzRzRzRzRzRzR closed this as not planned Won't fix, can't repro, duplicate, stale May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants