In [None]:
#模型下载
from modelscope import snapshot_download
model_dir = snapshot_download('BAAI/bge-m3', cache_dir='D:\\AInewModels')


In [None]:
!pip install FlagEmbedding


这段Python代码展示了如何使用**BGE-M3**模型（一种多功能嵌入模型）来计算文本之间的语义相似度。以下是逐步解析：

---

### **1. 模型初始化**
```python
model = BGEM3FlagModel('C:\\Users\\Administrator\\AInewModels\\BAAI\\bge-m3', use_fp16=True)
```
- **功能**：加载预训练的BGE-M3模型。
  - **模型路径**：`C:\Users\Administrator\AInewModels\BAAI\bge-m3`，指向本地下载的模型文件。
  - **`use_fp16=True`**：启用半精度浮点数（FP16）加速计算，牺牲少量精度换取更快的推理速度。

---

### **2. 输入文本准备**
```python
sentences_1 = ["What is BGE M3?", "Defination of BM25"]
sentences_2 = ["BGE M3 is an embedding model...", "BM25 is a bag-of-words..."]
```
- **两组句子**：
  - `sentences_1`：包含两个查询问题。
  - `sentences_2`：包含对应的答案或定义。

---

### **3. 生成嵌入向量**
```python
embeddings_1 = model.encode(sentences_1, batch_size=12, max_length=1024)['dense_vecs']
embeddings_2 = model.encode(sentences_2)['dense_vecs']
```
- **`model.encode()`**：将文本转换为稠密向量（dense embeddings）。
  - **参数**：
    - `batch_size=12`：批量处理12个句子，优化GPU利用率。
    - `max_length=1024`：限制输入文本的最大token长度（默认支持8192，缩短可加速处理）。
  - **输出**：`dense_vecs`是稠密向量，用于语义相似度计算。

---

### **4. 计算相似度**
```python
similarity = embeddings_1 @ embeddings_2.T
```
- **操作**：矩阵乘法（`@`）计算两组嵌入的相似度矩阵。
  - **结果**：`similarity[i][j]`表示`sentences_1[i]`与`sentences_2[j]`的相似度分数（值越高越相似）。
  - **示例**：若`sentences_1[0]`（"What is BGE M3?"）与`sentences_2[0]`（"BGE M3 is an embedding model..."）相似度高，则对应分数接近1.0。

---

### **关键点总结**
- **BGE-M3特性**：支持稠密检索、稀疏检索和多向量交互，适用于多语言和长文本场景。
- **性能优化**：通过`use_fp16`和`max_length`平衡速度与精度。
- **应用场景**：常用于检索增强生成（RAG）、语义搜索等任务。

如需进一步了解BGE-M3的架构或训练细节，可参考[官方教程](https://github.com/FlagOpen/FlagEmbedding/tree/master/Tutorials)。



In [7]:
from FlagEmbedding import BGEM3FlagModel

model = BGEM3FlagModel('D:\\AInewModels\\BAAI\\bge-m3',  
                       use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation

sentences_1 = ["What is BGE M3?", "Defination of BM25"]
sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.", 
               "BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]

embeddings_1 = model.encode(sentences_1, 
                            batch_size=12, 
                            #max_length=8192, # If you don't need such a long length, you can set a smaller value to speed up the encoding process.
                            max_length=1024,
                            )['dense_vecs']
embeddings_2 = model.encode(sentences_2)['dense_vecs']
similarity = embeddings_1 @ embeddings_2.T
print(similarity)

You're using a XLMRobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


[[0.6259036  0.34749582]
 [0.34986773 0.6782464 ]]


- "What is BGE M3?" 与"BGE M3 is an embedding model..." 的相似度为0.6265（较高）
- "What is BGE M3?"与"- BM25 is a bag-of-words retrieval function..." 的相似度为0.3477（较低）
- "Defination of BM25" 与"BGE M3 is an embedding model..." 的相似度为0.3499（较低）
- "Defination of BM25" 与"BM25 is a bag-of-words retrieval function..." 的相似度为0.678（较高）