# 1. 基于 BCEmbedding
## 1.1 调用EmbeddingModel计算句向量表示

1. BCEmbedding 项目简介
- 项目名称：BCEmbedding（双语和跨语言嵌入）
- 开发者：网易有道
- 组成部分：
  - EmbeddingModel（嵌入模型）：专注于生成语义向量，关键应用于语义搜索和问答系统。
  - RerankerModel（重排模型）：优化搜索结果的排序和排名任务。
- 特点：擅⻓双语和跨语言能力，尤其是在中英文之间，有效桥接语言差距。
- 成就：在MTEB的语义表示评估中取得高性能。在LlamaIndex的RAG评估中设立了新的基准。
  
```shell
Github: https://github.com/netease-youdao/BCEmbedding
```

## 环境配置
```
conda create -n bce python=3.10
conda init bash && source /root/.bashrc
conda activate bce
conda install ipykernel
ipython kernel install --user--name=bce

```

## 安装依赖包
```shell
pip install unstructured-inference
pip install opencv-python
pip install pdfminer.six
pip show pdfminer
pip install pdf2image
pip install unstructured
pip install langchain-community
pip install langchain
pip install langchain-core
pip install pikepdf
pip install llama-index-embeddings-huggingface
pip install llama-index
pip install llama-index-llms-huggingface
pip install accelerate
```

In [1]:
from BCEmbedding import EmbeddingModel

  from .autonotebook import tqdm as notebook_tqdm
  _torch_pytree._register_pytree_node(


In [2]:
sentences = ['今天天气不错哟', '明天一起去徒步']

In [3]:
model = EmbeddingModel(model_name_or_path="/slurm/resources/weights/huggingface/maidalun1020/bce-embedding-base_v1")

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
07/25/2024 09:56:29 - [INFO] -BCEmbedding.models.EmbeddingModel->>>    Loading from `/slurm/resources/weights/huggingface/maidalun1020/bce-embedding-base_v1`.
07/25/2024 09:56:30 - [INFO] -BCEmbedding.models.EmbeddingModel->>>    Execute device: cuda;	 gpu num: 1;	 use fp16: False;	 embedding pooling type: cls;	 trust remote code: False


In [4]:
embeddings = model.encode(sentences)

Extract embeddings: 100%|██████████| 1/1 [00:00<00:00,  1.38it/s]


In [5]:
embeddings.shape

(2, 768)

## 1.2 调用RerankerModel计算句子对的语义相关分数

In [6]:
from BCEmbedding import RerankerModel

query = "一个女人站在高崖上单腿站立，俯瞰一条河流。"

passages = ["一个女人站在悬崖上。",
            "一个孩子在她的卧室里读书。"]

# 构造语句对
sentence_pairs = [[query, passage] for passage in passages]

# 初始化 reranker 模型
rerank_model = RerankerModel(model_name_or_path="/slurm/resources/weights/huggingface/maidalun1020/bce-reranker-base_v1/")

# （1）计算语句对的相似性得分
scores = rerank_model.compute_score(sentence_pairs)

# (2) 对passages排序
rerank_results = rerank_model.rerank(query, passages)

print(rerank_results)

07/25/2024 09:56:35 - [INFO] -BCEmbedding.models.RerankerModel->>>    Loading from `/slurm/resources/weights/huggingface/maidalun1020/bce-reranker-base_v1/`.
07/25/2024 09:56:35 - [INFO] -BCEmbedding.models.RerankerModel->>>    Execute device: cuda;	 gpu num: 1;	 use fp16: False
Calculate scores: 100%|██████████| 1/1 [00:00<00:00, 106.65it/s]
You're using a XLMRobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'rerank_passages': ['一个女人站在悬崖上。', '一个孩子在她的卧室里读书。'], 'rerank_scores': [0.7433750033378601, 0.38531556725502014], 'rerank_ids': [0, 1]}


## 基于 transformers
### 调用EmbeddingModel计算句向量表示

In [8]:
from transformers import AutoModel, AutoTokenizer
import torch

passages = ["一个女人站在悬崖上。",
            "一个孩子在她的卧室里读书。"]

# 初始化模型
tokenizer = AutoTokenizer.from_pretrained("/slurm/resources/weights/huggingface/maidalun1020/bce-embedding-base_v1")
model = AutoModel.from_pretrained("/slurm/resources/weights/huggingface/maidalun1020/bce-embedding-base_v1")

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model.to(device)

# 基于tokenizer进行分词
encoded_inputs = tokenizer(passages, padding=True, truncation=True, max_length=512, return_tensors="pt").to(device)

# 获取embedding
outputs = model(**encoded_inputs, return_dict=True)

embeddings = outputs.last_hidden_state[:, 0] # cls

embeddings = embeddings / embeddings.norm(dim=1, keepdim=True) # 归一化

print(embeddings.shape)

torch.Size([2, 768])


## 调用RerankerModel计算句子对的语义相关分数

In [9]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# 初始化模型
tokenizer = AutoTokenizer.from_pretrained("/slurm/resources/weights/huggingface/maidalun1020/bce-reranker-base_v1/")
model = AutoModelForSequenceClassification.from_pretrained("/slurm/resources/weights/huggingface/maidalun1020/bce-reranker-base_v1/")

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

query = "一个女人站在高崖上单腿站立，俯瞰一条河流。"
passages = ["一个女人站在悬崖上。",
            "一个孩子在她的卧室里读书。"]
# 构造语句对
sentence_pairs = [[query, passage] for passage in passages]

# 获取分词后的输入
encoded_inputs = tokenizer(sentence_pairs, padding=True, truncation=True, max_length=512, return_tensors="pt").to(device)

# ① 通过模型计算每个语句对的分类得分（logits）,return_dict=True指示模型返回一个包含各种输出的字典，
# ② .logits提取出了分类得分，logits是模型的原始输出，对于序列分类任务，这是一个形状为(batch_size, num_labels)的张量，其中每个条目表示对应类别的得分。
# ③ .view(-1,): 这个操作改变logits张量的形状。-1意味着该维度的大小会自动计算，使得结果是一个一维张量。
scores = model(**encoded_inputs, return_dict=True).logits.view(-1,).float()
print(scores)
# ④ 使用sigmoid函数将每个得分转换为一个介于0和1之间的值，可以解释为概率。
#    对于二分类任务，sigmoid函数非常适合，因为它能够将任何实数映射到0和1之间，从而表示模型对每个类别的置信度。
scores = torch.sigmoid(scores)

print(scores)

tensor([ 1.1699, -0.4503], device='cuda:0', grad_fn=<ViewBackward0>)
tensor([0.7631, 0.3893], device='cuda:0', grad_fn=<SigmoidBackward0>)


## 基于 sentence_transformers
### 调用EmbeddingModel计算句向量表示

In [10]:
from sentence_transformers import SentenceTransformer

passages = ["一个女人站在悬崖上。",
            "一个孩子在她的卧室里读书。"]

model = SentenceTransformer("/slurm/resources/weights/huggingface/maidalun1020/bce-embedding-base_v1/")

embeddings = model.encode(passages, normalize_embeddings=True)

print(embeddings.shape)

07/25/2024 10:10:40 - [INFO] -datasets->>>    PyTorch version 2.2.0 available.
07/25/2024 10:10:41 - [INFO] -sentence_transformers.SentenceTransformer->>>    Use pytorch device_name: cuda
07/25/2024 10:10:41 - [INFO] -sentence_transformers.SentenceTransformer->>>    Load pretrained SentenceTransformer: /slurm/resources/weights/huggingface/maidalun1020/bce-embedding-base_v1/
Batches: 100%|██████████| 1/1 [00:00<00:00, 71.57it/s]

(2, 768)





### 调用RerankerModel计算句子对的语义相关分数

In [11]:
from sentence_transformers import CrossEncoder

model = CrossEncoder("/slurm/resources/weights/huggingface/maidalun1020/bce-reranker-base_v1/", max_length=512)


query = "一个女人站在高崖上单腿站立，俯瞰一条河流。"
passages = ["一个女人站在悬崖上。",
            "一个孩子在她的卧室里读书。"]
# 构造语句对
sentence_pairs = [[query, passage] for passage in passages]

scores = model.predict(sentence_pairs)

print(scores)

07/25/2024 10:13:44 - [INFO] -sentence_transformers.cross_encoder.CrossEncoder->>>    Use pytorch device: cuda
Batches: 100%|██████████| 1/1 [00:00<00:00,  4.13it/s]

[0.76313466 0.3892789 ]



