## sentence-BERT

Sentence-BERT模型由Ubiquitous Knowledge Processing Lab（UKP-TUDA）研发。顾名思义，Sentence-BERT模型是用来获得固定长度的句子特征的。

### 关键点
- 普通的BERT使用[CLS]来代表一个句子的特征，但是sentence-BERT使用的汇聚的方式，结合了句子的整体特点。
- 训练的方式是对BERT进行微调，Sentence-BERT模型使用了**二元组和三元组网络架构**，这使得微调的速度更快，并有助于获得准确的句子特征。

汇聚的方式有：平均汇聚策略（句子的整体特征）和最大汇聚策略（句子的关键词特征）

![](1.png)
![](2.png)
![](3.png)

### sentence-BERT的使用

In [1]:
! pip install sentence_transformers==0.4.0

Collecting sentence_transformers==0.4.0
  Downloading sentence-transformers-0.4.0.tar.gz (65 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/65.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m65.6/65.6 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: sentence_transformers
  Building wheel for sentence_transformers (setup.py) ... [?25l[?25hdone
  Created wheel for sentence_transformers: filename=sentence_transformers-0.4.0-py3-none-any.whl size=102642 sha256=cbfc46646be7a804cef70384326ceab0382fc522998aebd93ebed21087ac640e
  Stored in directory: /root/.cache/pip/wheels/a7/06/95/45a3f09c8c2438afbae6048d98b1baf957461d710e8e8a8d3e
Successfully built sentence_transformers
Installing collected packages: sentence_transformers
Successfully installed sentence_transformers-0.4.0


In [2]:
from sentence_transformers import SentenceTransformer, util
import numpy as np

下载使用平均汇聚的模型

In [3]:
model = SentenceTransformer('bert-base-nli-mean-tokens')

100%|██████████| 405M/405M [00:46<00:00, 8.72MB/s]


In [5]:
print(model)

SentenceTransformer(
  (0): Transformer(
    (auto_model): BertModel(
      (embeddings): BertEmbeddings(
        (word_embeddings): Embedding(30522, 768, padding_idx=0)
        (position_embeddings): Embedding(512, 768)
        (token_type_embeddings): Embedding(2, 768)
        (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (encoder): BertEncoder(
        (layer): ModuleList(
          (0-11): 12 x BertLayer(
            (attention): BertAttention(
              (self): BertSdpaSelfAttention(
                (query): Linear(in_features=768, out_features=768, bias=True)
                (key): Linear(in_features=768, out_features=768, bias=True)
                (value): Linear(in_features=768, out_features=768, bias=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (output): BertSelfOutput(
                (dense): Linear(in_features=768, out_features=768, bias=T

In [6]:
sentence = 'paris is a beautiful city'
sentence_representation = model.encode(sentence)
print(sentence_representation.shape)

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


(768,)


可以看到句子的特征大小为768！

#### 计算句子相似度

In [7]:
import scipy
from sentence_transformers import SentenceTransformer, util

In [8]:
model = SentenceTransformer('bert-base-nli-mean-tokens')



In [9]:
sentence1 = 'It was a great day'
sentence2 = 'Today was awesome'

计算句子对特征

In [10]:
sentence1_representation = model.encode(sentence1)
sentence2_representation = model.encode(sentence2)
print(sentence1_representation)
print(sentence2_representation)

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


[ 3.37693751e-01 -4.82370049e-01  1.58553016e+00  2.00142577e-01
 -1.02100894e-01 -7.23554790e-01 -6.04689419e-01  7.14780033e-01
  7.78009474e-01 -5.74720860e-01  2.86103319e-02  3.59513581e-01
 -1.57165490e-02  5.62267043e-02  9.48175550e-01  4.62555230e-01
 -3.51502836e-01 -9.32625592e-01  4.53409612e-01 -9.73472297e-01
 -9.28709149e-01 -7.79353499e-01 -5.63355505e-01 -8.51361394e-01
  1.37451261e-01 -8.15451622e-01 -3.07235587e-02 -1.05437052e+00
  3.44980806e-01  2.68196911e-01  3.34290266e-02  1.02815425e+00
  3.44589144e-01 -6.34696126e-01  3.46428575e-03  1.70063436e+00
  8.24051350e-02 -4.01637137e-01 -6.57009780e-01 -2.69110411e-01
  1.82982123e+00 -5.56168377e-01  8.90601158e-01 -2.33114928e-01
 -5.54292262e-01  5.29730180e-03 -5.53807318e-01  2.41963118e-01
  7.74686813e-01 -5.55196345e-01  2.71669179e-02 -1.30144283e-01
 -1.34238076e+00  1.38793036e-01 -9.25881207e-01  8.05456638e-01
 -2.46724084e-01 -4.02356893e-01  3.18338305e-01  5.14791548e-01
  3.10212195e-01  7.93040

计算余弦相似度

In [11]:
cosine_sim = util.pytorch_cos_sim(sentence1_representation,sentence2_representation)
print(cosine_sim)

tensor([[0.9313]])


#### 加载自定义模型

In [12]:
from sentence_transformers import models, SentenceTransformer

现在，定义我们的词嵌入模型，它将返回给定句子中每个标记的特征。我们使用预训练的ALBERT作为词嵌入模型。

In [13]:
word_embedding_model = models.Transformer('albert-base-v2')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/47.4M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/760k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.31M [00:00<?, ?B/s]

**定义汇聚模型，计算出标记的汇聚特征**

`pooling_mode_mean_tokens = True`，这意味着使用的是平均汇聚来计算固定长度的句子特征

In [14]:
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(),
    pooling_mode_mean_tokens = True,
    pooling_mode_cls_token = False,
    pooling_mode_max_tokens = False
    )

用单词嵌入和汇聚模型设置Sentence-BERT模型，如以下代码所示

In [15]:
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

In [16]:
print(model)

SentenceTransformer(
  (0): Transformer(
    (auto_model): AlbertModel(
      (embeddings): AlbertEmbeddings(
        (word_embeddings): Embedding(30000, 128, padding_idx=0)
        (position_embeddings): Embedding(512, 128)
        (token_type_embeddings): Embedding(2, 128)
        (LayerNorm): LayerNorm((128,), eps=1e-12, elementwise_affine=True)
        (dropout): Dropout(p=0, inplace=False)
      )
      (encoder): AlbertTransformer(
        (embedding_hidden_mapping_in): Linear(in_features=128, out_features=768, bias=True)
        (albert_layer_groups): ModuleList(
          (0): AlbertLayerGroup(
            (albert_layers): ModuleList(
              (0): AlbertLayer(
                (full_layer_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
                (attention): AlbertAttention(
                  (query): Linear(in_features=768, out_features=768, bias=True)
                  (key): Linear(in_features=768, out_features=768, bias=True)
                  

In [17]:
model.encode('Transformers are awesome')

You're using a AlbertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


array([ 1.53638512e-01,  4.68161181e-02,  3.16998005e-01, -5.63997403e-02,
       -6.21851146e-01, -1.46899849e-01,  1.31276235e-01, -1.22991996e-02,
        2.74347842e-01,  8.78009975e-01,  4.76432979e-01,  6.25795103e-04,
        1.77760720e-01,  1.85318306e-01,  6.20732307e-01, -3.31606209e-01,
        4.84567732e-01, -4.17456090e-01,  1.54896170e-01, -2.39129499e-01,
        5.23909748e-01, -4.82498556e-01,  5.72179437e-01, -2.29993775e-01,
        4.05046225e-01,  4.73296732e-01, -5.42280912e-01, -1.57834336e-01,
        3.09568703e-01, -6.59590364e-02, -1.49889678e-01,  2.92286247e-01,
        9.28213820e-02,  5.09658679e-02, -9.26298499e-01, -5.06899416e-01,
        4.96786423e-02,  2.08063245e-01,  1.30772933e-01,  4.12238855e-03,
        4.75722969e-01, -3.43318492e-01,  2.34481722e-01, -2.52656966e-01,
        1.32224068e-01,  1.15861036e-01, -1.02592465e-02, -8.93762887e-01,
        8.51906657e-01,  2.56798685e-01,  7.89300144e-01,  4.54610676e-01,
       -2.64778763e-01, -

#### 用Sentence-BERT模型寻找类似句子

In [18]:
from sentence_transformers import SentenceTransformer, util
import numpy as np

In [19]:
model = SentenceTransformer('bert-base-nli-mean-tokens')



定义一个句子词典

In [20]:
master_dict = [
                'How to cancel my order?',
                'Please let me know about the cancellation policy?',
                'Do you provide refund?',
                'what is the estimated delivery date of the product?',
                'why my order is missing?',
                'how do i report the delivery of the incorrect items?'
              ]

设置一个问题

In [21]:
inp_question = 'When is my product getting delivered?'

In [22]:
inp_question_representation = model.encode(inp_question, convert_to_tensor=True)

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


In [23]:
master_dict_representation = model.encode(master_dict, convert_to_tensor=True)

In [24]:
print(master_dict_representation.shape)

torch.Size([6, 768])


In [25]:
similarity = util.pytorch_cos_sim(inp_question_representation, master_dict_representation )
print(similarity)

tensor([[0.5038, 0.5928, 0.5860, 0.8190, 0.5336, 0.6035]])


显示最相似的问题

In [26]:
print('The most similar question in the master dictionary to given input question is:',master_dict[np.argmax(similarity)])

The most similar question in the master dictionary to given input question is: what is the estimated delivery date of the product?
