In [3]:
from IPython.display import Image

- Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
    - https://arxiv.org/abs/1908.10084
- reference
    - https://www.pinecone.io/learn/series/nlp/train-sentence-transformers-softmax/

In [69]:
# !pip install sentence_transformers
# sbert

In [66]:
from sentence_transformers import losses

## sentence level tasks

- NLI (natural language inferencing): 句子间关系
    - This task receives two input sentences and outputs either “entailment”, “contradiction” or “neutral”.
    - entailment: sentence1 entails sentence 2
    - contradiction: sentence1 contradicts sentence2
    - neutral: the two sentences have no relation.
- STS (sentence textual similarity):
    - This task receives two sentences and decides the similarity of them. Often similarity is calculated using cosine similarity function.

## demos

- `paraphrase-MiniLM-L6-v2`
    - embedding dimension：384 = 32*12

```
from sentence_transformers import SentenceTransformer, util
embed_model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
embed_1 = embed_model.encode(sentence1, convert_to_tensor=True)
embed_2 = embed_model.encode(sentence2, convert_to_tensor=True)
cos_sim = util.pytorch_cos_sim(embed_1, embed_2).item()
```

In [2]:
from sentence_transformers import SentenceTransformer, util
embed_model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
embed_1 = embed_model.encode('the movie is great!', convert_to_tensor=True)
embed_2 = embed_model.encode('positive', convert_to_tensor=True)
cos_sim = util.pytorch_cos_sim(embed_1, embed_2).item()
print(cos_sim)
embed_3 = embed_model.encode('negative', convert_to_tensor=True)
cos_sim = util.pytorch_cos_sim(embed_1, embed_3).item()
print(cos_sim)

[2024-02-04 20:51:09,363] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)


  return self.fget.__get__(instance, owner)()


0.1806926727294922
-0.12082745879888535


## models

### bert

In [5]:
Image(url='https://miro.medium.com/v2/resize:fit:1400/format:webp/1*fhd9gsQGBcPWduThINIIUQ.png', width=600)

- BERT is very good at learning the meaning of words/tokens. 
    - But It is not good at learning meaning of sentences.
    -  sentence classification, sentence pair-wise similarity.
- BERT produces token embedding, one way to get sentence embedding out of BERT is to average the embedding of all tokens. 
    - SentenceTransformer paper showed this produces very low quality sentence embeddings almost as bad as getting GLOVE embeddings. These embeddings do not capture the meaning of sentences.

### Training BERT on NLI (classification objective)

Siamese network. Siamese means twins and it consists of two networks of the exact same architecture that they share weight too.

In [6]:
Image(url='https://miro.medium.com/v2/resize:fit:1400/format:webp/1*XB85tOf1kWmpZxoTC3ab5g.png', width=600)

- sentence u => `emb(u)` (768d)
- sentence v => `emb(v)` (768d)
- `emb(u)-emb(v)` (768d)

$$
o=\text{softmax}(W_t(u,v,u-v))
$$

- cross entropy loss.

### Training BERT on STS (regression objective)


In [68]:
losses.ContrastiveLoss??

In [7]:
Image(url='https://miro.medium.com/v2/resize:fit:1400/format:webp/1*BQ4H_KErGUroYQ-59WhARA.png', width=600)

- Sentence textual similarity task receives two sentences and computes their similarity. 
- The network architecture for fine-tuning BERT on STS is as following. It is again a siamese network with mean pooling on top.

### Training BERT on Triplet dataset (triplet objective)

In [8]:
Image(url='https://miro.medium.com/v2/resize:fit:1400/format:webp/1*KPhp8A6pFsue7F8z8sF6-A.png', width=600)


To collect this data in text domain, we can pick a random sentence from a document as anchor, pick its following sentence as positive and pick a random sentence from a different passage as negative.


- In triplet objective, the model receives an anchor data point, 
- a positive data point that is related or close to the anchor, 
- and a negative data point that is unrelated to the anchor.

$$
|a-p|\lt |a-n|\\
L: = \max (0, |a-p|-|a-m|+\epsilon)\\
|a-p| \leq |a-m|-\epsilon\\
\Downarrow \\
L=0
$$ 

## pretrain

### models

In [12]:
from sentence_transformers import SentenceTransformer, models

In [19]:
word_embed_model = models.Transformer('bert-base-uncased')
# a pool function over the token embeddings
pooling_model = models.Pooling(word_embed_model.get_word_embedding_dimension(), 
                               pooling_mode = 'cls',
                               pooling_mode_cls_token=True, 
                               pooling_mode_mean_tokens = False)
model = SentenceTransformer(modules=[word_embed_model, pooling_model])

In [20]:
word_embed_model.get_word_embedding_dimension()

768

In [144]:
model.modules

<bound method Module.modules of SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)>

### 数据与任务

In [121]:
import os
os.environ['http_proxy'] = 'http://127.0.0.1:7890'
os.environ['https_proxy'] = 'http://127.0.0.1:7890'

In [122]:
from datasets import load_dataset

dataset = load_dataset("glue", "mrpc")

In [28]:
dataset['train']

Dataset({
    features: ['sentence1', 'sentence2', 'label', 'idx'],
    num_rows: 3668
})

In [29]:
dataset['train'][0]

{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
 'label': 1,
 'idx': 0}

In [43]:
# texts/label/guid
from sentence_transformers import InputExample

In [151]:
training_ds = []
for example in dataset['train']:
    training_ds.append(InputExample(texts=[example['sentence1'], example['sentence2']], 
                                    label=float(example['label'])))

In [152]:
from torch.utils.data import DataLoader

In [153]:
train_loader = DataLoader(training_ds, shuffle=True, batch_size=8)

In [154]:
import math
math.ceil(len(training_ds)/8)

459

In [155]:
len(train_loader)

459

In [157]:
# batch = next(iter(train_loader))
# batch

In [159]:
# batch[0][0]['input_ids'].shape
# batch[0][1]['input_ids'].shape

### training loss

In [160]:
train_examples = [
    InputExample(texts=['This is a positive pair', 'Where the distance will be minimized'], label=1),
    InputExample(texts=['This is a negative pair', 'Their distance will be increased'], label=0)]

train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=2)

In [148]:
next(iter(train_dataloader))

TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found <class 'sentence_transformers.readers.InputExample.InputExample'>

In [63]:
from sentence_transformers import losses

In [65]:
losses.ContrastiveLoss??

In [64]:
train_loss = losses.ContrastiveLoss(model=model)

In [142]:
model.device

device(type='cuda', index=0)

In [145]:
model.encode??

### training

In [112]:
from sentence_transformers import evaluation

In [111]:
s1s = []
s2s = []
scores = []
for example in dataset['validation']:
    s1s.append(example['sentence1'])
    s2s.append(example['sentence2'])
    scores.append(float(example['label']))
evaluator = evaluation.BinaryClassificationEvaluator(s1s, s2s, scores)

In [78]:
# Start training
model.fit(
    train_objectives=[(train_loader, train_loss)], 
    evaluator=evaluator,
    evaluation_steps=200,
    epochs=5, 
    warmup_steps=0,
    output_path='./sentence_transformer/',
    weight_decay=0.01,
    optimizer_params={'lr': 0.00004},
    save_best_model=True,
    show_progress_bar=True,

)

Epoch:   0%|          | 0/5 [00:00<?, ?it/s]

Iteration:   0%|          | 0/459 [00:00<?, ?it/s]

Configuration saved in ./sentence_transformer//config.json
Model weights saved in ./sentence_transformer//pytorch_model.bin
tokenizer config file saved in ./sentence_transformer//tokenizer_config.json
Special tokens file saved in ./sentence_transformer//special_tokens_map.json
Configuration saved in ./sentence_transformer//config.json
Model weights saved in ./sentence_transformer//pytorch_model.bin
tokenizer config file saved in ./sentence_transformer//tokenizer_config.json
Special tokens file saved in ./sentence_transformer//special_tokens_map.json
Configuration saved in ./sentence_transformer//config.json
Model weights saved in ./sentence_transformer//pytorch_model.bin
tokenizer config file saved in ./sentence_transformer//tokenizer_config.json
Special tokens file saved in ./sentence_transformer//special_tokens_map.json


Iteration:   0%|          | 0/459 [00:00<?, ?it/s]

Iteration:   0%|          | 0/459 [00:00<?, ?it/s]

Iteration:   0%|          | 0/459 [00:00<?, ?it/s]

Iteration:   0%|          | 0/459 [00:00<?, ?it/s]

### test

In [79]:
sentences = ['This is just a random sentence on a friday evenning', 'to test model ability.']

#Sentences are encoded by calling model.encode()
embeddings = model.encode(sentences)

print(embeddings)

[[ 0.07039643 -0.15299119  0.21200444 ... -0.0314035   0.10178623
   0.23059756]
 [-0.08689684 -0.02244206 -0.02044529 ... -0.5961807  -0.04331313
   0.21836686]]


In [81]:
from sentence_transformers import util

correct = 0
for row in dataset['test']:
    u = model.encode(row['sentence1'])
    v = model.encode(row['sentence2'])
    cos_score = util.cos_sim(u, v)[0].numpy()[0]
    if cos_score > 0.5 and row['label'] == 1:
        correct += 1
    if cos_score <= 0.5 and row['label'] == 0:
        correct += 1

12.28


In [82]:
print(correct/len(dataset['test']))

0.7118840579710145
