## Sentence Transformers

Sentence transformers require Python 3.6 or higher and PyTorch 1.6.0 or higher. PyTorch requies the use of a GPU. Colab is preinstalled with PyTorch and tensorflow/keras, and you can use the Colab GPU. But Colab meters the use of GPU and may charge you a small fee. If using a local machine, install pytorch or torchvision as follows. PyTorch is supported on macOS 10.15 (Catalina) or higher and . See https://pytorch.org/get-started/locally/ for installation details.

conda install pytorch torchvision torchaudio -c pytorch<br>
pip3 install torch torchvision

In [None]:
# !pip3 install torch torchvision

In [None]:
# Verify if torch installation was successful; it should display a 5 x 3 matrix of random numbers

import torch
x = torch.rand(5, 3)
print(x)

In [1]:
# Check for GPU (using PyTorch)
# If available, tell PyTorch to use the GPU, otherwise use the CPU

import torch

if torch.cuda.is_available():
    device = torch.device("cuda")
    print('There are %d GPU(s) available.' % torch.cuda.device_count())
    print('We will use the GPU:', torch.cuda.get_device_name(0))
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
We will use the GPU: Tesla T4


In [3]:
# !pip install transformers sentence_transformers

In [4]:
# conda install -c conda-forge ipywidgets
# jupyter nbextension enable --py widgetsnbextension
# Restart Jupyter notebook

import numpy as np
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('stsb-roberta-large')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.96k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/674 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

### Semantic Text Similarity (STS)
##### Pretrained sentence embedding models for Semantic Textual Similarity (STS):

The following models are optimized for Semantic Textual Similarity (STS).

- ***stsb-roberta-large:*** STSb performance: 86.39
- ***stsb-roberta-base***: STSb performance: 85.44
- ***stsb-bert-large***: STSb performance: 85.29
- ***stsb-distilbert-base***: STSb performance: 85.16

The following models are recommended for various applications, including various similarity and retrieval tasks, as they were trained on millions of paraphrase examples. They are currently under development, but they outperform NLI/STSb models for many tasks.

- ***paraphrase-distilroberta-base-v1***: Trained on large scale paraphrase data.
- ***paraphrase-xlm-r-multilingual-v1***: Multilingual version of paraphrase-distilroberta-base-v1,
trained on parallel data for 50+ languages.

In [5]:
# Calculate semantic similarity between two sentences
sentence1 = "I like Python because I can build AI applications"
sentence2 = "I like Python because I can do data analytics"

# Encode sentences to get their embeddings
embedding1 = model.encode(sentence1, convert_to_tensor=True)
embedding2 = model.encode(sentence2, convert_to_tensor=True)

# Compute similarity scores of two embeddings
cosine_score = util.pytorch_cos_sim(embedding1, embedding2)
print("Sentence 1:", sentence1)
print("Sentence 2:", sentence2)
print("Similarity score:", cosine_score.item())

Sentence 1: I like Python because I can build AI applications
Sentence 2: I like Python because I can do data analytics
Similarity score: 0.8015277981758118


In [6]:
# Print sentence embeddings
sentences = ["I like Python because I can build AI applications",
             "I like Python because I can do data analytics",
             "The cat sits on the ground"]

sentence_embeddings = model.encode(sentences)

for sentence, embedding in zip(sentences, sentence_embeddings):
    print("Sentence:", sentence)
    print("Embedding:", embedding)
    print("")

Sentence: I like Python because I can build AI applications
Embedding: [-0.4627137   0.7406838  -0.26615486 ...  1.6758355  -2.6872845
 -0.21768892]

Sentence: I like Python because I can do data analytics
Embedding: [-0.3860078   0.6501616  -0.30140662 ...  1.5000778  -2.2584777
  0.7605823 ]

Sentence: The cat sits on the ground
Embedding: [-0.23815408  0.5204212  -0.2830657  ...  0.09840191 -0.55245036
  0.40428603]



In [7]:
# Calculate semantic similarity between two lists of sentences
sentences1 = ["I like Python because I can build AI applications",
              "The cat sits on the ground"]
sentences2 = ["I like Python because I can do data analytics",
              "The cat walks on the sidewalk"]

embeddings1 = model.encode(sentences1, convert_to_tensor=True)
embeddings2 = model.encode(sentences2, convert_to_tensor=True)

cosine_scores = util.pytorch_cos_sim(embeddings1, embeddings2)

for i in range(len(sentences1)):
    for j in range(len(sentences2)):
        print("Sentence 1:", sentences1[i])
        print("Sentence 2:", sentences2[j])
        print("Similarity Score:", cosine_scores[i][j].item())
        print()

Sentence 1: I like Python because I can build AI applications
Sentence 2: I like Python because I can do data analytics
Similarity Score: 0.8015277981758118

Sentence 1: I like Python because I can build AI applications
Sentence 2: The cat walks on the sidewalk
Similarity Score: -0.031110037118196487

Sentence 1: The cat sits on the ground
Sentence 2: I like Python because I can do data analytics
Similarity Score: 0.11328636854887009

Sentence 1: The cat sits on the ground
Sentence 2: The cat walks on the sidewalk
Similarity Score: 0.40381476283073425



In [11]:
# Retreive top k most similar sentences from a corpus of sentences
corpus = ['A man is eating food.',
          'A man is eating a piece of bread.',
          'The girl is carrying a baby.',
          'A man is riding a horse.',
          'A woman is playing violin.',
          'Two men pushed carts through the woods.',
          'A man is riding a white horse on an enclosed ground.',
          'A monkey is playing drums.',
          'Someone in a gorilla costume is playing a set of drums.']

embeddings = model.encode(corpus)                                  # 9 context vectors

# Compute cosine similarity between all pairs
cos_sim = util.pytorch_cos_sim(embeddings, embeddings)             # 9x9 matrix

# Add all pairs to a list with their cosine similarity score
all_sentence_combinations = []
for i in range(len(cos_sim)-1):
    for j in range(i+1, len(cos_sim)):
        all_sentence_combinations.append([cos_sim[i][j], i, j])    # Flattening the cos_sim matrix

# Sort and print list by the highest cosine similarity score
all_sentence_combinations = sorted(all_sentence_combinations, key=lambda x: x[0], reverse=True)

print("Top-5 most similar pairs:")
for score, i, j in all_sentence_combinations[0:5]:
    print("{} \t {} \t {:.4f}".format(corpus[i], corpus[j], cos_sim[i][j]))

Top-5 most similar pairs:
A man is eating food. 	 A man is eating a piece of bread. 	 0.6789
A man is riding a horse. 	 A man is riding a white horse on an enclosed ground. 	 0.6144
A monkey is playing drums. 	 Someone in a gorilla costume is playing a set of drums. 	 0.5249
Two men pushed carts through the woods. 	 A man is riding a white horse on an enclosed ground. 	 0.2684
A man is riding a horse. 	 Two men pushed carts through the woods. 	 0.2275


### Information Retrieval (Q&A)

##### Embedding models for search queries (information retrieval):

The following models are optimized for question-answer retrieval in search queries.

- ***msmarco-distilbert-base-v3***: Trained on MSMARCO Passage Ranking, a dataset with 500k real queries from Microsoft Bing search.
- ***msmarco-roberta-base-ance-fristp***
- ***nq-distilbert-base-v1: MRR10***: 72.36 on NQ dev set (small): Trained on Google’s Natural Questions dataset, a dataset with 100k real queries from Google search, together with the relevant passages from Wikipedia.
- ***facebook-dpr-ctx_encoder-single-nq-base***: Karpukhin et al. trained these models for Dense Passage Retrieval (DPR) for Open-Domain Question Answering, using Google’s Natural Questions dataset
- ***facebook-dpr-question_encoder-single-nq-base***
- ***facebook-dpr-ctx_encoder-multiset-base***: Karpukhin et al. also trained models on the combination of Natural Questions, TriviaQA, WebQuestions, and CuratedTREC.
- ***facebook-dpr-question_encoder-multiset-base***

In [12]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('msmarco-distilbert-base-v3')

query_embedding = model.encode('How big is London')
passage_embedding = model.encode('London has 9,787,426 inhabitants at the 2011 census')

print("Similarity:", util.pytorch_cos_sim(query_embedding, passage_embedding))

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.75k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/545 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/265M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/499 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Similarity: tensor([[0.6082]])


In [13]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('nq-distilbert-base-v1')

query_embedding = model.encode('How many people live in London?')

# The passages are encoded as [ [title1, text1], [title2, text2], ...]
passage_embedding = model.encode([['London', 'London has 9,787,426 inhabitants at the 2011 census.']])

print("Similarity:", util.pytorch_cos_sim(query_embedding, passage_embedding))

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.73k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/540 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/265M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/554 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Similarity: tensor([[0.6503]])


### Multi-Lingual Models

Models for Semantic Similarity generate semantically similar sentences within one language or across languages. The following models generate aligned vector spaces, i.e., similar inputs in different languages are mapped close in vector space. You do not need to specify the input language.

- ***distiluse-base-multilingual-cased-v1***: Multilingual knowledge distilled version of multilingual Universal Sentence Encoder. Supports 15 languages: Arabic, Chinese, Dutch, English, French, German, Italian, Korean, Polish, Portuguese, Russian, Spanish, Turkish.
- ***distiluse-base-multilingual-cased-v2***: Multilingual knowledge distilled version of multilingual Universal Sentence Encoder. Supports 50+ languages. However, performance on the 15 languages mentioned above are a bit lower.
- ***paraphrase-xlm-r-multilingual-v1***: Multilingual version of paraphrase-distilroberta-base-v1, trained on parallel data for 50+ languages.
- ***stsb-xlm-r-multilingual***: Produces similar embeddings as the stsb-bert-base model. Trained on parallel data for 50+ languages.
- ***quora-distilbert-multilingual***: ultilingual version of quora-distilbert-base. Fine-tuned with parallel data for 50+ languages.

Bitext mining: Describes the process of finding translated sentence pairs in two languages. The best model for this use-case is ***LaBSE***, which Finds translation pairs across 109 languages. Works less well for assessing the similarity of sentence pairs that are not translations of each other.

For detail, see https://www.sbert.net/docs/pretrained_models.html

### Machine Translation

For translation between any two  languages, say English to German, we need a language model pretrained for this task. T5 (Text-to-Text Transfer Transformer) is a multilingual transformer that is pretrained for translation, question-answering, and classification tasks on a massive c4 dataset, including English-to-German translation. Use use pipeline to simplify the code (pipeline takes care of input tokenization and encoding, output decoding, etc.)

In [14]:
from transformers import pipeline
translation = pipeline(task="translation_en_to_de", model="t5-base", tokenizer="t5-base")

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

In [15]:
text = "I like to study Data Science and Machine Learning"
translated_text = translation(text, max_length=40)[0]['translation_text']
print(translated_text)

Ich studiere gerne Datenwissenschaft und maschinelles Lernen


The Huggingface community (https://huggingface.co/models?filter=translation) has created a set of pretrained language models for machine translation between different language pairs.

For example, Engligh-to-Chinese translation using HelsinkiNLPs pretrained model on Huggingface (https://huggingface.co/Helsinki-NLP/opus-mt-en-zh):

In [16]:
from transformers import AutoModelWithLMHead, AutoTokenizer

model = AutoModelWithLMHead.from_pretrained("Helsinki-NLP/opus-mt-en-zh")
tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-zh")
translation = pipeline("translation_en_to_zh", model=model, tokenizer=tokenizer)

text = "I like to study Data Science and Machine Learning"
translated_text = translation(text, max_length=40)[0]['translation_text']
print(translated_text)



config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/312M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/44.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/806k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/805k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.62M [00:00<?, ?B/s]



我喜欢学习数据科学和机器学习
