<a href="https://colab.research.google.com/github/dk-wei/super-duper-transformer/blob/main/Sentence_Transformers_(Siamese_Network_%2B_Bert)_Overview.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [48]:
!pip install -U sentence-transformers

Requirement already up-to-date: sentence-transformers in /usr/local/lib/python3.7/dist-packages (0.4.1.2)


# Generate sentence embedding

得到的是sentence embedding, 但是是通过Siamese network与Bert结合训练的embedding，由于loss function采用了 cosine-similarity loss function，生成的embedding是可以直接通过cosine similarity来比较**semantic similarity**的.

In [None]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('paraphrase-distilroberta-base-v1')  # paraphrase-distilroberta-base-v1 is a DistilBERT-base-uncased model fine tuned on a large dataset of paraphrase sentences.

100%|██████████| 306M/306M [00:17<00:00, 17.5MB/s]


In [None]:
sentences = ['This framework generates embeddings for each input sentence',
    'Sentences are passed as a list of string.', 
    'The quick brown fox jumps over the lazy dog.',
    '230u9nnJNKJNJK!@#@$']
    
sentence_embeddings = model.encode(sentences, show_progress_bar=True)


HBox(children=(FloatProgress(value=0.0, description='Batches', max=1.0, style=ProgressStyle(description_width=…




BERT (and other transformer networks) output for each token in our input text an embedding. In order to create a fixed-sized sentence embedding out of this, the model applies mean pooling, i.e., the output embeddings for all tokens are averaged to yield a 768-dimensional vector.

In [None]:
# for sentence, embedding in zip(sentences, sentence_embeddings):
#     print("Sentence:", sentence)
#     print("Embedding:", embedding)
#     print("")

In [None]:
sentence_embeddings.shape

(4, 768)

# Choose the Right Model

## Paraphrase Identification

The following models are recommended for various applications, as they were trained on Millions of paraphrase examples. They create extremely good results for various similarity and retrieval tasks. They are currently under development, better versions and more details will be released in future. But they many tasks they work better than the NLI / STSb models.

- `paraphrase-distilroberta-base-v1` - Trained on large scale paraphrase data.
- `paraphrase-xlm-r-multilingual-v1` - Multilingual version of distilroberta-base-paraphrase-v1, trained on parallel data for 50+ languages.

## Semantic Textual Similarity
The following models were optimized for Semantic Textual Similarity (STS). They were trained on SNLI+MultiNLI and then fine-tuned on the STS benchmark train set.

The best available models for STS are:

- `stsb-roberta-large` - STSb performance: 86.39
- `stsb-roberta-base` - STSb performance: 85.44
- `stsb-bert-large` - STSb performance: 85.29
- `stsb-distilbert-base` - STSb performance: 85.16

## Duplicate Questions Detection
The following models were trained for duplicate questions mining and duplicate questions retrieval. You can use them to detect duplicate questions in a large corpus (see paraphrase mining) or to search for similar questions (see semantic search).

Available models:

- `quora-distilbert-base` - Model first tuned on NLI+STSb data, then fine-tune for Quora Duplicate Questions detection retrieval.
- `quora-distilbert-multilingual` - Multilingual version of distilbert-base-nli-stsb-quora-ranking. Fine-tuned with parallel data for 50+ languages.

## Question-Answer Retrieval - MSMARCO
The following models were trained on MSMARCO Passage Ranking, a dataset with 500k real queries from Bing search. Given a search query, find the relevant passages.

- `msmarco-distilroberta-base-v2`: MRR@10: 28.55 on MS MARCO dev set
- `msmarco-roberta-base-v2`: MRR@10: 29.17 on MS MARCO dev set
- `msmarco-distilbert-base-v2`: MRR@10: 30.77 on MS MARCO dev set

In [None]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('msmarco-distilbert-base-v2')

query_embedding = model.encode('How big is London')
passage_embedding = model.encode('London has 9,787,426 inhabitants at the 2011 census')

print("Similarity:", util.pytorch_cos_sim(query_embedding, passage_embedding))


100%|██████████| 245M/245M [00:16<00:00, 14.8MB/s]


Similarity: tensor([[0.6136]])


## Question-Answer Retrieval - Natural Questions

The following models were trained on Google’s Natural Questions dataset, a dataset with 100k real queries from Google search together with the relevant passages from Wikipedia.

- `nq-distilbert-base-v1`: MRR10: 72.36 on NQ dev set (small)

In [None]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('nq-distilbert-base-v1')

query_embedding = model.encode('How many people live in London?')

#The passages are encoded as [ [title1, text1], [title2, text2], ...]
passage_embedding = model.encode([['London', 'London has 9,787,426 inhabitants at the 2011 census.']])

print("Similarity:", util.pytorch_cos_sim(query_embedding, passage_embedding))

100%|██████████| 245M/245M [00:16<00:00, 14.8MB/s]


Similarity: tensor([[0.6503]])


# Comparing Sentence Similarities

The sentences (texts) are mapped such that sentences with similar meanings are close in vector space. One common method to measure the similarity in vector space is to use cosine similarity. For two sentences, this can be done like this:

In [None]:
# Two lists of sentences
sentences1 = ['The cat sits outside',
             'A man is playing guitar',
             'The new movie is awesome',
              'he is so smart',
              'i work at a bank']

sentences2 = ['The dog plays in the garden',
              'A woman watches TV',
              'The new movie is so amazing',
              'he is a wise people',
              'i like sitting and drinking at bank of river']

#Compute embedding for both lists
embeddings1 = model.encode(sentences1, convert_to_tensor=True)
embeddings2 = model.encode(sentences2, convert_to_tensor=True)

#Compute cosine-similarits
cosine_scores = util.pytorch_cos_sim(embeddings1, embeddings2)

#Output the pairs with their score
for i in range(len(sentences1)):
    print("{} \t\t {} \t\t Score: {:.4f}".format(sentences1[i], sentences2[i], cosine_scores[i][i]))

The cat sits outside 		 The dog plays in the garden 		 Score: 0.4579
A man is playing guitar 		 A woman watches TV 		 Score: 0.1759
The new movie is awesome 		 The new movie is so amazing 		 Score: 0.9159
he is so smart 		 he is a wise people 		 Score: 0.6925
i work at a bank 		 i like sitting and drinking at bank of river 		 Score: 0.2215


Brute-force的方法得到每个sentence的similar sentences:

In [None]:
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('paraphrase-distilroberta-base-v1')

# Single list of sentences
sentences = ['The cat sits outside',
             'A man is playing guitar',
             'I love pasta',
             'The new movie is awesome',
             'The cat plays in the garden',
             'A woman watches TV',
             'The new movie is so great',
             'Do you like pizza?']

#Compute embeddings
embeddings = model.encode(sentences, convert_to_tensor=True)

#Compute cosine-similarities for each sentence with each other sentence
cosine_scores = util.pytorch_cos_sim(embeddings, embeddings)

#Find the pairs with the highest cosine similarity scores
pairs = []
for i in range(len(cosine_scores)-1):
    for j in range(i+1, len(cosine_scores)):
        pairs.append({'index': [i, j], 'score': cosine_scores[i][j]})

#Sort scores in decreasing order
pairs = sorted(pairs, key=lambda x: x['score'], reverse=True)

for pair in pairs[0:10]:
    i, j = pair['index']
    print("{} \t\t {} \t\t Score: {:.4f}".format(sentences[i], sentences[j], pair['score']))

The new movie is awesome 		 The new movie is so great 		 Score: 0.9283
The cat sits outside 		 The cat plays in the garden 		 Score: 0.6855
I love pasta 		 Do you like pizza? 		 Score: 0.5420
I love pasta 		 The new movie is awesome 		 Score: 0.2629
I love pasta 		 The new movie is so great 		 Score: 0.2268
The new movie is awesome 		 Do you like pizza? 		 Score: 0.1885
A man is playing guitar 		 A woman watches TV 		 Score: 0.1759
The new movie is so great 		 Do you like pizza? 		 Score: 0.1615
The cat plays in the garden 		 A woman watches TV 		 Score: 0.1521
The cat sits outside 		 The new movie is awesome 		 Score: 0.1475


# Paraphrase Mining

Use case: 给一堆sentences，快速的为每个sentence找到similar/duplicate sentence.

Paraphrase mining is the task of finding pharaphrases (texts with identical / similar meaning) in a large corpus of sentences. In Semantic Textual Similarity we saw a simplified version of finding paraphrases in a list of sentences. The approach presented there used a brute-force approach to score and rank all pairs.

However, as this has a quadratic runtime, it fails to scale to large (10,000 and more) collections of sentences.

For larger collections, util offers the paraphrase_mining function that can be used like this:

**Parameters**:
- `model` – SentenceTransformer model for embedding computation
- `sentences` – A list of strings (texts or sentences)
- `show_progress_bar` – Plotting of a progress bar
- `batch_size` – Number of texts that are encoded simultaneously by the model
- `query_chunk_size` – Search for most similar pairs for #query_chunk_size at the same time. Decrease, to lower memory footprint (increases run-time).
- `corpus_chunk_size` – Compare a sentence simultaneously against #corpus_chunk_size other sentences. Decrease, to lower memory footprint (increases run-time).
- `max_pairs` – Maximal number of text pairs returned.
- `top_k` – For each sentence, we retrieve up to top_k other sentences

**Returns**: Returns a list of triplets with the format [score, id1, id2]


In [None]:
# Single list of sentences - Possible tens of thousands of sentences
sentences = ['The cat sits outside',
             'A man is playing guitar',
             'I love pasta',
             'I love chinese food',
             'I love italian food',
             'I love kung pao chicken',
             'The new movie is awesome',
             'The cat plays in the garden',
             'A woman watches TV',
             'The new movie is so great',
             'Do you like pizza?']

paraphrases = util.paraphrase_mining(model, sentences, top_k=1)

for paraphrase in paraphrases[0:10]:
    score, i, j = paraphrase
    print("{} \t\t {} \t\t Score: {:.4f}".format(sentences[i], sentences[j], score))

The new movie is awesome 		 The new movie is so great 		 Score: 0.7063
I love chinese food 		 I love kung pao chicken 		 Score: 0.6930
I love pasta 		 I love italian food 		 Score: 0.6776
The cat sits outside 		 The cat plays in the garden 		 Score: 0.5356
Do you like pizza? 		 I love italian food 		 Score: 0.4335
A man is playing guitar 		 A woman watches TV 		 Score: 0.3726


# Semantic Search

Use case: 给一个new query/question，找出relevant document，听起来和paragraph mining挺相似的

Semantic search seeks to improve search accuracy by understanding the content of the search query. In contrast to traditional search engines, that only finds documents based on lexical matches, semantic search can also find synonyms.

Background
The idea behind semantic search is to embedd all entries in your corpus, which can be sentences, paragraphs, or documents, into a vector space.

At search time, the query is embedded into the same vector space and the closest embedding from your corpus are found. These entries should have a high semantic overlap with the query.

![](https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/SemanticSearch.png)
## Symmetric vs. Asymmetric Semantic Search
A cirtical distinction for your setup is symmetric vs. asymmetric semantic search:

- For **symmetric semantic search** your query and the entries in your corpus are of about the same length and have the same amount of content. An example would be searching for similar questions: Your query could for example be “How to learn Python online?” and you want to find an entry like “How to learn Python on the web?”. For symmetric tasks, you could potentially flip the query and the entries in your corpus.

- For **asymmetric semantic search**, you usually have a **short query** (like a question or some keywords) and you want to find a **longer paragraph** answering the query. An example would be a query like “What is Python” and you wand to find the paragraph “Python is an interpreted, high-level and general-purpose programming language. Python’s design philosophy …”. For asymmetric tasks, flipping the query and the entries in your corpus usually does not make sense.

It is cirtical that you choose the right model for your type of task.
Suitable models for symmetric semantic search:

- `paraphrase-distilroberta-base-v1` / `paraphrase-xlm-r-multilingual-v1`
- `quora-distilbert-base` / `quora-distilbert-multilingual`
- `distiluse-base-multilingual-cased-v2`

Suitable modesl for asymmetric semantic search:
- `msmarco-distilbert-base-v2`

In [None]:
"""
This is a simple application for sentence embeddings: semantic search

We have a corpus with various sentences. Then, for a given query sentence,
we want to find the most similar sentence in this corpus.

This script outputs for various queries the top 5 most similar sentences in the corpus.
"""
from sentence_transformers import SentenceTransformer, util
import torch

embedder = SentenceTransformer('paraphrase-distilroberta-base-v1')

# Corpus with example sentences
corpus = ['A man is eating food.',
          'A man is eating italian food.',
          'A man is eating a piece of bread.',
          'The girl is carrying a baby.',
          'A man is riding a horse.',
          'A woman is playing violin.',
          'Two men pushed carts through the woods.',
          'A man is riding a white horse on an enclosed ground.',
          'A monkey is playing drums.',
          'A cheetah is running behind its prey.'
          ]
corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True)

# Query sentences:
queries = ['A man is eating pasta.', 'Someone in a gorilla costume is playing a set of drums.', 'A cheetah chases prey on across a field.']


# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
top_k = min(5, len(corpus))
for query in queries:
    query_embedding = embedder.encode(query, convert_to_tensor=True)

    # We use cosine-similarity and torch.topk to find the highest 5 scores
    cos_scores = util.pytorch_cos_sim(query_embedding, corpus_embeddings)[0]
    top_results = torch.topk(cos_scores, k=top_k)

    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop 5 most similar sentences in corpus:")

    for score, idx in zip(top_results[0], top_results[1]):
        print(corpus[idx], "(Score: {:.4f})".format(score))

    """
    # Alternatively, we can also use util.semantic_search to perform cosine similarty + topk
    hits = util.semantic_search(query_embedding, corpus_embeddings, top_k=5)
    hits = hits[0]      #Get the hits for the first query
    for hit in hits:
        print(corpus[hit['corpus_id']], "(Score: {:.4f})".format(hit['score']))
    """





Query: A man is eating pasta.

Top 5 most similar sentences in corpus:
A man is eating food. (Score: 0.7096)
A man is eating italian food. (Score: 0.6677)
A man is eating a piece of bread. (Score: 0.6074)
A man is riding a horse. (Score: 0.3360)
A man is riding a white horse on an enclosed ground. (Score: 0.3069)




Query: Someone in a gorilla costume is playing a set of drums.

Top 5 most similar sentences in corpus:
A monkey is playing drums. (Score: 0.6842)
A woman is playing violin. (Score: 0.3762)
A man is riding a horse. (Score: 0.3079)
A cheetah is running behind its prey. (Score: 0.2760)
A man is eating italian food. (Score: 0.2749)




Query: A cheetah chases prey on across a field.

Top 5 most similar sentences in corpus:
A cheetah is running behind its prey. (Score: 0.7814)
A monkey is playing drums. (Score: 0.2824)
A man is riding a white horse on an enclosed ground. (Score: 0.2208)
A man is riding a horse. (Score: 0.2017)
A man is eating food. (Score: 0.1886)
