# Intro

## Modules

In [1]:
from sentence_transformers import SentenceTransformer
from pypdf import PdfReader

In [2]:
from sklearn.metrics.pairwise import cosine_similarity

**The models can be found on https://www.sbert.net/ and on the original hugginface**

## Sentences

In [3]:
sentences = [
    "This framework generates embeddings for each input sentence",
    "Sentences are passed as a list of string.",
    "The quick brown fox jumps over the lazy dog.",
]

# all-MiniLM-L6-v2

The standard.

## Load

In [4]:
model = SentenceTransformer("all-MiniLM-L6-v2")

In [5]:
sentence_embeddings = model.encode(sentences)

## Calculations

**Vectors should be reshaped with -1, 1 to get something like (384,1)**

In [6]:
sentence_embeddings[0].reshape(-1,1).shape

(384, 1)

In [7]:
simlarity = cosine_similarity(sentence_embeddings[0].reshape(1,-1),sentence_embeddings[1].reshape(1,-1))[0][0]

In [8]:
simlarity2 = cosine_similarity(sentence_embeddings[0].reshape(1,-1),sentence_embeddings[2].reshape(1,-1))[0][0]

## Final result

In [9]:
print(sentences[0], "::::", sentences[1], "\nSIM_LEVEL" , simlarity)

This framework generates embeddings for each input sentence :::: Sentences are passed as a list of string. 
SIM_LEVEL 0.53807926


In [10]:
print(sentences[0], "::::", sentences[2], "\nSIM_LEVEL" , simlarity2)

This framework generates embeddings for each input sentence :::: The quick brown fox jumps over the lazy dog. 
SIM_LEVEL 0.11805622


# distilbert-multilingual-nli-stsb-quora-ranking

Change model. Probably this is not the correct one since it is:  
1. multilingual;
2. based on questions/answers, i.e. more intended to predict the answer to a question.

## Load

In [11]:
model = SentenceTransformer("distilbert-multilingual-nli-stsb-quora-ranking")

In [12]:
sentence_embeddings = model.encode(sentences)

## Calculations

In [13]:
sentence_embeddings[0].reshape(-1,1).shape

(768, 1)

In [14]:
simlarity = cosine_similarity(sentence_embeddings[0].reshape(1,-1),sentence_embeddings[1].reshape(1,-1))[0][0]

In [15]:
simlarity2 = cosine_similarity(sentence_embeddings[0].reshape(1,-1),sentence_embeddings[2].reshape(1,-1))[0][0]

## Final result

In [16]:
print(sentences[0], "::::", sentences[1], "\nSIM_LEVEL" , simlarity)

This framework generates embeddings for each input sentence :::: Sentences are passed as a list of string. 
SIM_LEVEL 0.7918707


In [17]:
print(sentences[0], "::::", sentences[2], "\nSIM_LEVEL" , simlarity2)

This framework generates embeddings for each input sentence :::: The quick brown fox jumps over the lazy dog. 
SIM_LEVEL 0.5909195


The mode is relevant!

# all-mpnet-base-v2

According to SentenceTransformer, it should be the best performing BERT induced model. Nevertheless it is slower and occupy more memory than "all-MiniLM-L6-v2".

## Load

In [18]:
model = SentenceTransformer("all-mpnet-base-v2")

In [19]:
sentence_embeddings = model.encode(sentences)

## Calculations

In [20]:
sentence_embeddings[0].reshape(-1,1).shape

(768, 1)

In [21]:
simlarity = cosine_similarity(sentence_embeddings[0].reshape(1,-1),sentence_embeddings[1].reshape(1,-1))[0][0]

In [22]:
simlarity2 = cosine_similarity(sentence_embeddings[0].reshape(1,-1),sentence_embeddings[2].reshape(1,-1))[0][0]

## Final result

In [23]:
print(sentences[0], "::::", sentences[1], "\nSIM_LEVEL" , simlarity)

This framework generates embeddings for each input sentence :::: Sentences are passed as a list of string. 
SIM_LEVEL 0.51263994


In [24]:
print(sentences[0], "::::", sentences[2], "\nSIM_LEVEL" , simlarity2)

This framework generates embeddings for each input sentence :::: The quick brown fox jumps over the lazy dog. 
SIM_LEVEL 0.09748442


# all-MiniLM-L12-v2

The new version of the standard: better, but a little slower.

## Load

In [25]:
model = SentenceTransformer("all-MiniLM-L12-v2")

In [26]:
sentence_embeddings = model.encode(sentences)

## Calculations

In [27]:
sentence_embeddings[0].reshape(-1,1).shape

(384, 1)

In [28]:
simlarity = cosine_similarity(sentence_embeddings[0].reshape(1,-1),sentence_embeddings[1].reshape(1,-1))[0][0]

In [29]:
simlarity2 = cosine_similarity(sentence_embeddings[0].reshape(1,-1),sentence_embeddings[2].reshape(1,-1))[0][0]

## Final result

In [30]:
print(sentences[0], "::::", sentences[1], "\nSIM_LEVEL" , simlarity)

This framework generates embeddings for each input sentence :::: Sentences are passed as a list of string. 
SIM_LEVEL 0.4799481


In [31]:
print(sentences[0], "::::", sentences[2], "\nSIM_LEVEL" , simlarity2)

This framework generates embeddings for each input sentence :::: The quick brown fox jumps over the lazy dog. 
SIM_LEVEL -0.01075776


# all-mpnet-base-v1

The model The Man does not want us to see o.O

## Load

In [32]:
model = SentenceTransformer("all-mpnet-base-v1")

In [33]:
sentence_embeddings = model.encode(sentences)

## Calculations

In [34]:
sentence_embeddings[0].reshape(-1,1).shape

(768, 1)

In [35]:
simlarity = cosine_similarity(sentence_embeddings[0].reshape(1,-1),sentence_embeddings[1].reshape(1,-1))[0][0]

In [36]:
simlarity2 = cosine_similarity(sentence_embeddings[0].reshape(1,-1),sentence_embeddings[2].reshape(1,-1))[0][0]

## Final result

In [37]:
print(sentences[0], "::::", sentences[1], "\nSIM_LEVEL" , simlarity)

This framework generates embeddings for each input sentence :::: Sentences are passed as a list of string. 
SIM_LEVEL 0.3705439


In [38]:
print(sentences[0], "::::", sentences[2], "\nSIM_LEVEL" , simlarity2)

This framework generates embeddings for each input sentence :::: The quick brown fox jumps over the lazy dog. 
SIM_LEVEL 0.12739469


# So ...?

So…nasega. What should be better? Why? What is the ground truth? How can we decide which has the best performances? The values [here](https://www.sbert.net/docs/pretrained_models.html) are almost meaningless and even wrongly considered (you cannot average "easily" between averages taken on sample of different dimensions), I would humbly say. So what? How to choose? Is Puliga's choice an added value _per sè_ or vice versa?

| Model  | 0 vs 1 | 0 vs 2 |
|:--------|--------|--------|
|all-MiniLM-L6-v2|0.538|0.118|
|distilbert-multilingual-nli-stsb-quora-ranking|0.792|0.591|
|all-mpnet-base-v2|0.513|0.097|
|all-MiniLM-L12-v2|0.480|-0.011|
|all-mpnet-base-v1|0.370|0.127|


In [44]:
    perf_models={
'minil6':[0.538, 0.118],
'dist':[0.792, 0.591],
'mpnetv2':[0.513, 0.097],
'minil12':[0.480, -0.011],
'mpnetv1':[0.370, 0.127]
    }

If we consider (why?) the ground truth as *all-mpnet-base-v2*, then the relative errors made by the different mode are:

In [40]:
def rel_err(value, realv):
    return abs(value-realv)/realv

In [55]:
for i in range(2):
    print('---------------------------')
    for key in perf_models.keys():
        if key !='mpnetv2':
            re=rel_err(perf_models[key][i], perf_models['mpnetv2'][i])
            print('{:7}) 0 vs {:} RE={:.3f}'.format(key, i, re))

---------------------------
minil6 ) 0 vs 0 RE=0.049
dist   ) 0 vs 0 RE=0.544
minil12) 0 vs 0 RE=0.064
mpnetv1) 0 vs 0 RE=0.279
---------------------------
minil6 ) 0 vs 1 RE=0.216
dist   ) 0 vs 1 RE=5.093
minil12) 0 vs 1 RE=1.113
mpnetv1) 0 vs 1 RE=0.309


mini-L6 seems to be effectively the best performing, against the benchmark. It should be considered since it is particularly fast and light weighted.