# 7CCMFM18 Machine Learning
King's College London <br>
Academic year 2022-2023 <br>
Lecturer: Mario Martone

## NLP: sentence transformer
First version: <i>24th March 2023</i>

You will need to install: 

1. SentenceTransformer.
2. NLTK
3. Sklearn
4. english-words

First let's load all our libraries:

In [1]:
from sentence_transformers import SentenceTransformer
import scipy.sparse.linalg
from english_words import english_words_alpha_set
import sklearn
from sklearn.metrics import pairwise_distances
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader

Then load our model (you can try comparing different models!):

In [2]:
model_name = 'all-distilroberta-v1'
model = SentenceTransformer(model_name)

Now let's embed two simple sentences:

In [3]:
sent1=model.encode('Mario studies hard')
sent2=model.encode('Mario goes to school')

Let's compute the similarity score:

In [4]:
sklearn.metrics.pairwise.cosine_similarity([sent1,sent2])

array([[1.        , 0.76597095],
       [0.76597095, 0.99999994]], dtype=float32)

The off-diagonal terms are the one that matter and they are indeed close to 1! Now let's try to see what happens with a sentence which has a different meaning:

In [5]:
sent3=model.encode('Tomorrow the sky will be blue')
sklearn.metrics.pairwise.cosine_similarity([sent1,sent2,sent3])

array([[1.        , 0.76597095, 0.0752472 ],
       [0.76597095, 0.99999994, 0.14101554],
       [0.0752472 , 0.14101554, 0.99999976]], dtype=float32)

And you can see that the similarity of sentence 3 with sentence 1 and 2 is far smaller.

### Look for similarities:

Now let's play with the large English vocabulary 

In [6]:
len(list(english_words_alpha_set))

25474

In [7]:
list(english_words_alpha_set)[:10]

['gneiss',
 'Dobbs',
 'boil',
 'dogwood',
 'Ainu',
 'parsimonious',
 'Coriolanus',
 'jenny',
 'Sicily',
 'gravid']

And let's check the words that have the highest similarity with the word football:

In [9]:
similarity={}
encode_football=model.encode('football')

for word in list(english_words_alpha_set):
    word_encode=model.encode(word)
    similarity[word]=sklearn.metrics.pairwise.cosine_similarity([encode_football,word_encode])[0,1]

And now let's check the words that are the most similar with football:

In [10]:
sorted(similarity.items(), key=lambda item: item[1],reverse=True)

[('football', 0.99999976),
 ('soccer', 0.86900795),
 ('basketball', 0.81611115),
 ('sport', 0.70091105),
 ('hockey', 0.6746446),
 ('cricket', 0.67358565),
 ('baseball', 0.670014),
 ('volleyball', 0.6460309),
 ('softball', 0.6318909),
 ('sportswriting', 0.6087425),
 ('ball', 0.60338104),
 ('volley', 0.58906734),
 ('tennis', 0.5827517),
 ('lacrosse', 0.5714959),
 ('athlete', 0.57062495),
 ('golf', 0.5705202),
 ('chess', 0.5621939),
 ('athletic', 0.55184615),
 ('snowball', 0.5504932),
 ('sportsmen', 0.54786897),
 ('badminton', 0.54423267),
 ('ballet', 0.5431351),
 ('pong', 0.5393827),
 ('madden', 0.5365356),
 ('knuckleball', 0.53595364),
 ('polo', 0.52711654),
 ('jockey', 0.5201056),
 ('sporty', 0.5108114),
 ('television', 0.50035995),
 ('karate', 0.499893),
 ('sportswriter', 0.4973939),
 ('skate', 0.4969545),
 ('quarterback', 0.49339512),
 ('sportsman', 0.49299866),
 ('touchdown', 0.4866844),
 ('Olympic', 0.4849062),
 ('sadden', 0.48392475),
 ('hooligan', 0.48306835),
 ('wrestle', 0.4815

### Now fine-tune the model:

We can notice that these two words have unusually high/low similarity score:

In [13]:
stadium_emb=model.encode('stadium')
ballet_emb=model.encode('ballet')

print(sklearn.metrics.pairwise.cosine_similarity([encode_football,stadium_emb])[0,1])
print(sklearn.metrics.pairwise.cosine_similarity([encode_football,ballet_emb])[0,1])

0.4538225
0.5431351


So let's fine tune the model to increase the similarity of stadium and decrease that of ballet:

In [14]:
#Define the model. Either from scratch of by loading a pre-trained model
model = SentenceTransformer('all-distilroberta-v1')

#Define your train examples. You need more than just two examples...
train_examples = [InputExample(texts=['football', 'ballet'], label=0.1),
    InputExample(texts=['football', 'stadium'], label=0.7)]

#Define your train dataset, the dataloader and the train loss
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.CosineSimilarityLoss(model)

#Tune the model
model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=10, warmup_steps=100)

Epoch:   0%|          | 0/10 [00:00<?, ?it/s]

Iteration:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/1 [00:00<?, ?it/s]

Now let's check how the similarity of these words changed:

In [15]:
stadium_emb=model.encode('stadium')
ballet_emb=model.encode('ballet')

print(sklearn.metrics.pairwise.cosine_similarity([uni_encode,stadium_emb])[0,1])
print(sklearn.metrics.pairwise.cosine_similarity([uni_encode,ballet_emb])[0,1])

0.48246533
0.45706216
