## Introduction to Sentence Embeddings
### from  https://osanseviero.github.io/hackerllama/blog/posts/sentence_embeddings/

In [35]:
from sentence_transformers import SentenceTransformer
from transformers import AutoModel, AutoTokenizer

#model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
tokenizer_bert = AutoTokenizer.from_pretrained("bert-base-uncased")
tokenizer_bpe = AutoTokenizer.from_pretrained("openai-gpt")


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [24]:
text= "this is about tokenization!"
tokenizer_bert.tokenize(text, add_special_tokens=True)


['[CLS]', 'this', 'is', 'about', 'token', '##ization', '!', '[SEP]']

In [25]:
text= "this is about tokenization!"
tokenizer_bpe.tokenize(text, add_special_tokens=True)

['this</w>', 'is</w>', 'about</w>', 'to', 'ken', 'ization</w>', '!</w>']

In [23]:
text= "IMT Atlantique is undergoing a reform of the first year teaching program"
tokenizer_bpe.tokenize(text, add_special_tokens=False)


['im',
 't</w>',
 'atlan',
 'ti',
 'que</w>',
 'is</w>',
 'undergoing</w>',
 'a</w>',
 'reform</w>',
 'of</w>',
 'the</w>',
 'first</w>',
 'year</w>',
 'teaching</w>',
 'program</w>']

In [19]:
text= "The IMT Atlantique engineering school is undergoing a reform of the first year teaching"
tokenizer_bert.tokenize(text, add_special_tokens=True)

['[CLS]',
 'the',
 'im',
 '##t',
 'at',
 '##lan',
 '##tique',
 'engineering',
 'school',
 'is',
 'undergoing',
 'a',
 'reform',
 'of',
 'the',
 'first',
 'year',
 'teaching',
 '[SEP]']

In [36]:
from sentence_transformers import util
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")


In [43]:
sentences = ["IMT is an engineering school", "First year students are many", "Cats are cute"]
embeddings = model.encode(sentences)
embeddings.shape

(3, 384)

In [44]:
first_embedding = model.encode("The python programming course is very good")
for embedding, sentence in zip(embeddings, sentences):
    similarity = util.pytorch_cos_sim(first_embedding, embedding)
    print(similarity, sentence)

tensor([[0.1887]]) IMT is an engineering school
tensor([[0.2232]]) First year students are many
tensor([[0.1648]]) Cats are cute


In [75]:
model_bert=AutoModel.from_pretrained("bert-base-uncased")
text = "The IMT director and the professors are happy about the first year teaching reform."
tokenizer_bert.tokenize(text, add_special_tokens=True)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


['[CLS]',
 'the',
 'im',
 '##t',
 'director',
 'and',
 'the',
 'professors',
 'are',
 'happy',
 'about',
 'the',
 'first',
 'year',
 'teaching',
 'reform',
 '.',
 '[SEP]']

In [90]:
text = "The IMT director and the professors are happy about the first year teaching reform."

model_bert=AutoModel.from_pretrained("bert-base-uncased")
encoded_input = tokenizer_bert(text, return_tensors="pt")
output = model_bert(**encoded_input)
output["last_hidden_state"].shape

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Shape of embeddings: 


torch.Size([1, 18, 768])

In [91]:
prof_embedding = output["last_hidden_state"][0][7]  # 7 is the position of prof
teaching_embedding = output["last_hidden_state"][0][14]  # 14 is the position of teaching
happy_embedding= output["last_hidden_state"][0][9]  # 9 is the position of happy

print(f"Cos Similarity between PROFESSORS and TEACHING {util.pytorch_cos_sim(prof_embedding, teaching_embedding)[0][0]}")

print(f"Cos Similarity between PROFESSORS and HAPPY  {util.pytorch_cos_sim(prof_embedding, happy_embedding)[0][0]}")

Cos Similarity between PROFESSORS and TEACHING 0.6519165635108948
Cos Similarity between PROFESSORS and HAPPY  0.36290210485458374


In [92]:
text = "The angry and unhappy professors"
encoded_input = tokenizer_bert(text, return_tensors="pt")
output = model_bert(**encoded_input)
output["last_hidden_state"].shape
prof_embedding_2 = output["last_hidden_state"][0][5]
print(f"Cos Similarity between two PROFESSORS embeddings {util.pytorch_cos_sim(prof_embedding_2,prof_embedding)[0][0]}")


Cos Similarity between two PROFESSORS embeddings 0.4150097370147705
