**Semantic Similarity** **Calculator**

goal: build a tool that measures how similar two sentences are in meaning

tech: Sentence-transformers, Cosine Similarity

date: 1st Feb, 2026

what I learned:
- embeddings convert text into vectors
- similar meanings -> vectors point in similar directions
- cosine similarity measures the angle between vectors (0 = different, 1 = identical)

In [1]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

In [2]:
# install pretrained model
model = SentenceTransformer('all-MiniLM-L6-v2')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]



README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

## What are Embeddings?

Embeddings are numerical representations of text. Each word/sentence becomes a vector (array of numbers).

This is what an embedding looks like:

In [3]:
sentence = 'I like cats'
embedding = model.encode(sentence)

In [4]:
print(f'original sentence: ', {sentence})
print(f'embedding shape: ', {embedding.shape})
print(f"First 10 values: {embedding[:10]}")
print(f'this means 1 sentence ->  {len(embedding)} numbers')

original sentence:  {'I like cats'}
embedding shape:  {(384,)}
First 10 values: [ 0.00601428 -0.01928949  0.0329517   0.01040479 -0.07798284 -0.01927789
  0.10223327  0.0097876   0.04977399  0.06472164]
this means 1 sentence ->  384 numbers


In [6]:
def semantic_similarity(sentence1, sentence2, model):
  """
    Calculate semantic similarity between two sentences.

    Args:
        sentence1 (str): First sentence
        sentence2 (str): Second sentence
        model: SentenceTransformer model

    Returns:
        float: Similarity score (0 = different, 1 = identical)
    """
  embeddings = model.encode([sentence1, sentence2])
  similarity = cosine_similarity([embeddings[0]], [embeddings[1]])[0][0]
  return similarity

## Testing the Calculator

Let's test with different types of sentence pairs:
1. Paraphrases (should be HIGH ~0.7-0.9)
2. Related topics (should be MODERATE ~0.4-0.6)
3. Unrelated topics (should be LOW ~0.1-0.3)

In [9]:
# Test 1: Paraphrases
sent1 = "The cat sits on the mat"
sent2 = "A feline rests on a rug"
score1 = semantic_similarity(sent1, sent2, model)

print("Test 1: Paraphrases")
print(f"  '{sent1}'")
print(f"  '{sent2}'")
print(f"  Similarity: {score1:.3f}\n")

Test 1: Paraphrases
  'The cat sits on the mat'
  'A feline rests on a rug'
  Similarity: 0.557



In [10]:
# Test 2: Related but different
sent3 = "I love programming"
sent4 = "Coding is fun"
score2 = semantic_similarity(sent3, sent4, model)

print("Test 2: Related topics")
print(f"  '{sent3}'")
print(f"  '{sent4}'")
print(f"  Similarity: {score2:.3f}\n")


Test 2: Related topics
  'I love programming'
  'Coding is fun'
  Similarity: 0.664



In [11]:
# Test 3: Unrelated
sent5 = "The weather is sunny"
sent6 = "I ate pizza for lunch"
score3 = semantic_similarity(sent5, sent6, model)

print("Test 3: Unrelated topics")
print(f"  '{sent5}'")
print(f"  '{sent6}'")
print(f"  Similarity: {score3:.3f} \n")

Test 3: Unrelated topics
  'The weather is sunny'
  'I ate pizza for lunch'
  Similarity: 0.103 



In [12]:
# Test 4: Identical
sent7 = "Machine learning is awesome"
score4 = semantic_similarity(sent7, sent7, model)

print("Test 4: Identical sentences")
print(f"  '{sent7}'")
print(f"  Similarity: {score4:.3f}")

Test 4: Identical sentences
  'Machine learning is awesome'
  Similarity: 1.000


Key Takeaways

1. **Embeddings** turn text into vectors (numbers)
2. **Cosine similarity** measures angle between vectors
3. Scores range 0-1, but rarely hit exact 0 or 1
4. This is the foundation for RAG, semantic search, and recommendation systems

---

**Next Steps:** Use this for building RAG applications (retrieving relevant documents)