Steps:


1.   The AutoTokenizer from Hugging Face is used to tokenize the input texts (English query and Armenian passage).
2.   The AutoModel generates the embeddings for each text and average over the token embeddings to get a single sentence-level representation using the average_pool function.
3. Normalize the embeddings so that they are unit vectors (i.e., have a length of 1). This ensures that the cosine similarity score reflects only the direction of the embeddings and not their magnitude.
4. The similarity score is calculated using cosine similarity between the English and Armenian embeddings. The result will give a value between 0 and 100, representing how similar the two sentences are.

In [1]:
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel

# Load tokenizer and model for Armenian embeddings
tokenizer = AutoTokenizer.from_pretrained('Metric-AI/armenian-text-embeddings-1')
model = AutoModel.from_pretrained('Metric-AI/armenian-text-embeddings-1')

# Function to average pool the last hidden states to get sentence-level embeddings
def average_pool(last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor:
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]

In [2]:
# Example input sentences (English query and Armenian translation as passage)
input_texts = [
    'query: Can I come to your house?',  # Example query in English
    'passage: Կարո՞ղ եմ գալ ձեր տուն։'  # Translation in Armenian
]

In [3]:
# Tokenize the input texts for the model
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')

# Get the model outputs for the input texts
outputs = model(**batch_dict)

# Extract sentence embeddings by averaging the last hidden states
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

# Normalize the embeddings (to unit vectors) before calculating similarity
embeddings = F.normalize(embeddings, p=2, dim=1)

# Calculate cosine similarity between the English and Armenian embeddings
# embeddings[0] = English query, embeddings[1] = Armenian passage
similarity_score = (embeddings[0] @ embeddings[1].T) * 100

  similarity_score = (embeddings[0] @ embeddings[1].T) * 100


* The program will output the cosine similarity score as a percentage.
* A higher similarity score indicates that the translation is more faithful to the original English text.

In [4]:
# Output the similarity score (range 0-100)
print(f"Cosine Similarity Score: {similarity_score.item():.2f}%")

Cosine Similarity Score: 91.08%
