# Sentence Similarity
Sentence similarity involves converting sentences into high-dimensional vector representations and then calculating the similarity between these vectors. 

This is typically done using pre-trained models that generate sentence embeddings. These embeddings can be used to calculate the similarity between sentences using cosine similarity or other distance metrics.

## Single Sentence Example

In [1]:
from sentence_transformers import SentenceTransformer, util

  from tqdm.autonotebook import tqdm, trange


In [2]:
# Load an example pre-trained sentence transformer models
model_miniLM = SentenceTransformer('paraphrase-MiniLM-L6-v2')



In [3]:
# Intent questions for each category
intents = {
    1: ["Why is action A not used in the plan?"],
    2: ["Why is action A used in the plan?"],
    3: ["Why is action A used rather than action B?"]
}

# Encode example questions
intent_embeddings_miniLM = {k: model_miniLM.encode(v) for k, v in intents.items()}

In [4]:
def categorize_question_similarity(question, model, intent_embeddings):
    question_embedding = model.encode(question)  # Encode the input question

    # Compute similarity scores
    scores = {k: util.pytorch_cos_sim(question_embedding, v).item() for k, v in intent_embeddings.items()}
    
    # Find the category with the highest similarity score
    return max(scores, key=scores.get)

In [5]:
question = "Why is action A used rather than action B?"
category = categorize_question_similarity(question, model_miniLM, intent_embeddings_miniLM)
print(f"The question belongs to category: {category}")

The question belongs to category: 3


<br>

## Evaluate Different Models on the Dataset

In [6]:
import pandas as pd
from sklearn.metrics import classification_report, accuracy_score

df = pd.read_csv('./data/intent_classification_dataset.csv')
print(f"Number of rows in the dataset: {df.shape[0]}")
df.head()

Number of rows in the dataset: 107


Unnamed: 0,text,label
0,Why is action A not included in the project ro...,1
1,What are the reasons for excluding action A fr...,1
2,Why was action A omitted from the strategy?,1
3,Why didn't we consider action A for the projec...,1
4,Why was action A left out of the final plan?,1


In [9]:
def text_similarity_from_model(model_name, df):
    model = SentenceTransformer(model_name)
    intent_embeddings = {k: model.encode(v) for k, v in intents.items()}

    # Apply the classification function to the dataset
    df['predicted_label'] = df['text'].apply(categorize_question_similarity, model=model, intent_embeddings=intent_embeddings)
    
    # Evaluate the results
    y_true = df['label']
    y_pred = df['predicted_label']
    
    return classification_report(y_true, y_pred), accuracy_score(y_true, y_pred)

In [10]:
class_report, accuracy = text_similarity_from_model("paraphrase-MiniLM-L6-v2", df)
print(class_report)
print(f"Accuracy: {accuracy:.2f}")



              precision    recall  f1-score   support

           1       0.89      0.69      0.78        36
           2       0.55      1.00      0.71        35
           3       1.00      0.42      0.59        36

    accuracy                           0.70       107
   macro avg       0.81      0.70      0.69       107
weighted avg       0.82      0.70      0.69       107

Accuracy: 0.70


In [12]:
class_report, accuracy = text_similarity_from_model("all-MiniLM-L12-v2", df)
print(class_report)
print(f"Accuracy: {accuracy:.2f}")



              precision    recall  f1-score   support

           1       0.68      0.94      0.79        36
           2       0.76      0.71      0.74        35
           3       0.96      0.64      0.77        36

    accuracy                           0.77       107
   macro avg       0.80      0.77      0.76       107
weighted avg       0.80      0.77      0.76       107

Accuracy: 0.77
