# Sentence Similarity
Sentence similarity involves converting sentences into high-dimensional vector representations and then calculating the similarity between these vectors. 

This is typically done using pre-trained models that generate sentence embeddings. These embeddings can be used to calculate the similarity between sentences using cosine similarity or other distance metrics.

In [2]:
from sentence_transformers import SentenceTransformer, util

In [3]:
# Load a pre-trained sentence transformer models
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')



In [4]:
# Intent questions for each category
intents = {
    1: ["Why is action A not used in the plan?"],
    2: ["Why is action A used in the plan?"],
    3: ["Why is action A used rather than action B?"]
}

# Encode example questions
intent_embeddings = {k: model.encode(v) for k, v in intents.items()}

In [5]:
def categorize_question_similarity(question):
    question_embedding = model.encode(question)  # Encode the input question

    # Compute similarity scores
    scores = {k: util.pytorch_cos_sim(question_embedding, v).item() for k, v in intent_embeddings.items()}
    
    # Find the category with the highest similarity score
    return max(scores, key=scores.get)

In [6]:
question = "Why is action A used rather than action B?"
category = categorize_question_similarity(question)
print(f"The question belongs to category: {category}")

The question belongs to category: 3


<br>

## Evaluate the Model on Dataset

In [10]:
import pandas as pd
from sklearn.metrics import classification_report, accuracy_score

df = pd.read_csv('./data/intent_classification_dataset.csv')
print(f"Number of rows in the dataset: {df.shape[0]}")
df.head()

Number of rows in the dataset: 107


Unnamed: 0,text,label
0,Why is action A not included in the project ro...,1
1,What are the reasons for excluding action A fr...,1
2,Why was action A omitted from the strategy?,1
3,Why didn't we consider action A for the projec...,1
4,Why was action A left out of the final plan?,1


In [8]:
# Apply the classification function to the dataset
df['predicted_label'] = df['text'].apply(categorize_question_similarity)

# Evaluate the results
y_true = df['label']
y_pred = df['predicted_label']

In [9]:
# Print classification report
print(classification_report(y_true, y_pred))
print(f"Accuracy: {accuracy_score(y_true, y_pred):.2f}")

              precision    recall  f1-score   support

           1       0.89      0.69      0.78        36
           2       0.55      1.00      0.71        35
           3       1.00      0.42      0.59        36

    accuracy                           0.70       107
   macro avg       0.81      0.70      0.69       107
weighted avg       0.82      0.70      0.69       107

Accuracy: 0.70
