1. **Relevance** - Does the chatbot’s response align with the conversation context?
    - Metric: Semantic similarity between user prompt and chatbot response. Compute cosine similarity of embeddings. Higher similarity = Higher relevance score.
    - Implementation: Utilize sentence embeddings from models like sentence-transformers/all-mpnet-base-v2  (backup: sentence-transformers/all-MiniLM-L6-v2) to compute cosine similarity between the query and response embeddings.

2. **Coherence** - Is the conversation logically structured?
    - Metric: Sentence Embedding Similarity (Transformer-Based) measuring semantic coherence between consecutive bot-human conversations. ￼
    - Implementation: Use sentence-transformers/all-mpnet-base-v2.

3. **Factual Accuracy** - Are the chatbot’s statements correct and verifiable?
    - Metric: Question Answering (QA) models: fact-checking against a knowledge base. Or Natural Language Inference (NLI) models: Determines if response contradicts factual statements. Higher entailment probability = Higher factual accuracy score.
    - Implementation:  Employ models like facebook/bart-large-mnli (Strong NLI model) to verify facts.

4. **Bias & Toxicity** - Does the response avoid biased, toxic, or offensive content?
    - Metric: Toxicity classification: Score toxic and biased phrases in chatbot output. Lower toxicity = Higher score.
	- Implementation: Use models like unitary/toxic-bert (best for bias & toxicity detection) to detect toxic language.

5. **Fluency** - Are responses grammatically correct and readable?
    - Metric: Grammaticality score using a fluency-checking model. 
	- Implementation: Utilize models like textattack/roberta-base-CoLA to check grammatical correctness directly.

6. **Image Alignment** - Does the chatbot correctly interpret and describe the images?
    - Metric: Vision-language similarity: Measures how well the response text aligns with the given image. Higher alignment score = Better rating. ￼
    - Implementation: Use models like openai/clip-vit-base-patch32 (Multimodal model for vision-text matching).

7. **Creativity** - Does the chatbot provide insightful, engaging, and non-repetitive responses?
    - Metric: Lexical diversity measures how many unique words appear relative to total words. Higher diversity = Higher creativity score.
    - Implementation: sentence-transformers/all-mpnet-base-v2 (Can generate diverse embeddings for novelty scoring)

In [1]:
# !pip install transformers sentence-transformers detoxify torch torchvision

In [2]:
import json
import pandas as pd
import torch
from sentence_transformers import SentenceTransformer, util
from transformers import pipeline, CLIPProcessor, CLIPModel
from detoxify import Detoxify
from PIL import Image
import io
import base64



In [3]:
# Load Transformer Models
similarity_model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")  # Relevance, Coherence
fact_checker = pipeline("text-classification", model="facebook/bart-large-mnli")  # Factual Accuracy
toxicity_model = pipeline("text-classification", model="unitary/toxic-bert")  # Bias & Toxicity
fluency_model = pipeline("text-classification", model="textattack/roberta-base-CoLA")  # Fluency
vision_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")  # Image Alignment
vision_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")  # CLIP Processor        

Device set to use mps:0
Device set to use mps:0
Some weights of the model checkpoint at textattack/roberta-base-CoLA were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use mps:0


In [15]:
def evaluate_relevance(user_prompts, bot_responses):
    """
    Measures how well the chatbot's responses align with the overall conversation context.
    """
    if not bot_responses or not user_prompts:
        return 5, "No responses provided."

    # Compute embeddings for all user prompts and bot responses
    embeddings_prompt = similarity_model.encode(user_prompts, convert_to_tensor=True)
    embeddings_response = similarity_model.encode(bot_responses, convert_to_tensor=True)
    similarity_score = util.pytorch_cos_sim(embeddings_prompt.mean(dim=0), embeddings_response.mean(dim=0)).item() * 10

    return round(similarity_score), f"Semantic relevance score: {similarity_score:.2f}"

def evaluate_coherence(bot_responses):
    """
    Measures coherence by computing **average sentence similarity** across chatbot responses.
    """
    if len(bot_responses) < 2:
        return 5, "Not enough chatbot responses to measure coherence."

    # Compute coherence based on similarity between consecutive bot responses
    similarities = []
    for i in range(len(bot_responses) - 1):
        emb1 = similarity_model.encode(bot_responses[i], convert_to_tensor=True)
        emb2 = similarity_model.encode(bot_responses[i + 1], convert_to_tensor=True)
        similarity = util.pytorch_cos_sim(emb1, emb2).item()
        similarities.append(similarity)
    avg_similarity = sum(similarities) / len(similarities) if similarities else 0.5
    scaled_score = int(1 + (avg_similarity * 9))  # Min-Max scaling

    return scaled_score, f"Average bot response coherence: {avg_similarity:.2f}"

def evaluate_factual_accuracy(bot_responses):
    """
    Uses NLI to check factual consistency across the entire conversation.
    """
    if not bot_responses:
        return 5, "No responses provided."
    
    result = fact_checker(" ".join(bot_responses))
    entailment_score = max(result, key=lambda x: x['score'])['score'] * 10

    return round(entailment_score), f"Factual consistency score: {entailment_score:.2f}"

def evaluate_bias_toxicity(bot_responses):
    """
    Detects bias or toxic language in chatbot responses.
    """
    if not bot_responses:
        return 10, "No responses provided."
    
    combined_text = " ".join(bot_responses)
    toxicity_score = Detoxify('original').predict(combined_text)['toxicity'] * 10
    final_score = max(1, 10 - int(toxicity_score))  # Lower toxicity = higher score

    return final_score, f"Toxicity probability: {toxicity_score:.2f}"

def evaluate_fluency(bot_responses):
    """
    Uses CoLA model to check grammatical fluency across chatbot responses.
    """
    if not bot_responses:
        return 5, "No responses provided."

    fluency_scores = [fluency_model(response)[0]['score'] * 10 for response in bot_responses]
    avg_fluency = sum(fluency_scores) / len(fluency_scores)

    return round(avg_fluency), f"Average fluency score: {avg_fluency:.2f}"

def evaluate_image_alignment(bot_responses, image_data, image_tag_mapping):
    """
    Uses CLIP to check if chatbot responses align with encoded images.
    """
    if not bot_responses or not image_data:
        return 5, "No images or responses provided."

    scores = []
    processed_texts = []
    processed_images = []

    # Process each image referenced in the conversation
    for tag, img_name in image_tag_mapping.items():
        for img in image_data:
            if img["name"] == img_name:
                # Decode base64 image
                image = Image.open(io.BytesIO(base64.b64decode(img["base64"])))
                
                processed_images.append(image)
                processed_texts.append(" ".join(bot_responses))  # Combine bot responses

    if not processed_images:
        return 5, "No valid images found for evaluation."

    # Process text and images using CLIP with truncation/padding
    inputs = vision_processor(
        text=processed_texts, 
        images=processed_images, 
        return_tensors="pt", 
        padding=True, 
        truncation=True
    )
    
    # Move tensors to the correct device
    inputs = {key: val.to(vision_model.device) for key, val in inputs.items()}

    with torch.no_grad():
        outputs = vision_model(**inputs)
        similarity_scores = outputs.logits_per_image.cpu().numpy()  # Convert to NumPy

    # Compute Min-Max normalization for similarity scores
    min_clip_score, max_clip_score = 0.1, 0.9  # Adjust based on dataset range
    avg_score = similarity_scores.mean() if similarity_scores.size > 0 else 0.5
    normalized_score = int(1 + ((avg_score - min_clip_score) / (max_clip_score - min_clip_score)) * 9)

    return normalized_score, f"CLIP text-image similarity score: {avg_score:.2f}"

def evaluate_creativity(bot_responses, past_responses):
    """
    Measures creativity by comparing bot responses against past responses.
    """
    if not bot_responses:
        return 5, "No responses provided."
    
    embedding_current = similarity_model.encode(bot_responses, convert_to_tensor=True)
    embedding_past = similarity_model.encode(past_responses, convert_to_tensor=True) if past_responses else None

    max_similarity = 0 if embedding_past is None else util.pytorch_cos_sim(embedding_current.mean(dim=0), embedding_past.mean(dim=0)).item()
    creativity_score = max(1, min(10, int((1 - max_similarity) * 10)))  # Inverse similarity

    return creativity_score, f"Novelty score: {1 - max_similarity:.2f} (Lower similarity = more creative)"

In [None]:
# Load the JSON dataset
with open("human_bot_conversation.json", "r") as file:
    conversations = json.load(file)

results = []
past_responses = []  # Stores previous bot responses for creativity evaluation

for idx, conversation_data in enumerate(conversations, 1):
    conversation_text = conversation_data.get("conversation", "")
    turns = conversation_text.split("\n\n")  # Split turns by double newlines

    # Extract all user prompts and bot responses
    user_prompts = [
        t.replace("HUMAN:", "").replace("[HUMAN]:", "").strip() 
        for t in turns if "HUMAN:" in t or "[HUMAN]:" in t
    ]
    bot_responses = [
        t.replace("BOT:", "").replace("[BOT]:", "").strip() 
        for t in turns if "BOT:" in t or "[BOT]:" in t
    ]

    # Extract images from JSON (if present)
    image_tag_mapping = conversation_data.get("image_tag_mapping", {})
    encoded_images = conversation_data.get("images", [])

    # Apply evaluations
    relevance_score, relevance_explanation = evaluate_relevance(user_prompts, bot_responses)
    coherence_score, coherence_explanation = evaluate_coherence(conversation_text)
    factual_accuracy_score, factual_accuracy_explanation = evaluate_factual_accuracy(bot_responses)
    bias_toxicity_score, bias_toxicity_explanation = evaluate_bias_toxicity(bot_responses)
    fluency_score, fluency_explanation = evaluate_fluency(bot_responses)
    image_alignment_score, image_alignment_explanation = evaluate_image_alignment(bot_responses, encoded_images, image_tag_mapping)
    creativity_score, creativity_explanation = evaluate_creativity(bot_responses, past_responses)

    # Store past responses for creativity comparison
    past_responses.extend(bot_responses)

    # Store results
    results.append({
        "conversation_id": idx,
        "evaluation_scores": {
            "Relevance": {"score": relevance_score, "explanation": relevance_explanation},
            "Coherence": {"score": coherence_score, "explanation": coherence_explanation},
            "Factual Accuracy": {"score": factual_accuracy_score, "explanation": factual_accuracy_explanation},
            "Bias & Toxicity": {"score": bias_toxicity_score, "explanation": bias_toxicity_explanation},
            "Fluency": {"score": fluency_score, "explanation": fluency_explanation},
            "Image Alignment": {"score": image_alignment_score, "explanation": image_alignment_explanation},
            "Creativity": {"score": creativity_score, "explanation": creativity_explanation}
        }
    })

# Save JSON results
with open("Final_Conversation_Evaluation.json", "w") as output_file:
    json.dump(results, output_file, indent=4)

print("Evaluation completed. Results saved to 'Final_Conversation_Evaluation.json'.")