# Prompt-Response Aggregation Similarity Score Example

This notebook demonstrates how to use the Prompt-Response Aggregation Similarity Metric to evaluate how well responses align with their prompts.

In [1]:
from examples.llm_aware_metrics.code.aggregated_similarity_score import AggregatedSimilarityMetric
from llm_metrics.semantic_similarity_metrics import BERTScore

<br>

## Example Data

In [2]:
prompt1 = "What are the main causes of climate change?"

# Well-aligned response
response1 = "The main causes of climate change include greenhouse gas emissions from burning fossil fuels, deforestation, and industrial processes."

# Partially aligned response
response2 = "Climate change is a serious issue. We need to reduce pollution and plant more trees."

# Poorly aligned response
response3 = "The weather has been quite unusual lately. Yesterday it rained all day."

## Using the Aggregation Similarity Metric

In [3]:
# Initialize the metric
base_metric = BERTScore(model_type="microsoft/deberta-xlarge-mnli")
agg_metric = AggregatedSimilarityMetric(base_metric)

In [4]:
# Compare well-aligned responses
score1 = agg_metric.calculate_with_prompt(
    response1,
    response2,
    prompt1
)

print(f"Aggregated Similarity score for well-aligned responses:\n{score1}")

Aggregated Similarity score for well-aligned responses:
{'precision': 0.6514176527659098, 'recall': 0.5639405051867167, 'f1': 0.600588838259379}


In [5]:
# Compare with poorly aligned response
score2 = agg_metric.calculate_with_prompt(
    response1,
    response3,
    prompt1
)

print(f"Aggregated Similarity score with poorly aligned response:\n{score2}")

Aggregated Similarity score with poorly aligned response:
{'precision': 0.5123771925767263, 'recall': 0.4340182642141978, 'f1': 0.4655380845069885}


## Analyzing Components of the Score

In [6]:
# Let's break down the components of the aggregation score
def analyze_aggregation(response1, response2, prompt):
    # Get individual components
    agg1 = agg_metric.calculate_prompt_aggregation(prompt, response1)
    agg2 = agg_metric.calculate_prompt_aggregation(prompt, response2)
    response_similarity = base_metric.calculate(response1, response2)
    
    print(f"Response 1 Prompt Aggregation:\n{agg1}")
    print(f"Response 2 Prompt Aggregation:\n{agg2}")
    print(f"Response Similarity:\n{response_similarity}")

print("Analysis of well-aligned responses:\n")
analyze_aggregation(response1, response2, prompt1)

print("\nAnalysis with poorly aligned response:\n")
analyze_aggregation(response1, response3, prompt1)

Analysis of well-aligned responses:

Response 1 Prompt Aggregation:
{'precision': 0.7984908819198608, 'recall': 0.5745624899864197, 'f1': 0.6682667136192322}
Response 2 Prompt Aggregation:
{'precision': 0.6100817918777466, 'recall': 0.5354424715042114, 'f1': 0.5703305006027222}
Response Similarity:
{'precision': 0.5456802845001221, 'recall': 0.581816554069519, 'f1': 0.5631693005561829}

Analysis with poorly aligned response:

Response 1 Prompt Aggregation:
{'precision': 0.7984908819198608, 'recall': 0.5745624899864197, 'f1': 0.6682667136192322}
Response 2 Prompt Aggregation:
{'precision': 0.4078367352485657, 'recall': 0.34359011054039, 'f1': 0.3729669153690338}
Response Similarity:
{'precision': 0.3308039605617523, 'recall': 0.3839021921157837, 'f1': 0.3553806245326996}
