Skip to content

A simple testing framework to evaluate and validate LLM-generated content using string similarity, semantic similarity, and model-based evaluation.

License

Notifications You must be signed in to change notification settings

danilop/llm-test-mate

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

15 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

LLM Test Mate 🀝

A simple testing framework to evaluate and validate LLM-generated content using string similarity, semantic similarity, and model-based (LLM as a judge) evaluation.

πŸš€ Features

  • πŸ“ String similarity testing using Damerau-Levenshtein distance and other methods
  • πŸ“Š Semantic similarity testing using sentence transformers
  • πŸ€– LLM-based evaluation of content quality and correctness
  • πŸ”§ Easy integration with pytest
  • πŸ“ Comprehensive test reports
  • 🎯 Sensible defaults with flexible overrides

πŸ“š Overview

Initialization

Using default models:

tester = LLMTestMate(
    similarity_threshold=0.8,
    temperature=0.7
)

Semantic Similarity Testing

tester.semantic_similarity(text: str, reference_text: str, threshold: Optional[float] = None)

Calculate semantic similarity between two texts using sentence transformers. Returns a similarity score and pass/fail status.

tester.semantic_similarity_list(text: str, reference_texts: list[str], threshold: Optional[float] = None)

Compare text against multiple references using semantic similarity. Returns results sorted by similarity score.

String Similarity Testing

tester.string_similarity(text: str, reference_text: str, threshold: Optional[float] = None, 
    normalize_case: bool = True, normalize_whitespace: bool = True,
    remove_punctuation: bool = True, method: str = "damerau-levenshtein")

Calculate string similarity using various distance metrics (damerau-levenshtein, levenshtein, hamming, jaro, jaro-winkler, indel).

tester.string_similarity_list(text: str, reference_texts: list[str], threshold: Optional[float] = None,
    normalize_case: bool = True, normalize_whitespace: bool = True,
    remove_punctuation: bool = True, method: str = "damerau-levenshtein")

Compare text against multiple references using string similarity. Returns results sorted by similarity score.

LLM-Based Evaluation

tester.llm_evaluate(text: str, reference_text: str, criteria: Optional[str] = None,
    model: Optional[str] = None, temperature: Optional[float] = None,
    max_tokens: Optional[int] = None)

Evaluate text quality and correctness using an LLM as judge. Returns detailed analysis in JSON format.

tester.llm_evaluate_list(text: str, reference_texts: list[str], criteria: Optional[str] = None,
    model: Optional[str] = None, temperature: Optional[float] = None,
    max_tokens: Optional[int] = None)

Evaluate text against multiple references using LLM. Returns results sorted by similarity if available.

πŸƒβ€β™‚οΈ Quick Start

Installation

# Create and activate virtual environment
python -m venv venv
source venv/bin/activate  # On Windows, use: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# To run the examples
python examples.py

# To run the tests
pytest                      # Run all tests
pytest test_examples.py     # Run all tests in file
pytest test_examples.py -v  # Run with verbose output
pytest test_examples.py::test_semantic_similarity  # Run a specific test

The test examples (test_examples.py) include:

  • Semantic similarity testing
  • LLM-based evaluation
  • Custom evaluation criteria with Llama
  • Model comparison tests
  • Parameterized threshold testing

Here's how to get you started using this tool (see quickstart.py):

import json

from llm_test_mate import LLMTestMate

# Initialize the test mate with your preferences
tester = LLMTestMate(
    llm_model="bedrock/anthropic.claude-3-5-sonnet-20240620-v1:0",
    similarity_threshold=0.8,
    temperature=0.7
)

# Example 1a: String similarity test (Single Reference)
print("\n=== Example 1a: String Similarity (Single Reference) ===")
text1 = "The quick brown fox jumps over the lazy dog."
text2 = "The quikc brown fox jumps over the lasy dog."

result = tester.string_similarity(text1, text2)
print(f"Text 1: {text1}")
print(f"Text 2: {text2}")
print(f"String similarity score: {result['similarity']:.2f}")
print(f"Edit distance: {result['distance']:.2f}")
print(f"Passed threshold: {result['passed']}")

# Example 1b: String similarity test (Multiple References)
print("\n=== Example 1b: String Similarity (Multiple References) ===")
test_text = "The quick brown fox jumps over the lazy dog."
reference_texts = [
    "The quikc brown fox jumps over the lasy dog.",
    "The quick brwon fox jumps over the layz dog.",
    "The quick brown fox jumps over the lazy dog."
]

print(f"Test text: {test_text}")
print("Reference texts:")
for i, ref in enumerate(reference_texts, 1):
    print(f"{i}. {ref}")

results = tester.string_similarity_list(test_text, reference_texts)
print("\nResults sorted by similarity (highest first):")
for result in results:
    print(f"\nReference: {result['reference_text']}")
    print(f"Similarity score: {result['similarity']:.2f}")
    print(f"Edit distance: {result['distance']:.2f}")
    print(f"Passed threshold: {result['passed']}")

# Example 2a: Semantic similarity test (Single Reference)
print("\n=== Example 2a: Semantic Similarity (Single Reference) ===")
text1 = "The quick brown fox jumps over the lazy dog."
text2 = "A swift brown fox leaps above a sleepy canine."

result = tester.semantic_similarity(text1, text2)
print(f"Text 1: {text1}")
print(f"Text 2: {text2}")
print(f"Semantic similarity score: {result['similarity']:.2f}")
print(f"Passed threshold: {result['passed']}")

# Example 2b: Semantic similarity test (Multiple References)
print("\n=== Example 2b: Semantic Similarity (Multiple References) ===")
test_text = "A swift brown fox leaps above a sleepy canine."
reference_texts = [
    "The quick brown fox jumps over the lazy dog.",
    "A fast brown fox leaps over a sleeping dog.",
    "The agile brown fox bounds over the tired dog."
]

print(f"Test text: {test_text}")
print("Reference texts:")
for i, ref in enumerate(reference_texts, 1):
    print(f"{i}. {ref}")

results = tester.semantic_similarity_list(test_text, reference_texts)
print("\nResults sorted by similarity (highest first):")
for result in results:
    print(f"\nReference: {result['reference_text']}")
    print(f"Similarity score: {result['similarity']:.2f}")
    print(f"Passed threshold: {result['passed']}")

# Example 3a: LLM-based evaluation (Single Reference)
print("\n=== Example 3a: LLM Evaluation (Single Reference) ===")
text1 = "The quick brown fox jumps over the lazy dog."
text2 = "A swift brown fox leaps above a sleepy canine."

result = tester.llm_evaluate(text1, text2)
print(f"Text 1: {text1}")
print(f"Text 2: {text2}")
print("Evaluation result:")
print(json.dumps(result, indent=2))

# Example 3b: LLM-based evaluation (Multiple References)
print("\n=== Example 3b: LLM Evaluation (Multiple References) ===")
test_text = "A swift brown fox leaps above a sleepy canine."
reference_texts = [
    "The quick brown fox jumps over the lazy dog.",
    "A fast brown fox leaps over a sleeping dog.",
    "The agile brown fox bounds over the tired dog."
]

print(f"Test text: {test_text}")
print("Reference texts:")
for i, ref in enumerate(reference_texts, 1):
    print(f"{i}. {ref}")

results = tester.llm_evaluate_list(test_text, reference_texts)
print("\nResults sorted by similarity (highest first):")
for result in results:
    print(f"\nReference: {result['reference_text']}")
    print(json.dumps(result, indent=2))

Sample output:

=== Example 1a: String Similarity (Single Reference) ===
Text 1: The quick brown fox jumps over the lazy dog.
Text 2: The quikc brown fox jumps over the lasy dog.
String similarity score: 0.95
Edit distance: 0.05
Passed threshold: True

=== Example 1b: String Similarity (Multiple References) ===
Test text: The quick brown fox jumps over the lazy dog.
Reference texts:
1. The quikc brown fox jumps over the lasy dog.
2. The quick brwon fox jumps over the layz dog.
3. The quick brown fox jumps over the lazy dog.

Results sorted by similarity (highest first):

Reference: The quick brown fox jumps over the lazy dog.
Similarity score: 1.00
Edit distance: 0.00
Passed threshold: True

Reference: The quikc brown fox jumps over the lasy dog.
Similarity score: 0.95
Edit distance: 0.05
Passed threshold: True

Reference: The quick brwon fox jumps over the layz dog.
Similarity score: 0.95
Edit distance: 0.05
Passed threshold: True

=== Example 2a: Semantic Similarity (Single Reference) ===
Text 1: The quick brown fox jumps over the lazy dog.
Text 2: A swift brown fox leaps above a sleepy canine.
Semantic similarity score: 0.79
Passed threshold: False

=== Example 2b: Semantic Similarity (Multiple References) ===
Test text: A swift brown fox leaps above a sleepy canine.
Reference texts:
1. The quick brown fox jumps over the lazy dog.
2. A fast brown fox leaps over a sleeping dog.
3. The agile brown fox bounds over the tired dog.

Results sorted by similarity (highest first):

Reference: A fast brown fox leaps over a sleeping dog.
Similarity score: 0.88
Passed threshold: True

Reference: The quick brown fox jumps over the lazy dog.
Similarity score: 0.79
Passed threshold: False

Reference: The agile brown fox bounds over the tired dog.
Similarity score: 0.72
Passed threshold: False

=== Example 3a: LLM Evaluation (Single Reference) ===
Text 1: The quick brown fox jumps over the lazy dog.
Text 2: A swift brown fox leaps above a sleepy canine.
Evaluation result:
{
  "passed": true,
  "similarity": 0.9,
  "analysis": {
    "semantic_match": "Both sentences convey the same core meaning of a fox jumping over a dog, with only minor variations in word choice.",
    "content_match": "The key elements (fox, brown, jumping, dog) are present in both texts, with slight differences in adjectives and verbs used.",
    "key_differences": [
      "Use of 'quick' vs 'swift'",
      "Use of 'jumps' vs 'leaps'",
      "Use of 'lazy' vs 'sleepy'",
      "Use of 'dog' vs 'canine'"
    ]
  },
  "model_used": "anthropic.claude-3-5-sonnet-20240620-v1:0"
}

=== Example 3b: LLM Evaluation (Multiple References) ===
Test text: A swift brown fox leaps above a sleepy canine.
Reference texts:
1. The quick brown fox jumps over the lazy dog.
2. A fast brown fox leaps over a sleeping dog.
3. The agile brown fox bounds over the tired dog.

Results sorted by similarity (highest first):

Reference: A fast brown fox leaps over a sleeping dog.
{
  "passed": true,
  "similarity": 0.9,
  "analysis": {
    "semantic_match": "Both texts convey the same core meaning of a fox quickly moving over a resting dog.",
    "content_match": "The key elements (fox, brown, leaping, dog) are present in both texts with minor variations in descriptors.",
    "key_differences": [
      "Use of 'swift' vs 'fast' to describe the fox",
      "Use of 'above' vs 'over' for the fox's action",
      "Description of the dog as 'sleepy' vs 'sleeping'",
      "Use of 'canine' instead of 'dog' in the generated text"
    ]
  },
  "model_used": "anthropic.claude-3-5-sonnet-20240620-v1:0",
  "reference_text": "A fast brown fox leaps over a sleeping dog."
}

Reference: The agile brown fox bounds over the tired dog.
{
  "passed": true,
  "similarity": 0.9,
  "analysis": {
    "semantic_match": "Both sentences convey the same core meaning of a fox moving quickly over a dog.",
    "content_match": "The key elements (fox, brown, jumping over, dog) are present in both texts with slight variations in descriptors.",
    "key_differences": [
      "The generated text uses 'swift' instead of 'agile'",
      "The generated text uses 'leaps above' instead of 'bounds over'",
      "The generated text describes the dog as 'sleepy' instead of 'tired'",
      "'A' is used instead of 'The' at the beginning of the generated text"
    ]
  },
  "model_used": "anthropic.claude-3-5-sonnet-20240620-v1:0",
  "reference_text": "The agile brown fox bounds over the tired dog."
}

Reference: The quick brown fox jumps over the lazy dog.
{
  "passed": true,
  "similarity": 0.85,
  "analysis": {
    "semantic_match": "Both sentences convey the same core meaning of a fox moving quickly over a dog.",
    "content_match": "The main elements (fox, dog, action of moving over) are present in both sentences, with slight variations in adjectives and verbs used.",
    "key_differences": [
      "Use of 'swift' instead of 'quick'",
      "Use of 'leaps above' instead of 'jumps over'",
      "Use of 'sleepy' instead of 'lazy'",
      "Absence of articles 'The' and 'the' in the generated text",
      "Use of 'canine' instead of 'dog'"
    ]
  },
  "model_used": "anthropic.claude-3-5-sonnet-20240620-v1:0",
  "reference_text": "The quick brown fox jumps over the lazy dog."
}

2. Custom Evaluation Criteria

# Initialize with custom criteria
tester = LLMTestMate(
    evaluation_criteria="""
    Evaluate the marketing effectiveness of the generated text compared to the reference.
    Consider:
    1. Feature Coverage: Are all key features mentioned?
    2. Tone: Is it engaging and professional?
    3. Clarity: Is the message clear and concise?

    Return JSON with:
    {
        "passed": boolean,
        "effectiveness_score": float (0-1),
        "analysis": {
            "feature_coverage": string,
            "tone_analysis": string,
            "suggestions": list[string]
        }
    }
    """
)

product_description = "Our new smartphone features a 6.1-inch OLED display, 12MP camera, and all-day battery life."
generated_description = generate_text("Write a short description of a smartphone's key features")

eval_result = tester.llm_evaluate(
    generated_description,
    product_description
)

Sample result:

{
  "passed": true,
  "effectiveness_score": 0.8,
  "analysis": {
    "feature_coverage": "The generated text provides a much more comprehensive coverage of smartphone features compared to the reference. It includes details on display, camera, performance, storage, battery, connectivity, operating system, and additional features, while the reference only mentions display, camera, and battery.",
    "tone_analysis": "The generated text maintains a professional and informative tone throughout, providing technical details and specifications. It is more detailed and technical compared to the concise, marketing-oriented tone of the reference.",
    "suggestions": [
      "Consider condensing some of the technical details for a more concise marketing message",
      "Add more engaging language or unique selling points to make the features stand out",
      "Include specific model comparisons or standout features to differentiate from competitors",
      "Consider adding a brief overview or summary statement at the beginning to capture attention quickly"
    ]
  },
  "model_used": "..."
}

3. Using with Pytest

import pytest
from llm_test_mate import LLMTestMate

@pytest.fixture
def tester():
    return LLMTestMate(
        similarity_threshold=0.8,
        temperature=0.7
    )

def test_generated_content(tester):
    generated = generate_text("Explain what is Python")
    expected = "Python is a high-level programming language..."
    
    # Check semantic similarity
    sem_result = tester.semantic_similarity(
        generated,
        expected
    )
    
    # Evaluate with LLM
    llm_result = tester.llm_evaluate(
        generated,
        expected
    )
    
    assert sem_result["passed"], "Failed similarity check"
    assert llm_result["passed"], f"Failed requirements: {llm_result['reasoning']}"

πŸ› οΈ Advanced Usage

String Similarity Testing

LLM Test Mate provides comprehensive string similarity testing with multiple methods and configuration options:

  1. Basic Usage:
result = tester.string_similarity(
    "The quick brown fox jumps over the lazy dog!",
    "The quikc brown fox jumps over the lasy dog",  # Different punctuation and typos
    threshold=0.9
)
  1. Available Methods:
Method Best For Description
damerau-levenshtein General text Handles transposed letters, good default choice
levenshtein Simple comparisons Basic edit distance
hamming Equal length strings Counts position differences
jaro Short strings Good for typos in short text
jaro-winkler Names Optimized for name comparisons
indel Subsequence matching Based on longest common subsequence
  1. Configuration Options:
  • normalize_case: Convert to lowercase (default: True)
  • normalize_whitespace: Standardize spaces (default: True)
  • remove_punctuation: Ignore punctuation marks (default: True)
  • processor: Custom function for text preprocessing
  • threshold: Similarity threshold for pass/fail (0-1)
  • method: Choice of similarity metric
  1. Example Usage:
# Name comparison with Jaro-Winkler
result = tester.string_similarity(
    "John Smith",
    "Jon Smyth",
    method="jaro-winkler",
    threshold=0.8
)

# Text with custom preprocessing
def remove_special_chars(text: str) -> str:
    return ''.join(c for c in text if c.isalnum() or c.isspace())

result = tester.string_similarity(
    "Hello! @#$ World",
    "Hello World",
    processor=remove_special_chars,
    threshold=0.9
)

# Combined options
result = tester.string_similarity(
    "Hello,  WORLD!",
    "hello world",
    method="damerau-levenshtein",
    normalize_case=True,
    normalize_whitespace=True,
    remove_punctuation=True,
    processor=remove_special_chars,
    threshold=0.9
)
  1. Result Dictionary:
{
    "similarity": 0.95,        # Similarity score (0-1)
    "distance": 0.05,         # Distance score (0-1)
    "method": "jaro-winkler", # Method used
    "normalized": {           # Applied normalizations
        "case": True,
        "whitespace": True,
        "punctuation": True
    },
    "options": {              # Additional options
        "processor": "remove_special_chars"
    },
    "passed": True,           # If threshold was met
    "threshold": 0.9         # Threshold used
}

Combined Testing Approach πŸ”„

def test_comprehensive_check(embedding_model):
    generated = generate_text("Write a recipe")
    expected = """
    Recipe must include:
    - Ingredients list
    - Instructions
    - Cooking time
    """
    
    # Check similarity
    sem_result = tester.semantic_similarity(
        generated,
        expected
    )
    
    # Detailed evaluation
    llm_result = tester.llm_evaluate(
        generated,
        expected
    )
    
    assert sem_result["passed"], "Failed similarity check"
    assert llm_result["passed"], f"Failed requirements: {llm_result['reasoning']}"

πŸ“Š Comprehensive Test Results

When running tests with LLM Test Mate, you get comprehensive results from two types of evaluations:

Semantic Similarity Results

{
    "similarity": 0.85,        # Similarity score between 0-1
    "embedding_model": "all-MiniLM-L6-v2",  # Model used for embeddings
    "passed": True,           # Whether it passed the threshold
    "threshold": 0.8          # The threshold used for this test
}

LLM Evaluation Results

{
    "passed": True,           # Overall pass/fail assessment
    "similarity_score": 0.9,  # Semantic similarity assessment by LLM
    "analysis": {
        "semantic_match": "The texts convey very similar meanings...",
        "content_match": "Both texts cover the same key points...",
        "key_differences": [
            "Minor variation in word choice",
            "Slightly different emphasis on..."
        ]
    },
    "model_used": "anthropic.claude-3-5-sonnet-20240620-v1:0"  # Model used for evaluation
}

For custom evaluation criteria, the results will match your specified JSON structure. For example, with marketing evaluation:

{
    "passed": True,
    "effectiveness_score": 0.85,
    "analysis": {
        "feature_coverage": "All key features mentioned...",
        "tone_analysis": "Professional and engaging...",
        "suggestions": [
            "Consider emphasizing battery life more",
            "Add specific camera capabilities"
        ]
    },
    "model_used": "meta.llama3-2-90b-instruct-v1:0"
}

Benefits of Combined Testing

When using both approaches together, you get:

  • Quantitative similarity metrics from embedding comparison
  • Qualitative content evaluation from LLM analysis
  • Model-specific insights (can compare different LLM evaluations)
  • Clear pass/fail indicators for automated testing
  • Detailed feedback for manual review

This comprehensive approach helps ensure both semantic closeness to reference content and qualitative correctness of the generated output.

πŸ”§ Adding to Your Project

The simplest way to add LLM Test Mate to your project is to copy the llm_test_mate.py file:

  1. Copy llm_test_mate.py to your project's test directory
  2. Add the required dependencies to your requirements.txt file:
  • litellm
  • sentence-transformers
  • boto3
  • pytest
  • rapidfuzz
  1. Install the dependencies:
pip install -r requirements.txt

Project Structure

Typical integration into an existing project:

your_project/
β”œβ”€β”€ src/
β”‚   └── your_code.py
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ llm_test_mate.py    # Copy the file here
β”‚   β”œβ”€β”€ your_test_file.py   # Your LLM tests
β”‚   └── conftest.py         # Pytest fixtures
β”œβ”€ requirements.txt        # Add dependencies here
└── pytest.ini              # Optional pytest configuration

Example conftest.py:

import pytest
from llm_test_mate import LLMTestMate

@pytest.fixture
def llm_tester():
    return LLMTestMate(
        similarity_threshold=0.8,
        temperature=0.7
    )

@pytest.fixture
def strict_llm_tester():
    return LLMTestMate(
        similarity_threshold=0.9,
        temperature=0.5
    )

Example test file:

def test_product_description(llm_tester):
    expected = "Our product helps you test LLM outputs effectively."
    generated = your_llm_function("Describe our product")
    
    result = llm_tester.semantic_similarity(generated, expected)
    assert result['passed'], f"Generated text not similar enough: {result['similarity']}"

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

πŸ“œ License

Distributed under the MIT License. See LICENSE for more information.

πŸ™ Acknowledgments

About

A simple testing framework to evaluate and validate LLM-generated content using string similarity, semantic similarity, and model-based evaluation.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages