A simple testing framework to evaluate and validate LLM-generated content using string similarity, semantic similarity, and model-based (LLM as a judge) evaluation.
- π String similarity testing using Damerau-Levenshtein distance and other methods
- π Semantic similarity testing using sentence transformers
- π€ LLM-based evaluation of content quality and correctness
- π§ Easy integration with pytest
- π Comprehensive test reports
- π― Sensible defaults with flexible overrides
Using default models:
tester = LLMTestMate(
similarity_threshold=0.8,
temperature=0.7
)
tester.semantic_similarity(text: str, reference_text: str, threshold: Optional[float] = None)
Calculate semantic similarity between two texts using sentence transformers. Returns a similarity score and pass/fail status.
tester.semantic_similarity_list(text: str, reference_texts: list[str], threshold: Optional[float] = None)
Compare text against multiple references using semantic similarity. Returns results sorted by similarity score.
tester.string_similarity(text: str, reference_text: str, threshold: Optional[float] = None,
normalize_case: bool = True, normalize_whitespace: bool = True,
remove_punctuation: bool = True, method: str = "damerau-levenshtein")
Calculate string similarity using various distance metrics (damerau-levenshtein, levenshtein, hamming, jaro, jaro-winkler, indel).
tester.string_similarity_list(text: str, reference_texts: list[str], threshold: Optional[float] = None,
normalize_case: bool = True, normalize_whitespace: bool = True,
remove_punctuation: bool = True, method: str = "damerau-levenshtein")
Compare text against multiple references using string similarity. Returns results sorted by similarity score.
tester.llm_evaluate(text: str, reference_text: str, criteria: Optional[str] = None,
model: Optional[str] = None, temperature: Optional[float] = None,
max_tokens: Optional[int] = None)
Evaluate text quality and correctness using an LLM as judge. Returns detailed analysis in JSON format.
tester.llm_evaluate_list(text: str, reference_texts: list[str], criteria: Optional[str] = None,
model: Optional[str] = None, temperature: Optional[float] = None,
max_tokens: Optional[int] = None)
Evaluate text against multiple references using LLM. Returns results sorted by similarity if available.
# Create and activate virtual environment
python -m venv venv
source venv/bin/activate # On Windows, use: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# To run the examples
python examples.py
# To run the tests
pytest # Run all tests
pytest test_examples.py # Run all tests in file
pytest test_examples.py -v # Run with verbose output
pytest test_examples.py::test_semantic_similarity # Run a specific test
The test examples (test_examples.py
) include:
- Semantic similarity testing
- LLM-based evaluation
- Custom evaluation criteria with Llama
- Model comparison tests
- Parameterized threshold testing
Here's how to get you started using this tool (see quickstart.py
):
import json
from llm_test_mate import LLMTestMate
# Initialize the test mate with your preferences
tester = LLMTestMate(
llm_model="bedrock/anthropic.claude-3-5-sonnet-20240620-v1:0",
similarity_threshold=0.8,
temperature=0.7
)
# Example 1a: String similarity test (Single Reference)
print("\n=== Example 1a: String Similarity (Single Reference) ===")
text1 = "The quick brown fox jumps over the lazy dog."
text2 = "The quikc brown fox jumps over the lasy dog."
result = tester.string_similarity(text1, text2)
print(f"Text 1: {text1}")
print(f"Text 2: {text2}")
print(f"String similarity score: {result['similarity']:.2f}")
print(f"Edit distance: {result['distance']:.2f}")
print(f"Passed threshold: {result['passed']}")
# Example 1b: String similarity test (Multiple References)
print("\n=== Example 1b: String Similarity (Multiple References) ===")
test_text = "The quick brown fox jumps over the lazy dog."
reference_texts = [
"The quikc brown fox jumps over the lasy dog.",
"The quick brwon fox jumps over the layz dog.",
"The quick brown fox jumps over the lazy dog."
]
print(f"Test text: {test_text}")
print("Reference texts:")
for i, ref in enumerate(reference_texts, 1):
print(f"{i}. {ref}")
results = tester.string_similarity_list(test_text, reference_texts)
print("\nResults sorted by similarity (highest first):")
for result in results:
print(f"\nReference: {result['reference_text']}")
print(f"Similarity score: {result['similarity']:.2f}")
print(f"Edit distance: {result['distance']:.2f}")
print(f"Passed threshold: {result['passed']}")
# Example 2a: Semantic similarity test (Single Reference)
print("\n=== Example 2a: Semantic Similarity (Single Reference) ===")
text1 = "The quick brown fox jumps over the lazy dog."
text2 = "A swift brown fox leaps above a sleepy canine."
result = tester.semantic_similarity(text1, text2)
print(f"Text 1: {text1}")
print(f"Text 2: {text2}")
print(f"Semantic similarity score: {result['similarity']:.2f}")
print(f"Passed threshold: {result['passed']}")
# Example 2b: Semantic similarity test (Multiple References)
print("\n=== Example 2b: Semantic Similarity (Multiple References) ===")
test_text = "A swift brown fox leaps above a sleepy canine."
reference_texts = [
"The quick brown fox jumps over the lazy dog.",
"A fast brown fox leaps over a sleeping dog.",
"The agile brown fox bounds over the tired dog."
]
print(f"Test text: {test_text}")
print("Reference texts:")
for i, ref in enumerate(reference_texts, 1):
print(f"{i}. {ref}")
results = tester.semantic_similarity_list(test_text, reference_texts)
print("\nResults sorted by similarity (highest first):")
for result in results:
print(f"\nReference: {result['reference_text']}")
print(f"Similarity score: {result['similarity']:.2f}")
print(f"Passed threshold: {result['passed']}")
# Example 3a: LLM-based evaluation (Single Reference)
print("\n=== Example 3a: LLM Evaluation (Single Reference) ===")
text1 = "The quick brown fox jumps over the lazy dog."
text2 = "A swift brown fox leaps above a sleepy canine."
result = tester.llm_evaluate(text1, text2)
print(f"Text 1: {text1}")
print(f"Text 2: {text2}")
print("Evaluation result:")
print(json.dumps(result, indent=2))
# Example 3b: LLM-based evaluation (Multiple References)
print("\n=== Example 3b: LLM Evaluation (Multiple References) ===")
test_text = "A swift brown fox leaps above a sleepy canine."
reference_texts = [
"The quick brown fox jumps over the lazy dog.",
"A fast brown fox leaps over a sleeping dog.",
"The agile brown fox bounds over the tired dog."
]
print(f"Test text: {test_text}")
print("Reference texts:")
for i, ref in enumerate(reference_texts, 1):
print(f"{i}. {ref}")
results = tester.llm_evaluate_list(test_text, reference_texts)
print("\nResults sorted by similarity (highest first):")
for result in results:
print(f"\nReference: {result['reference_text']}")
print(json.dumps(result, indent=2))
Sample output:
=== Example 1a: String Similarity (Single Reference) ===
Text 1: The quick brown fox jumps over the lazy dog.
Text 2: The quikc brown fox jumps over the lasy dog.
String similarity score: 0.95
Edit distance: 0.05
Passed threshold: True
=== Example 1b: String Similarity (Multiple References) ===
Test text: The quick brown fox jumps over the lazy dog.
Reference texts:
1. The quikc brown fox jumps over the lasy dog.
2. The quick brwon fox jumps over the layz dog.
3. The quick brown fox jumps over the lazy dog.
Results sorted by similarity (highest first):
Reference: The quick brown fox jumps over the lazy dog.
Similarity score: 1.00
Edit distance: 0.00
Passed threshold: True
Reference: The quikc brown fox jumps over the lasy dog.
Similarity score: 0.95
Edit distance: 0.05
Passed threshold: True
Reference: The quick brwon fox jumps over the layz dog.
Similarity score: 0.95
Edit distance: 0.05
Passed threshold: True
=== Example 2a: Semantic Similarity (Single Reference) ===
Text 1: The quick brown fox jumps over the lazy dog.
Text 2: A swift brown fox leaps above a sleepy canine.
Semantic similarity score: 0.79
Passed threshold: False
=== Example 2b: Semantic Similarity (Multiple References) ===
Test text: A swift brown fox leaps above a sleepy canine.
Reference texts:
1. The quick brown fox jumps over the lazy dog.
2. A fast brown fox leaps over a sleeping dog.
3. The agile brown fox bounds over the tired dog.
Results sorted by similarity (highest first):
Reference: A fast brown fox leaps over a sleeping dog.
Similarity score: 0.88
Passed threshold: True
Reference: The quick brown fox jumps over the lazy dog.
Similarity score: 0.79
Passed threshold: False
Reference: The agile brown fox bounds over the tired dog.
Similarity score: 0.72
Passed threshold: False
=== Example 3a: LLM Evaluation (Single Reference) ===
Text 1: The quick brown fox jumps over the lazy dog.
Text 2: A swift brown fox leaps above a sleepy canine.
Evaluation result:
{
"passed": true,
"similarity": 0.9,
"analysis": {
"semantic_match": "Both sentences convey the same core meaning of a fox jumping over a dog, with only minor variations in word choice.",
"content_match": "The key elements (fox, brown, jumping, dog) are present in both texts, with slight differences in adjectives and verbs used.",
"key_differences": [
"Use of 'quick' vs 'swift'",
"Use of 'jumps' vs 'leaps'",
"Use of 'lazy' vs 'sleepy'",
"Use of 'dog' vs 'canine'"
]
},
"model_used": "anthropic.claude-3-5-sonnet-20240620-v1:0"
}
=== Example 3b: LLM Evaluation (Multiple References) ===
Test text: A swift brown fox leaps above a sleepy canine.
Reference texts:
1. The quick brown fox jumps over the lazy dog.
2. A fast brown fox leaps over a sleeping dog.
3. The agile brown fox bounds over the tired dog.
Results sorted by similarity (highest first):
Reference: A fast brown fox leaps over a sleeping dog.
{
"passed": true,
"similarity": 0.9,
"analysis": {
"semantic_match": "Both texts convey the same core meaning of a fox quickly moving over a resting dog.",
"content_match": "The key elements (fox, brown, leaping, dog) are present in both texts with minor variations in descriptors.",
"key_differences": [
"Use of 'swift' vs 'fast' to describe the fox",
"Use of 'above' vs 'over' for the fox's action",
"Description of the dog as 'sleepy' vs 'sleeping'",
"Use of 'canine' instead of 'dog' in the generated text"
]
},
"model_used": "anthropic.claude-3-5-sonnet-20240620-v1:0",
"reference_text": "A fast brown fox leaps over a sleeping dog."
}
Reference: The agile brown fox bounds over the tired dog.
{
"passed": true,
"similarity": 0.9,
"analysis": {
"semantic_match": "Both sentences convey the same core meaning of a fox moving quickly over a dog.",
"content_match": "The key elements (fox, brown, jumping over, dog) are present in both texts with slight variations in descriptors.",
"key_differences": [
"The generated text uses 'swift' instead of 'agile'",
"The generated text uses 'leaps above' instead of 'bounds over'",
"The generated text describes the dog as 'sleepy' instead of 'tired'",
"'A' is used instead of 'The' at the beginning of the generated text"
]
},
"model_used": "anthropic.claude-3-5-sonnet-20240620-v1:0",
"reference_text": "The agile brown fox bounds over the tired dog."
}
Reference: The quick brown fox jumps over the lazy dog.
{
"passed": true,
"similarity": 0.85,
"analysis": {
"semantic_match": "Both sentences convey the same core meaning of a fox moving quickly over a dog.",
"content_match": "The main elements (fox, dog, action of moving over) are present in both sentences, with slight variations in adjectives and verbs used.",
"key_differences": [
"Use of 'swift' instead of 'quick'",
"Use of 'leaps above' instead of 'jumps over'",
"Use of 'sleepy' instead of 'lazy'",
"Absence of articles 'The' and 'the' in the generated text",
"Use of 'canine' instead of 'dog'"
]
},
"model_used": "anthropic.claude-3-5-sonnet-20240620-v1:0",
"reference_text": "The quick brown fox jumps over the lazy dog."
}
# Initialize with custom criteria
tester = LLMTestMate(
evaluation_criteria="""
Evaluate the marketing effectiveness of the generated text compared to the reference.
Consider:
1. Feature Coverage: Are all key features mentioned?
2. Tone: Is it engaging and professional?
3. Clarity: Is the message clear and concise?
Return JSON with:
{
"passed": boolean,
"effectiveness_score": float (0-1),
"analysis": {
"feature_coverage": string,
"tone_analysis": string,
"suggestions": list[string]
}
}
"""
)
product_description = "Our new smartphone features a 6.1-inch OLED display, 12MP camera, and all-day battery life."
generated_description = generate_text("Write a short description of a smartphone's key features")
eval_result = tester.llm_evaluate(
generated_description,
product_description
)
Sample result:
{
"passed": true,
"effectiveness_score": 0.8,
"analysis": {
"feature_coverage": "The generated text provides a much more comprehensive coverage of smartphone features compared to the reference. It includes details on display, camera, performance, storage, battery, connectivity, operating system, and additional features, while the reference only mentions display, camera, and battery.",
"tone_analysis": "The generated text maintains a professional and informative tone throughout, providing technical details and specifications. It is more detailed and technical compared to the concise, marketing-oriented tone of the reference.",
"suggestions": [
"Consider condensing some of the technical details for a more concise marketing message",
"Add more engaging language or unique selling points to make the features stand out",
"Include specific model comparisons or standout features to differentiate from competitors",
"Consider adding a brief overview or summary statement at the beginning to capture attention quickly"
]
},
"model_used": "..."
}
import pytest
from llm_test_mate import LLMTestMate
@pytest.fixture
def tester():
return LLMTestMate(
similarity_threshold=0.8,
temperature=0.7
)
def test_generated_content(tester):
generated = generate_text("Explain what is Python")
expected = "Python is a high-level programming language..."
# Check semantic similarity
sem_result = tester.semantic_similarity(
generated,
expected
)
# Evaluate with LLM
llm_result = tester.llm_evaluate(
generated,
expected
)
assert sem_result["passed"], "Failed similarity check"
assert llm_result["passed"], f"Failed requirements: {llm_result['reasoning']}"
LLM Test Mate provides comprehensive string similarity testing with multiple methods and configuration options:
- Basic Usage:
result = tester.string_similarity(
"The quick brown fox jumps over the lazy dog!",
"The quikc brown fox jumps over the lasy dog", # Different punctuation and typos
threshold=0.9
)
- Available Methods:
Method | Best For | Description |
---|---|---|
damerau-levenshtein | General text | Handles transposed letters, good default choice |
levenshtein | Simple comparisons | Basic edit distance |
hamming | Equal length strings | Counts position differences |
jaro | Short strings | Good for typos in short text |
jaro-winkler | Names | Optimized for name comparisons |
indel | Subsequence matching | Based on longest common subsequence |
- Configuration Options:
normalize_case
: Convert to lowercase (default: True)normalize_whitespace
: Standardize spaces (default: True)remove_punctuation
: Ignore punctuation marks (default: True)processor
: Custom function for text preprocessingthreshold
: Similarity threshold for pass/fail (0-1)method
: Choice of similarity metric
- Example Usage:
# Name comparison with Jaro-Winkler
result = tester.string_similarity(
"John Smith",
"Jon Smyth",
method="jaro-winkler",
threshold=0.8
)
# Text with custom preprocessing
def remove_special_chars(text: str) -> str:
return ''.join(c for c in text if c.isalnum() or c.isspace())
result = tester.string_similarity(
"Hello! @#$ World",
"Hello World",
processor=remove_special_chars,
threshold=0.9
)
# Combined options
result = tester.string_similarity(
"Hello, WORLD!",
"hello world",
method="damerau-levenshtein",
normalize_case=True,
normalize_whitespace=True,
remove_punctuation=True,
processor=remove_special_chars,
threshold=0.9
)
- Result Dictionary:
{
"similarity": 0.95, # Similarity score (0-1)
"distance": 0.05, # Distance score (0-1)
"method": "jaro-winkler", # Method used
"normalized": { # Applied normalizations
"case": True,
"whitespace": True,
"punctuation": True
},
"options": { # Additional options
"processor": "remove_special_chars"
},
"passed": True, # If threshold was met
"threshold": 0.9 # Threshold used
}
def test_comprehensive_check(embedding_model):
generated = generate_text("Write a recipe")
expected = """
Recipe must include:
- Ingredients list
- Instructions
- Cooking time
"""
# Check similarity
sem_result = tester.semantic_similarity(
generated,
expected
)
# Detailed evaluation
llm_result = tester.llm_evaluate(
generated,
expected
)
assert sem_result["passed"], "Failed similarity check"
assert llm_result["passed"], f"Failed requirements: {llm_result['reasoning']}"
When running tests with LLM Test Mate, you get comprehensive results from two types of evaluations:
{
"similarity": 0.85, # Similarity score between 0-1
"embedding_model": "all-MiniLM-L6-v2", # Model used for embeddings
"passed": True, # Whether it passed the threshold
"threshold": 0.8 # The threshold used for this test
}
{
"passed": True, # Overall pass/fail assessment
"similarity_score": 0.9, # Semantic similarity assessment by LLM
"analysis": {
"semantic_match": "The texts convey very similar meanings...",
"content_match": "Both texts cover the same key points...",
"key_differences": [
"Minor variation in word choice",
"Slightly different emphasis on..."
]
},
"model_used": "anthropic.claude-3-5-sonnet-20240620-v1:0" # Model used for evaluation
}
For custom evaluation criteria, the results will match your specified JSON structure. For example, with marketing evaluation:
{
"passed": True,
"effectiveness_score": 0.85,
"analysis": {
"feature_coverage": "All key features mentioned...",
"tone_analysis": "Professional and engaging...",
"suggestions": [
"Consider emphasizing battery life more",
"Add specific camera capabilities"
]
},
"model_used": "meta.llama3-2-90b-instruct-v1:0"
}
When using both approaches together, you get:
- Quantitative similarity metrics from embedding comparison
- Qualitative content evaluation from LLM analysis
- Model-specific insights (can compare different LLM evaluations)
- Clear pass/fail indicators for automated testing
- Detailed feedback for manual review
This comprehensive approach helps ensure both semantic closeness to reference content and qualitative correctness of the generated output.
The simplest way to add LLM Test Mate to your project is to copy the llm_test_mate.py
file:
- Copy
llm_test_mate.py
to your project's test directory - Add the required dependencies to your
requirements.txt
file:
- litellm
- sentence-transformers
- boto3
- pytest
- rapidfuzz
- Install the dependencies:
pip install -r requirements.txt
Typical integration into an existing project:
your_project/
βββ src/
β βββ your_code.py
βββ tests/
β βββ llm_test_mate.py # Copy the file here
β βββ your_test_file.py # Your LLM tests
β βββ conftest.py # Pytest fixtures
ββ requirements.txt # Add dependencies here
βββ pytest.ini # Optional pytest configuration
Example conftest.py
:
import pytest
from llm_test_mate import LLMTestMate
@pytest.fixture
def llm_tester():
return LLMTestMate(
similarity_threshold=0.8,
temperature=0.7
)
@pytest.fixture
def strict_llm_tester():
return LLMTestMate(
similarity_threshold=0.9,
temperature=0.5
)
Example test file:
def test_product_description(llm_tester):
expected = "Our product helps you test LLM outputs effectively."
generated = your_llm_function("Describe our product")
result = llm_tester.semantic_similarity(generated, expected)
assert result['passed'], f"Generated text not similar enough: {result['similarity']}"
Contributions are welcome! Please feel free to submit a Pull Request.
Distributed under the MIT License. See LICENSE
for more information.
- Built with LiteLLM
- Uses sentence-transformers
- String similarity powered by RapidFuzz