Proposal: Semantic assertions for qualitative test validation

## Problem
 
Several "qualitative" tests in the suite are flaky because they validate LLM outputs with rigid comparisons (string matching, exact patterns). Generative models can rephrase or slightly shift tone between runs, causing tests to fail even when the response is functionally correct.
 
Related issues:
- #398 – `test_kv` fails intermittently due to safety refusals instead of expected output
- #384 – `langchain_messages.py` test is flaky because the qualitative assert is too strict
 
Both seem to share a similar root cause, applying traditional testing approaches to non-deterministic outputs.
 
(Following up on a brief Slack discussion with @psschwei, who suggested opening this issue 😄)
 
## Proposal
 
One possible direction, from my perspective, could be a lightweight pytest-native semantic assertion layer that validates by meaning instead of exact matching. This would stay aligned with Mellea's existing LLM-as-a-judge approach, integrating it into pytest in a more "Pythonic" way, without introducing heavier evaluation frameworks.
 
The idea separates two validation dimensions:
 
- **`Content(response)`**: whether the model generated the correct information
- **`Behaviour(response)`**: whether the response behaves as expected (e.g. avoiding unexpected safety refusals)
 
### Example 1
 
```py
def test_kv_store_operations():
    response = m.instruct("Explain key-value store operations")
 
    # Content validation
    assert "key-value operations like get, set, delete" in Content(response)
    assert "unrelated topics" not in Content(response)
 
    # Behaviour validation
    assert "safety refusal or content rejection" not in Behaviour(response)
    assert "informative and direct" in Behaviour(response)
```
 
### Example 2
 
```py
def test_langchain_messages():
    response = m.instruct("Format this as a langchain message")
 
    assert "a properly structured message" in Content(response)
    assert "clear and well-formatted" in Behaviour(response)
```
 
## Proof of Concept
 
I put together a small PoC to explore this idea using `ibm-granite/granite-embedding-30m-english` for the semantic similarity layer, integrated as a pytest plugin.
 
The screenshots below illustrate how the same test behaves when the expected intent is semantically present versus when it is not:
 
```python
import pytest
from pytest_semantic_assert import Content
 
def test_exact_phrase_passes():
    response = "Redis is an in-memory key-value store used for caching."
    assert "key-value store" in Content(response)  # ✅ passes
```
 
```python
import pytest
from pytest_semantic_assert import Content
 
def test_exact_phrase_passes():
    response = "Redis is an in-memory key-value store used for caching."
    assert "Alan Turing" in Content(response)  # ❌ fails — similarity 0.51, below threshold 0.75
```
 
**Semantic match (passes):**

<img width="100%" alt="Semantic match passes" src="https://github.com/user-attachments/assets/ff1eaa58-5ffb-4cc1-bf7e-05de213469fd" />

**No semantic match (fails):**

<img width="100%" alt="No semantic match fails" src="https://github.com/user-attachments/assets/6467ba27-6f97-4e21-a2f0-0e657d282f28" />
 
## Context
 
I looked at evaluation frameworks like DeepEval and similar tools, but they can introduce additional complexity for what could be a more lightweight, testing-oriented helper. This approach aims to stay minimal and aligned with Mellea's existing philosophy.
 
## Open questions
 
- From your perspective, does this direction make sense for how qualitative testing is currently approached in Mellea?
- Are you already exploring similar approaches internally for handling flaky LLM-based tests?
- Do you think there's a broader need in the Python/GenAI ecosystem for lightweight semantic testing helpers?
 
Happy to refine the approach or try applying it to one of the existing flaky tests if that's useful. Thanks in advance for your time!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Semantic assertions for qualitative test validation #692

Problem

Proposal

Example 1

Example 2

Proof of Concept

Context

Open questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Proposal: Semantic assertions for qualitative test validation #692

Description

Problem

Proposal

Example 1

Example 2

Proof of Concept

Context

Open questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions