-
Notifications
You must be signed in to change notification settings - Fork 97
Description
Problem
Several "qualitative" tests in the suite are flaky because they validate LLM outputs with rigid comparisons (string matching, exact patterns). Generative models can rephrase or slightly shift tone between runs, causing tests to fail even when the response is functionally correct.
Related issues:
- test:
test_kvFlaky Due to Model Safety Refusal #398 –test_kvfails intermittently due to safety refusals instead of expected output - Bug: Flaky test in langchain_messages.py - qualitative assertion too strict #384 –
langchain_messages.pytest is flaky because the qualitative assert is too strict
Both seem to share a similar root cause, applying traditional testing approaches to non-deterministic outputs.
(Following up on a brief Slack discussion with @psschwei, who suggested opening this issue 😄)
Proposal
One possible direction, from my perspective, could be a lightweight pytest-native semantic assertion layer that validates by meaning instead of exact matching. This would stay aligned with Mellea's existing LLM-as-a-judge approach, integrating it into pytest in a more "Pythonic" way, without introducing heavier evaluation frameworks.
The idea separates two validation dimensions:
Content(response): whether the model generated the correct informationBehaviour(response): whether the response behaves as expected (e.g. avoiding unexpected safety refusals)
Example 1
def test_kv_store_operations():
response = m.instruct("Explain key-value store operations")
# Content validation
assert "key-value operations like get, set, delete" in Content(response)
assert "unrelated topics" not in Content(response)
# Behaviour validation
assert "safety refusal or content rejection" not in Behaviour(response)
assert "informative and direct" in Behaviour(response)Example 2
def test_langchain_messages():
response = m.instruct("Format this as a langchain message")
assert "a properly structured message" in Content(response)
assert "clear and well-formatted" in Behaviour(response)Proof of Concept
I put together a small PoC to explore this idea using ibm-granite/granite-embedding-30m-english for the semantic similarity layer, integrated as a pytest plugin.
The screenshots below illustrate how the same test behaves when the expected intent is semantically present versus when it is not:
import pytest
from pytest_semantic_assert import Content
def test_exact_phrase_passes():
response = "Redis is an in-memory key-value store used for caching."
assert "key-value store" in Content(response) # ✅ passesimport pytest
from pytest_semantic_assert import Content
def test_exact_phrase_passes():
response = "Redis is an in-memory key-value store used for caching."
assert "Alan Turing" in Content(response) # ❌ fails — similarity 0.51, below threshold 0.75Semantic match (passes):
No semantic match (fails):
Context
I looked at evaluation frameworks like DeepEval and similar tools, but they can introduce additional complexity for what could be a more lightweight, testing-oriented helper. This approach aims to stay minimal and aligned with Mellea's existing philosophy.
Open questions
- From your perspective, does this direction make sense for how qualitative testing is currently approached in Mellea?
- Are you already exploring similar approaches internally for handling flaky LLM-based tests?
- Do you think there's a broader need in the Python/GenAI ecosystem for lightweight semantic testing helpers?
Happy to refine the approach or try applying it to one of the existing flaky tests if that's useful. Thanks in advance for your time!