A lightweight, local evaluation framework for biodiversity data AI tasks. This is an experimental codelab project for learning and inviting collaboration.
Evals are like tests but for AI-powered apps, particularly LLM-powered applications. They help verify that your app is working as expected. While conventional tests typically return a simple pass or fail, evals provide a performance score that reflects how well your app is performing. The three-component structure (data, task, scorers) used in WildEval are directly inspired by Braintrust's evaluation framework.
- Simple Test Definition: Use Python decorators to define evaluation cases
- Flexible Scoring: Leverage autoevals comprehensive evaluation metrics or create custom scorers. The Autoevals suite (part of Braintrust) offers feature-rich scorers for string similarity, numeric comparison, JSON structure, and even LLM‑based evaluation.
- Extensible: Easy to add custom scorers and evaluation tasks
- Local First: Designed to run locally without external dependencies
git clone <your-repo-url>
cd wild-evalThis project uses uv for dependency management:
# Install uv if you don't have it
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install dependencies
uv sync
# If uv sync doesn't work, you may need to install dependencies manually:
uv pip install autoevals requests openaipython -m spacy download en_core_web_smpython run_eval.pyScorers are functions that evaluate how well your AI task's output matches the expected result. They return a score (typically between 0-1 or 0-100) and optional metadata.
Good evaluation metrics are crucial for:
- Comparing different AI models on the same task
- Tracking improvements as you iterate on your models
- Understanding failure modes and edge cases
- Building confidence in your AI systems
WildEval has very simple example evals in the evals/ directory. Their purpose is just to demonstrate the framework and help you get started. Two interesting examples:
@eval_case("Redact Places with Spacy")- this eval evaluates a task that uses spaCy for detecting and redacting place names in biodiversity data (it's very basic though!) It's in evals/eval_simple_examples.py.@eval_case("Redact Places with OpenAI")- this eval evaluates a task that uses OpenAI to redact place names in biodiversity data. It also uses an LLM to evaluate the output. It's in evals/eval_openai_simple_example.py.
Create a new Python file in the evals/ directory with the naming pattern eval_*.py:
# evals/eval_my_evaluation.py
from eval_framework import eval_case
from autoevals import Levenshtein
def nbn_name_match(name: str) -> str:
response = requests.get("https://namematching.nbnatlas.org/api/search", params={"q": name})
data = response.json()
if data.get("matchType") == "EXACT" or "scientificName" in data:
return data["scientificName"]
return ""
@eval_case("NBN Atlas Name Resolution")
def test_name_resolution():
return {
"data": lambda: [
{"input": "Bumblebee", "expected": "Bombus"},
{"input": "red fox", "expected": "Vulpes vulpes"}
],
"task": nbn_name_match,
"scorers": [Levenshtein()]
} Each eval has three components:
- data: A function that returns a list of test cases with
inputandexpectedvalues - task: A function that processes the input and returns the output
- scorers: A list of scoring functions that compare output to expected results
The framework is designed to work seamlessly with autoevals, a comprehensive library of evaluation metrics. Autoevals provides many sophisticated scorers out of the box:
- String Similarity:
autoevals.Levenshtein,autoevals.StringSimilarity - JSON Comparison: JSON validation and structural comparison
- Semantic Similarity:
autoevals.SemanticSimilarity,autoevals.EmbeddingSimilarity - LLM-based Evaluation:
autoevals.LLMScorer,autoevals.Faithfulness - Check the autoevals documentation for the full list
The framework could eventually include domain-specific scorers tailored to biodiversity tasks. For now, we’ve implemented a very simple example focused on location redaction, primarily to demonstrate the concept.
GazetteerMatchScorer: Checks if specific location names are present/absent in text (useful for location redaction tasks)
You can create your own scorers by implementing a callable that returns a score:
class MyCustomScorer:
def __call__(self, output, expected):
# Your scoring logic here
score = calculate_score(output, expected)
return {
"score": score,
"metadata": {"custom_info": "additional data"}
}For evaluations that use external APIs (like OpenAI), set your API keys:
import openai
openai.api_key = "your-api-key-here"Or use environment variables:
export OPENAI_API_KEY="your-api-key-here"python run_eval.pypython run_eval.py --listpython run_eval.py --name "Simple Taxon Name Fix"python run_eval.py --filter "redact" # Run all redaction-related evaluations
python run_eval.py --filter "name" # Run all name-related evaluationspython run_eval.py --help- Python 3.11+
- uv (package manager)
- autoevals (comprehensive evaluation metrics library)
- spaCy (for NLP features)
- requests (for API calls)
- openai (for OpenAI API integration)
autoevals is a comprehensive library of evaluation metrics developed by Braintrust, used by companies like Anthropic, OpenAI, and others in production environments. It provides standardized, well-tested evaluation functions that make it easy to assess AI model performance:
- String Similarity: Levenshtein distance, string similarity, exact matching
- JSON Comparison: JSON validation and structural comparison
- Semantic Similarity: Embedding-based similarity, semantic comparison
- LLM-based Evaluation: Using other LLMs to evaluate outputs (faithfulness, relevance, etc.)
- Custom Metrics: Framework for building domain-specific evaluation functions
- Production Ready: Used in real-world AI applications and thoroughly tested
spaCy is an open-source Python library for fast, industrial-strength Natural Language Processing (NLP). WildEval uses spaCy in a very simple example to detect and redact place names in biodiversity text, it's just enough to demonstrate what's possible. While spaCy is great for general-purpose NLP, many biodiversity tasks will benefit more from domain-specific models, such as those available on Hugging Face.
spaCy provides tools like:
- Named Entity Recognition (NER): Detect people, organizations, locations, dates, etc.
- Tokenization: Split text into words, sentences, etc.
- Dependency parsing: Understand grammatical relationships
- Text classification: Categorize text content
- Similarity analysis: Compare text meaning using word vectors
In biodiversity data contexts, spaCy could help with tasks like:
- Extracting species names from text descriptions
- Identifying geographic locations in observation records
- Parsing taxonomic hierarchies
- Analyzing field notes and descriptions