A comprehensive evaluation framework for assessing Large Language Models (LLMs) on human rights scenarios. This tool evaluates various LLMs across multiple providers (OpenAI, Anthropic, Google) on their understanding and reasoning about human rights violations, obligations, and remedies.
The evaluation tool processes scenarios from the Human Rights Benchmark dataset and evaluates LLM responses across different question types:
- Multiple Choice Questions (MCQ): Single-choice questions with letter answers (A, B, C, etc.)
- Ranking Questions (R): Questions requiring ordered responses (e.g., "A,B,C")
- Short Answer Questions (P): Open-ended questions requiring detailed responses
- ✅ Multi-provider support (OpenAI, Anthropic, Google, Qwen)
- ✅ Structured output using Pydantic models
- ✅ Automatic scoring for MCQ and ranking questions
- ✅ JSON response parsing
- ✅ Comprehensive evaluation metrics and reporting
- ✅ CSV export of results
- ✅ Dry-run mode for testing prompts
- ✅ Configurable temperature and token limits
- ✅ Progress tracking and error handling
- Python 3.9 or higher
- pip package manager
pip install -r requirements.txtscenarios.json- The benchmark dataset containing human rights scenariosmodel_choices.py- LLM client wrapper for different providerseval.py- Main evaluation script
Set up your API keys for the providers you want to use:
# OpenAI
export OPENAI_API_KEY="sk-..."
# Anthropic
export ANTHROPIC_API_KEY="sk-ant-..."
# Google
export GOOGLE_API_KEY="..."Alternatively, pass API keys directly via command line using --api_key.
python eval.py --provider <provider> --model <model_name>| Argument | Required | Default | Description |
|---|---|---|---|
--provider |
Yes | - | LLM provider: openai, anthropic, google, or qwen |
--model |
Yes | - | Model name (e.g., gpt-4o, claude-3-5-sonnet-20241022) |
--temperature |
No | 0.0 | Sampling temperature (0.0 = deterministic, 1.0 = creative) |
--max-tokens |
No | 1000 | Maximum tokens in response |
--api_key |
No | None | API key (overrides environment variables) |
--dry-run |
No | False | Print prompts without calling API (for testing) |
--limit |
No | None | Limit number of evaluations to run |
python eval.py --provider openai --model gpt-4o --dry-run --limit 5python eval.py --provider openai --model gpt-4o --api_key "sk-..."python eval.py --provider anthropic --model claude-3-5-sonnet-20241022 --temperature 0.5python eval.py --provider google --model gemini-pro --limit 10python eval.py --provider openai --model gpt-4o --temperature 0.7 --max-tokens 2000The script displays real-time progress including:
- Current evaluation progress (e.g., "Progress: 5/184")
- Question ID and type
- Full prompt sent to the LLM
- LLM response (raw JSON or text)
- Correct answer
- Score (1.0 for correct, 0.0 for incorrect, -1 for manual evaluation)
Results are saved to a CSV file named:
eval_results_{provider}_{model}.csv
scenario_id- Unique scenario identifierscenario_text- Full scenario descriptiontags- Scenario categorization tagsdifficulty- Difficulty level (easy, medium, hard)subscenario_id- Sub-scenario identifiersubscenario_text- Specific sub-scenario descriptionquestion_id- Question type (MCQ letters, R for ranking, P for short answer)question_text- The actual questionanswer_choices- Available answer optionscorrect_answer- Ground truth answerllm_response- LLM's response (JSON or text)score- Evaluation score
At the end of evaluation, the script prints:
- Average score (percentage)
- Total questions evaluated
- Number of correct answers
- Number of incorrect answers
Example:
================================================================================
Results saved to: eval_results_openai_gpt-4o.csv
Average Score: 76.32%
Total Questions: 152
Correct: 116
Incorrect: 36
================================================================================
Questions with one correct answer from options A-H.
Example:
Question: Did the state violate any human rights obligations?
Options: A. Obligation to respect
B. Obligation to protect
C. No obligation violated
Correct Answer: B
Questions requiring a specific order of answers.
Example:
Question: Rank these obligations from most to least violated
Correct Answer: A,B,C,D
Open-ended questions requiring detailed responses (manual evaluation required).
Example:
Question: List up to 10 possible actions the state could take...
Score: -1 (requires manual evaluation)
- 1.0: Correct answer
- 0.0: Incorrect answer
- -1: Requires manual evaluation (short answer questions)
- JSON Parsing: Extracts
answer_choiceoranswerfield from structured output - Single Choice: Exact match with correct answer
- Multiple Choice/Ranking: Set comparison - all choices must match
- Fallback: Plain text parsing if JSON parsing fails
The evaluation uses Pydantic models for structured responses:
{
"answer_choice": "B",
"explanation": "The state violated..."
}{
"answer_choice": "A,B,C",
"explanation": "Multiple violations..."
}{
"answer": "Detailed response..."
}- Uses
beta.chat.completions.parse()for native structured outputs - Supports all Pydantic response formats
- Fallback to JSON mode if beta API unavailable
- Uses tool calling mechanism for structured outputs
- Tool spec generated from Pydantic schema
- Extracts answer from tool use blocks
- JSON schema instructions added to prompt
- Manual parsing of JSON responses
- May require prompt engineering for consistency
Problem: Error calling anthropic: Connection error
Solutions:
- Check API key is set correctly
- Verify internet connection
- Check firewall/proxy settings
- Ensure API provider services are operational
- Verify account has sufficient credits
Problem: Scoring fails due to unparseable responses
Solutions:
- Check the
llm_responsecolumn in CSV for actual format - Increase temperature for more consistent outputs
- Update scoring logic to handle provider-specific formats
Problem: ModuleNotFoundError
Solution:
pip install -r requirements.txtHumanRightsBench/
├── eval.py # Main evaluation script
├── model_choices.py # LLM client wrapper
├── scenarios.json # Benchmark dataset
├── requirements.txt # Python dependencies
├── README.md # This file
└── eval_results_*.csv # Evaluation results (generated)
To add support for a new LLM provider:
- Update
ModelProviderenum inmodel_choices.py - Add initialization logic in
_initialize_client() - Implement provider-specific call method (e.g.,
_call_newprovider()) - Update
call_llm()method to route to new provider
Prompts are defined in eval.py around lines 169-214:
prompt_mcq- For multiple choice questionsprompt_short_answer- For short answer questionsprompt_ranking- For ranking questions
Add new Pydantic models at the top of eval.py and update the response format assignment logic around lines 220-234.
Contributions are welcome! Areas for improvement:
- Additional LLM provider support
- Enhanced scoring algorithms
- Better error handling and retry logic
- Batch processing optimization
- Multi-language support
[Add your license information here]
If you use this evaluation tool in your research, please cite:
[Add citation information here][Add contact information here]
This evaluation framework is part of the Human Rights Benchmark project, designed to assess LLM capabilities in understanding and reasoning about human rights issues.