A framework for evaluating confidence augmented systems, built on PyTorch Lightning. Supports both local HuggingFace models and GCP Vertex AI (Gemini) backends across multiple benchmarks.
| Benchmark | Task type | Semantic‑equivalence evaluation needed? |
|---|---|---|
| TriviaQA | Open‑ended QA | Yes — uses an evaluator LLM |
| MMLU | Multiple‑choice | No — exact match on answer letter |
| Classification | Image/Audio/Text Classification | No — exact match on class label |
| Metric | Description |
|---|---|
| ECE | Expected Calibration Error (L1, 10 bins) |
| AUC | Area Under the ROC Curve |
| BS | Brier Score (MSE between confidence and correctness) |
| CE | Binary Cross‑Entropy |
| n‑ECUAS | Expected Cost for Uncertainty-Augmented Systems (parameterised by n = 0, 1, 2, …) |
| γ‑ECUAS | Gamma‑ECUAS — selective prediction at operating point γ |
| AURC | Area Under the Risk‑Coverage curve |
| FPR@95 | False Positive Rate at 95% recall |
| Error Rate | Overall prediction error rate |
| LogLog | LogLog Score (Classification) |
| NER / NBS / NCE | Normalized versions of Error Rate, Brier Score, and Cross-Entropy |
uv synccp .env.example .envThen edit .env:
HF_TOKEN=your_huggingface_token_here # needed for gated models (e.g. Llama, Mistral)
GOOGLE_APPLICATION_CREDENTIALS=path/to/creds.json # needed for GCP / Gemini modelsuv run pytest callm/tests/ -vThe CLI is built on top of LightningCLI and exposes three subcommands:
# TriviaQA with a local HuggingFace model (default config)
uv run python main.py validate \
--model.init_args.model_name=google/flan-t5-small \
--data.init_args.batch_size=8
# MMLU with a local model
uv run python main.py validate \
-c configs/config_mmlu_base_validation.yaml \
--model.init_args.model_name=mistralai/Ministral-3-8B-Instruct-2512
# TriviaQA with a GCP Gemini model
uv run python main.py validate \
-c configs/config_gcp_validation.yamlOutputs are saved to lightning_logs/<run>/llm_outputs.csv.
For benchmarks that require semantic-equivalence checking (TriviaQA):
uv run python main.py evaluation \
--llm_outputs_path=lightning_logs/<run>/llm_outputs.csv
# Or recalculate metrics from an existing evaluation CSV:
uv run python main.py evaluation \
--use_existing_csv \
--llm_outputs_path=lightning_logs/<run>/llm_outputs.csvuv run python main.py evaluate_csv \
--csv_path=lightning_logs/<run>_evaluation/version_0/evaluation_results.csvAll runs are configured via YAML. Pre-built configs live in configs/:
| Config | Backend | Benchmark |
|---|---|---|
config_base_validation.yaml |
HuggingFace | TriviaQA |
config_gcp_validation.yaml |
GCP (Gemini) | TriviaQA |
config_base_evaluation.yaml |
HuggingFace | TriviaQA (evaluator) |
config_gcp_evaluation.yaml |
GCP (Gemini) | TriviaQA (evaluator) |
config_mmlu_base_validation.yaml |
HuggingFace | MMLU |
config_mmlu_gcp_validation.yaml |
GCP (Gemini) | MMLU |
Any config value can be overridden from the CLI — see the LightningCLI docs.
callm/
├── models/
│ ├── base.py # Shared Lightning module base
│ ├── llm.py # HuggingFace LLM (local GPU)
│ ├── gcp_llm.py # GCP Vertex AI / Gemini LLM
│ ├── evaluator.py # Semantic-equivalence evaluator (HF)
│ └── gcp_evaluator.py # Semantic-equivalence evaluator (GCP)
├── data/
│ ├── triviaqa/ # TriviaQA data modules
│ ├── mmlu/ # MMLU data modules
│ ├── answers_data.py # Shared answer-loading utilities
│ ├── classification.py # Classification data module
│ └── simulation.py # Simulated confidence data module
├── extractors/
│ ├── base.py # Base + posterior extractors
│ ├── triviaqa.py # TriviaQA verbalized-confidence extractor
│ └── mmlu.py # MMLU answer/confidence extractors
├── prompts/
│ ├── base.py # Prompt / ChatPrompt base classes
│ ├── triviaqa.py # TriviaQA prompt templates
│ └── mmlu.py # MMLU prompt templates
├── metrics/
│ ├── confidences.py # Calibration metrics (ECE, AUC, BS, CE, n-ECUAS, …)
│ ├── classification.py # Classification-specific metric variants
│ ├── constants.py # Metric constants and registry
│ └── utils.py # Metric lookup helpers
├── tests/ # Unit & integration tests
├── config.py # Shared config utilities
└── utils.py # Model loading & tokenizer helpers
configs/ # YAML run configurations
scripts/ # Analysis & paper-figure scripts
cli.py # CalibrationCLI (extends LightningCLI)
main.py # Entrypoint
| Extractor | How confidence is obtained |
|---|---|
| SequencePosteriorExtractor | Product of token log‑probabilities of the generated answer |
| IsTruePosteriorExtractor | Log‑prob of the "True" token after an "Is this true?" follow‑up |
| VerbalizedConfidenceExtractor | Parsed from the model's own verbalized confidence value |
MMLU variants (MMLUSequencePosteriorExtractor, MMLUVerbalizedExtractor, etc.) adapt these strategies to multiple‑choice format.
See LICENSE.