callm — Confidence Calibration for LLMs

A framework for evaluating confidence augmented systems, built on PyTorch Lightning. Supports both local HuggingFace models and GCP Vertex AI (Gemini) backends across multiple benchmarks.

Supported Benchmarks

Benchmark	Task type	Semantic‑equivalence evaluation needed?
TriviaQA	Open‑ended QA	Yes — uses an evaluator LLM
MMLU	Multiple‑choice	No — exact match on answer letter
Classification	Image/Audio/Text Classification	No — exact match on class label

Calibration Metrics

Metric	Description
ECE	Expected Calibration Error (L1, 10 bins)
AUC	Area Under the ROC Curve
BS	Brier Score (MSE between confidence and correctness)
CE	Binary Cross‑Entropy
n‑ECUAS	Expected Cost for Uncertainty-Augmented Systems (parameterised by n = 0, 1, 2, …)
γ‑ECUAS	Gamma‑ECUAS — selective prediction at operating point γ
AURC	Area Under the Risk‑Coverage curve
FPR@95	False Positive Rate at 95% recall
Error Rate	Overall prediction error rate
LogLog	LogLog Score (Classification)
NER / NBS / NCE	Normalized versions of Error Rate, Brier Score, and Cross-Entropy

Quick Start

1. Install dependencies

uv sync

2. Configure environment (optional)

cp .env.example .env

Then edit .env:

HF_TOKEN=your_huggingface_token_here               # needed for gated models (e.g. Llama, Mistral)
GOOGLE_APPLICATION_CREDENTIALS=path/to/creds.json   # needed for GCP / Gemini models

3. Run unit tests

uv run pytest callm/tests/ -v

Usage

The CLI is built on top of LightningCLI and exposes three subcommands:

`validate` — Run LLM inference and extract answers + confidences

# TriviaQA with a local HuggingFace model (default config)
uv run python main.py validate \
  --model.init_args.model_name=google/flan-t5-small \
  --data.init_args.batch_size=8

# MMLU with a local model
uv run python main.py validate \
  -c configs/config_mmlu_base_validation.yaml \
  --model.init_args.model_name=mistralai/Ministral-3-8B-Instruct-2512

# TriviaQA with a GCP Gemini model
uv run python main.py validate \
  -c configs/config_gcp_validation.yaml

Outputs are saved to lightning_logs/<run>/llm_outputs.csv.

`evaluation` — Evaluate correctness of LLM outputs via a judge model

For benchmarks that require semantic-equivalence checking (TriviaQA):

uv run python main.py evaluation \
  --llm_outputs_path=lightning_logs/<run>/llm_outputs.csv

# Or recalculate metrics from an existing evaluation CSV:
uv run python main.py evaluation \
  --use_existing_csv \
  --llm_outputs_path=lightning_logs/<run>/llm_outputs.csv

`evaluate_csv` — Compute metrics from a saved evaluation CSV

uv run python main.py evaluate_csv \
  --csv_path=lightning_logs/<run>_evaluation/version_0/evaluation_results.csv

Configuration

All runs are configured via YAML. Pre-built configs live in configs/:

Config	Backend	Benchmark
`config_base_validation.yaml`	HuggingFace	TriviaQA
`config_gcp_validation.yaml`	GCP (Gemini)	TriviaQA
`config_base_evaluation.yaml`	HuggingFace	TriviaQA (evaluator)
`config_gcp_evaluation.yaml`	GCP (Gemini)	TriviaQA (evaluator)
`config_mmlu_base_validation.yaml`	HuggingFace	MMLU
`config_mmlu_gcp_validation.yaml`	GCP (Gemini)	MMLU

Any config value can be overridden from the CLI — see the LightningCLI docs.

Project Structure

callm/
├── models/
│   ├── base.py              # Shared Lightning module base
│   ├── llm.py               # HuggingFace LLM (local GPU)
│   ├── gcp_llm.py           # GCP Vertex AI / Gemini LLM
│   ├── evaluator.py         # Semantic-equivalence evaluator (HF)
│   └── gcp_evaluator.py     # Semantic-equivalence evaluator (GCP)
├── data/
│   ├── triviaqa/            # TriviaQA data modules
│   ├── mmlu/                # MMLU data modules
│   ├── answers_data.py      # Shared answer-loading utilities
│   ├── classification.py    # Classification data module
│   └── simulation.py        # Simulated confidence data module
├── extractors/
│   ├── base.py              # Base + posterior extractors
│   ├── triviaqa.py          # TriviaQA verbalized-confidence extractor
│   └── mmlu.py              # MMLU answer/confidence extractors
├── prompts/
│   ├── base.py              # Prompt / ChatPrompt base classes
│   ├── triviaqa.py          # TriviaQA prompt templates
│   └── mmlu.py              # MMLU prompt templates
├── metrics/
│   ├── confidences.py       # Calibration metrics (ECE, AUC, BS, CE, n-ECUAS, …)
│   ├── classification.py    # Classification-specific metric variants
│   ├── constants.py         # Metric constants and registry
│   └── utils.py             # Metric lookup helpers
├── tests/                   # Unit & integration tests
├── config.py                # Shared config utilities
└── utils.py                 # Model loading & tokenizer helpers
configs/                     # YAML run configurations
scripts/                     # Analysis & paper-figure scripts
cli.py                       # CalibrationCLI (extends LightningCLI)
main.py                      # Entrypoint

Confidence Extraction Methods

Extractor	How confidence is obtained
SequencePosteriorExtractor	Product of token log‑probabilities of the generated answer
IsTruePosteriorExtractor	Log‑prob of the "True" token after an "Is this true?" follow‑up
VerbalizedConfidenceExtractor	Parsed from the model's own verbalized confidence value

MMLU variants (MMLUSequencePosteriorExtractor, MMLUVerbalizedExtractor, etc.) adapt these strategies to multiple‑choice format.

License

See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 171 Commits
callm		callm
configs		configs
notebooks		notebooks
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
cli.py		cli.py
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

callm — Confidence Calibration for LLMs

Supported Benchmarks

Calibration Metrics

Quick Start

1. Install dependencies

2. Configure environment (optional)

3. Run unit tests

Usage

`validate` — Run LLM inference and extract answers + confidences

`evaluation` — Evaluate correctness of LLM outputs via a judge model

`evaluate_csv` — Compute metrics from a saved evaluation CSV

Configuration

Project Structure

Confidence Extraction Methods

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

callm — Confidence Calibration for LLMs

Supported Benchmarks

Calibration Metrics

Quick Start

1. Install dependencies

2. Configure environment (optional)

3. Run unit tests

Usage

validate — Run LLM inference and extract answers + confidences

evaluation — Evaluate correctness of LLM outputs via a judge model

evaluate_csv — Compute metrics from a saved evaluation CSV

Configuration

Project Structure

Confidence Extraction Methods

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`validate` — Run LLM inference and extract answers + confidences

`evaluation` — Evaluate correctness of LLM outputs via a judge model

`evaluate_csv` — Compute metrics from a saved evaluation CSV

Packages