CBLRE — Canadian Bilingual Legal & Regulatory Evaluation

Status: In active development. The scoring harness and item schema are published here. The item bank is gated — see Requesting the evaluation set.

CBLRE is a benchmark for evaluating large language models on Canadian legal and regulatory tasks, in both official languages. It is developed and maintained by Alpine Pacific Trading Inc. (operating as SimpleDirect®).

Why public harness, gated items?

The item bank, gold answers, and held-out scoring split are not published here. Evaluators can inspect the full scoring logic, run the harness against their own items, and understand the methodology — while the private split prevents models from being trained on test content before evaluation. This is the standard contamination-resistance design used by credible benchmarks.

Tracks

#	Track	What it measures
1	`bilingual_parity`	Accuracy drop between matched EN and FR items
2	`quebec_civil_law`	Québec civil law doctrine (C.c.Q., extracontractual liability)
3	`common_law`	Canadian common law doctrine (negligence, contract)
4	`constitutional_charter`	Charter rights, Oakes test, division of powers
5	`privacy_compliance`	PIPEDA, Québec Law 25, provincial privacy statutes
6	`citation_integrity`	Legal citation hallucination detection
7	`safety_calibration`	Appropriate refusal vs. compliance
8	`grounded_rag`	Answer faithfulness to supplied legal context
9	`function_calling`	Structured tool use for legal search tasks
10	`capability_retention`	Instruction-following under legal framing

Scoring methods

Method	Used by	Programmatic?
`mcq_exact`	Multiple-choice items	Yes — final-committed-answer extraction
`language_adherence`	Bilingual parity track	Yes — heuristic, fastText recommended for release
`citation_validity`	Citation integrity track	Yes — optional CanLII verifier
`keyword_coverage`	Compliance tracks (gate)	Yes — gate only; judge sets quality score
`rubric`	Open legal reasoning	No — requires an LLM judge
`refusal`	Safety calibration	Yes — binary correctness
`tool_call`	Function calling	Partial — judge scores argument quality

mcq_exact implements a final-committed-answer strategy: for reasoning models that produce chain-of-thought before committing, it scans for the LAST commitment pattern rather than the first letter. See docs/methodology.md for detail.

Running the harness on your own items

You supply a JSONL file where each line is an item conforming to schema/eval_item.schema.json. The harness calls your model, scores each item, and writes a per-item JSONL and a summary.json.

pip install -r requirements.txt

Programmatic tracks only (no judge required):

python -m harness.run_eval \
  --items your_items.jsonl \
  --model '{"kind":"openai_compat","model_name":"your-model-name","base_url":"http://localhost:8000/v1"}' \
  --run-id your-model-v1 \
  --out-dir ./results

With an LLM judge for rubric/open items:

python -m harness.run_eval \
  --items your_items.jsonl \
  --model '{"kind":"openai_compat","model_name":"your-model-name","base_url":"http://localhost:8000/v1"}' \
  --judge '{"kind":"openai_compat","model_name":"gpt-4o","base_url":"https://api.openai.com/v1","api_key_env":"OPENAI_API_KEY"}' \
  --run-id your-model-v1 \
  --out-dir ./results

With the canonical judge (Claude Sonnet 4.6 — recommended):

export ANTHROPIC_API_KEY=sk-ant-...

python -m harness.run_eval \
  --items your_items.jsonl \
  --model '{"kind":"openai_compat","model_name":"your-model-name","base_url":"http://localhost:8000/v1"}' \
  --judge '{"kind":"anthropic","model_name":"claude-sonnet-4-6","api_key_env":"ANTHROPIC_API_KEY"}' \
  --run-id your-model-v1 \
  --out-dir ./results

Teams already on GCP can reach the same judge model through Vertex AI instead — scores remain comparable because the model is identical:

  --judge '{"kind":"vertex_anthropic","model_name":"claude-sonnet-4-6","region":"us-east5","project":"your-gcp-project"}'

Supported --model / --judge client kinds:

`kind`	When to use
`openai_compat`	Any `/v1/chat/completions` server (vLLM, Together, OpenAI, Groq, etc.)
`hf_local`	Local HuggingFace checkpoint (requires `torch`, `transformers`)
`anthropic`	Claude via native Anthropic API — canonical judge; set `ANTHROPIC_API_KEY`
`vertex_anthropic`	Same Claude models via Google Vertex AI — for teams on GCP (ADC auth)

Run python -m harness.run_eval --help for all CLI options.

Output format

Results are written to --out-dir/<run-id>/:

items.jsonl — one row per item: response, latency, programmatic score, judge score, final score
summary.json — per-track means with 95% bootstrap CIs, bilingual parity ratios, difficulty breakdowns

Results from seed runs carry the note: "Items require SME validation before these numbers are publishable." Do not cite seed-run numbers as finalized benchmark results.

Synthetic sample

data/sample/ contains a small set of illustrative synthetic items demonstrating the item format. These are not part of the scoring set and have no bearing on model evaluation. See data/sample/README.md.

Requesting the evaluation set

The full item bank (~129 expert-reviewed items in the current validation set; SME validation ongoing) is available to qualified evaluators on request. Official scoring against the held-out split is coordinated with SimpleDirect® to prevent contamination.

To request access: open an Evaluation Set Access Request on this repository's issue tracker.

See docs/data_access.md for the full access and official scoring protocol.

Vendor neutrality

CBLRE is a vendor-neutral instrument designed to evaluate any instruction-following LLM regardless of provider or architecture. SimpleDirect® publishes the flash-1-mini model; CBLRE is not designed to favour it or any other specific model. The scoring harness applies identical prompts and decoding conditions to every evaluated model.

The canonical LLM judge (Claude Sonnet 4.6) shares no lineage with flash-1-mini, which is a Qwen fine-tune — so the benchmark author's own model receives no self-preference advantage from the judge. When Claude-family models are evaluated as competitors, a judge ensemble including a non-Claude judge is required to control for self-preference bias (see docs/methodology.md §5.2).

Limitations & current status

CBLRE is at an early development stage. We state its current limitations plainly so that no result is over-interpreted:

Sample size. The current validation set is ~129 items across 10 tracks (~13 per track). Per-track confidence intervals are correspondingly wide, and most per-track differences between models will not be statistically distinguishable at this size. Track-level numbers should be read as directional, not definitive.
Validity evidence is in progress. SME (subject-matter expert) validation is ongoing. Inter-annotator agreement (Cohen's / weighted κ) and judge-vs-human calibration are planned but not yet published. No baseline model leaderboard is published; the instrument's discrimination has not yet been demonstrated empirically.
Two scorers are not yet release-grade. language_adherence uses a lightweight heuristic (fastText lid.176 is the intended replacement), and citation_validity cannot confirm hallucinated citations without an external verifier (e.g. a CanLII lookup) wired in; without one it flags citation-shaped text for review rather than confirming.
Contamination defense. Items carry canary strings for leak detection, but corpus-overlap analysis (n-gram / embedding) against training data is not yet part of the protocol.
Single-entity authorship. CBLRE is built by the entity that also ships a model. We mitigate this with vendor-neutral conditions, a third-party judge model, gated items, and by not publishing self-run leaderboards — but independent governance is the stronger long-term fix, and external SME authorship is a goal.

Roadmap

Active development is focused on:

Expanding the item bank — substantially increasing the number of expert-reviewed items per track to narrow confidence intervals and support stable per-track comparisons.
Broadening domain coverage — adding further Canadian legal and regulatory domains beyond the current 10 tracks.
Publishing SME inter-annotator agreement (κ) and judge-vs-human calibration figures.
Upgrading language_adherence to a published language-ID model and integrating a citation verifier.
Independent, third-party-run baselines before any comparative claim is published.

Until these are in place, CBLRE should be described as a development-stage instrument, not a finalized standard.

License

Code (this repository): Apache License 2.0. See LICENSE.

Evaluation data (item bank, gold answers): separately licensed and gated. See DATA_LICENSE.md.

Citation

Citation entry will be added once a technical report is published.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CBLRE — Canadian Bilingual Legal & Regulatory Evaluation

Why public harness, gated items?

Tracks

Scoring methods

Running the harness on your own items

Output format

Synthetic sample

Requesting the evaluation set

Vendor neutrality

Limitations & current status

Roadmap

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
data/sample		data/sample
docs		docs
harness		harness
schema		schema
.gitattributes		.gitattributes
.gitignore		.gitignore
DATA_LICENSE.md		DATA_LICENSE.md
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

CBLRE — Canadian Bilingual Legal & Regulatory Evaluation

Why public harness, gated items?

Tracks

Scoring methods

Running the harness on your own items

Output format

Synthetic sample

Requesting the evaluation set

Vendor neutrality

Limitations & current status

Roadmap

License

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages