API polish: recommended defaults, structured notes, and trust metadata for policy comparisons by Yurashku · Pull Request #41 · Yurashku/OffPolicyLab

Yurashku · 2026-04-13T15:53:22Z

Motivation

Make high-level comparison outputs more opinionated and easier to use safely by surfacing recommended defaults and clearer trust signals.
Reduce noisy unstructured notes by separating informational notes, diagnostic warnings, inference warnings, and trust guidance.
Ensure reporting reflects inference settings (non-95% alpha) and keep backwards compatibility for existing consumers.

Description

Added exported recommended-default constants and metadata: RECOMMENDED_ESTIMATOR, RECOMMENDED_PROPENSITY_SOURCE_WITH_LOGGED, RECOMMENDED_PROPENSITY_SOURCE_FALLBACK, and RECOMMENDED_CROSSFIT_ESTIMATORS, and included recommended_defaults in PolicyComparisonSummary (src/policyscope/comparison.py).
Introduced structured result fields on PolicyComparisonSummary: info_notes, diagnostic_warnings, inference_warnings, trust_notes, trust_level, and optional recommendation, while preserving legacy notes as an additive, backward-compatible combined view (src/policyscope/comparison.py).
Added a lightweight rule-based trust rollup generator _build_trust_metadata(...) that summarizes diagnostics + inference warnings into trust_level/recommendation and nudges cross-fit guidance for estimators where helpful (src/policyscope/comparison.py).
Updated decision_summary to render CI level from alpha rather than hard-coded "95%" and to surface trust_level/recommendation when present (src/policyscope/report.py).
Threaded trust_level into the validation harness outputs so experiment aggregates can include trust metadata (src/policyscope/validation.py).
Documentation updates describing the recommended defaults, structured notes, and migration compatibility (README.md, docs/architecture.md, docs/validation_harness.md).
Tests added/updated to cover recommended-default metadata, structured notes compatibility, and reporting with non-95% alpha (tests/test_comparison.py, tests/test_bootstrap_report.py).

Testing

Ran the focused test suite with the package import path set: PYTHONPATH=src pytest -q tests/test_comparison.py tests/test_bootstrap_report.py tests/test_docs_consistency.py tests/test_validation.py.
Result: all tests passed (29 passed).
No estimator math changed, no new estimators were added, and the legacy notes field remains available for backward compatibility.

Codex Task

Yurashku · 2026-04-13T17:42:35Z

Готовый prompt для Codex на полную прическу tutorial.

Perform a focused cleanup and redesign pass for examples/tutorial.ipynb so it becomes a clear, compact, user-facing tutorial for the current Policyscope workflow.

Important context:

The library architecture and methodology are already mostly complete.
The notebook should teach the current official workflow, not document the history of repository changes.
Keep estimator mathematics unchanged.
Do not add new estimators or new architecture layers.
This task is about tutorial clarity, noise reduction, and consistency with the current repository.

Main goals:

Remove changelog-style narrative
Clean the notebook from wording that sounds like development history, for example phrases like:

"новый интерфейс"
"новый единый слой"
"новый универсальный API"
references that read like migration notes rather than tutorial guidance

The notebook should read like a stable product tutorial, not a release note.

Simplify and shorten the structure
Redesign the notebook into a clean, minimal flow.
Recommended structure:

short intro: what problem the library solves
generate or load logs
validate data contract
show oracle only for synthetic data
compute one compact comparison table of estimators
show the official high-level path (compare_policies(...) and/or OPEEvaluator)
show CI / significance / diagnostics interpretation
short "how to adapt to your own data" section

Try to reduce repeated sections and repeated explanations.

Remove noisy outputs
The current notebook contains too much stderr/logging noise from estimators and bootstrap loops.
Make notebook execution outputs clean and readable.

Acceptable approaches:

suppress verbose estimator logging during tutorial execution
wrap noisy calls in a logging/redirect context
or otherwise keep outputs compact and user-friendly

The final notebook should not spam repeated lines like [IPS] ..., [Replay] ... hundreds of times.

Use the current official API consistently
Prefer the current official user-facing workflow.
The tutorial should emphasize:

BanditSchema / LoggedBanditDataset
compare_policies(...) as the official orchestration path
optionally compare_policies_multi_target(...) if it adds clear value
OPEEvaluator only if it adds clarity, not if it duplicates too much

Reduce emphasis on low-level manual estimator plumbing unless it is pedagogically necessary.

Keep only the most useful manual section
It is okay to keep one short section showing low-level estimator intuition (Replay / IPS / DM / DR table), but keep it concise.
Do not let the notebook become a wall of duplicated low-level calls plus a second duplicated high-level path.
Fix consistency with current docs and reporting
Make sure the notebook reflects the current repository state:

logged vs estimated propensity modes
diagnostics / nuisance diagnostics / trust interpretation where relevant
current report.py behavior and current wording
no outdated statements about DR applicability

Improve final interpretation section
Add a short, practical interpretation block explaining:

point estimate is not enough
CI / p-value are not enough alone
diagnostics must also be checked
low overlap / low ESS / heavy weights reduce trust

Keep this concise and practical.

Re-execute the notebook and commit clean outputs
Re-run the notebook end-to-end and commit the updated outputs.
The saved notebook should be readable and not cluttered.
Update README only if needed
Only update README if the tutorial entry or wording clearly needs synchronization.
Keep this minimal.
Tests / validation
At minimum:

ensure the notebook executes successfully
keep the rest of the repository tests green if touched

Non-goals:

no new estimators
no new architecture work
no big API redesign
no tutorial expansion into a full theory course

Deliverables:

a cleaned and shorter examples/tutorial.ipynb
less noisy execution outputs
stable product-style wording instead of changelog-style wording
current official workflow reflected clearly
minimal doc sync if necessary

Nice-to-have direction:
If reasonable, prefer one strong polished tutorial over showing every possible path. The notebook should optimize for clarity and user confidence, not for exhaustive API coverage.

Если хочется сделать задачу ещё уже, можно сначала ограничиться только:

убрать changelog-style текст,
убрать logging noise,
сократить дублирование low-level vs high-level sections,
переисполнить notebook.

Yurashku · 2026-04-13T18:32:32Z

Новый prompt для правильной переработки tutorial. Акцент не на косметике, а на учебной ясности: method-vs-oracle, валидность применения методов, доверие к оценкам и high-level API как основной путь.

Redesign examples/tutorial.ipynb into a truly instructive, user-facing tutorial for the current Policyscope methodology.

Important context:

The library architecture is already strong.
The tutorial is currently weaker than the library itself.
The goal is not just to clean wording, but to make the tutorial actually explain:
1. what each estimator produced on the synthetic dataset,
2. how close each estimator is to oracle truth,
3. when each method is valid / fragile,
4. how to interpret confidence and trust in the result.
Keep estimator mathematics unchanged.
Do not add new estimators or new architecture layers.
This task is about tutorial pedagogy, clarity, and consistency with the current official API.

Main goals:

Rebuild the notebook around one clear narrative
The notebook should answer this practical question:
"I have logs from policy A and a candidate policy B. How do I compare them offline, which estimators say what, when can I trust them, and how do I read the result?"

Use a stable product-style tone.
Remove release-note/changelog wording.
Do not frame sections as repository migration history.

Make the official high-level API the main path
The tutorial should primarily teach the current official orchestration path.
Prefer:

BanditSchema
LoggedBanditDataset
compare_policies(...)
optionally compare_policies_multi_target(...) if it clearly helps

OPEEvaluator may be mentioned as an alternative convenience path, but it should not dominate the tutorial.

The notebook should not feel like a low-level estimator plumbing demo first and an official API demo second.

Add one central “method vs oracle” comparison table
This is the most important new requirement.

Because the notebook uses synthetic data, compute oracle truth and then show a single clear table comparing all implemented estimators on that same dataset.

At minimum include rows for:

Replay
IPS
SNIPS
DM
DR
SNDR
Switch-DR
(and optionally On-policy A as baseline)

At minimum include columns like:

estimator
V_A
V_B
Delta
V_A_oracle
V_B_oracle
Delta_oracle
abs_error_V_B
abs_error_Delta
V_B_CI
Delta_CI
p_value
is_significant
replay_overlap
weight_ess_ratio
trust_level
compact warning summary / key warnings

The goal is that a reader can immediately see:

which methods were close to oracle,
which were unstable,
which looked misleading,
and why.

Add a short Russian guide on method validity / applicability inside the tutorial
Do not rely only on external docs.
Inside the notebook itself, include a concise Russian-language practical guide that explains when each family of methods is appropriate.

For example:

Replay: overlap-dependent diagnostic baseline; not universally reliable under arbitrary contextual logging.
IPS: valid with correct propensities and support, but fragile under heavy weights.
SNIPS: often more stable than IPS, but biased.
DM: depends strongly on outcome-model quality.
DR: usually the main practical default when nuisance quality is acceptable.
SNDR / Switch-DR: useful when DR suffers from unstable weights.

Keep this practical and compact.
It should help the reader decide when the method is trustworthy, not just define formulas.

Add a clear Russian section about confidence / trust interpretation
The reader should understand that:

point estimate alone is not enough,
CI / p-value alone are not enough,
diagnostics matter,
low overlap / low ESS / heavy weights reduce trust,
trust_level is a summary, not magic truth.

Use the current structured outputs of the library where possible:

diagnostics
nuisance diagnostics if relevant
trust_level
recommendation
warnings / notes

This section should be short but very clear.

Keep only one short low-level intuition section
It is okay to keep one compact low-level section showing manual estimator intuition.
But it must be brief.
Do not duplicate the whole workflow twice.

The main educational value should come from:

official high-level API
method-vs-oracle comparison
validity/trust interpretation

Remove noisy logging from notebook outputs
The current notebook outputs are too noisy because estimator/bootstrap logs flood stderr.
Suppress or redirect verbose logging so the saved notebook is readable.
The final notebook should not contain long repeated blocks like [IPS] ..., [Replay] ... dozens of times.
Improve final structure
Recommended structure:

intro: what Policyscope solves
generate/load synthetic logs
validate data contract
show oracle (synthetic only)
one short manual intuition section
official compare path across estimators
central method-vs-oracle table
Russian guide: validity/applicability of methods
Russian guide: how to interpret confidence and trust
how to adapt to your own data

Re-execute notebook and save clean outputs
Re-run the notebook end-to-end.
Commit clean, readable outputs.
The notebook should be pleasant to read in GitHub, not just executable.
Minimal doc sync only if necessary
Update README only if the tutorial description clearly needs synchronization.
Keep this minimal.

Validation expectations:

notebook executes successfully
changed tests (if any) stay green
tutorial content is visibly clearer and more informative than before

Non-goals:

no new estimators
no new methodology features
no large architecture refactor
no theory-course expansion

Deliverables:

redesigned examples/tutorial.ipynb
central estimator-vs-oracle comparison table
Russian method-validity guidance inside the tutorial
Russian confidence/trust interpretation section
reduced logging noise
clean executed outputs

Important quality bar:
After the rewrite, a reader should be able to open the notebook and answer these four questions without confusion:

What did each estimator say on this dataset?
Which ones were close to oracle?
When is each method valid or fragile?
How much trust should I place in the reported result?

Если хочется сделать задачу поэтапно, сначала можно выполнить только:

перестройку структуры,
method-vs-oracle table,
русский guide по valid/fragile cases,
suppression of noisy logging,
а уже потом полировать формулировки.

Yurashku · 2026-04-14T14:02:10Z

Operational prompt and execution plan for the next Codex pass after merge to main.

Can Codex do this?

Yes — Codex can implement almost all of the next plan.

Best use of Codex here:

structural cleanup of tutorial/docs/examples
moving files between examples/, scripts/, and docs/
fixing the clear tutorial bug(s)
removing package-level logging side effects
creating new notebooks / docs with consistent wording
updating README navigation and entrypoints

What still benefits from human review after Codex:

whether the new tutorial structure matches the intended pedagogy
whether the synthetic scenarios are actually convincing
whether the decision/trust language is too strong or too weak
final readability of notebooks on GitHub

Recommended way to use Codex here

Do not ask for everything in one giant pass.
Use 2-3 focused Codex passes.

Recommended sequence:

Pass 1 — fix obvious defects + restructure examples
- fix tutorial replay bug
- remove logging.basicConfig(...) from package import path
- separate examples user-facing materials from script-like experiment runners
- improve README navigation
Pass 2 — build the educational materials you actually want
- main quickstart notebook for own data
- synthetic estimator comparison notebook
- RU interpretation guide for outputs / trust / when not to trust OPE
Pass 3 — polish / execute / final consistency
- re-run notebooks
- tighten wording
- make cross-links clean
- ensure tests still pass

Pre-work before running Codex

Minimal pre-work only:

use a fresh branch from current main
tell Codex to keep scope tight and avoid estimator-math changes
tell Codex explicitly that user-facing pedagogy matters more than showing every internal API path
tell Codex to prefer one strong happy-path over exhaustive duplication

No special manual prep is required beyond that.

Prompt for Codex — Pass 1 + 2 combined but still focused

Rework the user-facing learning materials of the Policyscope repository so that a user can actually become productive and informed.

Important context:

The core library architecture is already in good shape.
The main remaining problems are user-facing: tutorial structure, examples organization, interpretation guidance, and clarity about when OPE results are trustworthy.
Keep estimator mathematics unchanged.
Do not add new estimators or new architecture layers.
The goal is not to expand theory endlessly, but to make the repo teach the user how to use the library and how to interpret outputs responsibly.

Main goals:

Fix obvious defects first

Fix the tutorial bug in the current low-level Replay section: replay_value(...) must receive policy-B actions, not piB_taken probabilities.
Fix any related overlap computation in the tutorial if it is currently using the wrong quantity.
Remove package-level logging side effects from src/policyscope/__init__.py (do not call logging.basicConfig(...) on import).
Keep backward compatibility where practical.

Reorganize user-facing materials by user intent
Make the repository easier to navigate by separating:

beginner quickstart / own-data usage
synthetic estimator comparison
script-like experiment runners
interpretation/trust guidance

Recommended target structure:

examples/quickstart_own_data_ru.ipynb — main user tutorial
examples/compare_estimators_vs_oracle_ru.ipynb — synthetic estimator comparison notebook
docs/how_to_interpret_ope_outputs_ru.md — practical RU guide for interpreting outputs and trust
keep script-like runnable experiment files only if they have a clear role; if needed, move them to a more suitable place such as scripts/ or clearly document them as experiment runners rather than tutorials

You may keep examples/tutorial.ipynb only if it remains useful; otherwise replace it with better named notebooks. Prefer clarity over attachment to the old file name.

Create a true quickstart notebook for applying the library to user data
Build a notebook whose primary purpose is:
"How do I run Policyscope on my own dataset, with the official high-level API, and how do I read the result?"

This notebook should emphasize:

BanditSchema
LoggedBanditDataset
compare_policies(...) as the main orchestration path
minimal required columns
what to do when logged propensities are available
what to do when logged propensities are unavailable
what Delta, CI, p_value, diagnostics, and trust_level mean

The notebook must include a minimal copy-adapt-run code path for a user’s own DataFrame.
Do not make it overly long.
Optimize for usability.

Create a separate synthetic comparison notebook
Build a second notebook whose primary purpose is:
"How do the estimators behave relative to oracle truth under controlled synthetic settings?"

This notebook should:

use synthetic data with an intentionally visible non-trivial delta_V / policy-value difference
show oracle values
compare Replay / IPS / SNIPS / DM / DR / SNDR / Switch-DR on the same dataset
include one central method-vs-oracle table
help the reader see which estimators were accurate, which were fragile, and why

If useful, include more than one synthetic scenario, for example:

a reasonably healthy overlap scenario
a poor-overlap scenario
a logged-vs-estimated propensity contrast

Keep this notebook pedagogical, not benchmark-heavy.

Add a practical RU interpretation guide
Create docs/how_to_interpret_ope_outputs_ru.md (or a very similar name).

This guide should answer questions like:

What do V_A, V_B, and Delta mean?
What do CI and p_value mean, and what do they NOT mean?
Why are diagnostics necessary in addition to CI?
What does low overlap / low ESS / heavy weights imply?
What does trust_level mean and what does it NOT guarantee?
When can OPE be used as directional evidence / screening, and when is it not enough to replace an A/B test?
How should a user think about choosing among Replay / IPS / SNIPS / DM / DR / SNDR / Switch-DR?

Keep the tone practical, responsible, and clear.
Do not overclaim that any single statistic can certify “safe to skip experimentation”.

Clarify estimator-selection guidance
Across the quickstart notebook and interpretation guide, clearly explain a practical rule-of-thumb such as:

Replay as a support/overlap-dependent baseline
IPS/SNIPS as weighted estimators sensitive to support and weight tails
DM as model-dependent
DR as a common practical default when nuisance quality is acceptable
SNDR / Switch-DR as useful robustness variants when weights are unstable

The goal is not perfect theoretical completeness, but actionable user guidance.

Improve repository navigation
Update README so it becomes a clean navigator rather than trying to be every document at once.
At minimum, README should clearly point to:

quickstart notebook for own data
synthetic estimator comparison notebook
interpretation guide
architecture doc
validation harness doc

If script-like files remain in examples/, explain their purpose explicitly.

Keep notebooks readable

suppress noisy estimator/bootstrap logging in saved outputs
re-run notebooks end-to-end
commit clean outputs that are pleasant to read on GitHub

Tests / validation
At minimum:

keep existing tests green if touched
ensure notebooks execute successfully
update lightweight tests if reporting/output field names or paths changed nearby

Non-goals:

no new estimators
no new inference methods
no major architecture refactor
no massive benchmark framework expansion

Deliverables:

fixed tutorial bug(s)
no package-level logging side effects on import
reorganized user-facing materials
one quickstart notebook for own data
one synthetic estimator-comparison notebook
one RU interpretation guide
improved README navigation
clean executed notebook outputs

Quality bar:
After this pass, a user should be able to answer these questions from the repo materials:

How do I run the library on my own data?
How do I interpret the outputs?
When should I distrust the result?
Which estimator should I start from, and why?
How do the estimators compare to oracle truth on synthetic data?

Operational advice for the Codex run

Ask Codex to:

first inspect the current files before rewriting them
preserve estimator math
prioritize pedagogy and user clarity
avoid duplicating the same workflow across multiple notebooks
make one notebook the clear happy-path for real usage
make one notebook the clear synthetic comparison notebook

After Codex finishes:

manually review notebook structure and titles
manually skim the saved outputs on GitHub
then run one short final cleanup pass if wording still feels off

Yurashku · 2026-04-14T14:07:28Z

Updated plan after reviewing additional external feedback.

Short verdict:

some points from the external review are highly relevant and should be addressed now;
some are correct but belong to a later stage;
we should avoid turning the repo into an overbuilt research platform in the first pass.

What is truly relevant now (must-fix / near-term)

Target-type inference bug
Current outcome-model fitting still keys binary-vs-continuous behavior off the target name ("accept" vs everything else) instead of the semantic/type of the target.
This hurts the promised “bring your own data” story and should be fixed early.
Action-label diagnostics bug
Behavior-model top-1 diagnostics should not compare np.argmax(...) column indices directly to raw action labels when labels may be non-0..k-1 or strings.
This is a real correctness issue and should be fixed early.
Package import logging side effect
logging.basicConfig(...) on package import is not good library behavior and should be removed.
Canonical user path is still too blurry
We should explicitly make compare_policies(...) the canonical orchestration entrypoint for docs/tutorials.
OPEEvaluator can remain as a convenience wrapper, but should not be the main pedagogical path unless it reaches full feature parity.
Learning materials must be split by user intent
One notebook should not try to be quickstart, own-data cookbook, estimator benchmark, and decision guide all at once.

What is relevant but should be delayed (later stage)

Full decision framework / “can I skip A/B?” rubric
Important, but not first-pass material.
For now we should give responsible interpretation guidance, not a pseudo-certification gate.
Major validation-harness hardening
More scenarios, stronger regression expectations, less MC noise in oracle, etc. are valuable, but this is second-wave work.
Large-scale repo/tooling overhaul
requirements-dev, deeper CI hardening, bigger benchmark suite — useful, but not necessary before fixing user-facing correctness and pedagogy.

Updated implementation plan

Phase 1 — correctness and portability fixes

fix target-type inference for outcome-model fitting
fix action-label bug in nuisance diagnostics
remove import-time logging configuration
add focused tests for these fixes

Phase 2 — define the canonical user-facing path

make compare_policies(...) the official tutorial/docs path
treat OPEEvaluator as optional convenience wrapper unless feature parity is explicitly expanded
update wording so the repo does not overpromise universality

Phase 3 — restructure teaching materials

Create / reorganize materials by user intent:

examples/quickstart_own_data_ru.ipynb — main happy path for applying to user data
examples/compare_estimators_vs_oracle_ru.ipynb — synthetic estimator comparison
docs/how_to_interpret_ope_outputs_ru.md — practical interpretation/trust guide
script-like runners kept only if their role is explicit; otherwise move to scripts/ or document clearly

Phase 4 — optional second wave

stronger validation harness scenarios
more decision-oriented guidance
tooling / dev-environment cleanup

Practical prioritization

For the next Codex pass, the best order is:

fix correctness/portability bugs
clean the canonical user path
rebuild learning materials around that path

This keeps the repo from becoming overloaded while still addressing the most user-visible and credibility-critical issues.

Polish comparison defaults, trust notes, and reporting metadata

e28981d

Yurashku added the codex label Apr 13, 2026 — with ChatGPT Codex Connector

Polish report compatibility and replay guidance wording

ae3dac0

Refocus tutorial notebook on official compact OPE workflow

9128401

Redesign tutorial around method-vs-oracle and trust interpretation

850a742

Yurashku merged commit af7e97c into main Apr 14, 2026
0 of 2 checks passed

Yurashku deleted the codex/finalize-api-polish-for-ope-library branch April 14, 2026 13:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API polish: recommended defaults, structured notes, and trust metadata for policy comparisons#41

API polish: recommended defaults, structured notes, and trust metadata for policy comparisons#41
Yurashku merged 4 commits into
mainfrom
codex/finalize-api-polish-for-ope-library

Yurashku commented Apr 13, 2026

Uh oh!

Yurashku commented Apr 13, 2026

Uh oh!

Yurashku commented Apr 13, 2026

Uh oh!

Uh oh!

Yurashku commented Apr 14, 2026

Uh oh!

Yurashku commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Yurashku commented Apr 13, 2026

Motivation

Description

Testing

Uh oh!

Yurashku commented Apr 13, 2026

Uh oh!

Yurashku commented Apr 13, 2026

Uh oh!

Uh oh!

Yurashku commented Apr 14, 2026

Can Codex do this?

Recommended way to use Codex here

Pre-work before running Codex

Prompt for Codex — Pass 1 + 2 combined but still focused

Operational advice for the Codex run

Uh oh!

Yurashku commented Apr 14, 2026

What is truly relevant now (must-fix / near-term)

What is relevant but should be delayed (later stage)

Updated implementation plan

Phase 1 — correctness and portability fixes

Phase 2 — define the canonical user-facing path

Phase 3 — restructure teaching materials

Phase 4 — optional second wave

Practical prioritization

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant