Skip to content

API polish: recommended defaults, structured notes, and trust metadata for policy comparisons#41

Merged
Yurashku merged 4 commits into
mainfrom
codex/finalize-api-polish-for-ope-library
Apr 14, 2026
Merged

API polish: recommended defaults, structured notes, and trust metadata for policy comparisons#41
Yurashku merged 4 commits into
mainfrom
codex/finalize-api-polish-for-ope-library

Conversation

@Yurashku
Copy link
Copy Markdown
Owner

Motivation

  • Make high-level comparison outputs more opinionated and easier to use safely by surfacing recommended defaults and clearer trust signals.
  • Reduce noisy unstructured notes by separating informational notes, diagnostic warnings, inference warnings, and trust guidance.
  • Ensure reporting reflects inference settings (non-95% alpha) and keep backwards compatibility for existing consumers.

Description

  • Added exported recommended-default constants and metadata: RECOMMENDED_ESTIMATOR, RECOMMENDED_PROPENSITY_SOURCE_WITH_LOGGED, RECOMMENDED_PROPENSITY_SOURCE_FALLBACK, and RECOMMENDED_CROSSFIT_ESTIMATORS, and included recommended_defaults in PolicyComparisonSummary (src/policyscope/comparison.py).
  • Introduced structured result fields on PolicyComparisonSummary: info_notes, diagnostic_warnings, inference_warnings, trust_notes, trust_level, and optional recommendation, while preserving legacy notes as an additive, backward-compatible combined view (src/policyscope/comparison.py).
  • Added a lightweight rule-based trust rollup generator _build_trust_metadata(...) that summarizes diagnostics + inference warnings into trust_level/recommendation and nudges cross-fit guidance for estimators where helpful (src/policyscope/comparison.py).
  • Updated decision_summary to render CI level from alpha rather than hard-coded "95%" and to surface trust_level/recommendation when present (src/policyscope/report.py).
  • Threaded trust_level into the validation harness outputs so experiment aggregates can include trust metadata (src/policyscope/validation.py).
  • Documentation updates describing the recommended defaults, structured notes, and migration compatibility (README.md, docs/architecture.md, docs/validation_harness.md).
  • Tests added/updated to cover recommended-default metadata, structured notes compatibility, and reporting with non-95% alpha (tests/test_comparison.py, tests/test_bootstrap_report.py).

Testing

  • Ran the focused test suite with the package import path set: PYTHONPATH=src pytest -q tests/test_comparison.py tests/test_bootstrap_report.py tests/test_docs_consistency.py tests/test_validation.py.
  • Result: all tests passed (29 passed).
  • No estimator math changed, no new estimators were added, and the legacy notes field remains available for backward compatibility.

Codex Task

Copy link
Copy Markdown
Owner Author

Готовый prompt для Codex на полную прическу tutorial.


Perform a focused cleanup and redesign pass for examples/tutorial.ipynb so it becomes a clear, compact, user-facing tutorial for the current Policyscope workflow.

Important context:

  • The library architecture and methodology are already mostly complete.
  • The notebook should teach the current official workflow, not document the history of repository changes.
  • Keep estimator mathematics unchanged.
  • Do not add new estimators or new architecture layers.
  • This task is about tutorial clarity, noise reduction, and consistency with the current repository.

Main goals:

  1. Remove changelog-style narrative
    Clean the notebook from wording that sounds like development history, for example phrases like:
  • "новый интерфейс"
  • "новый единый слой"
  • "новый универсальный API"
  • references that read like migration notes rather than tutorial guidance

The notebook should read like a stable product tutorial, not a release note.

  1. Simplify and shorten the structure
    Redesign the notebook into a clean, minimal flow.
    Recommended structure:
  • short intro: what problem the library solves
  • generate or load logs
  • validate data contract
  • show oracle only for synthetic data
  • compute one compact comparison table of estimators
  • show the official high-level path (compare_policies(...) and/or OPEEvaluator)
  • show CI / significance / diagnostics interpretation
  • short "how to adapt to your own data" section

Try to reduce repeated sections and repeated explanations.

  1. Remove noisy outputs
    The current notebook contains too much stderr/logging noise from estimators and bootstrap loops.
    Make notebook execution outputs clean and readable.

Acceptable approaches:

  • suppress verbose estimator logging during tutorial execution
  • wrap noisy calls in a logging/redirect context
  • or otherwise keep outputs compact and user-friendly

The final notebook should not spam repeated lines like [IPS] ..., [Replay] ... hundreds of times.

  1. Use the current official API consistently
    Prefer the current official user-facing workflow.
    The tutorial should emphasize:
  • BanditSchema / LoggedBanditDataset
  • compare_policies(...) as the official orchestration path
  • optionally compare_policies_multi_target(...) if it adds clear value
  • OPEEvaluator only if it adds clarity, not if it duplicates too much

Reduce emphasis on low-level manual estimator plumbing unless it is pedagogically necessary.

  1. Keep only the most useful manual section
    It is okay to keep one short section showing low-level estimator intuition (Replay / IPS / DM / DR table), but keep it concise.
    Do not let the notebook become a wall of duplicated low-level calls plus a second duplicated high-level path.

  2. Fix consistency with current docs and reporting
    Make sure the notebook reflects the current repository state:

  • logged vs estimated propensity modes
  • diagnostics / nuisance diagnostics / trust interpretation where relevant
  • current report.py behavior and current wording
  • no outdated statements about DR applicability
  1. Improve final interpretation section
    Add a short, practical interpretation block explaining:
  • point estimate is not enough
  • CI / p-value are not enough alone
  • diagnostics must also be checked
  • low overlap / low ESS / heavy weights reduce trust

Keep this concise and practical.

  1. Re-execute the notebook and commit clean outputs
    Re-run the notebook end-to-end and commit the updated outputs.
    The saved notebook should be readable and not cluttered.

  2. Update README only if needed
    Only update README if the tutorial entry or wording clearly needs synchronization.
    Keep this minimal.

  3. Tests / validation
    At minimum:

  • ensure the notebook executes successfully
  • keep the rest of the repository tests green if touched

Non-goals:

  • no new estimators
  • no new architecture work
  • no big API redesign
  • no tutorial expansion into a full theory course

Deliverables:

  • a cleaned and shorter examples/tutorial.ipynb
  • less noisy execution outputs
  • stable product-style wording instead of changelog-style wording
  • current official workflow reflected clearly
  • minimal doc sync if necessary

Nice-to-have direction:
If reasonable, prefer one strong polished tutorial over showing every possible path. The notebook should optimize for clarity and user confidence, not for exhaustive API coverage.


Если хочется сделать задачу ещё уже, можно сначала ограничиться только:

  • убрать changelog-style текст,
  • убрать logging noise,
  • сократить дублирование low-level vs high-level sections,
  • переисполнить notebook.

Copy link
Copy Markdown
Owner Author

Новый prompt для правильной переработки tutorial. Акцент не на косметике, а на учебной ясности: method-vs-oracle, валидность применения методов, доверие к оценкам и high-level API как основной путь.


Redesign examples/tutorial.ipynb into a truly instructive, user-facing tutorial for the current Policyscope methodology.

Important context:

  • The library architecture is already strong.
  • The tutorial is currently weaker than the library itself.
  • The goal is not just to clean wording, but to make the tutorial actually explain:
    1. what each estimator produced on the synthetic dataset,
    2. how close each estimator is to oracle truth,
    3. when each method is valid / fragile,
    4. how to interpret confidence and trust in the result.
  • Keep estimator mathematics unchanged.
  • Do not add new estimators or new architecture layers.
  • This task is about tutorial pedagogy, clarity, and consistency with the current official API.

Main goals:

  1. Rebuild the notebook around one clear narrative
    The notebook should answer this practical question:
    "I have logs from policy A and a candidate policy B. How do I compare them offline, which estimators say what, when can I trust them, and how do I read the result?"

Use a stable product-style tone.
Remove release-note/changelog wording.
Do not frame sections as repository migration history.

  1. Make the official high-level API the main path
    The tutorial should primarily teach the current official orchestration path.
    Prefer:
  • BanditSchema
  • LoggedBanditDataset
  • compare_policies(...)
  • optionally compare_policies_multi_target(...) if it clearly helps

OPEEvaluator may be mentioned as an alternative convenience path, but it should not dominate the tutorial.

The notebook should not feel like a low-level estimator plumbing demo first and an official API demo second.

  1. Add one central “method vs oracle” comparison table
    This is the most important new requirement.

Because the notebook uses synthetic data, compute oracle truth and then show a single clear table comparing all implemented estimators on that same dataset.

At minimum include rows for:

  • Replay
  • IPS
  • SNIPS
  • DM
  • DR
  • SNDR
  • Switch-DR
    (and optionally On-policy A as baseline)

At minimum include columns like:

  • estimator
  • V_A
  • V_B
  • Delta
  • V_A_oracle
  • V_B_oracle
  • Delta_oracle
  • abs_error_V_B
  • abs_error_Delta
  • V_B_CI
  • Delta_CI
  • p_value
  • is_significant
  • replay_overlap
  • weight_ess_ratio
  • trust_level
  • compact warning summary / key warnings

The goal is that a reader can immediately see:

  • which methods were close to oracle,
  • which were unstable,
  • which looked misleading,
  • and why.
  1. Add a short Russian guide on method validity / applicability inside the tutorial
    Do not rely only on external docs.
    Inside the notebook itself, include a concise Russian-language practical guide that explains when each family of methods is appropriate.

For example:

  • Replay: overlap-dependent diagnostic baseline; not universally reliable under arbitrary contextual logging.
  • IPS: valid with correct propensities and support, but fragile under heavy weights.
  • SNIPS: often more stable than IPS, but biased.
  • DM: depends strongly on outcome-model quality.
  • DR: usually the main practical default when nuisance quality is acceptable.
  • SNDR / Switch-DR: useful when DR suffers from unstable weights.

Keep this practical and compact.
It should help the reader decide when the method is trustworthy, not just define formulas.

  1. Add a clear Russian section about confidence / trust interpretation
    The reader should understand that:
  • point estimate alone is not enough,
  • CI / p-value alone are not enough,
  • diagnostics matter,
  • low overlap / low ESS / heavy weights reduce trust,
  • trust_level is a summary, not magic truth.

Use the current structured outputs of the library where possible:

  • diagnostics
  • nuisance diagnostics if relevant
  • trust_level
  • recommendation
  • warnings / notes

This section should be short but very clear.

  1. Keep only one short low-level intuition section
    It is okay to keep one compact low-level section showing manual estimator intuition.
    But it must be brief.
    Do not duplicate the whole workflow twice.

The main educational value should come from:

  • official high-level API
  • method-vs-oracle comparison
  • validity/trust interpretation
  1. Remove noisy logging from notebook outputs
    The current notebook outputs are too noisy because estimator/bootstrap logs flood stderr.
    Suppress or redirect verbose logging so the saved notebook is readable.
    The final notebook should not contain long repeated blocks like [IPS] ..., [Replay] ... dozens of times.

  2. Improve final structure
    Recommended structure:

  • intro: what Policyscope solves
  • generate/load synthetic logs
  • validate data contract
  • show oracle (synthetic only)
  • one short manual intuition section
  • official compare path across estimators
  • central method-vs-oracle table
  • Russian guide: validity/applicability of methods
  • Russian guide: how to interpret confidence and trust
  • how to adapt to your own data
  1. Re-execute notebook and save clean outputs
    Re-run the notebook end-to-end.
    Commit clean, readable outputs.
    The notebook should be pleasant to read in GitHub, not just executable.

  2. Minimal doc sync only if necessary
    Update README only if the tutorial description clearly needs synchronization.
    Keep this minimal.

Validation expectations:

  • notebook executes successfully
  • changed tests (if any) stay green
  • tutorial content is visibly clearer and more informative than before

Non-goals:

  • no new estimators
  • no new methodology features
  • no large architecture refactor
  • no theory-course expansion

Deliverables:

  • redesigned examples/tutorial.ipynb
  • central estimator-vs-oracle comparison table
  • Russian method-validity guidance inside the tutorial
  • Russian confidence/trust interpretation section
  • reduced logging noise
  • clean executed outputs

Important quality bar:
After the rewrite, a reader should be able to open the notebook and answer these four questions without confusion:

  1. What did each estimator say on this dataset?
  2. Which ones were close to oracle?
  3. When is each method valid or fragile?
  4. How much trust should I place in the reported result?

Если хочется сделать задачу поэтапно, сначала можно выполнить только:

  • перестройку структуры,
  • method-vs-oracle table,
  • русский guide по valid/fragile cases,
  • suppression of noisy logging,
    а уже потом полировать формулировки.

@Yurashku Yurashku merged commit af7e97c into main Apr 14, 2026
0 of 2 checks passed
@Yurashku Yurashku deleted the codex/finalize-api-polish-for-ope-library branch April 14, 2026 13:55
Copy link
Copy Markdown
Owner Author

Operational prompt and execution plan for the next Codex pass after merge to main.


Can Codex do this?

Yes — Codex can implement almost all of the next plan.

Best use of Codex here:

  • structural cleanup of tutorial/docs/examples
  • moving files between examples/, scripts/, and docs/
  • fixing the clear tutorial bug(s)
  • removing package-level logging side effects
  • creating new notebooks / docs with consistent wording
  • updating README navigation and entrypoints

What still benefits from human review after Codex:

  • whether the new tutorial structure matches the intended pedagogy
  • whether the synthetic scenarios are actually convincing
  • whether the decision/trust language is too strong or too weak
  • final readability of notebooks on GitHub

Recommended way to use Codex here

Do not ask for everything in one giant pass.
Use 2-3 focused Codex passes.

Recommended sequence:

  1. Pass 1 — fix obvious defects + restructure examples

    • fix tutorial replay bug
    • remove logging.basicConfig(...) from package import path
    • separate examples user-facing materials from script-like experiment runners
    • improve README navigation
  2. Pass 2 — build the educational materials you actually want

    • main quickstart notebook for own data
    • synthetic estimator comparison notebook
    • RU interpretation guide for outputs / trust / when not to trust OPE
  3. Pass 3 — polish / execute / final consistency

    • re-run notebooks
    • tighten wording
    • make cross-links clean
    • ensure tests still pass

Pre-work before running Codex

Minimal pre-work only:

  • use a fresh branch from current main
  • tell Codex to keep scope tight and avoid estimator-math changes
  • tell Codex explicitly that user-facing pedagogy matters more than showing every internal API path
  • tell Codex to prefer one strong happy-path over exhaustive duplication

No special manual prep is required beyond that.


Prompt for Codex — Pass 1 + 2 combined but still focused

Rework the user-facing learning materials of the Policyscope repository so that a user can actually become productive and informed.

Important context:

  • The core library architecture is already in good shape.
  • The main remaining problems are user-facing: tutorial structure, examples organization, interpretation guidance, and clarity about when OPE results are trustworthy.
  • Keep estimator mathematics unchanged.
  • Do not add new estimators or new architecture layers.
  • The goal is not to expand theory endlessly, but to make the repo teach the user how to use the library and how to interpret outputs responsibly.

Main goals:

  1. Fix obvious defects first
  • Fix the tutorial bug in the current low-level Replay section: replay_value(...) must receive policy-B actions, not piB_taken probabilities.
  • Fix any related overlap computation in the tutorial if it is currently using the wrong quantity.
  • Remove package-level logging side effects from src/policyscope/__init__.py (do not call logging.basicConfig(...) on import).
  • Keep backward compatibility where practical.
  1. Reorganize user-facing materials by user intent
    Make the repository easier to navigate by separating:
  • beginner quickstart / own-data usage
  • synthetic estimator comparison
  • script-like experiment runners
  • interpretation/trust guidance

Recommended target structure:

  • examples/quickstart_own_data_ru.ipynb — main user tutorial
  • examples/compare_estimators_vs_oracle_ru.ipynb — synthetic estimator comparison notebook
  • docs/how_to_interpret_ope_outputs_ru.md — practical RU guide for interpreting outputs and trust
  • keep script-like runnable experiment files only if they have a clear role; if needed, move them to a more suitable place such as scripts/ or clearly document them as experiment runners rather than tutorials

You may keep examples/tutorial.ipynb only if it remains useful; otherwise replace it with better named notebooks. Prefer clarity over attachment to the old file name.

  1. Create a true quickstart notebook for applying the library to user data
    Build a notebook whose primary purpose is:
    "How do I run Policyscope on my own dataset, with the official high-level API, and how do I read the result?"

This notebook should emphasize:

  • BanditSchema
  • LoggedBanditDataset
  • compare_policies(...) as the main orchestration path
  • minimal required columns
  • what to do when logged propensities are available
  • what to do when logged propensities are unavailable
  • what Delta, CI, p_value, diagnostics, and trust_level mean

The notebook must include a minimal copy-adapt-run code path for a user’s own DataFrame.
Do not make it overly long.
Optimize for usability.

  1. Create a separate synthetic comparison notebook
    Build a second notebook whose primary purpose is:
    "How do the estimators behave relative to oracle truth under controlled synthetic settings?"

This notebook should:

  • use synthetic data with an intentionally visible non-trivial delta_V / policy-value difference
  • show oracle values
  • compare Replay / IPS / SNIPS / DM / DR / SNDR / Switch-DR on the same dataset
  • include one central method-vs-oracle table
  • help the reader see which estimators were accurate, which were fragile, and why

If useful, include more than one synthetic scenario, for example:

  • a reasonably healthy overlap scenario
  • a poor-overlap scenario
  • a logged-vs-estimated propensity contrast

Keep this notebook pedagogical, not benchmark-heavy.

  1. Add a practical RU interpretation guide
    Create docs/how_to_interpret_ope_outputs_ru.md (or a very similar name).

This guide should answer questions like:

  • What do V_A, V_B, and Delta mean?
  • What do CI and p_value mean, and what do they NOT mean?
  • Why are diagnostics necessary in addition to CI?
  • What does low overlap / low ESS / heavy weights imply?
  • What does trust_level mean and what does it NOT guarantee?
  • When can OPE be used as directional evidence / screening, and when is it not enough to replace an A/B test?
  • How should a user think about choosing among Replay / IPS / SNIPS / DM / DR / SNDR / Switch-DR?

Keep the tone practical, responsible, and clear.
Do not overclaim that any single statistic can certify “safe to skip experimentation”.

  1. Clarify estimator-selection guidance
    Across the quickstart notebook and interpretation guide, clearly explain a practical rule-of-thumb such as:
  • Replay as a support/overlap-dependent baseline
  • IPS/SNIPS as weighted estimators sensitive to support and weight tails
  • DM as model-dependent
  • DR as a common practical default when nuisance quality is acceptable
  • SNDR / Switch-DR as useful robustness variants when weights are unstable

The goal is not perfect theoretical completeness, but actionable user guidance.

  1. Improve repository navigation
    Update README so it becomes a clean navigator rather than trying to be every document at once.
    At minimum, README should clearly point to:
  • quickstart notebook for own data
  • synthetic estimator comparison notebook
  • interpretation guide
  • architecture doc
  • validation harness doc

If script-like files remain in examples/, explain their purpose explicitly.

  1. Keep notebooks readable
  • suppress noisy estimator/bootstrap logging in saved outputs
  • re-run notebooks end-to-end
  • commit clean outputs that are pleasant to read on GitHub
  1. Tests / validation
    At minimum:
  • keep existing tests green if touched
  • ensure notebooks execute successfully
  • update lightweight tests if reporting/output field names or paths changed nearby

Non-goals:

  • no new estimators
  • no new inference methods
  • no major architecture refactor
  • no massive benchmark framework expansion

Deliverables:

  • fixed tutorial bug(s)
  • no package-level logging side effects on import
  • reorganized user-facing materials
  • one quickstart notebook for own data
  • one synthetic estimator-comparison notebook
  • one RU interpretation guide
  • improved README navigation
  • clean executed notebook outputs

Quality bar:
After this pass, a user should be able to answer these questions from the repo materials:

  1. How do I run the library on my own data?
  2. How do I interpret the outputs?
  3. When should I distrust the result?
  4. Which estimator should I start from, and why?
  5. How do the estimators compare to oracle truth on synthetic data?

Operational advice for the Codex run

Ask Codex to:

  • first inspect the current files before rewriting them
  • preserve estimator math
  • prioritize pedagogy and user clarity
  • avoid duplicating the same workflow across multiple notebooks
  • make one notebook the clear happy-path for real usage
  • make one notebook the clear synthetic comparison notebook

After Codex finishes:

  • manually review notebook structure and titles
  • manually skim the saved outputs on GitHub
  • then run one short final cleanup pass if wording still feels off

Copy link
Copy Markdown
Owner Author

Updated plan after reviewing additional external feedback.

Short verdict:

  • some points from the external review are highly relevant and should be addressed now;
  • some are correct but belong to a later stage;
  • we should avoid turning the repo into an overbuilt research platform in the first pass.

What is truly relevant now (must-fix / near-term)

  1. Target-type inference bug
    Current outcome-model fitting still keys binary-vs-continuous behavior off the target name ("accept" vs everything else) instead of the semantic/type of the target.
    This hurts the promised “bring your own data” story and should be fixed early.

  2. Action-label diagnostics bug
    Behavior-model top-1 diagnostics should not compare np.argmax(...) column indices directly to raw action labels when labels may be non-0..k-1 or strings.
    This is a real correctness issue and should be fixed early.

  3. Package import logging side effect
    logging.basicConfig(...) on package import is not good library behavior and should be removed.

  4. Canonical user path is still too blurry
    We should explicitly make compare_policies(...) the canonical orchestration entrypoint for docs/tutorials.
    OPEEvaluator can remain as a convenience wrapper, but should not be the main pedagogical path unless it reaches full feature parity.

  5. Learning materials must be split by user intent
    One notebook should not try to be quickstart, own-data cookbook, estimator benchmark, and decision guide all at once.

What is relevant but should be delayed (later stage)

  1. Full decision framework / “can I skip A/B?” rubric
    Important, but not first-pass material.
    For now we should give responsible interpretation guidance, not a pseudo-certification gate.

  2. Major validation-harness hardening
    More scenarios, stronger regression expectations, less MC noise in oracle, etc. are valuable, but this is second-wave work.

  3. Large-scale repo/tooling overhaul
    requirements-dev, deeper CI hardening, bigger benchmark suite — useful, but not necessary before fixing user-facing correctness and pedagogy.

Updated implementation plan

Phase 1 — correctness and portability fixes

  • fix target-type inference for outcome-model fitting
  • fix action-label bug in nuisance diagnostics
  • remove import-time logging configuration
  • add focused tests for these fixes

Phase 2 — define the canonical user-facing path

  • make compare_policies(...) the official tutorial/docs path
  • treat OPEEvaluator as optional convenience wrapper unless feature parity is explicitly expanded
  • update wording so the repo does not overpromise universality

Phase 3 — restructure teaching materials

Create / reorganize materials by user intent:

  • examples/quickstart_own_data_ru.ipynb — main happy path for applying to user data
  • examples/compare_estimators_vs_oracle_ru.ipynb — synthetic estimator comparison
  • docs/how_to_interpret_ope_outputs_ru.md — practical interpretation/trust guide
  • script-like runners kept only if their role is explicit; otherwise move to scripts/ or document clearly

Phase 4 — optional second wave

  • stronger validation harness scenarios
  • more decision-oriented guidance
  • tooling / dev-environment cleanup

Practical prioritization

For the next Codex pass, the best order is:

  1. fix correctness/portability bugs
  2. clean the canonical user path
  3. rebuild learning materials around that path

This keeps the repo from becoming overloaded while still addressing the most user-visible and credibility-critical issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant