Conversation
Bumps [actions/upload-pages-artifact](https://github.com/actions/upload-pages-artifact) from 3 to 4. - [Release notes](https://github.com/actions/upload-pages-artifact/releases) - [Commits](actions/upload-pages-artifact@v3...v4) --- updated-dependencies: - dependency-name: actions/upload-pages-artifact dependency-version: '4' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com>
Contributor
Author
LabelsThe following labels could not be found: Please fix the above issues or remove invalid values from |
anulum
added a commit
that referenced
this pull request
Mar 21, 2026
Update domain benchmark section with calibrated measurements: - PubMedQA: score range [0.01, 0.77], best F1=62.1% at t=0.50 - FinanceBench: score range [0.007, 0.63], 80%+ FPR without KB - Key finding: NLI-only scoring needs KB grounding for discrimination - Competitive positioning: every claim sourced with measurement date - Honest Limitations: NLI-only domain scoring weak without KB as #1 Co-Authored-By: Arcane Sapience <protoscience@anulum.li>
anulum
added a commit
that referenced
this pull request
Apr 12, 2026
…n score_and_save
The F1 metric mislabel that has confused the entire FPR-reduction
campaign. ``AggreFactMetrics.avg_balanced_acc`` computes the
**per-dataset mean** of balanced accuracies (unweighted average
across the 11 AggreFact datasets). ``score_and_save()`` stored
this value under the field name ``global_balanced_accuracy`` with
a doc string that said "sample-pooled BA computed once across all
29 320 samples" — which it was not.
Consequence: FactCG's stored ``global_balanced_accuracy: 0.7558``
is the per-dataset mean at the global threshold, NOT sample-pooled.
Direct computation shows FactCG's TRUE sample-pooled BA is 0.8142
at the same threshold. We had been comparing the champion's
sample-pooled 82.11 % against FactCG's per-dataset mean 75.58 %
as if they were the same metric, overstating the gap by ~6 pp.
This commit:
* Adds ``_compute_sample_pooled_ba(predictions, labels) -> float``
helper that computes true sample-pooled balanced accuracy on the
flat (preds, labels) pool.
* ``score_and_save()`` now writes FOUR explicit metric fields in a
2×2 matrix of {per-dataset-mean, sample-pooled} × {global
threshold, per-dataset thresholds}:
- ``per_dataset_mean_balanced_accuracy_at_global_threshold``
(= AggreFact leaderboard convention, verified verbatim from
https://llm-aggrefact.github.io/ on 2026-04-12)
- ``per_dataset_mean_balanced_accuracy_at_per_dataset_thresholds``
(post-hoc tuned — our FactCG "77.76 % potential #1" number)
- ``sample_pooled_balanced_accuracy_at_global_threshold``
(true sample-pooled, new)
- ``sample_pooled_balanced_accuracy_at_per_dataset_thresholds``
(true sample-pooled with per-ds tuning, new)
* Legacy aliases ``global_balanced_accuracy`` and
``per_dataset_avg_balanced_accuracy`` are kept for back-compat
and map to the per-dataset-mean variants. A deprecation comment
documents the migration path.
* ``AggreFactMetrics.avg_balanced_acc`` is renamed to
``per_dataset_mean_balanced_acc`` (canonical) with ``avg_balanced_acc``
kept as a deprecated alias. Doc string explains both metrics
and cites the leaderboard verification.
Full audit trail: docs/internal/experiments_log_2026-04-12.md
Entries 18 and 19.
Co-Authored-By: Arcane Sapience <protoscience@anulum.li>
anulum
added a commit
that referenced
this pull request
Apr 12, 2026
- 75.8% → 75.6% per-dataset mean (leaderboard #6, verified 2026-04-12) - Add FactCG-tuned 77.76% (potential #1, ahead of Bespoke-MiniCheck 77.4%) - Add leaderboard rank column - Remove MiniCheck-DeBERTa-L row (not on published leaderboard) - Simplify Gemma routed callout (remove sample-pooled per-family breakdown which was mixing metrics) Co-Authored-By: Arcane Sapience <protoscience@anulum.li>
anulum
added a commit
that referenced
this pull request
Apr 18, 2026
director_ai.core.trajectory ships the foundation for the 2026-04-21 roadmap Tier 1 #1 feature: pre-execution Monte-Carlo halt based on N simulated draws from an injected actor. TrajectorySimulator runs n_simulations independent draws (default 8) with deterministic per-draw seeds (base_seed + i), feeds each trajectory's joined text to a CoherenceScorer-shaped verdict producer, and aggregates the results into a PreflightVerdict: - halt_rate / mean_coherence / std_coherence - 95% empirical credible interval over the per-trajectory scores - recommended action (``proceed`` / ``warn`` / ``halt``) based on two halt-rate thresholds (warn 0.25, halt 0.50 by default) - the raw TrajectoryResult list so operators can inspect which draws failed Seeded determinism means two preflight calls with the same prompt produce byte-identical verdicts — reproducibility for forensic incident review and for regression tests on preflight decisions. Optional on_trajectory callback per draw; exceptions from the callback are swallowed so a broken sink cannot abort the loop. Follow-ups tracked separately (distilled-actor integration, CoherenceAgent wiring, conformal calibration against historical traces, Rust-accelerated Monte-Carlo loop). Foundation scope matches the roadmap memo. Coverage: 17 tests covering construction validation, proceed / warn / halt bands, deterministic replay, seed variation, per-trajectory callback, callback failure isolation, verdict shape, min/max/std aggregation. mypy clean on 194 source files. Co-Authored-By: Arcane Sapience <protoscience@anulum.li>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Bumps actions/upload-pages-artifact from 3 to 4.
Release notes
Sourced from actions/upload-pages-artifact's releases.
Commits
7b1f4a7Merge pull request #127 from heavymachinery/pin-sha4cc19c7Pinactions/upload-artifactto SHA2d163beMerge pull request #107 from KittyChiu/mainc704843fix: linted README9605915Merge pull request #106 from KittyChiu/kittychiu/update-readme-1e59cdfeUpdate README.mda2d6704doc: updated usage section in readme984864eMerge pull request #105 from actions/Jcambass-patch-145dc788Add workflow file for publishing releases to immutable action packageefaad07Merge pull request #102 from actions/hidden-filesDependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting
@dependabot rebase.Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR:
@dependabot rebasewill rebase this PR@dependabot recreatewill recreate this PR, overwriting any edits that have been made to it@dependabot show <dependency name> ignore conditionswill show all of the ignore conditions of the specified dependency@dependabot ignore this major versionwill close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)@dependabot ignore this minor versionwill close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)@dependabot ignore this dependencywill close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)