Skip to content

chore(deps): Bump actions/upload-pages-artifact from 3 to 4#1

Merged
anulum merged 1 commit intomainfrom
dependabot/github_actions/actions/upload-pages-artifact-4
Mar 1, 2026
Merged

chore(deps): Bump actions/upload-pages-artifact from 3 to 4#1
anulum merged 1 commit intomainfrom
dependabot/github_actions/actions/upload-pages-artifact-4

Conversation

@dependabot
Copy link
Copy Markdown
Contributor

@dependabot dependabot Bot commented on behalf of github Mar 1, 2026

Bumps actions/upload-pages-artifact from 3 to 4.

Release notes

Sourced from actions/upload-pages-artifact's releases.

v4.0.0

What's Changed

Full Changelog: actions/upload-pages-artifact@v3.0.1...v4.0.0

v3.0.1

Changelog

See details of all code changes since previous release.

Commits
  • 7b1f4a7 Merge pull request #127 from heavymachinery/pin-sha
  • 4cc19c7 Pin actions/upload-artifact to SHA
  • 2d163be Merge pull request #107 from KittyChiu/main
  • c704843 fix: linted README
  • 9605915 Merge pull request #106 from KittyChiu/kittychiu/update-readme-1
  • e59cdfe Update README.md
  • a2d6704 doc: updated usage section in readme
  • 984864e Merge pull request #105 from actions/Jcambass-patch-1
  • 45dc788 Add workflow file for publishing releases to immutable action package
  • efaad07 Merge pull request #102 from actions/hidden-files
  • Additional commits viewable in compare view

Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

  • @dependabot rebase will rebase this PR
  • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
  • @dependabot show <dependency name> ignore conditions will show all of the ignore conditions of the specified dependency
  • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

Bumps [actions/upload-pages-artifact](https://github.com/actions/upload-pages-artifact) from 3 to 4.
- [Release notes](https://github.com/actions/upload-pages-artifact/releases)
- [Commits](actions/upload-pages-artifact@v3...v4)

---
updated-dependencies:
- dependency-name: actions/upload-pages-artifact
  dependency-version: '4'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
@dependabot @github
Copy link
Copy Markdown
Contributor Author

dependabot Bot commented on behalf of github Mar 1, 2026

Labels

The following labels could not be found: ci, dependencies. Please create them before Dependabot can add them to a pull request.

Please fix the above issues or remove invalid values from dependabot.yml.

@anulum anulum merged commit 11b5c65 into main Mar 1, 2026
8 checks passed
@dependabot dependabot Bot deleted the dependabot/github_actions/actions/upload-pages-artifact-4 branch March 1, 2026 23:34
anulum added a commit that referenced this pull request Mar 21, 2026
Update domain benchmark section with calibrated measurements:
- PubMedQA: score range [0.01, 0.77], best F1=62.1% at t=0.50
- FinanceBench: score range [0.007, 0.63], 80%+ FPR without KB
- Key finding: NLI-only scoring needs KB grounding for discrimination
- Competitive positioning: every claim sourced with measurement date
- Honest Limitations: NLI-only domain scoring weak without KB as #1

Co-Authored-By: Arcane Sapience <protoscience@anulum.li>
anulum added a commit that referenced this pull request Apr 12, 2026
…n score_and_save

The F1 metric mislabel that has confused the entire FPR-reduction
campaign. ``AggreFactMetrics.avg_balanced_acc`` computes the
**per-dataset mean** of balanced accuracies (unweighted average
across the 11 AggreFact datasets). ``score_and_save()`` stored
this value under the field name ``global_balanced_accuracy`` with
a doc string that said "sample-pooled BA computed once across all
29 320 samples" — which it was not.

Consequence: FactCG's stored ``global_balanced_accuracy: 0.7558``
is the per-dataset mean at the global threshold, NOT sample-pooled.
Direct computation shows FactCG's TRUE sample-pooled BA is 0.8142
at the same threshold. We had been comparing the champion's
sample-pooled 82.11 % against FactCG's per-dataset mean 75.58 %
as if they were the same metric, overstating the gap by ~6 pp.

This commit:

* Adds ``_compute_sample_pooled_ba(predictions, labels) -> float``
  helper that computes true sample-pooled balanced accuracy on the
  flat (preds, labels) pool.
* ``score_and_save()`` now writes FOUR explicit metric fields in a
  2×2 matrix of {per-dataset-mean, sample-pooled} × {global
  threshold, per-dataset thresholds}:
  - ``per_dataset_mean_balanced_accuracy_at_global_threshold``
    (= AggreFact leaderboard convention, verified verbatim from
    https://llm-aggrefact.github.io/ on 2026-04-12)
  - ``per_dataset_mean_balanced_accuracy_at_per_dataset_thresholds``
    (post-hoc tuned — our FactCG "77.76 % potential #1" number)
  - ``sample_pooled_balanced_accuracy_at_global_threshold``
    (true sample-pooled, new)
  - ``sample_pooled_balanced_accuracy_at_per_dataset_thresholds``
    (true sample-pooled with per-ds tuning, new)
* Legacy aliases ``global_balanced_accuracy`` and
  ``per_dataset_avg_balanced_accuracy`` are kept for back-compat
  and map to the per-dataset-mean variants. A deprecation comment
  documents the migration path.
* ``AggreFactMetrics.avg_balanced_acc`` is renamed to
  ``per_dataset_mean_balanced_acc`` (canonical) with ``avg_balanced_acc``
  kept as a deprecated alias. Doc string explains both metrics
  and cites the leaderboard verification.

Full audit trail: docs/internal/experiments_log_2026-04-12.md
Entries 18 and 19.

Co-Authored-By: Arcane Sapience <protoscience@anulum.li>
anulum added a commit that referenced this pull request Apr 12, 2026
- 75.8% → 75.6% per-dataset mean (leaderboard #6, verified 2026-04-12)
- Add FactCG-tuned 77.76% (potential #1, ahead of Bespoke-MiniCheck 77.4%)
- Add leaderboard rank column
- Remove MiniCheck-DeBERTa-L row (not on published leaderboard)
- Simplify Gemma routed callout (remove sample-pooled per-family breakdown
  which was mixing metrics)

Co-Authored-By: Arcane Sapience <protoscience@anulum.li>
anulum added a commit that referenced this pull request Apr 18, 2026
director_ai.core.trajectory ships the foundation for the 2026-04-21
roadmap Tier 1 #1 feature: pre-execution Monte-Carlo halt based on
N simulated draws from an injected actor.

TrajectorySimulator runs n_simulations independent draws (default
8) with deterministic per-draw seeds (base_seed + i), feeds each
trajectory's joined text to a CoherenceScorer-shaped verdict
producer, and aggregates the results into a PreflightVerdict:

- halt_rate / mean_coherence / std_coherence
- 95% empirical credible interval over the per-trajectory scores
- recommended action (``proceed`` / ``warn`` / ``halt``) based on
  two halt-rate thresholds (warn 0.25, halt 0.50 by default)
- the raw TrajectoryResult list so operators can inspect which
  draws failed

Seeded determinism means two preflight calls with the same prompt
produce byte-identical verdicts — reproducibility for forensic
incident review and for regression tests on preflight decisions.

Optional on_trajectory callback per draw; exceptions from the
callback are swallowed so a broken sink cannot abort the loop.

Follow-ups tracked separately (distilled-actor integration,
CoherenceAgent wiring, conformal calibration against historical
traces, Rust-accelerated Monte-Carlo loop). Foundation scope
matches the roadmap memo.

Coverage: 17 tests covering construction validation, proceed /
warn / halt bands, deterministic replay, seed variation,
per-trajectory callback, callback failure isolation, verdict
shape, min/max/std aggregation.

mypy clean on 194 source files.

Co-Authored-By: Arcane Sapience <protoscience@anulum.li>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant