Skip to content

Update README.md#2

Merged
hrdkbhatnagar merged 2 commits into
mainfrom
readme-new-1
Dec 17, 2025
Merged

Update README.md#2
hrdkbhatnagar merged 2 commits into
mainfrom
readme-new-1

Conversation

@hrdkbhatnagar
Copy link
Copy Markdown
Collaborator

No description provided.

Comment thread README.md Outdated
Benchmark scores are computed after post-training, for all but the "base model" score.

All scores are averages over 4 models (Qwen-3-1.7B, Qwen-3-4B, SmolLM3-3B and Gemma-3-4B).
All scores are averages over 4 models (Qwen3-1.7B, Qwen3-4B, SmolLM3-3B, and Gemma-3-4B-IT).
Copy link
Copy Markdown
Collaborator

@rank-and-file rank-and-file Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: Here we can leave Gemma-3-4B (or Gemma3-4B), because the "IT" means instruction tuned (and we use the base model for the agents and the instruction tuned model for the human baseline only.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment thread README.md Outdated
Comment on lines 95 to 97
Add your code to `agents/<agent_name>/` with:
1. `solve.sh` - Script that calls the agent

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is rendered a bit weirdly (the 1. seems off)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@hrdkbhatnagar hrdkbhatnagar merged commit 6c0bfb7 into main Dec 17, 2025
@hrdkbhatnagar hrdkbhatnagar deleted the readme-new-1 branch January 11, 2026 11:55
JackPayne123 added a commit to JackPayne123/PostTrainBench that referenced this pull request May 11, 2026
Re-ran the full 22-eval baseline against base Qwen/Qwen3-1.7B IT on
image :18 (was :16). _index__limit100.json now records git_sha=5352aca,
computed_at=2026-05-11T03:39Z. Same four evals still fail under :18:
bfcl, spiralbench_mini, political_bias_openai, moru (see CHANGELOG
2026-05-11 entry + design-todos aisa-group#2 / aisa-group#6 for each).

Source run: 2026-05-11_11-59_baseline_qwen_qwen3-1.7b_limit100.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants