RYS is a small reproducibility repo for relayering experiments on decoder LLMs.
The core idea is simple: duplicate part of a model's existing layer path without changing any weights.
A standard single-block configuration is written as (i, j):
- run layers
0 .. j-1 - then jump back and run layers
i .. N-1 - so layers
i .. j-1are traversed twice
Baseline is (0,0), which means no duplication.
This repo contains the pieces needed to reproduce the main experimental workflows:
- scanner for full
(i, j)sweeps - fixed Math and EQ probe sets
- multi-block beam search
- XGBoost surrogate pipeline
- model exporter for writing relayered Hugging Face checkpoints
- heatmap and balanced Math+EQ analysis code
It does not include the private historical runs, blog drafts, ad hoc notebooks, or dataset generation/calibration code.
datasets/math_16.jsonmath_120.jsoneq_16.jsoneq_140.jsonmanifest.json
src/core/- config parsing
- layer-list expansion
- relayer wrappers for dense and MoE-style stacks
src/workers/- Math and EQ benchmark workers
- queue handling
- model loading helpers
src/utils/- balanced Math+EQ analysis
- heatmap helpers
- surrogate utilities
scripts/- sweep setup
- ExLlama workers
- beam search
- surrogate pipeline
- repeat-sweep helpers
hf_export/- checkpoint export
- HF upload helper
- Colab notebook
This repo ships the fixed benchmark subsets used by the public workflow:
datasets/math_16.jsondatasets/math_120.jsondatasets/eq_16.jsondatasets/eq_140.json
Important notes:
eq_16andeq_140are first-pass-only EQ subsets.datasets/manifest.jsonrecords provenance and checksums.- The file named
eq_140.jsoncurrently contains139records; this is documented in the manifest and preserved for continuity with the original naming.
Python uses uv:
uv syncFor ExLlama scanning you also need:
- a local
exllamav3checkout - an EXL3-compatible model directory
- CUDA-capable hardware if you want real scan throughput
Set:
export EXLLAMAV3_PATH=/path/to/exllamav3Example for a 64-layer model:
uv run python scripts/init_queue.py \
--num-layers 64 \
--queue-file results/demo/queue.json \
--results-file results/demo/combined_results.pklThis writes canonical layer-list configs, including the baseline (0,0).
This is the main fast scan path in the public repo. It loads the EXL3 model once and scores Math and EQ in one mixed pass per config.
uv run python scripts/run_exllama_math_eq_combined_worker.py \
--queue-file results/demo/queue.json \
--combined-results-file results/demo/combined_results.pkl \
--math-results-file results/demo/math_results.pkl \
--eq-results-file results/demo/eq_results.pkl \
--model-dir /path/to/model.exl3 \
--math-dataset-path datasets/math_16.json \
--eq-dataset-path datasets/eq_16.json \
--math-max-new 64 \
--eq-max-new 64 \
--auto-cacheuv run python scripts/analyze_results.py \
--math-scores results/demo/math_results.pkl \
--eq-scores results/demo/eq_results.pkl \
--out-dir results/demo/analysis \
--num-layers 64This produces:
- top-ranked balanced configs
- scatter plots
- balanced Math+EQ heatmap artifacts
If you want a simple Docker entrypoint:
MODEL_DIR=/path/to/model.exl3 \
EXLLAMAV3_PATH=/path/to/exllamav3 \
./scripts/run_exllama_docker.shUseful overrides:
QUEUE_FILECOMBINED_RESULTS_FILEMATH_RESULTS_FILEEQ_RESULTS_FILEMATH_DATASETEQ_DATASETDEVICERESERVE_PER_DEVICEUSE_PER_DEVICE
Beam search composes multiple repeated blocks and benchmarks only unseen configs.
Example:
uv run python scripts/beam_search.py \
--model-path /path/to/hf-model \
--num-layers 64 \
--seed-math-results results/demo/math_results.pkl \
--seed-eq-results results/demo/eq_results.pkl \
--math-dataset-path datasets/math_16.json \
--eq-dataset-path datasets/eq_16.json \
--work-dir results/demo/beam-searchNotes:
- beam search uses the Hugging Face worker path in
src/workers/ - seed pickles should come from an already measured single-block scan
- state under
--work-diris resume-friendly
The surrogate uses per-layer repeat counts as features. Predicted scores are only for ranking candidates; the final benchmark scores must be measured.
uv run python scripts/train_surrogate.py \
--single-block-math-results results/demo/math_results.pkl \
--single-block-eq-results results/demo/eq_results.pkl \
--beam-math-results results/demo/beam-search/beam_math_results.pkl \
--beam-eq-results results/demo/beam-search/beam_eq_results.pkl \
--out-dir results/demo/surrogate \
--num-layers 64uv run python scripts/generate_candidates.py \
--num-layers 64 \
--max-extra-layers 12 \
--count 2000000 \
--output-csv results/demo/surrogate/candidates.csvuv run python scripts/score_candidates.py \
--model-dir results/demo/surrogate \
--candidates-file results/demo/surrogate/candidates.csv \
--output-csv results/demo/surrogate/top_scored.csv \
--top-k 100
uv run python scripts/build_topk_config.py \
--top-candidates-csv results/demo/surrogate/top_scored.csv \
--num-layers 64 \
--output-config results/demo/surrogate/top100.configThen benchmark those configs with the same Math/EQ harness used elsewhere.
hf_export writes a relayered Hugging Face checkpoint with the duplicated layers physically materialized into the safetensor shards.
Single block:
uv run python -m hf_export.export_model \
--source /path/to/base-model \
--source-repo-id some/model \
--output exports/model-block-30-34 \
--blocks "30,34"Multi-block:
uv run python -m hf_export.export_model \
--source /path/to/base-model \
--source-repo-id some/model \
--output exports/model-31_34__43_45 \
--blocks "31,34;43,45"Upload:
export HF_TOKEN=...
uv run python -m hf_export.upload_to_hf \
--folder exports/model-block-30-34 \
--repo-id your-name/model-block-30-34The export manifest is written to rys_export_manifest.json.
A minimal Colab notebook is available at:
hf_export/colab/export_upload_minimal.ipynb
The reusable plotting helpers live in src/utils/heatmaps.py.
For standard (i, j) scans, the normal entrypoint is:
uv run python scripts/analyze_results.py \
--math-scores results/demo/math_results.pkl \
--eq-scores results/demo/eq_results.pkl \
--out-dir results/demo/analysis \
--num-layers 64For per-layer repeat sweeps:
uv run python scripts/plot_repeat_heatmaps.py \
--results-file results/demo/repeatx8_math_results.pkl \
--manifest-file results/demo/repeatx8_manifest.json \
--out-dir results/demo/repeatx8_heatmaps- The public scan path is ExLlama-first.
- Beam search uses the Hugging Face worker path rather than ExLlama.
- This repo assumes decoder-layer architectures; unsupported architectures should fail explicitly.
- The HF exporter currently detects decoder stacks under:
model.language_model.layers.model.layers.language_model.layers.
- Dataset generation and recalibration are intentionally out of scope here.
If you are starting from scratch, do this in order:
- run a small
(i, j)scan withmath_16 + eq_16 - inspect the heatmaps and top balanced configs
- run beam search seeded from the single-block scan
- train the surrogate on measured configs and benchmark its top candidates
- rerun the strongest candidates on
math_120 + eq_140 - export any final variants you want to share