GitHub - allenai/EMO

EMO is a new Mixture-of-Experts model trained so that modular structure emerges during pretraining without requiring human-defined priors. EMO enables selective expert use, down to 12.5% of total experts, with minimal performance degradation. We find that its expert groups specialize to higher-level topics and capabilities rather than low-level lexical patterns.

Installation

git clone https://github.com/allenai/EMO.git
cd EMO
conda create -n emo python==3.12
uv pip install -e .[all]
uv pip install --upgrade 'chardet>=7'

Released Models

All checkpoints are available in the EMO collection on the Hugging Face Hub.

Main Release (1T tokens)

Model	Active	Total	Pretraining (1T)	Annealing (50B)	Description
`allenai/Emo_1b14b_1T`	1B	14B	EMO [train_script]	EMO [train_script]	Main EMO release
`allenai/StdMoE_1b14b_1T`	1B	14B	standard [train_script]	standard [train_script]	Architecture-matched standard MoE baseline

Ablation Models (130B tokens)

Smaller-scale checkpoints used for memory-matched comparisons. These models were not midtrained.

Model	Active	Total	Pretraining (130B)	Description
`allenai/Emo_1b14b_130B`	1B	14B	EMO [train_script]	EMO at the 130B-token ablation scale
`allenai/StdMoE_1b14b_130B`	1B	14B	standard [train_script]	Standard MoE baseline at the 130B-token scale
`allenai/StdMoE_1b4b_130B`	1B	4B	standard [train_script]	Memory-matched standard MoE with 32 experts ("Reg. MoE @ 32" in Figure 1), used as a memory-matched baseline for EMO's 32-expert subsets
`allenai/Dense_1b_130B`	1B	1B	dense LM [train_script]	Dense baseline matched to active parameters ("Dense @ 8" in Figure 1), used as a memory-matched baseline for EMO's 8-expert subsets

Midtraining Ablation Models

Checkpoints used in Appendix B.4 to test whether modularity can be induced after pretraining via annealing alone, rather than during pretraining.

Model	Active	Total	Pretraining (1T)	Annealing (50B)	Description
`allenai/StdMoE_1b14b_1T_Preanneal`	1B	14B	standard [train_script]	—	Standard MoE checkpoint after 1T-token pretraining, before any annealing. Starting point for the EMO-anneal experiment
`allenai/StdMoE_1b14b_1T_EmoAnnealed`	1B	14B	standard [train_script]	EMO [train_script]	EMO-anneal: a standard MoE annealed under the document-level expert pool constraint for 50B tokens

Inference

See Released Models for the available checkpoints. All inference snippets below require trust_remote_code=True since the models use custom modeling code from the ryanyxw/transformers fork (Note: you do not need to clone this fork yourself, the Hugging Face Hub will pull the necessary code when you load the model with trust_remote_code=True).

With Hugging Face Transformers

You can use our Hugging Face transformers integration to run inference on the released checkpoints:

from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "allenai/Emo_1b14b_1T"
olmo = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
message = ["Language modeling is "]
inputs = tokenizer(message, return_tensors='pt', return_token_type_ids=False)
# inputs = {k: v.to('cuda') for k,v in inputs.items()} # optional verifying cuda
# olmo = olmo.to('cuda')
response = olmo.generate(**inputs, max_new_tokens=100, do_sample=True, temperature=1.0, top_p=0.7)
print(tokenizer.batch_decode(response, skip_special_tokens=True)[0])

Alternatively, with the Hugging Face pipeline abstraction:

from transformers import pipeline
olmo_pipe = pipeline("text-generation", model="allenai/Emo_1b14b_1T", trust_remote_code=True)
print(olmo_pipe("Language modeling is"))

With vLLM

vLLM provides high-throughput inference. We ship a small out-of-tree plugin at src/vllm_plugin/ that registers EmoForCausalLM with vLLM's native model registry

pip install vllm>=0.11.0
pip install -e src/vllm_plugin  # optional; only needed for the native path

You can run offline batched inference:

from vllm import LLM, SamplingParams
llm = LLM(model="allenai/Emo_1b14b_1T", trust_remote_code=True)
sampling_params = SamplingParams(temperature=1.0, top_p=0.7)
prompts = ["Language modeling is"]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

For more details, see the vLLM documentation.

Training scripts

Project-specific pretraining recipes live in scripts. Please refer to Released Models for the training scripts corresponding to each released checkpoint.

Note: these scripts are trained on the exact same data as OLMoE, which is publicly accessible here. The current pretraining script draws data from a tokenized version of this dataset hosted internally. You can tokenize the dataset yourself following instructions here. We will also be releasing an endpoint for the data we used directly soon.

Run a script locally:

bash scripts/models/dense_1b_lr-4e-3_0213.sh

Submit it as a Beaker job:

MODE=beaker bash scripts/models/dense_1b_lr-4e-3_0213.sh

Override paths via env vars before launching:

PREFIX — output root
MODELS_DIR — derived from PREFIX (${PREFIX}/models)
DATASET_CACHE — tokenizer-mapped dataset cache

Override Beaker cluster sizing per script with BEAKER_GPUS=8 BEAKER_NODES=4 ....

After training, OLMo-core checkpoints can be converted to the HuggingFace format (suitable for inference and the evaluation pipelines below) with scripts/convert_olmo_to_hf.sh.

Evaluation scripts

Selective Expert Usage

The launch scripts in scripts/selective_hf/ exercise the full router-activation → expert selection → finetuning → eval pipeline on the released checkpoints. Each (model × keep-k × task × method) combination lands in its own subdirectory under selective_evals_final/<model>/..., with the pruned-expert model, finetuned checkpoint, and per-checkpoint metrics all colocated. Three scripts target different questions:

Script	Investigates	Sweep
`launch_selective_hf.sh`	Main selective-expert evaluation — how each released model performs when only a subset of experts is retained for a given task (Figure 3 of the paper).	All released models × keep-k ∈ {8, 16, 32, 64, 128} × MC9 / Gen5 / MMLU / MMLU-Pro / GSM8K task groups.
`launch_selective_method_hf.sh`	Robustness to the choice of expert-selection method (Figure 4 of the paper).	{`layerwise`, `easy_ep`, `random`} selection methods × main 1T models × keep-k × tasks.
`launch_selective_validation_hf.sh`	Calibration-data ablation — how much validation data and how many few-shot examples are needed to identify the right experts (Appendix B.2 of the paper).	Validation-set sizes ∈ {1, 5, 10, 100, All} × 3 shot-count configurations × `Emo_1b14b_1T` × keep-k ∈ {8, 16, 32, 128} × tasks.

Output layout

Every (model × keep-k × task × method) combination produces one self-contained subdirectory under selective_evals_final/:

selective_evals_final/
└── <sanitized_model>/                            # e.g. allenaiEmo_1b14b_1T
    └── <task>_keepk_<K>_bs-<B>_lr-<LR>_epoch-<E>_selectivemode-{layerwise,easy_ep,random}[_nselective-<N>][_pseed-<S>][_pshots-<X>][_eshots-<Y>]/
        ├── selected_model/                       # pruned-expert HF checkpoint + pruning_metadata.json
        ├── finetuned_model/
        │   └── checkpoint-<N>/                   # HF Trainer-format finetuned weights
        └── results/
            └── checkpoint-<N>/
                ├── task-<name>-metrics.json      # aggregate metrics for the task
                ├── task-<name>-predictions.jsonl # per-instance predictions
                └── per_subject/                  # only for MMLU category tasks
                    └── <subject>/
                        └── task-<name>-metrics.json

The optional _nselective-, _pseed-, _pshots-, _eshots- suffixes only appear when the corresponding override is set (e.g. you'll only see _nselective-100 when running with a sub-sampled calibration set).

Customization

Each script writes its config (MODELS, SELECTIVE_KEEP_K_VALUES, TASK_GROUPS_LIST, etc.) at the top — comment lines out to skip combinations. Override the output root with OUTPUT_DIR=… and the per-worker GPU count with NUM_GPUS=….

We recommend running these on a slurm or other scheduling system, since each script launches many sequential worker invocations.

Aggregating results into tables

Once one of the launchers above has populated selective_evals_final/, two scripts in scripts/plotting/ walk the per-run subdirectories and produce flat CSV/TSV/markdown tables suitable for downstream analysis:

Script	Source launcher	What it produces
`get_table_scores_selective_evals_final.py`	`launch_selective_hf.sh` and `launch_selective_method_hf.sh`	Per-metric tables with rows = (model × keep-k variant) and paired columns "task (lw) / task (ep) / task (rd)" — one column per selection method. Group averages (`mc9_avg`, `gen5_avg`, `mmlu_merged_avg_no_other`, `mmlu_pro_merged_avg_no_other`) are prepended automatically. Both `last`-checkpoint (post-finetune) and `first`-checkpoint (pre-finetune) variants are emitted by default.
`get_table_scores_nselective_ablation.py`	`launch_selective_validation_hf.sh`	Validation-data-ablation tables: rows = (model, selection-method, task group), columns = `keepk_K (1) / keepk_K (5) / keepk_K (10) / keepk_K (100) / keepk_K (All) / keepk_K (Random)`. Includes optional 0-shot variants when `_pshots-0`/`_eshots-0` runs are present.

Both scripts default to reading from <repo>/selective_evals_final/ and writing to <repo>/plots/. Overrides:

# Main + method-comparison tables
python -m scripts.plotting.get_table_scores_selective_evals_final \
    --selective-evals-root selective_evals_final \
    --output-dir plots

# Validation-size ablation tables
python -m scripts.plotting.get_table_scores_nselective_ablation \
    --selective-evals-root selective_evals_final \
    --output-dir plots

The model registries (MODEL_SPECS at the top of each file) currently list the released HF Hub checkpoints — add new entries there if you point either script at a directory built from a different model.

Clustering Pretraining Document Tokens

scripts/clustering/run_pretraining_compare.sh reproduces the side-by-side router-activation clustering used to compare EMO and the standard MoE baseline (Section 5.3 / Figure 5 of the paper). For each of allenai/Emo_1b14b_1T and allenai/StdMoE_1b14b_1T it:

Streams ~1M tokens of the OLMoE pretraining mix from S3
Runs a forward pass and saves token-level router logits
Derives softmax probs, runs PCA + spherical k-means at k=32
Renders an interactive side-by-side HTML explorer of both models' clusters

bash scripts/clustering/run_pretraining_compare.sh
# → cluster_eval_final/pretraining/compare_Emo_1b14b_1T_vs_StdMoE_1b14b_1T.html

Output layout

cluster_eval_final/
├── pretraining_mix.json                   # generated once, then reused
└── pretraining/
    ├── Emo_1b14b_1T/
    │   ├── embeddings_logits.npy + ...    # extract outputs (tokens, doc boundaries, metadata)
    │   ├── embeddings_probs.npy           # transform output
    │   └── probs_mean_pca_l2_spherical_kmeans_k32/
    │       ├── assignments.npy, run_info.json, summary.json
    │       └── cluster_explorer.html
    ├── StdMoE_1b14b_1T/
    │   └── (same structure)
    └── compare_Emo_1b14b_1T_vs_StdMoE_1b14b_1T.html

The underlying primitives (extract / transform / cluster / visualize) live in scripts/clustering/ — see its README for the modular pipeline.

Customization

CLUSTER_ROOT=… overrides the output root (default cluster_eval_final/).
TARGET_TOKENS=… and MAX_TOKENS_PER_DOC=… change the extraction budget and per-doc truncation.
CUDA_VISIBLE_DEVICES=… restricts which GPUs the model is sharded across.

Note: this script uses the exact same data as OLMoE, which is publicly accessible here. The current script draws data from a tokenized version of this dataset hosted internally. You can tokenize the dataset yourself following instructions here. We will also be releasing an endpoint for the data we used directly soon.

Weborganizer Expert Coverage

scripts/clustering/run_weborganizer_compare.sh reproduces the per-domain expert-activation heatmaps used to compare EMO and the standard MoE baseline (Section 5.3 / Figure 6 of the paper). For each of allenai/Emo_1b14b_1T and allenai/StdMoE_1b14b_1T it:

Streams ~20M tokens of the cc_all_dressed weborganizer mix from S3, sampled uniformly across the 24 topics
Runs a single forward pass and aggregates router activations into per-document expert vectors (top-k frequency + softmax probs)
Renders 5 expert-coverage heatmaps per embedding type (10 PNGs total per model)

Both models share a single topic_order.json (stratified row/column ordering) so the resulting heatmaps are directly comparable side-by-side.

bash scripts/clustering/run_weborganizer_compare.sh
# → cluster_eval_final/weborganizer/{Emo_1b14b_1T,StdMoE_1b14b_1T}/*.png

Output layout

cluster_eval_final/
└── weborganizer/
    ├── mix_composition.json      # auto-generated on first run by extract_document.py
    ├── topic_order.json          # shared row/column ordering for cross-model comparison
    ├── Emo_1b14b_1T/
    │   ├── embeddings_doc_topk_freq.npy
    │   ├── embeddings_doc_probs.npy
    │   └── *.png                 # 5 heatmaps × 2 embedding types = 10 PNGs
    └── StdMoE_1b14b_1T/
        └── (same structure)

The underlying primitives (extract_document / plot_doc_expert_coverage) live in scripts/clustering/weborganizer/.

Customization

CLUSTER_ROOT=… overrides the output root (default cluster_eval_final/).
TARGET_TOKENS=… changes the extraction budget (default 20M).
CUDA_VISIBLE_DEVICES=… restricts which GPUs the model is sharded across.

Note: this script uses the WebOrganizer dataset, which is publicly accessible here. The current script draws data from a tokenized version of this dataset hosted internally. You can tokenize the dataset yourself following instructions here. We will also be releasing an endpoint for the data we used directly soon.

Contact and Contributing

If you have a fix, improvement, or extension you'd like to share, please open a pull request — direct contributions are the best way to help the project, and we're happy to review them.

For other interactions:

Public questions, bug reports, or feature suggestions: please file a GitHub issue. This keeps the conversation visible to everyone and lets others benefit from the answer.
Private or sensitive inquiries (e.g. anything you'd rather not discuss in public): email ryanyxw@berkeley.edu.

Citing

TODO

Name		Name	Last commit message	Last commit date
Latest commit History 3,308 Commits
.github		.github
assets		assets
docs		docs
scripts		scripts
src		src
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Table of Contents

Installation

Released Models

Main Release (1T tokens)

Ablation Models (130B tokens)

Midtraining Ablation Models

Inference

With Hugging Face Transformers

With vLLM

Training scripts

Evaluation scripts

Selective Expert Usage

Output layout

Customization

Aggregating results into tables

Clustering Pretraining Document Tokens

Output layout

Customization

Weborganizer Expert Coverage

Output layout

Customization

Contact and Contributing

Citing

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

Installation

Released Models

Main Release (1T tokens)

Ablation Models (130B tokens)

Midtraining Ablation Models

Inference

With Hugging Face Transformers

With vLLM

Training scripts

Evaluation scripts

Selective Expert Usage

Output layout

Customization

Aggregating results into tables

Clustering Pretraining Document Tokens

Output layout

Customization

Weborganizer Expert Coverage

Output layout

Customization

Contact and Contributing

Citing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages