EMO is a new Mixture-of-Experts model trained so that modular structure emerges during pretraining without requiring human-defined priors. EMO enables selective expert use, down to 12.5% of total experts, with minimal performance degradation. We find that its expert groups specialize to higher-level topics and capabilities rather than low-level lexical patterns.
- Installation
- Released Models
- Inference
- Training scripts
- Evaluation scripts
- Contact and Contributing
- Citing
git clone https://github.com/allenai/EMO.git
cd EMO
conda create -n emo python==3.12
uv pip install -e .[all]
uv pip install --upgrade 'chardet>=7'All checkpoints are available in the EMO collection on the Hugging Face Hub.
| Model | Active | Total | Pretraining (1T) | Annealing (50B) | Description |
|---|---|---|---|---|---|
allenai/Emo_1b14b_1T |
1B | 14B | EMO [train_script] | EMO [train_script] | Main EMO release |
allenai/StdMoE_1b14b_1T |
1B | 14B | standard [train_script] | standard [train_script] | Architecture-matched standard MoE baseline |
Smaller-scale checkpoints used for memory-matched comparisons. These models were not midtrained.
| Model | Active | Total | Pretraining (130B) | Description |
|---|---|---|---|---|
allenai/Emo_1b14b_130B |
1B | 14B | EMO [train_script] | EMO at the 130B-token ablation scale |
allenai/StdMoE_1b14b_130B |
1B | 14B | standard [train_script] | Standard MoE baseline at the 130B-token scale |
allenai/StdMoE_1b4b_130B |
1B | 4B | standard [train_script] | Memory-matched standard MoE with 32 experts ("Reg. MoE @ 32" in Figure 1), used as a memory-matched baseline for EMO's 32-expert subsets |
allenai/Dense_1b_130B |
1B | 1B | dense LM [train_script] | Dense baseline matched to active parameters ("Dense @ 8" in Figure 1), used as a memory-matched baseline for EMO's 8-expert subsets |
Checkpoints used in Appendix B.4 to test whether modularity can be induced after pretraining via annealing alone, rather than during pretraining.
| Model | Active | Total | Pretraining (1T) | Annealing (50B) | Description |
|---|---|---|---|---|---|
allenai/StdMoE_1b14b_1T_Preanneal |
1B | 14B | standard [train_script] | — | Standard MoE checkpoint after 1T-token pretraining, before any annealing. Starting point for the EMO-anneal experiment |
allenai/StdMoE_1b14b_1T_EmoAnnealed |
1B | 14B | standard [train_script] | EMO [train_script] | EMO-anneal: a standard MoE annealed under the document-level expert pool constraint for 50B tokens |
See Released Models for the available checkpoints. All inference snippets below require trust_remote_code=True since the models use custom modeling code from the ryanyxw/transformers fork (Note: you do not need to clone this fork yourself, the Hugging Face Hub will pull the necessary code when you load the model with trust_remote_code=True).
You can use our Hugging Face transformers integration to run inference on the released checkpoints:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "allenai/Emo_1b14b_1T"
olmo = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
message = ["Language modeling is "]
inputs = tokenizer(message, return_tensors='pt', return_token_type_ids=False)
# inputs = {k: v.to('cuda') for k,v in inputs.items()} # optional verifying cuda
# olmo = olmo.to('cuda')
response = olmo.generate(**inputs, max_new_tokens=100, do_sample=True, temperature=1.0, top_p=0.7)
print(tokenizer.batch_decode(response, skip_special_tokens=True)[0])Alternatively, with the Hugging Face pipeline abstraction:
from transformers import pipeline
olmo_pipe = pipeline("text-generation", model="allenai/Emo_1b14b_1T", trust_remote_code=True)
print(olmo_pipe("Language modeling is"))vLLM provides high-throughput inference. We ship a small out-of-tree plugin at src/vllm_plugin/ that registers EmoForCausalLM with vLLM's native model registry
pip install vllm>=0.11.0
pip install -e src/vllm_plugin # optional; only needed for the native pathYou can run offline batched inference:
from vllm import LLM, SamplingParams
llm = LLM(model="allenai/Emo_1b14b_1T", trust_remote_code=True)
sampling_params = SamplingParams(temperature=1.0, top_p=0.7)
prompts = ["Language modeling is"]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")For more details, see the vLLM documentation.
Project-specific pretraining recipes live in scripts. Please refer to Released Models for the training scripts corresponding to each released checkpoint.
Note: these scripts are trained on the exact same data as OLMoE, which is publicly accessible here. The current pretraining script draws data from a tokenized version of this dataset hosted internally. You can tokenize the dataset yourself following instructions here. We will also be releasing an endpoint for the data we used directly soon.
Run a script locally:
bash scripts/models/dense_1b_lr-4e-3_0213.shSubmit it as a Beaker job:
MODE=beaker bash scripts/models/dense_1b_lr-4e-3_0213.shOverride paths via env vars before launching:
PREFIX— output rootMODELS_DIR— derived fromPREFIX(${PREFIX}/models)DATASET_CACHE— tokenizer-mapped dataset cache
Override Beaker cluster sizing per script with BEAKER_GPUS=8 BEAKER_NODES=4 ....
After training, OLMo-core checkpoints can be converted to the HuggingFace format (suitable for inference and the evaluation pipelines below) with scripts/convert_olmo_to_hf.sh.
The launch scripts in scripts/selective_hf/ exercise the full router-activation → expert selection → finetuning → eval pipeline on the released checkpoints. Each (model × keep-k × task × method) combination lands in its own subdirectory under selective_evals_final/<model>/..., with the pruned-expert model, finetuned checkpoint, and per-checkpoint metrics all colocated. Three scripts target different questions:
| Script | Investigates | Sweep |
|---|---|---|
launch_selective_hf.sh |
Main selective-expert evaluation — how each released model performs when only a subset of experts is retained for a given task (Figure 3 of the paper). | All released models × keep-k ∈ {8, 16, 32, 64, 128} × MC9 / Gen5 / MMLU / MMLU-Pro / GSM8K task groups. |
launch_selective_method_hf.sh |
Robustness to the choice of expert-selection method (Figure 4 of the paper). | {layerwise, easy_ep, random} selection methods × main 1T models × keep-k × tasks. |
launch_selective_validation_hf.sh |
Calibration-data ablation — how much validation data and how many few-shot examples are needed to identify the right experts (Appendix B.2 of the paper). | Validation-set sizes ∈ {1, 5, 10, 100, All} × 3 shot-count configurations × Emo_1b14b_1T × keep-k ∈ {8, 16, 32, 128} × tasks. |
Every (model × keep-k × task × method) combination produces one self-contained subdirectory under selective_evals_final/:
selective_evals_final/
└── <sanitized_model>/ # e.g. allenaiEmo_1b14b_1T
└── <task>_keepk_<K>_bs-<B>_lr-<LR>_epoch-<E>_selectivemode-{layerwise,easy_ep,random}[_nselective-<N>][_pseed-<S>][_pshots-<X>][_eshots-<Y>]/
├── selected_model/ # pruned-expert HF checkpoint + pruning_metadata.json
├── finetuned_model/
│ └── checkpoint-<N>/ # HF Trainer-format finetuned weights
└── results/
└── checkpoint-<N>/
├── task-<name>-metrics.json # aggregate metrics for the task
├── task-<name>-predictions.jsonl # per-instance predictions
└── per_subject/ # only for MMLU category tasks
└── <subject>/
└── task-<name>-metrics.json
The optional _nselective-, _pseed-, _pshots-, _eshots- suffixes only appear when the corresponding override is set (e.g. you'll only see _nselective-100 when running with a sub-sampled calibration set).
Each script writes its config (MODELS, SELECTIVE_KEEP_K_VALUES, TASK_GROUPS_LIST, etc.) at the top — comment lines out to skip combinations. Override the output root with OUTPUT_DIR=… and the per-worker GPU count with NUM_GPUS=….
We recommend running these on a slurm or other scheduling system, since each script launches many sequential worker invocations.
Once one of the launchers above has populated selective_evals_final/, two scripts in scripts/plotting/ walk the per-run subdirectories and produce flat CSV/TSV/markdown tables suitable for downstream analysis:
| Script | Source launcher | What it produces |
|---|---|---|
get_table_scores_selective_evals_final.py |
launch_selective_hf.sh and launch_selective_method_hf.sh |
Per-metric tables with rows = (model × keep-k variant) and paired columns "task (lw) / task (ep) / task (rd)" — one column per selection method. Group averages (mc9_avg, gen5_avg, mmlu_merged_avg_no_other, mmlu_pro_merged_avg_no_other) are prepended automatically. Both last-checkpoint (post-finetune) and first-checkpoint (pre-finetune) variants are emitted by default. |
get_table_scores_nselective_ablation.py |
launch_selective_validation_hf.sh |
Validation-data-ablation tables: rows = (model, selection-method, task group), columns = keepk_K (1) / keepk_K (5) / keepk_K (10) / keepk_K (100) / keepk_K (All) / keepk_K (Random). Includes optional 0-shot variants when _pshots-0/_eshots-0 runs are present. |
Both scripts default to reading from <repo>/selective_evals_final/ and writing to <repo>/plots/. Overrides:
# Main + method-comparison tables
python -m scripts.plotting.get_table_scores_selective_evals_final \
--selective-evals-root selective_evals_final \
--output-dir plots
# Validation-size ablation tables
python -m scripts.plotting.get_table_scores_nselective_ablation \
--selective-evals-root selective_evals_final \
--output-dir plotsThe model registries (MODEL_SPECS at the top of each file) currently list the released HF Hub checkpoints — add new entries there if you point either script at a directory built from a different model.
scripts/clustering/run_pretraining_compare.sh reproduces the side-by-side router-activation clustering used to compare EMO and the standard MoE baseline (Section 5.3 / Figure 5 of the paper). For each of allenai/Emo_1b14b_1T and allenai/StdMoE_1b14b_1T it:
- Streams ~1M tokens of the OLMoE pretraining mix from S3
- Runs a forward pass and saves token-level router logits
- Derives softmax probs, runs PCA + spherical k-means at
k=32 - Renders an interactive side-by-side HTML explorer of both models' clusters
bash scripts/clustering/run_pretraining_compare.sh
# → cluster_eval_final/pretraining/compare_Emo_1b14b_1T_vs_StdMoE_1b14b_1T.htmlcluster_eval_final/
├── pretraining_mix.json # generated once, then reused
└── pretraining/
├── Emo_1b14b_1T/
│ ├── embeddings_logits.npy + ... # extract outputs (tokens, doc boundaries, metadata)
│ ├── embeddings_probs.npy # transform output
│ └── probs_mean_pca_l2_spherical_kmeans_k32/
│ ├── assignments.npy, run_info.json, summary.json
│ └── cluster_explorer.html
├── StdMoE_1b14b_1T/
│ └── (same structure)
└── compare_Emo_1b14b_1T_vs_StdMoE_1b14b_1T.html
The underlying primitives (extract / transform / cluster / visualize) live in scripts/clustering/ — see its README for the modular pipeline.
CLUSTER_ROOT=…overrides the output root (defaultcluster_eval_final/).TARGET_TOKENS=…andMAX_TOKENS_PER_DOC=…change the extraction budget and per-doc truncation.CUDA_VISIBLE_DEVICES=…restricts which GPUs the model is sharded across.
Note: this script uses the exact same data as OLMoE, which is publicly accessible here. The current script draws data from a tokenized version of this dataset hosted internally. You can tokenize the dataset yourself following instructions here. We will also be releasing an endpoint for the data we used directly soon.
scripts/clustering/run_weborganizer_compare.sh reproduces the per-domain expert-activation heatmaps used to compare EMO and the standard MoE baseline (Section 5.3 / Figure 6 of the paper). For each of allenai/Emo_1b14b_1T and allenai/StdMoE_1b14b_1T it:
- Streams ~20M tokens of the cc_all_dressed weborganizer mix from S3, sampled uniformly across the 24 topics
- Runs a single forward pass and aggregates router activations into per-document expert vectors (top-k frequency + softmax probs)
- Renders 5 expert-coverage heatmaps per embedding type (10 PNGs total per model)
Both models share a single topic_order.json (stratified row/column ordering) so the resulting heatmaps are directly comparable side-by-side.
bash scripts/clustering/run_weborganizer_compare.sh
# → cluster_eval_final/weborganizer/{Emo_1b14b_1T,StdMoE_1b14b_1T}/*.pngcluster_eval_final/
└── weborganizer/
├── mix_composition.json # auto-generated on first run by extract_document.py
├── topic_order.json # shared row/column ordering for cross-model comparison
├── Emo_1b14b_1T/
│ ├── embeddings_doc_topk_freq.npy
│ ├── embeddings_doc_probs.npy
│ └── *.png # 5 heatmaps × 2 embedding types = 10 PNGs
└── StdMoE_1b14b_1T/
└── (same structure)
The underlying primitives (extract_document / plot_doc_expert_coverage) live in scripts/clustering/weborganizer/.
CLUSTER_ROOT=…overrides the output root (defaultcluster_eval_final/).TARGET_TOKENS=…changes the extraction budget (default 20M).CUDA_VISIBLE_DEVICES=…restricts which GPUs the model is sharded across.
Note: this script uses the WebOrganizer dataset, which is publicly accessible here. The current script draws data from a tokenized version of this dataset hosted internally. You can tokenize the dataset yourself following instructions here. We will also be releasing an endpoint for the data we used directly soon.
If you have a fix, improvement, or extension you'd like to share, please open a pull request — direct contributions are the best way to help the project, and we're happy to review them.
For other interactions:
- Public questions, bug reports, or feature suggestions: please file a GitHub issue. This keeps the conversation visible to everyone and lets others benefit from the answer.
- Private or sensitive inquiries (e.g. anything you'd rather not discuss in public): email ryanyxw@berkeley.edu.
TODO
