Apoorv Khandelwal & Ellie Pavlick
Abstract: While large language models (LLMs) appear to be increasingly capable of solving compositional tasks, it is an open question whether they do so using compositional mechanisms. In this work, we investigate how feedforward LLMs solve two-hop factual recall tasks, which can be expressed compositionally as
# Install the uv package manager
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install project dependencies
uv sync --frozen
cp .env.local.example .env.local
# Update .env.local with user-specific variables
source .venv/bin/activate
source .env
source .env.local
This codebase offers tools to prompt and perform mechanistic analyses on a selection of models and compositional tasks.
Beyond this, we implement our experiments as modular pipelines, run using the AI2 Tango library. Each Experiment
implements a step_dict
method that defines that pipeline's steps (which can depend on other steps, like a graph). The outputs of steps are automatically cached in ./tango_workspace
by our scripts.
To replicate the experiments and plots (output to artifacts/
) in our paper, one can run:
- Generate datasets for every task
python configs/generate_data.py
python plotting/tasks_table.py # Tables 1-2
- Evaluate all models on a few tasks
python configs/evaluation_by_model.py
python plotting/compositionality_gap_by_model.py # Fig. 2
python plotting/compositionality_gap_by_size.py # App. C
- Evaluate Llama 3 (3B) on all tasks
python configs/llama_3_3b/evaluation.py
python plotting/compositionality_gap.py # Fig. 1
- Logit lens analyses on all tasks
python configs/llama_3_3b/lens.py
python plotting/logit_lens_overall.py # Fig. 3(a-b)
python plotting/logit_lens_per_task.py # Fig. 3(c-f), App. D-E
python plotting/intermediate_var_distribution.py # Fig. 4(b)
- Correlation between task linearity and intermediate variables
python configs/llama_3_3b/linear_task_embedding.py
python configs/llama_3_3b/lens.py
python plotting/task_logit_lens_corr.py # Fig. 4(a)
python plotting/hop_logit_lens_corr.py # App. H
- Token identity patchscope analyses
python configs/llama_3_3b/lens_token_identity.py
python plotting/token_identity_per_task.py # App. F, Fig. 10-11
python configs/llama_3_3b/linear_task_embedding.py
python plotting/task_token_identity_correlation.py # App. F, Fig. 12
- Causality of intermediate variable representations
python configs/llama_3_3b/patching_across_tasks_comp.py
python configs/llama_3_3b/patching_across_tasks_direct.py
python plotting/patching_across_tasks.py # App. G
@misc{khandelwal2025:compose,
title={How Do Language Models Compose Functions?},
author={Apoorv Khandelwal and Ellie Pavlick},
year={2025},
eprint={2510.01685},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2510.01685},
}