How Do Language Models Compose Functions?

Apoorv Khandelwal & Ellie Pavlick

Abstract: While large language models (LLMs) appear to be increasingly capable of solving compositional tasks, it is an open question whether they do so using compositional mechanisms. In this work, we investigate how feedforward LLMs solve two-hop factual recall tasks, which can be expressed compositionally as $g(f(x))$. We first confirm that modern LLMs continue to suffer from the "compositionality gap": i.e. their ability to compute both $z = f(x)$ and $y = g(z)$ does not entail their ability to compute the composition $y = g(f(x))$. Then, using logit lens on their residual stream activations, we identify two processing mechanisms, one which solves tasks compositionally, computing $f(x)$ along the way to computing $g(f(x))$, and one which solves them directly, without any detectable signature of the intermediate variable $f(x)$. Finally, we find that which mechanism is employed appears to be related to the embedding space geometry, with the idiomatic mechanism being dominant in cases where there exists a linear mapping from $x$ to $g(f(x))$ in the embedding spaces.

Installation

# Install the uv package manager
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install project dependencies
uv sync --frozen

cp .env.local.example .env.local
# Update .env.local with user-specific variables

Activate Environment

source .venv/bin/activate
source .env
source .env.local

Usage

This codebase offers tools to prompt and perform mechanistic analyses on a selection of models and compositional tasks.

Beyond this, we implement our experiments as modular pipelines, run using the AI2 Tango library. Each Experiment implements a step_dict method that defines that pipeline's steps (which can depend on other steps, like a graph). The outputs of steps are automatically cached in ./tango_workspace by our scripts.

To replicate the experiments and plots (output to artifacts/) in our paper, one can run:

Generate datasets for every task

python configs/generate_data.py
python plotting/tasks_table.py  # Tables 1-2

Evaluate all models on a few tasks

python configs/evaluation_by_model.py
python plotting/compositionality_gap_by_model.py  # Fig. 2
python plotting/compositionality_gap_by_size.py  # App. C

Evaluate Llama 3 (3B) on all tasks

python configs/llama_3_3b/evaluation.py
python plotting/compositionality_gap.py  # Fig. 1

Logit lens analyses on all tasks

python configs/llama_3_3b/lens.py
python plotting/logit_lens_overall.py  # Fig. 3(a-b)
python plotting/logit_lens_per_task.py  # Fig. 3(c-f), App. D-E
python plotting/intermediate_var_distribution.py  # Fig. 4(b)

Correlation between task linearity and intermediate variables

python configs/llama_3_3b/linear_task_embedding.py
python configs/llama_3_3b/lens.py
python plotting/task_logit_lens_corr.py  # Fig. 4(a)
python plotting/hop_logit_lens_corr.py  # App. H

Token identity patchscope analyses

python configs/llama_3_3b/lens_token_identity.py
python plotting/token_identity_per_task.py  # App. F, Fig. 10-11
python configs/llama_3_3b/linear_task_embedding.py
python plotting/task_token_identity_correlation.py  # App. F, Fig. 12

Causality of intermediate variable representations

python configs/llama_3_3b/patching_across_tasks_comp.py
python configs/llama_3_3b/patching_across_tasks_direct.py
python plotting/patching_across_tasks.py  # App. G

Citation

@misc{khandelwal2025:compose,
      title={How Do Language Models Compose Functions?}, 
      author={Apoorv Khandelwal and Ellie Pavlick},
      year={2025},
      eprint={2510.01685},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2510.01685}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
configs		configs
plotting		plotting
src/composing_functions		src/composing_functions
.env		.env
.env.local.example		.env.local.example
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

How Do Language Models Compose Functions?

Installation

Activate Environment

Usage

Citation

About

Uh oh!

Languages

apoorvkh/composing-functions

Folders and files

Latest commit

History

Repository files navigation

How Do Language Models Compose Functions?

Installation

Activate Environment

Usage

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Languages