MUSE: A Simple Yet Effective Multimodal Search-Based Framework for Lifelong User Interest Modeling

Overview | Set Up Environment | Prepare Dataset | Run Experiments | Citation

This is the official pytorch implementation of paper "MUSE: A Simple Yet Effective Multimodal Search-Based Framework for Lifelong User Interest Modeling". MUSE framework targets industrial-scale recommendation scenarios that require modeling lifelong user interests through ultra-long behavior sequences and rich multimodal item content. MUSE has been deployed in Taobao display advertising system since mid-2025, demonstrating significant improvements during online service.

Overview

Overview of MUSE. (a) Multimodal item embeddings are pre-trained via Semantic-aware Contrastive Learning (SCL). In the recommendation phase, (b) the GSU stage efficiently retrieves the top-𝐾 behaviors most relevant to the target item from the user’s lifelong history using lightweight multimodal cosine similarity, drastically reducing the sequence length for downstream processing. (c) The ESU stage models fine-grained user interests through two components: the SimTier module compresses multimodal similarity sequences into histograms, while the Semantic-Aware Target Attention (SA-TA) module enriches ID-based attention with semantic guidance to produce the final lifelong user interest representation.

Set Up Environment

Our project relies on Python 3.10.19. First, you need to ensure that pytorch of version comaparbale to 2.6.0 is installed (may be backward compatible). Then install other required dependencies from requirements.txt.

pip install -r requirements.txt

Prepare Dataset

To reproduce the results presented in Table 7 ("Open-source-1k") of the paper, the TAOBAO-MM dataset is required. TAOBAO-MM now is publicly available on 🤗HuggingFace. After installing the huggingface_hub package via pip install huggingface_hub, you can download the dataset directly using the following command:

huggingface-cli download --repo-type=dataset TaoBao-MM/Taobao-MM --local-dir your/local/path

The complete dataset occupies 139 GB. For more details about the dataset, please refer to the official dataset website or the huggingface repository.

In the ./utils/muse_dataset, we implement the dataset class to load the training and test samples of TAOBAO-MM. We primarily use the following feature fields:

# In line 10 of `trainer.py`

FEATURE_BLOCKS = {
    # non-sequential attributes of target item (ad)
    # "205" is the item ID
    # "205_c" is the item scl embedding
    "ad": ["205", "206", "213", "214", "205_c"],

    # non-sequential attributes of user
    # "129_1" is the user ID
    "user": ["129_1", "130_1", "130_2", "130_3", "130_4", "130_5"],

    # sequential features of user lifelong behavior, each is a list of length 1000
    # "150_2_180" is item ID list
    # "151_2_180" is item category list
    # "150_2_180_c" is item scl embedding list
    "uni_seq_fn": ["150_2_180", "151_2_180", "150_2_180_c"],

    # sequential features of user recent behavior, each is a list of length 50
    # obtained by truncating the long sequence from the newer end
    "short_seq_fn": ["150_1_180", "151_1_180", "150_1_180_c"]
}

Apart from the multimodal embeddings, all features are represented as discrete IDs, such as -58759430334327705. For each feature field, we first compute the number of unique IDs to determine the vocabulary size (ID_SIZE), then map the original IDs to consecutive integers in the range [0, ID_SIZE). We implement static embeddings using torch.nn.Embedding, with the remapped integer IDs serving as input indices. It is important to note that this static embedding implementation differs from the one used in production, which may involve dynamic embeddings.

Run Experiments

To reproduce the experiments, replace the configuration file in the bash script and run it.

bash script/run_exp.sh

The code we currently provide includes implementations of MUSE, SIM-hard, and SIM-soft.

The configuration files for the experiments are located in the config directory. Below, we describe the key configuration parameters.

Configuration	Type	Description
train_data_path	Path	Path to the training data.
test_data_path	Path	Path to the test data.
feature_map_path	Path	Directory containing feature maps and SCL embedding tables.
dense_lr	Float	Learning rate for the dense components of the model (e.g., DNN layers).
sparse_lr	Float	Learning rate of the sparse part of the model (embeddings).
keep_top	Int	Top-K of GSU.
item_id_p90	Bool	If `true`, uses a simplified item ID vocabulary of size 35M, covering 90% of user historical interactions and 100% of target items; otherwise, uses the full vocabulary of 243M entries.
scl_emb_p90	Bool	If `true`, uses a simplified SCL embedding table of size 35M (same coverage as above); currently, only the simplified version is supported.
feature_map_on_cuda	Bool	If `true`, ID remapping is performed on GPU; otherwise, it is executed on CPU.
scl_emb_on_cuda	Bool	If `true`, SCL embedding lookups are performed on GPU; otherwise, they are carried out on CPU.

Our training code supports distributed data-parallel training using PyTorch’s DistributedDataParallel (DDP). The training and test datasets are partitioned into 160 and 48 shards, respectively, to facilitate efficient data loading. To ensure balanced workload distribution across devices, we strongly recommend using 1, 2, 4, or 8 GPUs. With 8 GPUs, a full training and evaluation cycle can be completed within one hour.

The CUDA memory overhead during training primarily arises from three sources:

Memory consumed by the embedding layers and their associated gradients during backpropagation;
Memory used to store the feature map;
Memory required for the SCL embedding table.

When modeling users’ long-term historical interaction sequences, the number of unique items can be very large, up to 243 million. Allocating an embedding table for all 243 million items would alone require over 30 GB of CUDA memory. However, many of these item IDs appear only once or twice in the training data and contribute minimally to model performance. To reduce memory consumption without significant degradation in accuracy, we recommend using a simplified ID vocabulary that filters out low-frequency items (set item_id_p90 to true). With this simplification, the per-GPU memory cost is reduced to approximately 16 GB.

For further GPU memory savings, you may set scl_emb_on_cuda to false, which moves SCL embedding lookups to CPU memory. Note that it will incur a moderate overhead in training speed.

Citation

If you find our work useful for your research, please consider citing the paper:

@misc{wu2025musesimpleeffectivemultimodal,
      title={MUSE: A Simple Yet Effective Multimodal Search-Based Framework for Lifelong User Interest Modeling}, 
      author={Bin Wu and Feifan Yang and Zhangming Chan and Yu-Ran Gu and Jiawei Feng and Chao Yi and Xiang-Rong Sheng and Han Zhu and Jian Xu and Mang Ye and Bo Zheng},
      year={2025},
      eprint={2512.07216},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2512.07216}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
assets		assets
config		config
model		model
script		script
utils		utils
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
trainer.py		trainer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MUSE: A Simple Yet Effective Multimodal Search-Based Framework for Lifelong User Interest Modeling

Overview

Set Up Environment

Prepare Dataset

Run Experiments

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Languages

License

alimama-tech/MUSE

Folders and files

Latest commit

History

Repository files navigation

MUSE: A Simple Yet Effective Multimodal Search-Based Framework for Lifelong User Interest Modeling

Overview

Set Up Environment

Prepare Dataset

Run Experiments

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages