Skip to content

Official pytorch implementation of "MUSE: A Simple Yet Effective Multimodal Search-Based Framework for Lifelong User Interest Modeling"

License

Notifications You must be signed in to change notification settings

alimama-tech/MUSE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MUSE: A Simple Yet Effective Multimodal Search-Based Framework for Lifelong User Interest Modeling

arXiv Project Page

Overview | Set Up Environment | Prepare Dataset | Run Experiments | Citation

This is the official pytorch implementation of paper "MUSE: A Simple Yet Effective Multimodal Search-Based Framework for Lifelong User Interest Modeling". MUSE framework targets industrial-scale recommendation scenarios that require modeling lifelong user interests through ultra-long behavior sequences and rich multimodal item content. MUSE has been deployed in Taobao display advertising system since mid-2025, demonstrating significant improvements during online service.

Overview

Framework Overview

Overview of MUSE. (a) Multimodal item embeddings are pre-trained via Semantic-aware Contrastive Learning (SCL). In the recommendation phase, (b) the GSU stage efficiently retrieves the top-𝐾 behaviors most relevant to the target item from the user’s lifelong history using lightweight multimodal cosine similarity, drastically reducing the sequence length for downstream processing. (c) The ESU stage models fine-grained user interests through two components: the SimTier module compresses multimodal similarity sequences into histograms, while the Semantic-Aware Target Attention (SA-TA) module enriches ID-based attention with semantic guidance to produce the final lifelong user interest representation.

Set Up Environment

Our project relies on Python 3.10.19. First, you need to ensure that pytorch of version comaparbale to 2.6.0 is installed (may be backward compatible). Then install other required dependencies from requirements.txt.

pip install -r requirements.txt

Prepare Dataset

To reproduce the results presented in Table 7 ("Open-source-1k") of the paper, the TAOBAO-MM dataset is required. TAOBAO-MM now is publicly available on 🤗HuggingFace. After installing the huggingface_hub package via pip install huggingface_hub, you can download the dataset directly using the following command:

huggingface-cli download --repo-type=dataset TaoBao-MM/Taobao-MM --local-dir your/local/path

The complete dataset occupies 139 GB. For more details about the dataset, please refer to the official dataset website or the huggingface repository.

In the ./utils/muse_dataset, we implement the dataset class to load the training and test samples of TAOBAO-MM. We primarily use the following feature fields:

# In line 10 of `trainer.py`

FEATURE_BLOCKS = {
    # non-sequential attributes of target item (ad)
    # "205" is the item ID
    # "205_c" is the item scl embedding
    "ad": ["205", "206", "213", "214", "205_c"],

    # non-sequential attributes of user
    # "129_1" is the user ID
    "user": ["129_1", "130_1", "130_2", "130_3", "130_4", "130_5"],

    # sequential features of user lifelong behavior, each is a list of length 1000
    # "150_2_180" is item ID list
    # "151_2_180" is item category list
    # "150_2_180_c" is item scl embedding list
    "uni_seq_fn": ["150_2_180", "151_2_180", "150_2_180_c"],

    # sequential features of user recent behavior, each is a list of length 50
    # obtained by truncating the long sequence from the newer end
    "short_seq_fn": ["150_1_180", "151_1_180", "150_1_180_c"]
}

Apart from the multimodal embeddings, all features are represented as discrete IDs, such as -58759430334327705. For each feature field, we first compute the number of unique IDs to determine the vocabulary size (ID_SIZE), then map the original IDs to consecutive integers in the range [0, ID_SIZE). We implement static embeddings using torch.nn.Embedding, with the remapped integer IDs serving as input indices. It is important to note that this static embedding implementation differs from the one used in production, which may involve dynamic embeddings.

Run Experiments

To reproduce the experiments, replace the configuration file in the bash script and run it.

bash script/run_exp.sh

The code we currently provide includes implementations of MUSE, SIM-hard, and SIM-soft.

The configuration files for the experiments are located in the config directory. Below, we describe the key configuration parameters.

Configuration Type Description
train_data_path Path Path to the training data.
test_data_path Path Path to the test data.
feature_map_path Path Directory containing feature maps and SCL embedding tables.
dense_lr Float Learning rate for the dense components of the model (e.g., DNN layers).
sparse_lr Float Learning rate of the sparse part of the model (embeddings).
keep_top Int Top-K of GSU.
item_id_p90 Bool If true, uses a simplified item ID vocabulary of size 35M, covering 90% of user historical interactions and 100% of target items; otherwise, uses the full vocabulary of 243M entries.
scl_emb_p90 Bool If true, uses a simplified SCL embedding table of size 35M (same coverage as above); currently, only the simplified version is supported.
feature_map_on_cuda Bool If true, ID remapping is performed on GPU; otherwise, it is executed on CPU.
scl_emb_on_cuda Bool If true, SCL embedding lookups are performed on GPU; otherwise, they are carried out on CPU.

Our training code supports distributed data-parallel training using PyTorch’s DistributedDataParallel (DDP). The training and test datasets are partitioned into 160 and 48 shards, respectively, to facilitate efficient data loading. To ensure balanced workload distribution across devices, we strongly recommend using 1, 2, 4, or 8 GPUs. With 8 GPUs, a full training and evaluation cycle can be completed within one hour.

The CUDA memory overhead during training primarily arises from three sources:

  1. Memory consumed by the embedding layers and their associated gradients during backpropagation;
  2. Memory used to store the feature map;
  3. Memory required for the SCL embedding table.

When modeling users’ long-term historical interaction sequences, the number of unique items can be very large, up to 243 million. Allocating an embedding table for all 243 million items would alone require over 30 GB of CUDA memory. However, many of these item IDs appear only once or twice in the training data and contribute minimally to model performance. To reduce memory consumption without significant degradation in accuracy, we recommend using a simplified ID vocabulary that filters out low-frequency items (set item_id_p90 to true). With this simplification, the per-GPU memory cost is reduced to approximately 16 GB.

For further GPU memory savings, you may set scl_emb_on_cuda to false, which moves SCL embedding lookups to CPU memory. Note that it will incur a moderate overhead in training speed.

Citation

If you find our work useful for your research, please consider citing the paper:

@misc{wu2025musesimpleeffectivemultimodal,
      title={MUSE: A Simple Yet Effective Multimodal Search-Based Framework for Lifelong User Interest Modeling}, 
      author={Bin Wu and Feifan Yang and Zhangming Chan and Yu-Ran Gu and Jiawei Feng and Chao Yi and Xiang-Rong Sheng and Han Zhu and Jian Xu and Mang Ye and Bo Zheng},
      year={2025},
      eprint={2512.07216},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2512.07216}, 
}

About

Official pytorch implementation of "MUSE: A Simple Yet Effective Multimodal Search-Based Framework for Lifelong User Interest Modeling"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published