Making Avatars Interact
Towards Text-Driven Human-Object Interaction for Controllable Talking Avatars

InteractAvatar is a novel dual-stream DiT framework that enables talking avatars to perform Grounded Human-Object Interaction (GHOI). Unlike previous methods restricted to simple gestures, our model can perceive the environment from a static reference image and generate complex, text-guided interactions with objects while maintaining high-fidelity lip synchronization.

🔥🔥🔥 News!!

Jan 20, 2026: 👋 We release the InteractAvatar paper and project page.
Jan 30, 2026: 👋 We release the inference code.

📑 Open-source Plan

Paper and Project Page
Inference code
Pretrained Checkpoints (Initialised from Wan2.2-5B)
GroundedInter Benchmark Data

📖 Abstract

Generating talking avatars that can interact with their environment remains an open challenge. Existing methods often struggle with the Control-Quality Dilemma, failing to ground actions in the scene or losing video fidelity when complex motions are required.

InteractAvatar introduces a dual-stream framework that explicitly decouples perception planning from video synthesis:

Perception and Interaction Module (PIM): Handles environmental perception and motion planning (detection & motion generation) based on the reference image and text prompts.
Audio-Interaction Aware Generation Module (AIM): Synthesizes vivid talking avatars performing object interactions, guided by PIM via a novel Motion-to-Video (M2V) aligner.

Key advantages:

✅ Grounded Interaction: Perceives static scenes and interacts with specific objects (e.g., "Pick up the apple on the table").
✅ Multimodal Control: Supports any combination of Text, Audio, and Motion inputs.
✅ Mult-Scene Control: Generate avatars that can Talk, Act, and Interact with Object inputs.
✅ High Fidelity: Parallel co-generation ensures plausible video quality and precise lip-sync.

🏗️ Model Architecture

Our framework consists of two parallel DiT streams:

PIM (Planning Brain): Takes the reference image and text prompt to generate a structural motion sequence (skeletal poses + object bounding boxes). It uses a "Perception as Generation" training strategy.
AIM (Rendering Engine): Takes the audio, reference image, and the motion features from PIM (injected via the M2V Aligner) to generate the final video frames.

📊 Performance

We evaluate on our proposed GroundedInter benchmark (400 images, 100+ object types). InteractAvatar significantly outperforms SOTA methods (HuMo, VACE, Wan-S2V, etc.) in both interaction quality and video consistency.

🛠️ Installation

Clone the repository:

git clone https://github.com/angzong/InteractAvatar.git
cd InteractAvatar

Create a Conda environment:

conda create -n interact_avatar python=3.10
conda activate interact_avatar

Install dependencies:

pip install torch==2.6.0+cu124 torchvision==0.21.0+cu124 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124
pip install ninja psutil packaging
pip install flash_attn==2.7.4.post1 --no-build-isolation
conda install -c conda-forge librosa
conda install -c conda-forge ffmpeg
pip install -r requirements.txt

🧱 Download Models

Model Component	Description	Download Link
InteractAvatar	InteractAvatar model	🤗 Huggingface
InteractAvatar-long	InteractAvatar support for long video generation	🤗 Huggingface
Wav2Vec 2.0	Audio Feature Extractor	🤗 Huggingface
Wan2.2-TI2V-5B	Pretrained Video Model	🤗 Huggingface
GroundedInter	GHOI-benchmark	🤗 Huggingface

Place the weights in the ./ckpt directory.

🚀 Inference

You can generate videos using a reference image, an audio file, and a text prompt.

. test_inter_tia2mv_GPu_hoi.sh

Key Parameters:

mode: Choise from 'a2mv', 'ap2v','mv', 'a2v', 'p2v' for audio-driven video-motion co-generation, audio-pose-driven generation, text-driven video-motion co-generation, audio-driven video generation and pose-driven video generation.
transformer_path: .safetensor file dir. For long video generation, choise long-video-gen version with better id preservation.
test_data_path: Test case json path, provide first frame ref img, action and interaction text prompt, optional audio or motion signal.
audio_guide_scale: cfg scale for audio-sync, 7.5 by default.
text_guide_scale: cfg scale for prompt following, 5 by default.
sample_steps: inference denoising steps, 40 by default.
bad_cfg: trick for visual quality imporvement, True by default.

📝 Citation

If you find InteractAvatar useful for your research, please cite our paper:

@article{zhang2026making,
  title={Making Avatars Interact: Towards Text-Driven Human-Object Interaction for Controllable Talking Avatars},
  author={Zhang, Youliang and Zhou, Zhengguang and Yu, Zhentao and Huang, Ziyao and Hu, Teng and Liang, Sen and Zhang, Guozhen and Peng, Ziqiao and Li, Shunkai and Chen, Yi and Zhou, Zixiang and Zhou, Yuan and Lu, Qinglin and Li, Xiu},
  journal={arXiv preprint arXiv:2602.01538},
  year={2026}
}

🙏 Acknowledgements

We sincerely thank the contributors to the following projects:

Star ⭐ this repo if you find it helpful!

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
InterDemo		InterDemo
assets		assets
configs		configs
optimizers		optimizers
utils		utils
videos		videos
wan		wan
.DS_Store		.DS_Store
.gitattributes		.gitattributes
.gitignore		.gitignore
DroidSansFallback.ttf		DroidSansFallback.ttf
README.md		README.md
requirements.txt		requirements.txt
test_inter_tia2mv_GPu_hoi.sh		test_inter_tia2mv_GPu_hoi.sh
test_wanx_tia2mv_obj_back.py		test_wanx_tia2mv_obj_back.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Making Avatars Interact
Towards Text-Driven Human-Object Interaction for Controllable Talking Avatars

🔥🔥🔥 News!!

📑 Open-source Plan

📋 Table of Contents

📖 Abstract

🏗️ Model Architecture

📊 Performance

🛠️ Installation

🧱 Download Models

🚀 Inference

📝 Citation

🙏 Acknowledgements

About

Uh oh!

Releases

Packages

Languages

angzong/InteractAvatar

Folders and files

Latest commit

History

Repository files navigation

Making Avatars Interact Towards Text-Driven Human-Object Interaction for Controllable Talking Avatars

🔥🔥🔥 News!!

📑 Open-source Plan

📋 Table of Contents

📖 Abstract

🏗️ Model Architecture

📊 Performance

🛠️ Installation

🧱 Download Models

🚀 Inference

📝 Citation

🙏 Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Making Avatars Interact
Towards Text-Driven Human-Object Interaction for Controllable Talking Avatars

Packages