🐱 MaineCoon

Pursuing a Real-Time Audio-Visual Social World Model

MaineCoon is the first real-time audio-visual autoregressive model, a 22B-parameter generative core that streams synchronized audio and video chunk-by-chunk on a single GPU, with sub-second interaction and a record-breaking frame rate of up to 47.5 FPS. It is our first step toward social world models — generative systems that actively observe users, internally simulate social dynamics, and react to people in real time.

🔗 Official website & demos: https://mainecoon.tech/

🕹️ Experience platform (try it live): https://mainecoon.tech/experience-platform

📝 Blog: https://mainecoon.tech/blogs

🤗 Hugging Face: catnip-ai-tech/MaineCoon

📄 Paper (arXiv): arxiv.org/abs/2606.17800

Overview

As an ever-growing majority of the world's video is watched on social platforms and created for interactive social purposes, video generation built for social worlds is important yet largely overlooked. Prior world models simulate physical environments or game-world exploration, but remain detached from human-centric social dynamics — they often omit audio, or fail to capture the high-engagement pacing, emotional resonance, and rapid conversational flow that define social media.

We define the position of social world models and build MaineCoon as a prototype generative core toward this goal: a real-time audio-visual generator optimized from the ground up for social-interactive applications. Unlike offline bidirectional video diffusion models, MaineCoon is designed around deployment-time streaming end-to-end — its data infrastructure, training framework, attention pattern, context distribution, KV-cache usage, and agentic streaming inference are all optimized for real-time social audio-visual generation.

Highlights

⚡ Real-time on a single GPU. A 22B interactive audio-visual autoregressive model capable of streaming generation and sub-second interaction, with a record-breaking frame rate of up to 47.5 FPS on a single H100. Generation cost drops well below $0.001 per second — and keeps falling.
🌍 A new paradigm: social world models. MaineCoon positions and serves as the first generative core for social world models, a technical foundation for next-generation AI-native social platforms.
🎓 Forcing-free streaming training. A multi-stage training paradigm — self-resampling, cross-modal representation alignment, domain-aware preference optimization, and reinforced online-policy distillation (ROPD) — that enables native, efficient streaming audio-visual training at 22B scale.
🧠 Agentic streaming inference. An agentic inference framework that supports thousand-second-scale generation while mitigating drift through agentic cache management, chunk commitment, long-context rollout, and prompt planning.
📊 SocialVideo-Bench. A new benchmark focused on audio-visual social-video generation, with 9 representative metrics covering visual quality, motion, audio quality, audio-visual alignment, and social-video harmony. MaineCoon outperforms 7 representative open audio-visual models while achieving the fastest generation speed — a new state of the art for real-time social video generation.

Showcase

A few hand-picked MaineCoon generations, generated chunk-by-chunk under the streaming regime — audio-visual, with sound.

Demo1.mp4	Demo2.mp4
Demo5.mp4	Demo11.mp4
Demo21.mp4	Demo16.mp4

🎬 Minute-scale, long-form demos are best viewed on our blog. 🕹️ Try MaineCoon live at the experience platform.

Benchmark — SocialVideo-Bench

Table 2. Main quantitative results on SocialVideo-Bench. 🐱 MaineCoon (Ours) achieves the best average score and wins most metrics, including the two most comprehensive ones — Audio-Visual Harmony (AVH) and Joint Audio-Visual Integrated Score (JAVIS) — over both streaming and bidirectional baselines.

Type	Model	Vis↑	Mot↑	Aud↑	IB-TV↑	IB-TA↑	IB-AV↑	AV-Al↑	AVH↑	JAVIS↑	Average↑
Bidirectional T2AV	JavisDiT++	4.39	2.22	4.06	0.134	0.070	0.151	0.312	0.136	0.112	0.711
	Ovi	4.44	1.89	3.76	0.138	0.079	0.191	0.412	0.188	0.162	0.779
	JoyAI-Echo	4.61	1.17	3.47	0.147	0.088	0.226	0.319	0.196	0.173	0.749
	MoVA	4.66	1.68	3.69	0.133	0.105	0.258	0.359	0.245	0.216	0.842
	LTX-2.3	4.10	0.99	4.06	0.132	0.111	0.311	0.334	0.287	0.247	0.848
Streaming TA2V	LiveAvatar	4.60	1.46	4.13	0.131	0.120	0.316	0.326	0.291	0.246	0.892
	SoulX-FlashTalk	4.65	1.99	4.07	0.128	0.120	0.307	0.279	0.283	0.238	0.895
Streaming T2AV	🐱 MaineCoon (Ours)	4.71	1.62	4.35	0.127	0.130	0.318	0.334	0.308	0.272	0.934 🥇

_{🐱 = our method · bold = best, italic = second best. Metrics — Vis: visual quality · Mot: motion · Aud: audio quality · IB-TV / IB-TA / IB-AV: ImageBind Text–Video / Text–Audio / Audio–Video alignment · AV-Al: audio–visual alignment · AVH: Audio-Visual Harmony · JAVIS: Joint Audio-Visual Integrated Score. See the technical report for the full benchmark and metric definitions.}

Table 3. Latency and model size comparison. Sampling throughput (FPS) is measured for 480P 20-second generation on a single H100 GPU. 🐱 MaineCoon (Ours) has the largest model yet by far the fastest speed — up to 7× faster than other streaming audio-visual generators, and faster even than a 1.3B streaming video model.

Type	Model	Params	FPS↑
Bidirectional T2AV	JavisDiT++	1.8B	0.87
	Ovi	11B	0.58
	JoyAI-Echo	23B	18.0
	MoVA	32B	0.26
	LTX-2.3	22B	1.40
	LTX-2.3-Distilled	22B	20.7
Streaming T2V	Causal-Forcing	1.3B	19.1
	Helios-Distilled	14B	18.2
	Krea	14B	6.1
Streaming TA2V	LiveAvatar	14B	6.7
	SoulX-FlashTalk	14B	6.6
Streaming T2AV	🐱 MaineCoon (Ours)	22B	47.5 🥇

_{🐱 = our method · bold = best, italic = second best. FPS for 480P-20s on a single H100.}

Paper

The full paper is available on arXiv:

📄 arXiv:2606.17800

It covers the social-video data infrastructure, the native streaming autoregressive training recipe, the agentic streaming inference framework, SocialVideo-Bench, and a position/outlook on social world models.

Links

Website & live demos: https://mainecoon.tech/
Experience platform (try it live): https://mainecoon.tech/experience-platform
Blog: https://mainecoon.tech/blogs
Hugging Face: https://huggingface.co/catnip-ai-tech/MaineCoon
Paper (arXiv): https://arxiv.org/abs/2606.17800

This repository hosts the project README, demo showcase, and links. It does not contain model weights or source code. For demos and the latest updates, please visit mainecoon.tech.

Acknowledgements

MaineCoon stands on the shoulders of the open-source community. We are especially grateful to:

🎬 LTX-2.3 & the LTX series — Lightricks. MaineCoon's audio-visual backbone builds on the excellent open LTX-2.3 model. Huge credit to the LTX team and the broader LTX-Video series.
- LTX-2 (incl. LTX-2.3): https://github.com/Lightricks/LTX-2
- LTX-Video: https://github.com/Lightricks/LTX-Video
⚡ DMD series & the distribution-matching distillation community. Our reinforced online-policy distillation (ROPD) builds on the Distribution Matching Distillation (DMD / DMD2) line of work and the wider few-step / real-time distillation community.
- DMD2: https://github.com/tianweiy/DMD2
- DMD (project page): https://tianweiy.github.io/dmd/

We thank these projects and their communities for advancing real-time, few-step, and streaming video generation.

Citation

If you find MaineCoon useful in your research, please consider citing:

@article{catnip2026mainecoon,
  title        = {MaineCoon: Pursuing A Real-Time Audio-Visual Social World Model},
  author       = {Lichen Bai and Tianhao Zhang and Shitong Shao and Dingwei Tan and Qiyu Zhong and Zhengpeng Xie and Haopeng Li and Qinghao Huang and Dandan Shen and Tengjiao Ji and Wei Wang and Peicheng Wu and Yuxuan Zhao and Xiangyu Zhu and Welly Luo and Shurui Yang and Zeke Xie},
  year         = {2026},
  journal      = {arXiv preprint arXiv:2606.17800},
  url          = {https://arxiv.org/abs/2606.17800}
}

About

MaineCoon is developed by the Catnip AI Team. Learn more at https://mainecoon.tech/.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
showcase		showcase
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🐱 MaineCoon

Pursuing a Real-Time Audio-Visual Social World Model

Overview

Highlights

Showcase

Benchmark — SocialVideo-Bench

Paper

Links

Acknowledgements

Citation

About

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

🐱 MaineCoon

Pursuing a Real-Time Audio-Visual Social World Model

Overview

Highlights

Showcase

Benchmark — SocialVideo-Bench

Paper

Links

Acknowledgements

Citation

About

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Packages