Here we will track the latest AI Multimodal Models, including Multimodal Foundation Model, LLM, Agent, Audio, Image, Video, Music and 3D content. š„
Date | Source | Description | Paper | Model |
---|---|---|---|---|
2024-07 | SEED-Story | SEED-Story: Multimodal Long Story Generation with Large Language Model. | arXiv | Hugging Face |
2024-07 | VTA-LDM | Video-to-Audio Generation with Hidden Alignment. | arXiv | Hugging Face |
2024-07 | Qwen2-Audio | Qwen2-Audio chat & pretrained large audio language model proposed by Alibaba Cloud. | arXiv | |
2024-07 | Moshi | Moshi is an experimental conversational AI. | Website | |
2024-07 | Anole | Anole: An Open, Autoregressive and Native Multimodal Models for Interleaved Image-Text Generation. | Hugging Face | |
2024-06 | Cambrian-1 | A Fully Open, Vision-Centric Exploration of Multimodal LLMs. | arXiv | Hugging Face |
2024-06 | MINT-1T | Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens. | arXiv | |
2024-06 | OmniTokenizer | A Joint Image-Video Tokenizer for Visual Generation. | arXiv | Website |
2024-06 | ml-4m | A framework for training any-to-any multimodal foundation models. | arXiv | Website |
2024-06 | VideoLLaMA 2 | Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs. | arXiv | Hugging Face |
2024-05 | ManyICL | Many-Shot In-Context Learning in Multimodal Foundation Models. | arXiv | |
2024-05 | Contrastive ALignment (CAL) | Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment. | arXiv | |
2024-05 | Groma | Grounded Multimodal Large Language Model with Localized Visual Tokenization. | arXiv | Hugging Face |
2024-05 | CogVLM2 | GPT4V-level open-source multi-modal model based on Llama3-8B. | Hugging Face | |
2024-05 | Chameleon | Mixed-Modal Early-Fusion Foundation Models. | arXiv | |
2024-05 | Lumina-T2X | Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers. | arXiv | Hugging Face |
2024-05 | MiniCPM-Llama3-V 2.5 | MiniCPM-Llama3-V 2.5 is the latest model in the MiniCPM-V series. The model is built on SigLip-400M and Llama3-8B-Instruct with a total of 8B parameters. | Hugging Face | |
2024-05 | Gemini | Build with state-of-the-art generative models and tools to make AI helpful for everyone. | API | |
2024-05 | GPT-4o | GPT-4o (āoā for āomniā) is a step towards much more natural human-computer interactionāit accepts as input any combination of text, audio, image, and video and generates any combination of text, audio, and image outputs. | API | |
2024-04 | MyGO | Discrete Modality Information as Fine-Grained Tokens for Multi-modal Knowledge Graph Completion. | arXiv | |
2024-04 | InternLM-XComposer2 | InternLM-XComposer2 is a groundbreaking vision-language large model (VLLM) excelling in free-form text-image composition and comprehension. | arXiv | Hugging Face |
2024-01 | MMVP | Exploring the Visual Shortcomings of Multimodal LLMs. | arXiv | |
2023-12 | V* | Guided Visual Search as a Core Mechanism in Multimodal LLMs. | arXiv | |
2023-12 | Tokenize Anything | Tokenize Anything via Prompting. | arXiv | Hugging Face |
2023-11 | ShareGPT4V | Improving Large Multi-Modal Models with Better Captions. | arXiv | Hugging Face |
2023-11 | Video-LLaVA | Learning United Visual Representation by Alignment Before Projection. | arXiv | Hugging Face |
2023-10 | LanguageBind | Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment. | arXiv | Hugging Face |
2023-07 | Emu | Emu: Generative Multimodal Models from BAAI. | arXiv | Hugging Face |
2023-05 | ImageBind | One Embedding Space To Bind Them All. | arXiv | Website |
2022-11 | EVA | EVA: Visual Representation Fantasies from BAAI. | arXiv | Hugging Face |
Date | Source | Description | Paper | Model |
---|---|---|---|---|
2024-07 | Index-1.9B | A SOTA lightweight multilingual LLM | Hugging Face | |
2024-06 | Claude 3.5 Sonnet | Claude 3.5 Sonnet | API | |
2024-06 | Nemotron-4 | Nemotron-4-340B-Instruct is a large language model (LLM) that can be used as part of a synthetic data generation pipeline to create training data that helps researchers and developers build their own LLMs. | arXiv | Hugging Face |
2024-06 | Qwen2 | Qwen2 is the large language model series developed by Qwen team, Alibaba Cloud. | Hugging Face | |
2024-04 | Llama 3 | Meta Llama 3 is the next generation of our state-of-the-art open source large language model. | Hugging Face | |
2024-03 | Claude 3 | Talk with Claude, an AI assistant from Anthropic. | API | |
2024-03 | Grok-1 | The weights and architecture of our 314 billion parameter Mixture-of-Experts model, Grok-1. | Hugging Face | |
2023-11 | Mixtral | Open and portable generative AI for devs and businesses. | arXiv | Hugging Face |
2023-09 | Baichuan 2 | A series of large language models developed by Baichuan Intelligent Technology. | Hugging Face | |
2023-07 | GPT-4 | GPT-4 is OpenAIās most advanced system, producing safer and more useful responses. | API |
Date | Source | Description | Paper | Model |
---|---|---|---|---|
2024-07 | OmAgent | A multimodal agent framework for solving complex tasks. | arXiv | |
2024-06 | GraphRAG | A modular graph-based Retrieval-Augmented Generation (RAG) system. | Website | |
2024-06 | Mixture of Agents (MoA) | Mixture-of-Agents Enhances Large Language Model Capabilities. | arXiv | |
2024-06 | Buffer of Thoughts | Thought-Augmented Reasoning with Large Language Models. | arXiv | |
2024-06 | Translation Agent | Agentic translation using reflection workflow. | ||
2024-06 | Atomic Agents | The Atomic Agents framework is designed to be modular, extensible, and easy to use. | ||
2024-05 | Pipecat | Open Source framework for voice and multimodal conversational AI. | ||
2024-02 | V-IRL | Grounding Virtual Intelligence in Real Life. | arXiv |
Date | Source | Description | Paper | Model |
---|---|---|---|---|
2024-07 | CosyVoice | Multi-lingual large voice generation model, providing inference, training and deployment full-stack ability. | ||
2024-06 | DEX-TTS | Diffusion-based EXpressive Text-to-Speech with Style Modeling on Time Variability. | arXiv | Website |
2024-05 | ChatTTS | ChatTTS is a text-to-speech model designed specifically for dialogue scenario such as LLM assistant. | ||
2023-06 | StyleTTS 2 | Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models. | arXiv | Hugging Face |
Date | Source | Description | Paper | Model |
---|---|---|---|---|
2024-07 | SenseVoice | SenseVoice is a speech foundation model with multiple speech understanding capabilities, including automatic speech recognition (ASR), spoken language identification (LID), speech emotion recognition (SER), and audio event detection (AED). | Hugging Face | |
2024-05 | TeleSpeech-ASR | Large speech model-super multi-dialect ASR. | Hugging Face | |
2022-12 | Whisper | Whisper is a general-purpose speech recognition model. | arXiv | API |
Date | Source | Description | Paper | Model |
---|---|---|---|---|
2024-07 | FoleyCrafter | FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds. | arXiv | Hugging Face |
2024-06 | SEE-2-SOUND | Zero-Shot Spatial Environment-to-Spatial Sound. | arXiv | |
2024-05 | Make-An-Audio 3 | Transforming Text into Audio via Flow-based Large Diffusion Transformers. | arXiv | Hugging Face |
Date | Source | Description | Paper | Model |
---|---|---|---|---|
2024-07 | UltraEdit | UltraEdit: Instruction-based Fine-Grained Image Editing at Scale. | arXiv | Hugging Face |
2024-07 | UltraPixel | UltraPixel: Advancing Ultra-High-Resolution Image Synthesis to New Peaks. | arXiv | |
2024-07 | PaintsUndo | PaintsUndo: A Base Model of Drawing Behaviors in Digital Paintings. | ||
2024-07 | Kolors | Kolors: Effective Training of Diffusion Model for Photorealistic Text-to-Image Synthesis. | Hugging Face | |
2024-06 | Depth Anything V2 | Depth Anything V2. | arXiv | Hugging Face |
2024-06 | AutoStudio | Crafting Consistent Subjects in Multi-turn Interactive Image Generation. | arXiv | |
2024-06 | MimicBrush | Zero-shot Image Editing with Reference Imitation. | arXiv | Hugging Face |
2024-06 | LlamaGen | Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation. | arXiv | Hugging Face |
2024-05 | Omost | Omost is a project to convert LLM's coding capability to image generation (or more accurately, image composing) capability. | Hugging Face | |
2024-05 | Hunyuan-DiT | A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding. | arXiv | Hugging Face |
2024-02 | MIGC | MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis. | arXiv | |
2023-10 | DALLĀ·E 3 | DALLĀ·E is a AI system that can create realistic images and art from a description in natural language. | API |
Date | Source | Description | Paper | Model |
---|---|---|---|---|
2024-06 | Diffutoon | High-Resolution Editable Toon Shading via Diffusion Models. | arXiv | Website |
2024-05 | Video-MME | The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis. | ||
2024-05 | Video-of-Thought | Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition. | Website | |
2024-05 | MOFA-Video | MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model. | arXiv | Hugging Face |
2024-05 | MotionLLM | Understanding Human Behaviors from Human Motions and Videos. | arXiv | |
2024-05 | Vidu | Vidu: a Highly Consistent, Dynamic and Skilled Text-to-Video Generator with Diffusion Models. | arXiv | |
2024-02 | Sora | Sora is an AI model that can create realistic and imaginative scenes from text instructions. | Technical Report | |
2023-11 | Pika | Pika is the idea-to-video platform that sets your creativity in motion. | ||
2023-03 | Runway | Runway is an applied AI research company shaping the next era of art, entertainment and human creativity. |
Date | Source | Description | Paper | Model |
---|---|---|---|---|
2024-05 | Diff-BGM | A Diffusion Model for Video Background Music Generation. | arXiv | |
2024-04 | Udio | Udio - AI Music Generator | Website | |
2023-12 | Suno | Suno is building a future where anyone can make great music. | Website | |
2023-12 | Soundry AI | Generative AI tools including text-to-sound and infinite sample packs. | Website | |
2023-12 | Sonauto | Sonauto is an AI music editor that turns prompts, lyrics, or melodies into full songs in any style. | Website |
Date | Source | Description | Paper | Model |
---|---|---|---|---|
2024-07 | CharacterGen | CharacterGen: Efficient 3D Character Generation from Single Images with Multi-View Pose Canonicalization. | arXiv | Website |
2024-07 | GALA3D | GALA3D: Towards Text-to-3D Complex Scene Generation via Layout-guided Generative Gaussian Splatting. | arXiv | Website |
2024-06 | Unique3D | High-Quality and Efficient 3D Mesh Generation from a Single Image. | arXiv | Hugging Face |
2024-06 | DreamGaussian4D | Generative 4D Gaussian Splatting. | arXiv | Hugging Face |
2024-03 | GaussCtrl | GaussCtrl: Multi-View Consistent Text-Driven 3D Gaussian Splatting Editing. | arXiv | |
2024-03 | GaussianCube | A Structured and Explicit Radiance Representation for 3D Generative Modeling. | arXiv | Hugging Face |
2024-03 | TripoSR | Fast 3D Object Reconstruction from a Single Image. | arXiv | Hugging Face |