AI Multimodal Timeline

Here we will track the latest AI Multimodal Models, including Multimodal Foundation Model, LLM, Agent, Audio, Image, Video, Music and 3D content. 🔥

Project List

Multimodal Model

Date	Source	Description	Paper	Model
2024-07	SEED-Story	SEED-Story: Multimodal Long Story Generation with Large Language Model.	arXiv	Hugging Face
2024-07	VTA-LDM	Video-to-Audio Generation with Hidden Alignment.	arXiv	Hugging Face
2024-07	Qwen2-Audio	Qwen2-Audio chat & pretrained large audio language model proposed by Alibaba Cloud.	arXiv
2024-07	Moshi	Moshi is an experimental conversational AI.		Website
2024-07	Anole	Anole: An Open, Autoregressive and Native Multimodal Models for Interleaved Image-Text Generation.		Hugging Face
2024-06	Cambrian-1	A Fully Open, Vision-Centric Exploration of Multimodal LLMs.	arXiv	Hugging Face
2024-06	MINT-1T	Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens.	arXiv
2024-06	OmniTokenizer	A Joint Image-Video Tokenizer for Visual Generation.	arXiv	Website
2024-06	ml-4m	A framework for training any-to-any multimodal foundation models.	arXiv	Website
2024-06	VideoLLaMA 2	Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs.	arXiv	Hugging Face
2024-05	ManyICL	Many-Shot In-Context Learning in Multimodal Foundation Models.	arXiv
2024-05	Contrastive ALignment (CAL)	Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment.	arXiv
2024-05	Groma	Grounded Multimodal Large Language Model with Localized Visual Tokenization.	arXiv	Hugging Face
2024-05	CogVLM2	GPT4V-level open-source multi-modal model based on Llama3-8B.		Hugging Face
2024-05	Chameleon	Mixed-Modal Early-Fusion Foundation Models.	arXiv
2024-05	Lumina-T2X	Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers.	arXiv	Hugging Face
2024-05	MiniCPM-Llama3-V 2.5	MiniCPM-Llama3-V 2.5 is the latest model in the MiniCPM-V series. The model is built on SigLip-400M and Llama3-8B-Instruct with a total of 8B parameters.		Hugging Face
2024-05	Gemini	Build with state-of-the-art generative models and tools to make AI helpful for everyone.		API
2024-05	GPT-4o	GPT-4o (“o” for “omni”) is a step towards much more natural human-computer interaction—it accepts as input any combination of text, audio, image, and video and generates any combination of text, audio, and image outputs.		API
2024-04	MyGO	Discrete Modality Information as Fine-Grained Tokens for Multi-modal Knowledge Graph Completion.	arXiv
2024-04	InternLM-XComposer2	InternLM-XComposer2 is a groundbreaking vision-language large model (VLLM) excelling in free-form text-image composition and comprehension.	arXiv	Hugging Face
2024-01	MMVP	Exploring the Visual Shortcomings of Multimodal LLMs.	arXiv
2023-12	V*	Guided Visual Search as a Core Mechanism in Multimodal LLMs.	arXiv
2023-12	Tokenize Anything	Tokenize Anything via Prompting.	arXiv	Hugging Face
2023-11	ShareGPT4V	Improving Large Multi-Modal Models with Better Captions.	arXiv	Hugging Face
2023-11	Video-LLaVA	Learning United Visual Representation by Alignment Before Projection.	arXiv	Hugging Face
2023-10	LanguageBind	Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment.	arXiv	Hugging Face
2023-07	Emu	Emu: Generative Multimodal Models from BAAI.	arXiv	Hugging Face
2023-05	ImageBind	One Embedding Space To Bind Them All.	arXiv	Website
2022-11	EVA	EVA: Visual Representation Fantasies from BAAI.	arXiv	Hugging Face

^ Back to Contents ^

LLM

Date	Source	Description	Paper	Model
2024-07	Index-1.9B	A SOTA lightweight multilingual LLM		Hugging Face
2024-06	Claude 3.5 Sonnet	Claude 3.5 Sonnet		API
2024-06	Nemotron-4	Nemotron-4-340B-Instruct is a large language model (LLM) that can be used as part of a synthetic data generation pipeline to create training data that helps researchers and developers build their own LLMs.	arXiv	Hugging Face
2024-06	Qwen2	Qwen2 is the large language model series developed by Qwen team, Alibaba Cloud.		Hugging Face
2024-04	Llama 3	Meta Llama 3 is the next generation of our state-of-the-art open source large language model.		Hugging Face
2024-03	Claude 3	Talk with Claude, an AI assistant from Anthropic.		API
2024-03	Grok-1	The weights and architecture of our 314 billion parameter Mixture-of-Experts model, Grok-1.		Hugging Face
2023-11	Mixtral	Open and portable generative AI for devs and businesses.	arXiv	Hugging Face
2023-09	Baichuan 2	A series of large language models developed by Baichuan Intelligent Technology.		Hugging Face
2023-07	GPT-4	GPT-4 is OpenAI’s most advanced system, producing safer and more useful responses.		API

^ Back to Contents ^

Agent

Date	Source	Description	Paper	Model
2024-07	OmAgent	A multimodal agent framework for solving complex tasks.	arXiv
2024-06	GraphRAG	A modular graph-based Retrieval-Augmented Generation (RAG) system.		Website
2024-06	Mixture of Agents (MoA)	Mixture-of-Agents Enhances Large Language Model Capabilities.	arXiv
2024-06	Buffer of Thoughts	Thought-Augmented Reasoning with Large Language Models.	arXiv
2024-06	Translation Agent	Agentic translation using reflection workflow.
2024-06	Atomic Agents	The Atomic Agents framework is designed to be modular, extensible, and easy to use.
2024-05	Pipecat	Open Source framework for voice and multimodal conversational AI.
2024-02	V-IRL	Grounding Virtual Intelligence in Real Life.	arXiv

^ Back to Contents ^

Audio

Audio/Text-to-Speech

Date	Source	Description	Paper	Model
2024-07	CosyVoice	Multi-lingual large voice generation model, providing inference, training and deployment full-stack ability.
2024-06	DEX-TTS	Diffusion-based EXpressive Text-to-Speech with Style Modeling on Time Variability.	arXiv	Website
2024-05	ChatTTS	ChatTTS is a text-to-speech model designed specifically for dialogue scenario such as LLM assistant.
2023-06	StyleTTS 2	Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models.	arXiv	Hugging Face

Audio/Automatic Speech Recognition

Date	Source	Description	Paper	Model
2024-07	SenseVoice	SenseVoice is a speech foundation model with multiple speech understanding capabilities, including automatic speech recognition (ASR), spoken language identification (LID), speech emotion recognition (SER), and audio event detection (AED).		Hugging Face
2024-05	TeleSpeech-ASR	Large speech model-super multi-dialect ASR.		Hugging Face
2022-12	Whisper	Whisper is a general-purpose speech recognition model.	arXiv	API

Audio/Audio Generation

Date	Source	Description	Paper	Model
2024-07	FoleyCrafter	FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds.	arXiv	Hugging Face
2024-06	SEE-2-SOUND	Zero-Shot Spatial Environment-to-Spatial Sound.	arXiv
2024-05	Make-An-Audio 3	Transforming Text into Audio via Flow-based Large Diffusion Transformers.	arXiv	Hugging Face

^ Back to Contents ^

Image

Date	Source	Description	Paper	Model
2024-07	UltraEdit	UltraEdit: Instruction-based Fine-Grained Image Editing at Scale.	arXiv	Hugging Face
2024-07	UltraPixel	UltraPixel: Advancing Ultra-High-Resolution Image Synthesis to New Peaks.	arXiv
2024-07	PaintsUndo	PaintsUndo: A Base Model of Drawing Behaviors in Digital Paintings.
2024-07	Kolors	Kolors: Effective Training of Diffusion Model for Photorealistic Text-to-Image Synthesis.		Hugging Face
2024-06	Depth Anything V2	Depth Anything V2.	arXiv	Hugging Face
2024-06	AutoStudio	Crafting Consistent Subjects in Multi-turn Interactive Image Generation.	arXiv
2024-06	MimicBrush	Zero-shot Image Editing with Reference Imitation.	arXiv	Hugging Face
2024-06	LlamaGen	Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation.	arXiv	Hugging Face
2024-05	Omost	Omost is a project to convert LLM's coding capability to image generation (or more accurately, image composing) capability.		Hugging Face
2024-05	Hunyuan-DiT	A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding.	arXiv	Hugging Face
2024-02	MIGC	MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis.	arXiv
2023-10	DALL·E 3	DALL·E is a AI system that can create realistic images and art from a description in natural language.		API

^ Back to Contents ^

Video

Date	Source	Description	Paper	Model
2024-06	Diffutoon	High-Resolution Editable Toon Shading via Diffusion Models.	arXiv	Website
2024-05	Video-MME	The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis.
2024-05	Video-of-Thought	Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition.		Website
2024-05	MOFA-Video	MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model.	arXiv	Hugging Face
2024-05	MotionLLM	Understanding Human Behaviors from Human Motions and Videos.	arXiv
2024-05	Vidu	Vidu: a Highly Consistent, Dynamic and Skilled Text-to-Video Generator with Diffusion Models.	arXiv
2024-02	Sora	Sora is an AI model that can create realistic and imaginative scenes from text instructions.	Technical Report
2023-11	Pika	Pika is the idea-to-video platform that sets your creativity in motion.
2023-03	Runway	Runway is an applied AI research company shaping the next era of art, entertainment and human creativity.

^ Back to Contents ^

Music

Date	Source	Description	Paper	Model
2024-05	Diff-BGM	A Diffusion Model for Video Background Music Generation.	arXiv
2024-04	Udio	Udio - AI Music Generator		Website
2023-12	Suno	Suno is building a future where anyone can make great music.		Website
2023-12	Soundry AI	Generative AI tools including text-to-sound and infinite sample packs.		Website
2023-12	Sonauto	Sonauto is an AI music editor that turns prompts, lyrics, or melodies into full songs in any style.		Website

^ Back to Contents ^

3D

Date	Source	Description	Paper	Model
2024-07	CharacterGen	CharacterGen: Efficient 3D Character Generation from Single Images with Multi-View Pose Canonicalization.	arXiv	Website
2024-07	GALA3D	GALA3D: Towards Text-to-3D Complex Scene Generation via Layout-guided Generative Gaussian Splatting.	arXiv	Website
2024-06	Unique3D	High-Quality and Efficient 3D Mesh Generation from a Single Image.	arXiv	Hugging Face
2024-06	DreamGaussian4D	Generative 4D Gaussian Splatting.	arXiv	Hugging Face
2024-03	GaussCtrl	GaussCtrl: Multi-View Consistent Text-Driven 3D Gaussian Splatting Editing.	arXiv
2024-03	GaussianCube	A Structured and Explicit Radiance Representation for 3D Generative Modeling.	arXiv	Hugging Face
2024-03	TripoSR	Fast 3D Object Reconstruction from a Single Image.	arXiv	Hugging Face

^ Back to Contents ^

Name		Name	Last commit message	Last commit date
Latest commit History 109 Commits
AI-Multimodal-Timeline.png		AI-Multimodal-Timeline.png
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Multimodal Timeline

Table of Contents

Project List

Multimodal Model

LLM

Agent

Audio

Audio/Text-to-Speech

Audio/Automatic Speech Recognition

Audio/Audio Generation

Image

Video

Music

3D

About

Releases

Packages

License

Yuan-ManX/ai-multimodal-timeline

Folders and files

Latest commit

History

Repository files navigation

AI Multimodal Timeline

Table of Contents

Project List

Multimodal Model

LLM

Agent

Audio

Audio/Text-to-Speech

Audio/Automatic Speech Recognition

Audio/Audio Generation

Image

Video

Music

3D

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages