-
Clone this repository and navigate to XmodelVLM folder
git clone https://github.com/XiaoduoAILab/XmodelVLM.git cd xmodelvlm
-
Install Package
conda create -n xmodelvlm python=3.10 -y conda activate xmodelvlm pip install --upgrade pip pip install -r requirements.txt
python inference.py
The overall architecture of our network, closely mirrors that of LLaVA-1.5. It consists of three key components:
- a vision encoder (CLIP ViT-L/14)
- a lightweight languagemodel (LLM)
- a projector responsible for aligning the visual and textual spaces (XDP)
Refer to our paper for more details!
The training process of Xmodel_VLM is divided into two stages:
- stage I: pre-training
- ❄️ frozen vision encoder + 🔥 learnable XDP projector + ❄️ learnable LLM
- stage II: multi-task training
Please firstly download MobileLLaMA chatbot checkpoints from huggingface website
- prepare benchmark data
-
We evaluate models on a diverse set of 9 benchmarks, i.e. GQA, MMBench, MMBench-cn, MME, POPE, SQA, TextVQA, VizWiz, MM-Vet. For example, you should follow these instructions to manage the datasets:
-
Data Download Instructions
- download some useful data/scripts pre-collected by us.
unzip benchmark_data.zip && cd benchmark_data
bmk_dir=${work_dir}/data/benchmark_data
- gqa
- download its image data following the official instructions here
cd ${bmk_dir}/gqa && ln -s /path/to/gqa/images images
- mme
- download the data following the official instructions here.
cd ${bmk_dir}/mme && ln -s /path/to/MME/MME_Benchmark_release_version images
- pope
- download coco from POPE following the official instructions here.
cd ${bmk_dir}/pope && ln -s /path/to/pope/coco coco && ln -s /path/to/coco/val2014 val2014
- sqa
- download images from the
data/scienceqa
folder of the ScienceQA repo. cd ${bmk_dir}/sqa && ln -s /path/to/sqa/images images
- download images from the
- textvqa
- download images following the instructions here.
cd ${bmk_dir}/textvqa && ln -s /path/to/textvqa/train_images train_images
- mmbench
- no action is needed.
- download some useful data/scripts pre-collected by us.
-
We provide detailed pre-training, fine-tuning and testing shell scripts, for example:
bash scripts/pretrain.sh 0,1,2,3 #GPU:0,1,2,3
If you find Xmodel_VLM useful in your research or applications, please consider giving a star ⭐ and citing using the following BibTeX:
@misc{xu2024xmodelvlm,
title={Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model},
author={Wanting Xu and Yang Liu and Langping He and Xucheng Huang and Ling Jiang},
year={2024},
eprint={2405.09215},
archivePrefix={arXiv},
primaryClass={cs.CV}
}