Xmodel_VLM: A Simple Baseline for Multimodal Vision Language Model

🛠️ Install

Clone this repository and navigate to XmodelVLM folder

git clone https://github.com/XiaoduoAILab/XmodelVLM.git
cd xmodelvlm

Install Package

conda create -n xmodelvlm python=3.10 -y
conda activate xmodelvlm
pip install --upgrade pip
pip install -r requirements.txt

🗝️ Quick Start

Example for Xmodel_VLM model inference

python inference.py

🪜 Step-by-step Tutorial

Xmodel_VLM

The overall architecture of our network, closely mirrors that of LLaVA-1.5. It consists of three key components:

a vision encoder (CLIP ViT-L/14)
a lightweight languagemodel (LLM)
a projector responsible for aligning the visual and textual spaces (XDP)

Refer to our paper for more details!

The training process of Xmodel_VLM is divided into two stages:

stage I: pre-training
- ❄️ frozen vision encoder + 🔥 learnable XDP projector + ❄️ learnable LLM
stage II: multi-task training
- ❄️ frozen vision encoder + 🔥 learnable XDP projector + 🔥 learnable LLM

1️⃣ Prepare Xmodel_VLM checkpoints

Please firstly download MobileLLaMA chatbot checkpoints from huggingface website

2️⃣ Prepare data

prepare benchmark data
- We evaluate models on a diverse set of 9 benchmarks, i.e. GQA, MMBench, MMBench-cn, MME, POPE, SQA, TextVQA, VizWiz, MM-Vet. For example, you should follow these instructions to manage the datasets:
- Data Download Instructions
  - download some useful data/scripts pre-collected by us.
    - unzip benchmark_data.zip && cd benchmark_data
    - bmk_dir=${work_dir}/data/benchmark_data
  - gqa
    - download its image data following the official instructions here
    - cd ${bmk_dir}/gqa && ln -s /path/to/gqa/images images
  - mme
    - download the data following the official instructions here.
    - cd ${bmk_dir}/mme && ln -s /path/to/MME/MME_Benchmark_release_version images
  - pope
    - download coco from POPE following the official instructions here.
    - cd ${bmk_dir}/pope && ln -s /path/to/pope/coco coco && ln -s /path/to/coco/val2014 val2014
  - sqa
    - download images from the data/scienceqa folder of the ScienceQA repo.
    - cd ${bmk_dir}/sqa && ln -s /path/to/sqa/images images
  - textvqa
    - download images following the instructions here.
    - cd ${bmk_dir}/textvqa && ln -s /path/to/textvqa/train_images train_images
  - mmbench
    - no action is needed.

3️⃣ Run everything with one click!

We provide detailed pre-training, fine-tuning and testing shell scripts, for example:

bash scripts/pretrain.sh 0,1,2,3  #GPU:0,1,2,3

🤝 Acknowledgments

LLaVA: Thanks for their wonderful work! 👏
MobileVLM: Thanks for their wonderful work! 👏

✏️ Reference

If you find Xmodel_VLM useful in your research or applications, please consider giving a star ⭐ and citing using the following BibTeX:

@misc{xu2024xmodelvlm,
      title={Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model}, 
      author={Wanting Xu and Yang Liu and Langping He and Xucheng Huang and Ling Jiang},
      year={2024},
      eprint={2405.09215},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
assets		assets
scripts		scripts
xmodelvlm		xmodelvlm
LICENSE		LICENSE
README.md		README.md
inference.py		inference.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

assets

assets

scripts

scripts

xmodelvlm

xmodelvlm

LICENSE

LICENSE

README.md

README.md

inference.py

inference.py

requirements.txt

requirements.txt

Repository files navigation

Xmodel_VLM: A Simple Baseline for Multimodal Vision Language Model

🛠️ Install

🗝️ Quick Start

Example for Xmodel_VLM model inference

🪜 Step-by-step Tutorial

Xmodel_VLM

1️⃣ Prepare Xmodel_VLM checkpoints

2️⃣ Prepare data

3️⃣ Run everything with one click!

🤝 Acknowledgments

✏️ Reference

About

Releases

Packages

Contributors 3

Languages

License

XiaoduoAILab/XmodelVLM

Folders and files

Latest commit

History

Repository files navigation

Xmodel_VLM: A Simple Baseline for Multimodal Vision Language Model

🛠️ Install

🗝️ Quick Start

Example for Xmodel_VLM model inference

🪜 Step-by-step Tutorial

Xmodel_VLM

1️⃣ Prepare Xmodel_VLM checkpoints

2️⃣ Prepare data

3️⃣ Run everything with one click!

🤝 Acknowledgments

✏️ Reference

About

Resources

License

Stars

Watchers

Forks

Languages