Skip to content

XiaoduoAILab/XmodelVLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Xmodel_VLM: A Simple Baseline for Multimodal Vision Language Model

hf_space arXiv Code License githubgithub

🛠️ Install

  1. Clone this repository and navigate to XmodelVLM folder

    git clone https://github.com/XiaoduoAILab/XmodelVLM.git
    cd xmodelvlm
  2. Install Package

    conda create -n xmodelvlm python=3.10 -y
    conda activate xmodelvlm
    pip install --upgrade pip
    pip install -r requirements.txt

🗝️ Quick Start

Example for Xmodel_VLM model inference

python inference.py

🪜 Step-by-step Tutorial

Xmodel_VLM

The overall architecture of our network, closely mirrors that of LLaVA-1.5. It consists of three key components:

  • a vision encoder (CLIP ViT-L/14)
  • a lightweight languagemodel (LLM)
  • a projector responsible for aligning the visual and textual spaces (XDP)

Refer to our paper for more details!
assets/model archtecture.jpeg
assets/XDP.jpeg

The training process of Xmodel_VLM is divided into two stages:

  • stage I: pre-training
    • ❄️ frozen vision encoder + 🔥 learnable XDP projector + ❄️ learnable LLM
  • stage II: multi-task training
    • ❄️ frozen vision encoder + 🔥 learnable XDP projector + 🔥 learnable LLM https://github.com/XiaoduoAILab/XmodelVLM/tree/main/assets/training strategy.jpeg

1️⃣ Prepare Xmodel_VLM checkpoints

Please firstly download MobileLLaMA chatbot checkpoints from huggingface website

2️⃣ Prepare data

  • prepare benchmark data
    • We evaluate models on a diverse set of 9 benchmarks, i.e. GQA, MMBench, MMBench-cn, MME, POPE, SQA, TextVQA, VizWiz, MM-Vet. For example, you should follow these instructions to manage the datasets:

    • Data Download Instructions
      • download some useful data/scripts pre-collected by us.
        • unzip benchmark_data.zip && cd benchmark_data
        • bmk_dir=${work_dir}/data/benchmark_data
      • gqa
        • download its image data following the official instructions here
        • cd ${bmk_dir}/gqa && ln -s /path/to/gqa/images images
      • mme
        • download the data following the official instructions here.
        • cd ${bmk_dir}/mme && ln -s /path/to/MME/MME_Benchmark_release_version images
      • pope
        • download coco from POPE following the official instructions here.
        • cd ${bmk_dir}/pope && ln -s /path/to/pope/coco coco && ln -s /path/to/coco/val2014 val2014
      • sqa
        • download images from the data/scienceqa folder of the ScienceQA repo.
        • cd ${bmk_dir}/sqa && ln -s /path/to/sqa/images images
      • textvqa
        • download images following the instructions here.
        • cd ${bmk_dir}/textvqa && ln -s /path/to/textvqa/train_images train_images
      • mmbench
        • no action is needed.

3️⃣ Run everything with one click!

We provide detailed pre-training, fine-tuning and testing shell scripts, for example:

bash scripts/pretrain.sh 0,1,2,3  #GPU:0,1,2,3

🤝 Acknowledgments

  • LLaVA: Thanks for their wonderful work! 👏
  • MobileVLM: Thanks for their wonderful work! 👏

✏️ Reference

If you find Xmodel_VLM useful in your research or applications, please consider giving a star ⭐ and citing using the following BibTeX:

@misc{xu2024xmodelvlm,
      title={Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model}, 
      author={Wanting Xu and Yang Liu and Langping He and Xucheng Huang and Ling Jiang},
      year={2024},
      eprint={2405.09215},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published