This project is the key code (such as Instruction Alignment Score, IAS) for the paper "VisLingInstruct: Elevating Zero-Shot Learning in Multi-Modal Language Models with Autonomous Instruction Optimization".
The IAS is the most critical component of VisLingInstruct, utilized for assessing the quality of instructions. Our experiment is based on the LAVIS library, which we expanded upon. For example, in the lavis/models/blip2_models/blip2_vicuna_instruct.py
file, we added the calculate_ias
function. For further details, please refer to mmlm_vicuna.py
.
The entry point for the training script is in the train.py
file, and the training can be initiated by maintaining the train.sh
script. The related configuration files are located in the train_configs
folder. Parameters prefixed with "need:" indicate mandatory adjustments, which are typically paths.
Some inference-related examples are demonstrated in the test.py
file, and the related configuration files are located in the eval_configs
folder.
- BLIP2 The model architecture of BLIVA follows BLIP-2. Don't forget to check this great open-source work if you don't know it before.
- Lavis The codebase we built upon.
- Vicuna Vicuna-13B demonstrates fantastic language ability and it's open source.
- BLIVA A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions.
@article{zhu2024vislinginstruct,
title={VisLingInstruct: Elevating Zero-Shot Learning in Multi-Modal Language Models with Autonomous Instruction Optimization},
author={Zhu, Dongsheng and Tang, Xunzhu and Han, Weidong and Lu, Jinghui and Zhao, Yukun and Xing, Guoliang and Wang, Junfeng and Yin, Dawei},
journal={arXiv preprint arXiv:2402.07398},
year={2024}
}