Streaming Continual Visual Instruction Tuning (StrCVIT) is a new and realistic continual learning setting for MLLMs that models data as a single-pass stream of interleaved and dynamically evolving tasks.
- Clone this repository and navigate to the StrCVIT folder.
git clone https://github.com/chanceche/StrCVIT.git
cd StrCVIT- Create the environment.
conda create -n strcvit python=3.10 -y
conda activate strcvit
pip install --upgrade pip
pip install -e ".[all]"- Set
PYTHONPATH.
export PYTHONPATH="$PWD:${PYTHONPATH:-}"The released StrCVIT instruction files are hosted on Hugging Face: chanceche/StrCVIT_dataset.
https://huggingface.co/datasets/chanceche/StrCVIT_dataset
Download them into the repository root:
huggingface-cli download chanceche/StrCVIT_dataset \
--repo-type dataset \
--local-dir StrCVIT_dataset
export STRCVIT_DATASET_DIR="$PWD/StrCVIT_dataset"The Hugging Face dataset repository contains:
StrCVIT_dataset/
train/
data_001.json ... data_025.json
record.json
test/
AD/
ChartQA/
Fin/
GQA/
Grounding/
ImageNet/
OCRVQA/
Places365/
RS/
TextCaps/
VQAv2/
manifest.json
The instruction files do not include raw images. Download the raw image datasets listed below, organize them under a single image root, and set:
export STRCVIT_IMAGE_ROOT=/path/to/raw_imagesAll released JSON files use <STRCVIT_IMAGE_ROOT> as a placeholder for image paths.
| Dataset | Expected image path under <STRCVIT_IMAGE_ROOT> |
Download links |
|---|---|---|
| AD / DriveLM | AD/images/drivelm/stitch/ |
DriveLM Hugging Face, DriveLM GitHub |
| COCO 2014 | COCO2014/train2014/, COCO2014/val2014/ |
train2014, val2014, test2015 |
| VQAv2 | COCO2014/val2014/ |
VQAv2 download page, COCO val2014 |
| Grounding / RefCOCO-style | COCO2014/train2014/ |
COCO train2014, RefCOCO, RefCOCO+, RefCOCOg |
| GQA | GQA/images/ |
GQA images.zip, GQA download page |
| ImageNet / ILSVRC2012 | ImageNet_withlabel/val/, ImageNet_withlabel/train/ |
ImageNet ILSVRC2012 download page |
| Places365 | Places/val_256/, Places/train_256/ |
train_256_places365standard.tar, val_256.tar, filelist |
| RSVQA-HR / Remote Sensing VQA | RS/images/ |
RSVQA website |
| FinVis / financial charts | Fin/images/train/, Fin/images/test/ |
FinVis Hugging Face, FinVis-GPT GitHub |
| OCR-VQA | OCR-VQA/images/ |
OCR-VQA website, OCR-VQA images |
| TextCaps | Textcaps/ |
train/val images, test images, TextCaps page |
| ChartQA | ChartQA/train/, ChartQA/test/ |
HuggingFaceM4/ChartQA, ChartQA GitHub |
Download the base models before running training:
export MODEL_ROOT=/path/to/modelsConfigure local placeholders before running scripts:
<MODEL_ROOT>: base model directory<STRCVIT_DATASET_DIR>: released StrCVIT JSON directory<STRCVIT_IMAGE_ROOT>: image root for the original datasets<STRCVIT_METHODS_DIR>: this repository directory<CUDA_HOME>: CUDA toolkit directory used by DeepSpeed;scripts/configure_paths.pydetects it automatically from the installednvidia-cuda-nvccpackage
python scripts/configure_paths.pyRun LoRA fine-tuning baselines:
bash scripts/Train_intervl/internvl3_5_8b_LoRA/train_all.sh
bash scripts/Train_intervl/internvl3_5_4b_LoRA/train_all.sh
bash scripts/Train_Gemma/LoRA/train_all.shRun MoELoRA baselines:
bash scripts/Train_intervl/internvl3_5_8b_MoELoRA/train_all.sh
bash scripts/Train_intervl/internvl3_5_4b_MoELoRA/train_all.sh
bash scripts/Train_Gemma/MoELoRA/train_all.shRun EWC baselines:
bash scripts/Train_intervl/internvl3_5_8b_EWC/train_all.sh
bash scripts/Train_intervl/internvl3_5_4b_EWC/train_all.sh
bash scripts/Train_Gemma/EWC/train_all.shRun the SMoLoRA baseline:
bash scripts/Train_intervl/internvl3_5_8b_SMoLoRA/train_all.sh
bash scripts/Train_intervl/internvl3_5_4b_SMoLoRA/train_all.sh
bash scripts/Train_Gemma/SMoLoRA/train_all.shRun the main continual StrLoRA experiments:
bash scripts/Train_intervl/internvl3_5_8b_StrLoRA/train_all.sh
bash scripts/Train_intervl/internvl3_5_4b_StrLoRA/train_all.sh
bash scripts/Train_Gemma/StrLoRA/train_all.shThe aggregate scripts iterate from data_001 to data_025. The first task uses train_start.sh; later tasks use train_strcvit.sh and load the previous task checkpoint.
Evaluation is called automatically after each task by the aggregate training scripts. Standalone evaluation scripts are provided under:
scripts/Eval_internvl_proxy/
scripts/Eval_Gemma_proxy/
Training checkpoints are written under:
<STRCVIT_METHODS_DIR>/checkpoints/StrCVIT/data_001/
...
<STRCVIT_METHODS_DIR>/checkpoints/StrCVIT/data_025/
Evaluation outputs are written under:
<STRCVIT_METHODS_DIR>/results/StrCVIT/
- Code: Apache License 2.0, see LICENSE.
- Released StrCVIT instruction files: CC BY-NC 4.0, see the Hugging Face dataset repository.
- Raw images, original annotations, and base model weights are not included and remain governed by their original dataset/model licenses and access terms.
This repository adapts the training framework code from ms-swift and builds on Hugging Face PEFT and the public model/dataset resources listed above. We thank the authors and maintainers for their contributions.
