Skip to content

chanceche/StrCVIT

Repository files navigation

StrLoRA: Towards Streaming Continual Visual Instruction Tuning for MLLMs

Streaming Continual Visual Instruction Tuning (StrCVIT) is a new and realistic continual learning setting for MLLMs that models data as a single-pass stream of interleaved and dynamically evolving tasks.

StrCVIT overview

Dataset Code

Install

  1. Clone this repository and navigate to the StrCVIT folder.
git clone https://github.com/chanceche/StrCVIT.git
cd StrCVIT
  1. Create the environment.
conda create -n strcvit python=3.10 -y
conda activate strcvit

pip install --upgrade pip
pip install -e ".[all]"
  1. Set PYTHONPATH.
export PYTHONPATH="$PWD:${PYTHONPATH:-}"

Dataset

The released StrCVIT instruction files are hosted on Hugging Face: chanceche/StrCVIT_dataset.

https://huggingface.co/datasets/chanceche/StrCVIT_dataset

Download them into the repository root:

huggingface-cli download chanceche/StrCVIT_dataset \
  --repo-type dataset \
  --local-dir StrCVIT_dataset

export STRCVIT_DATASET_DIR="$PWD/StrCVIT_dataset"

The Hugging Face dataset repository contains:

StrCVIT_dataset/
  train/
    data_001.json ... data_025.json
    record.json
  test/
    AD/
    ChartQA/
    Fin/
    GQA/
    Grounding/
    ImageNet/
    OCRVQA/
    Places365/
    RS/
    TextCaps/
    VQAv2/
  manifest.json

The instruction files do not include raw images. Download the raw image datasets listed below, organize them under a single image root, and set:

export STRCVIT_IMAGE_ROOT=/path/to/raw_images

All released JSON files use <STRCVIT_IMAGE_ROOT> as a placeholder for image paths.

Raw Image Downloads

Dataset Expected image path under <STRCVIT_IMAGE_ROOT> Download links
AD / DriveLM AD/images/drivelm/stitch/ DriveLM Hugging Face, DriveLM GitHub
COCO 2014 COCO2014/train2014/, COCO2014/val2014/ train2014, val2014, test2015
VQAv2 COCO2014/val2014/ VQAv2 download page, COCO val2014
Grounding / RefCOCO-style COCO2014/train2014/ COCO train2014, RefCOCO, RefCOCO+, RefCOCOg
GQA GQA/images/ GQA images.zip, GQA download page
ImageNet / ILSVRC2012 ImageNet_withlabel/val/, ImageNet_withlabel/train/ ImageNet ILSVRC2012 download page
Places365 Places/val_256/, Places/train_256/ train_256_places365standard.tar, val_256.tar, filelist
RSVQA-HR / Remote Sensing VQA RS/images/ RSVQA website
FinVis / financial charts Fin/images/train/, Fin/images/test/ FinVis Hugging Face, FinVis-GPT GitHub
OCR-VQA OCR-VQA/images/ OCR-VQA website, OCR-VQA images
TextCaps Textcaps/ train/val images, test images, TextCaps page
ChartQA ChartQA/train/, ChartQA/test/ HuggingFaceM4/ChartQA, ChartQA GitHub

Model Preparation

Download the base models before running training:

export MODEL_ROOT=/path/to/models

Configure local placeholders before running scripts:

  • <MODEL_ROOT>: base model directory
  • <STRCVIT_DATASET_DIR>: released StrCVIT JSON directory
  • <STRCVIT_IMAGE_ROOT>: image root for the original datasets
  • <STRCVIT_METHODS_DIR>: this repository directory
  • <CUDA_HOME>: CUDA toolkit directory used by DeepSpeed; scripts/configure_paths.py detects it automatically from the installed nvidia-cuda-nvcc package
python scripts/configure_paths.py

Training

Run LoRA fine-tuning baselines:

bash scripts/Train_intervl/internvl3_5_8b_LoRA/train_all.sh
bash scripts/Train_intervl/internvl3_5_4b_LoRA/train_all.sh
bash scripts/Train_Gemma/LoRA/train_all.sh

Run MoELoRA baselines:

bash scripts/Train_intervl/internvl3_5_8b_MoELoRA/train_all.sh
bash scripts/Train_intervl/internvl3_5_4b_MoELoRA/train_all.sh
bash scripts/Train_Gemma/MoELoRA/train_all.sh

Run EWC baselines:

bash scripts/Train_intervl/internvl3_5_8b_EWC/train_all.sh
bash scripts/Train_intervl/internvl3_5_4b_EWC/train_all.sh
bash scripts/Train_Gemma/EWC/train_all.sh

Run the SMoLoRA baseline:

bash scripts/Train_intervl/internvl3_5_8b_SMoLoRA/train_all.sh
bash scripts/Train_intervl/internvl3_5_4b_SMoLoRA/train_all.sh
bash scripts/Train_Gemma/SMoLoRA/train_all.sh

Run the main continual StrLoRA experiments:

bash scripts/Train_intervl/internvl3_5_8b_StrLoRA/train_all.sh
bash scripts/Train_intervl/internvl3_5_4b_StrLoRA/train_all.sh
bash scripts/Train_Gemma/StrLoRA/train_all.sh

The aggregate scripts iterate from data_001 to data_025. The first task uses train_start.sh; later tasks use train_strcvit.sh and load the previous task checkpoint.

Evaluation

Evaluation is called automatically after each task by the aggregate training scripts. Standalone evaluation scripts are provided under:

scripts/Eval_internvl_proxy/
scripts/Eval_Gemma_proxy/

Training checkpoints are written under:

<STRCVIT_METHODS_DIR>/checkpoints/StrCVIT/data_001/
...
<STRCVIT_METHODS_DIR>/checkpoints/StrCVIT/data_025/

Evaluation outputs are written under:

<STRCVIT_METHODS_DIR>/results/StrCVIT/

License

  • Code: Apache License 2.0, see LICENSE.
  • Released StrCVIT instruction files: CC BY-NC 4.0, see the Hugging Face dataset repository.
  • Raw images, original annotations, and base model weights are not included and remain governed by their original dataset/model licenses and access terms.

Acknowledgement

This repository adapts the training framework code from ms-swift and builds on Hugging Face PEFT and the public model/dataset resources listed above. We thank the authors and maintainers for their contributions.

About

Streaming Continual Visual Instruction Tuning (StrCVIT) is a new and realistic continual learning setting for MLLMs that models data as a single-pass stream of interleaved and dynamically evolving tasks.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors