StrLoRA: Towards Streaming Continual Visual Instruction Tuning for MLLMs

Streaming Continual Visual Instruction Tuning (StrCVIT) is a new and realistic continual learning setting for MLLMs that models data as a single-pass stream of interleaved and dynamically evolving tasks.

Install

Clone this repository and navigate to the StrCVIT folder.

git clone https://github.com/chanceche/StrCVIT.git
cd StrCVIT

Create the environment.

conda create -n strcvit python=3.10 -y
conda activate strcvit

pip install --upgrade pip
pip install -e ".[all]"

Set PYTHONPATH.

export PYTHONPATH="$PWD:${PYTHONPATH:-}"

Dataset

The released StrCVIT instruction files are hosted on Hugging Face: chanceche/StrCVIT_dataset.

https://huggingface.co/datasets/chanceche/StrCVIT_dataset

Download them into the repository root:

huggingface-cli download chanceche/StrCVIT_dataset \
  --repo-type dataset \
  --local-dir StrCVIT_dataset

export STRCVIT_DATASET_DIR="$PWD/StrCVIT_dataset"

The Hugging Face dataset repository contains:

StrCVIT_dataset/
  train/
    data_001.json ... data_025.json
    record.json
  test/
    AD/
    ChartQA/
    Fin/
    GQA/
    Grounding/
    ImageNet/
    OCRVQA/
    Places365/
    RS/
    TextCaps/
    VQAv2/
  manifest.json

The instruction files do not include raw images. Download the raw image datasets listed below, organize them under a single image root, and set:

export STRCVIT_IMAGE_ROOT=/path/to/raw_images

All released JSON files use <STRCVIT_IMAGE_ROOT> as a placeholder for image paths.

Raw Image Downloads

Dataset	Expected image path under `<STRCVIT_IMAGE_ROOT>`	Download links
AD / DriveLM	`AD/images/drivelm/stitch/`	DriveLM Hugging Face, DriveLM GitHub
COCO 2014	`COCO2014/train2014/`, `COCO2014/val2014/`	train2014, val2014, test2015
VQAv2	`COCO2014/val2014/`	VQAv2 download page, COCO val2014
Grounding / RefCOCO-style	`COCO2014/train2014/`	COCO train2014, RefCOCO, RefCOCO+, RefCOCOg
GQA	`GQA/images/`	GQA images.zip, GQA download page
ImageNet / ILSVRC2012	`ImageNet_withlabel/val/`, `ImageNet_withlabel/train/`	ImageNet ILSVRC2012 download page
Places365	`Places/val_256/`, `Places/train_256/`	train_256_places365standard.tar, val_256.tar, filelist
RSVQA-HR / Remote Sensing VQA	`RS/images/`	RSVQA website
FinVis / financial charts	`Fin/images/train/`, `Fin/images/test/`	FinVis Hugging Face, FinVis-GPT GitHub
OCR-VQA	`OCR-VQA/images/`	OCR-VQA website, OCR-VQA images
TextCaps	`Textcaps/`	train/val images, test images, TextCaps page
ChartQA	`ChartQA/train/`, `ChartQA/test/`	HuggingFaceM4/ChartQA, ChartQA GitHub

Model Preparation

Download the base models before running training:

export MODEL_ROOT=/path/to/models

Configure local placeholders before running scripts:

<MODEL_ROOT>: base model directory
<STRCVIT_DATASET_DIR>: released StrCVIT JSON directory
<STRCVIT_IMAGE_ROOT>: image root for the original datasets
<STRCVIT_METHODS_DIR>: this repository directory
<CUDA_HOME>: CUDA toolkit directory used by DeepSpeed; scripts/configure_paths.py detects it automatically from the installed nvidia-cuda-nvcc package

python scripts/configure_paths.py

Training

Run LoRA fine-tuning baselines:

bash scripts/Train_intervl/internvl3_5_8b_LoRA/train_all.sh
bash scripts/Train_intervl/internvl3_5_4b_LoRA/train_all.sh
bash scripts/Train_Gemma/LoRA/train_all.sh

Run MoELoRA baselines:

bash scripts/Train_intervl/internvl3_5_8b_MoELoRA/train_all.sh
bash scripts/Train_intervl/internvl3_5_4b_MoELoRA/train_all.sh
bash scripts/Train_Gemma/MoELoRA/train_all.sh

Run EWC baselines:

bash scripts/Train_intervl/internvl3_5_8b_EWC/train_all.sh
bash scripts/Train_intervl/internvl3_5_4b_EWC/train_all.sh
bash scripts/Train_Gemma/EWC/train_all.sh

Run the SMoLoRA baseline:

bash scripts/Train_intervl/internvl3_5_8b_SMoLoRA/train_all.sh
bash scripts/Train_intervl/internvl3_5_4b_SMoLoRA/train_all.sh
bash scripts/Train_Gemma/SMoLoRA/train_all.sh

Run the main continual StrLoRA experiments:

bash scripts/Train_intervl/internvl3_5_8b_StrLoRA/train_all.sh
bash scripts/Train_intervl/internvl3_5_4b_StrLoRA/train_all.sh
bash scripts/Train_Gemma/StrLoRA/train_all.sh

The aggregate scripts iterate from data_001 to data_025. The first task uses train_start.sh; later tasks use train_strcvit.sh and load the previous task checkpoint.

Evaluation

Evaluation is called automatically after each task by the aggregate training scripts. Standalone evaluation scripts are provided under:

scripts/Eval_internvl_proxy/
scripts/Eval_Gemma_proxy/

Training checkpoints are written under:

<STRCVIT_METHODS_DIR>/checkpoints/StrCVIT/data_001/
...
<STRCVIT_METHODS_DIR>/checkpoints/StrCVIT/data_025/

Evaluation outputs are written under:

<STRCVIT_METHODS_DIR>/results/StrCVIT/

License

Code: Apache License 2.0, see LICENSE.
Released StrCVIT instruction files: CC BY-NC 4.0, see the Hugging Face dataset repository.
Raw images, original annotations, and base model weights are not included and remain governed by their original dataset/model licenses and access terms.

Acknowledgement

This repository adapts the training framework code from ms-swift and builds on Hugging Face PEFT and the public model/dataset resources listed above. We thank the authors and maintainers for their contributions.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
StrLoRA/peft		StrLoRA/peft
assets		assets
requirements		requirements
scripts		scripts
swift		swift
.gitignore		.gitignore
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
environment.yml		environment.yml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

StrLoRA: Towards Streaming Continual Visual Instruction Tuning for MLLMs

Install

Dataset

Raw Image Downloads

Model Preparation

Training

Evaluation

License

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

StrLoRA: Towards Streaming Continual Visual Instruction Tuning for MLLMs

Install

Dataset

Raw Image Downloads

Model Preparation

Training

Evaluation

License

Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages