TinyAlign

TinyAlign is a retrieval-augmented lightweight vision-language modeling project built on top of TinyLLaVA Factory.

This repository contains the official code for TinyAlign: Boosting Lightweight Vision-Language Models by Mitigating Modal Alignment Bottlenecks, accepted to Findings of ACL 2026.

Overview

Compared with the original TinyLLaVA codebase, this project mainly adds:

memory bank construction for retrieval-augmented alignment
memory retrieval during training and inference
an extra retrieval connector (connector2) to inject retrieved latent features into the multimodal sequence
<rag>-aware prompt/token handling in the data template layer

The goal is to improve lightweight VLM alignment by retrieving relevant latent context from an external memory bank instead of relying only on the frozen vision encoder and language model.

Code Structure

tinyllava/model/modeling_tinyllava.py: core retrieval-augmented multimodal forward path
tinyllava/model/configuration_tinyllava.py: model config, including retrieval parameters
tinyllava/data/template/base.py: <image> and <rag> token processing
tinyllava/train/train.py: training entry
tinyllava/training_recipe/base.py: checkpoint save/load, including connector2
docs/CODEBASE_OVERVIEW.md: concise code walkthrough for this refactored release

Installation

conda create -n tinyalign python=3.10 -y
conda activate tinyalign
pip install --upgrade pip
pip install -e .

If you use FlashAttention in your environment:

pip install flash-attn --no-build-isolation

Retrieval Memory Bank

The retrieval path expects a memory directory containing:

a FAISS index file, default: Merged_faiss.index
a serialized value store, default: Merged_LLaVA_Dataset_Memory.pt

You can now configure these paths through arguments instead of editing source code:

--retrieval_memory_dir
--retrieval_index_file
--retrieval_value_file
--top_rag
--retrieval_text_start
--retrieval_alpha

How Memory Is Built

This is the core addition of TinyAlign.

The memory bank is not a plain text knowledge base. Each memory item is a paired (key, value) built from image-caption supervision data:

key: a compressed multimodal query vector
value: a compact latent representation produced by a Perceiver-style encoder

In the current codebase, the reference implementation is scattered in demo.py, demo1.py, and build_memory.ipynb.

Memory construction pipeline

For each image-text pair in the pretraining set:

Encode the image with the TinyLLaVA vision tower and the original multimodal connector.
Tokenize the paired caption text and obtain text token embeddings from the language model embedding table.
Normalize image features and text features, then fuse them with a weighted mixture: multimodal_features = concat(alpha * image_features, (1 - alpha) * text_features)
Compute self-similarity over the fused multimodal sequence.
Average the attention map and use it to compress the multimodal sequence into a single query vector.
Feed the original image and caption into a Perceiver multimodal encoder.
Use the Perceiver latent output as the retrieval value.
Store the pair:
- key: compressed multimodal vector
- value: Perceiver latent tensor

What gets saved

After collecting all pairs:

all keys are added into a FAISS index for nearest-neighbor search
all values are stored in a tensor file and aligned with FAISS ids

At runtime, the model:

rebuilds a query vector from the current input image and text
retrieves top-k nearest memory items from FAISS
concatenates the retrieved values
projects them through connector2
injects the projected retrieval features into the multimodal token sequence

Shape intuition from the current implementation

From your current code:

the Perceiver value is typically a latent tensor shaped like 32 x 96
multiple retrieved items are concatenated along the latent width dimension
connector2 maps the retrieved latent features into the language model hidden size

Source files for the original memory-building code

demo.py: end-to-end prototype for constructing compressed keys and latent values
demo1.py: smaller-scale memory construction example
build_memory.ipynb: notebook experiments for building, saving, merging, and querying the memory bank

Reproducible memory-building script

The repository now includes a standalone script:

scripts/build_memory_bank.py

It does two jobs:

build shard files containing (key, value) pairs
merge the shards into:
- Merged_faiss.index
- Merged_LLaVA_Dataset_Memory.pt

Expected dataset format

By default, the script assumes a LLaVA-style JSON list where each sample contains:

{
  "image": "relative/path/to/image.jpg",
  "conversations": [
    {"from": "human", "value": "..."},
    {"from": "gpt", "value": "caption or target text"}
  ]
}

If your caption is stored in another field, pass --caption-field.

Build commands

Build shards and merge immediately:

python scripts/build_memory_bank.py \
  --model-path /path/to/tinyllava_base_checkpoint \
  --dataset-json /path/to/pretrain.json \
  --image-root /path/to/images \
  --perceiver-tokenizer /path/to/perceiver_tokenizer \
  --output-dir /path/to/memory_bank \
  --save-every 5000 \
  --merge-after-build

If you already built shards and only want to merge them:

python scripts/build_memory_bank.py \
  --output-dir /path/to/memory_bank \
  --merge-only

Output structure

After a successful run, output-dir contains:

memory_bank/
  shards/
    memory_shard_00000.pt
    memory_shard_00001.pt
    ...
  Merged_faiss.index
  Merged_LLaVA_Dataset_Memory.pt

The final merged tensor file stores:

keys: normalized compressed multimodal query vectors
values: Perceiver latent tensors aligned with FAISS ids

Plug the memory bank into training or inference

Once the memory bank is built, point the model to it with:

--retrieval_memory_dir /path/to/memory_bank

Optionally override filenames if you changed them:

--retrieval_index_file Merged_faiss.index
--retrieval_value_file Merged_LLaVA_Dataset_Memory.pt

Training

Example command structure:

python -m tinyllava.train.train \
  --model_name_or_path TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
  --vision_tower google/siglip-so400m-patch14-384 \
  --connector_type mlp2x_gelu \
  --connector2_type rag2x_gelu \
  --data_path /path/to/data.json \
  --image_folder /path/to/images \
  --retrieval_memory_dir /path/to/memory_bank \
  --output_dir /path/to/output

If you initialize from a TinyLLaVA checkpoint and keep connector2 separately, use:

--pretrained_model_path /path/to/base_checkpoint \
--pretrained_connector2_path /path/to/connector2

Inference

The main inference entry is:

python -m tinyllava.eval.run_tiny_llava \
  --model-path /path/to/model \
  --image-file /path/to/image.jpg \
  --query "Describe the image."

Make sure the released model config points to the correct retrieval memory directory before inference.

Notes on This Release

This repository keeps the original package name tinyllava for code compatibility.
The refactor removes local debug hooks and hard-coded absolute paths that were tied to the author's machine.

Acknowledgement

This codebase is built on top of TinyLLaVA Factory. Credit goes to the original authors for the base multimodal training framework.

Citation

@article{hu2025tinyalign,
  title={TinyAlign: Boosting Lightweight Vision-Language Models by Mitigating Modal Alignment Bottlenecks},
  author={Hu, Yuanze and Fan, Zhaoxin and Wang, Xinyu and Li, Gen and Qiu, Ye and Yang, Zhichao and Wu, Wenjun and Wu, Kejian and Sun, Yifan and Deng, Xiaotie and Dong, Jin},
  journal={arXiv preprint arXiv:2505.12884},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
docs		docs
scripts		scripts
tinyllava		tinyllava
tinyllava_visualizer		tinyllava_visualizer
.gitattributes		.gitattributes
.gitignore		.gitignore
CUSTOM_FINETUNE.md		CUSTOM_FINETUNE.md
LICENSE		LICENSE
README.md		README.md
build_memory.ipynb		build_memory.ipynb
demo.ipynb		demo.ipynb
demo.py		demo.py
demo1.py		demo1.py
eval.py		eval.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TinyAlign

Overview

Code Structure

Installation

Retrieval Memory Bank

How Memory Is Built

Memory construction pipeline

What gets saved

Shape intuition from the current implementation

Source files for the original memory-building code

Reproducible memory-building script

Expected dataset format

Build commands

Output structure

Plug the memory bank into training or inference

Training

Inference

Notes on This Release

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TinyAlign

Overview

Code Structure

Installation

Retrieval Memory Bank

How Memory Is Built

Memory construction pipeline

What gets saved

Shape intuition from the current implementation

Source files for the original memory-building code

Reproducible memory-building script

Expected dataset format

Build commands

Output structure

Plug the memory bank into training or inference

Training

Inference

Notes on This Release

Acknowledgement

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages