Official implementation of our paper "Visually-Augmented Language Modeling". Please cite our paper if you find this repository helpful in your research:
@article{valm,
title={Visually-augmented language modeling},
author={Wang, Weizhi and Dong, Li and Cheng, Hao and Song, Haoyu and Liu, Xiaodong and Yan, Xifeng and Gao, Jianfeng and Wei, Furu},
journal={arXiv preprint arXiv:2205.10178},
year={2022}
}
Create a virtual environment and run
bash setup.sh
Then the revised fairseq
and ohter packages will be installed. We strongly recommend you to use python version >=3.6 <=3.8 for stability.
- Preprocessing text training data:
bash myscripts/preprocess_valm_text.sh
The cc100 English original corpus would be available at CC100-EN. The sharding script is available at ./data/roberta-cc100-ori/sharded_data.py
.
- Preprocessing image data:
Please refer to LAION for downloading the image dataset for creating image visual knowledge base.
- portion=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
- python ImageRetrieval/clip_image_retrieval.py --mount /mnt --ifp /multimodal/VaLM/image_features_raw \
--image_data_path /multimodal/data/image/laion_all \
--tar_id_start 0 --tar_id_end 20000 \
--n_gpus 16 \
--portion {portion}
Once we get the image features, we could make a sanity check by
python ImageRetrieval/clip_image_retrieval.py --mount /mnt --ifp /multimodal/VaLM/image_features_raw \
--image_data_path /multimodal/data/image/laion_all \
--verify
where it will retrieve from the first five shards, and the output should be:
imageRetriever.retrieve("A cute cat") # 000048808.jpg √
imageRetriever.retrieve("A cute dog") # 000032573.jpg √
Path to the processed image features:
/mnt/multimodal/VaLM/image_features_raw
Each file's name is like img_features_12345.pt
, where 12345
is the id of the laion tar file.
- Constructing cached datastore of image features:
DSTORE_PATH=./data/image_feature_datastore_200M
python ImageRetrieval/clip_image_retrieval.py --mount /mnt --ifp /multimodal/VaLM/image_features_raw \
--image_data_path /multimodal/data/image/laion_all \
--save_image_datastore --dstore_mmap $DSTORE_PATH --dstore_fp16 \
--dstore_size 191504487
- Training faiss index of cached datastore:
DSTORE_PATH=./data/image_feature_datastore_200M
python ImageRetrieval/train_datastore_gpu.py --dstore_size 191504486 \
--dstore_mmap $DSTORE_PATH \
--dstore_fp16 --dimension 768 --ncentroids 131072
- Verify retrieval with samples:
DSTORE_PATH=./data/image_feature_datastore_200M
python ImageRetrieval/clip_image_retrieval.py --mount /mnt --ifp /multimodal/VaLM/image_features \
--image_data_path /multimodal/data/image/laion_all \
--verify_retriever --dstore_mmap $DSTORE_PATH \
--dstore_filename $DSTORE_PATH --dstore_fp16 \
--dstore_size 191504486
- The demo retrieval results will be write to
./html/reports.html
. Download thehtml
folder to see the results.
- Example training command on multiple data shards with 16 Tesla-V100 gpus:
bash myscripts/train_valm.sh
For training text-only baseline GPT-BLIND, run:
bash myscripts/train_gpt_blind.sh
- Evaluate the trained checkpoint on object color reasoning:
python evaluation_scripts/verify_color_prediction.py --path /path/to/ckpt --model-overrides
- Evaluate the trained checkpoint on object size reasoning:
python evaluation_scripts/verify_size_reason.py --path /path/to/ckpt --model-overrides
- Evaluate the trained checkpoint on language modeling:
fairseq-eval-lm ./data/wikitext-103/ --batch-size 4 --sample-break-mode eos --path /path/to/ckpt
fairseq-eval-lm ./data/lambada/ --batch-size 4 --sample-break-mode eos --path /path/to/ckpt
python evaluation_scripts/eval_lambada.py --data-path ./data/lambada/lambada_test.jsonl --preprocess --path /path/to/ckpt
The script for selecting best checkpoint on validation set is available at ./evaluation_scripts/ckpt_selection_valid.py
.
We are currently working on the following directions to push VaLM to a higher level:
- Adapt VaLM to vision-language tasks, especially image captioning and visual question-answering
- Train larger size VaLM, i.e. VaLM-Medium, VaLM-Large, VaLM-XL
- Adapt VaLM to a Encoder-Decoder architecture for NLG tasks
If you are interested in cooperation or have fantastic ideas based on VaLM, please contact weizhiwang AT ucsb.edu or leave some Git issues.