VaLM

Official implementation of our paper "Visually-Augmented Language Modeling". Please cite our paper if you find this repository helpful in your research:

@article{valm,
  title={Visually-augmented language modeling},
  author={Wang, Weizhi and Dong, Li and Cheng, Hao and Song, Haoyu and Liu, Xiaodong and Yan, Xifeng and Gao, Jianfeng and Wei, Furu},
  journal={arXiv preprint arXiv:2205.10178},
  year={2022}
}

Environment Setup

Create a virtual environment and run

bash setup.sh

Then the revised fairseq and ohter packages will be installed. We strongly recommend you to use python version >=3.6 <=3.8 for stability.

Text and Image Data Preparation

Preprocessing text training data:

bash myscripts/preprocess_valm_text.sh

The cc100 English original corpus would be available at CC100-EN. The sharding script is available at ./data/roberta-cc100-ori/sharded_data.py.

Preprocessing image data:

Please refer to LAION for downloading the image dataset for creating image visual knowledge base.

- portion=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
- python ImageRetrieval/clip_image_retrieval.py --mount /mnt --ifp /multimodal/VaLM/image_features_raw \
    --image_data_path /multimodal/data/image/laion_all \
    --tar_id_start 0 --tar_id_end 20000 \
    --n_gpus 16 \
    --portion {portion}

Once we get the image features, we could make a sanity check by

python ImageRetrieval/clip_image_retrieval.py --mount /mnt --ifp /multimodal/VaLM/image_features_raw \
    --image_data_path /multimodal/data/image/laion_all \
    --verify

where it will retrieve from the first five shards, and the output should be:

    imageRetriever.retrieve("A cute cat")  # 000048808.jpg √
    imageRetriever.retrieve("A cute dog")  # 000032573.jpg √

Path to the processed image features:

/mnt/multimodal/VaLM/image_features_raw

Each file's name is like img_features_12345.pt, where 12345 is the id of the laion tar file.

Visual Knowledge Base Creation and Text-to-Image Retrieval

Constructing cached datastore of image features:

DSTORE_PATH=./data/image_feature_datastore_200M

python ImageRetrieval/clip_image_retrieval.py --mount /mnt --ifp /multimodal/VaLM/image_features_raw \
    --image_data_path /multimodal/data/image/laion_all \
    --save_image_datastore --dstore_mmap $DSTORE_PATH --dstore_fp16 \
    --dstore_size 191504487

Training faiss index of cached datastore:

DSTORE_PATH=./data/image_feature_datastore_200M

python ImageRetrieval/train_datastore_gpu.py --dstore_size 191504486 \
    --dstore_mmap $DSTORE_PATH \
    --dstore_fp16 --dimension 768 --ncentroids 131072

Verify retrieval with samples:

DSTORE_PATH=./data/image_feature_datastore_200M

python ImageRetrieval/clip_image_retrieval.py --mount /mnt --ifp /multimodal/VaLM/image_features \
    --image_data_path /multimodal/data/image/laion_all \
    --verify_retriever --dstore_mmap $DSTORE_PATH \
    --dstore_filename $DSTORE_PATH --dstore_fp16 \
    --dstore_size 191504486

The demo retrieval results will be write to ./html/reports.html. Download the html folder to see the results.

Training VaLM

Example training command on multiple data shards with 16 Tesla-V100 gpus:

bash myscripts/train_valm.sh

For training text-only baseline GPT-BLIND, run:

bash myscripts/train_gpt_blind.sh

Evaluating VaLM

Evaluate the trained checkpoint on object color reasoning:

python evaluation_scripts/verify_color_prediction.py --path /path/to/ckpt --model-overrides

Evaluate the trained checkpoint on object size reasoning:

python evaluation_scripts/verify_size_reason.py --path /path/to/ckpt --model-overrides

Evaluate the trained checkpoint on language modeling:

fairseq-eval-lm ./data/wikitext-103/ --batch-size 4 --sample-break-mode eos --path /path/to/ckpt
fairseq-eval-lm ./data/lambada/ --batch-size 4 --sample-break-mode eos --path /path/to/ckpt
python evaluation_scripts/eval_lambada.py --data-path ./data/lambada/lambada_test.jsonl --preprocess --path /path/to/ckpt

The script for selecting best checkpoint on validation set is available at ./evaluation_scripts/ckpt_selection_valid.py.

Future Work

We are currently working on the following directions to push VaLM to a higher level:

Adapt VaLM to vision-language tasks, especially image captioning and visual question-answering
Train larger size VaLM, i.e. VaLM-Medium, VaLM-Large, VaLM-XL
Adapt VaLM to a Encoder-Decoder architecture for NLG tasks

If you are interested in cooperation or have fantastic ideas based on VaLM, please contact weizhiwang AT ucsb.edu or leave some Git issues.

Credits

VaLM is developed based on fairseq and CLIP.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VaLM

Environment Setup

Text and Image Data Preparation

Visual Knowledge Base Creation and Text-to-Image Retrieval

Training VaLM

Evaluating VaLM

Future Work

Credits

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
ImageRetrieval		ImageRetrieval
clip		clip
data		data
evaluation_scripts		evaluation_scripts
fairseq		fairseq
html		html
myscripts		myscripts
.gitignore		.gitignore
README.md		README.md
setup.sh		setup.sh

Victorwz/VaLM

Folders and files

Latest commit

History

Repository files navigation

VaLM

Environment Setup

Text and Image Data Preparation

Visual Knowledge Base Creation and Text-to-Image Retrieval

Training VaLM

Evaluating VaLM

Future Work

Credits

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages