Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time


PyTorch code for the EMNLP 2020 paper "Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision" (Hao Tan and Mohit Bansal).


Note: I recommend to focus on "Wiki103" first and ingore the code blocks related to "English Wikipedia". "Eng Wiki" might take too long to complete.


pip install -r requirements.txt

Require python 3.6 + (to support huggingface transformers).

Contextualized Cross-Modal Matching (xmatching)

In this module (corresponding to Sec 3.2 of the paper), we want to learn a token-image matching model from sentence-image aligned data (i.e., image captioning data). The model "contextually" measures the relevance between tokens (i.e., words) and images. The terminology "contextual" emphasize the nature that the sentences (the context) are considered when measuring the token-image relevance score.

Download Image and Captioning Data

  1. Download MS COCO images:

    # MS COCO (Train 13G, Valid 6G)
    mkdir -p data/mscoco
    wget -P data/mscoco
    wget -P data/mscoco
    unzip data/mscoco/ -d data/mscoco/images/ && rm data/mscoco/
    unzip data/mscoco/ -d data/mscoco/images/ && rm data/mscoco/

    If you already have COCO image on disk. Save them as

      |-- mscoco
            |-- images
                 |-- train2014
                         |-- COCO_train2014_000000000009.jpg
                         |-- COCO_train2014_000000000025.jpg
                         |-- ......
                 |-- val2014
                         |-- COCO_val2014_000000000042.jpg
                         |-- ......
  2. Download captions (split following the LXMERT project):

    mkdir -p data/lxmert
    wget -P data/lxmert/
    wget -P data/lxmert/
    wget -P data/lxmert/
    wget -P data/lxmert/

Training the Cross-Modal Matching Model

The model is trained on MS COCO with pairwise hinge loss (details in Sec. 3.2 of the paper).

Running Commands:

# Run the cross-modal matching model with single-machine multi-processing distributed training
# "0,1" indicates using the GPUs 0 and 1.
# "bert_resnext" is the name of this snapshot and would be saved at snap/xmatching/bert_resnext
# "--visn resnext101_32x8d" is the vision backbone
# "--lang bert" is the langaugae backbone
# Speed: 20 min ~ 30 min / 1 Epoch, 20 Epochs by default.
bash scripts/run_xmatching.bash 0,1 bert_resnext --visn resnext101_32x8d --lang bert

The options --visn and --lang specify the architecture of the encoder. Tested options

--visn $VISN_MODEL
VISN_MODEL={resnet18, resnet34, resnet50, resnet101, resnet152, 
            wide_resnet50_2, wide_resnet101_2, resnext101_32x8d (default), ...} 
--lang $LANG_MODEL
LANG_MODEL={bert, roberta, xlnet, bert-large, ...}

For visual backbones, the models in torchvision are mostly supported. You might need to handle the last FC layer, because it is written differently in different backbones. The language backbones are initialized from huggingface transformers.

We found that the results with XLNet is pretty low but have not identified the reason. Results of other backbones are similar.

Vokenization (vokenization)

The vokenization is a bridge between the cross-modality (words-and-image) matching models (xmatching) and visually-supervised lagnauge models (vlm). The final goal is to convert the language tokens to related images (we called them vokens). These vokens enable the visual supervision of the language model. We mainly provide pr-eprocessing tools (i.e., feature extraction, tokenization, and vokenization) and evaluation tools of previous cross-modal matching models here. Here is a diagram of these processes and we next discuss them one-by-one:

Extracting Image Features-----> Benchmakring the Matching Models (Optional) --> Vokenization
Downloading Language Data --> Tokenization -->-->--/

Downloading and Pre-Processing Pure-Language Data

We provide scripts to get the datasets "wiki103" and "wiki". We would note them as "XX-cased" or "XX-uncased" where the suffix "cased" / "uncased" only indicates the property of the raw text.

  1. Wiki103. The wiki103 dataset is a seleted subset of English Wikipedia, containing around 100M tokens.
    bash data/wiki103/
  2. English Wikipedia. The script to download and process wiki data are modified from XLM. It will download a 17G file. The speed depends on the networking and it usually takes several hours to filter the data. The process ends with around 2.8B tokens.
    bash data/wiki/get_data_cased.bash en
    Note: For RoBERTa, it requires an untokenized version of wiki (o.w. the results would be much lower), so please use the following command:
    bash data/wiki/get_data_cased_untokenized.bash en

Note: I recommend to focus on "Wiki103" first and ingore the code blocks related to "English Wikipedia". "Eng Wiki" might take too long to complete.

Tokenization of Language Data

We next tokenize the language corpus. It would locally save three files: "$dataset_name.$tokenizer_name", "$dataset_name.$tokenizer_name.hdf5", and "$dataset_name.$tokenizer_name.line". Taking the wiki103 dataset and BERT tokenizer as an example, we convert the training file into

 |-- wiki103-cased 
        |-- wiki.train.raw.bert-base-uncased
        |-- wiki.train.raw.bert-base-uncased.hdf5
        |-- wiki.train.raw.bert-base-uncased.line

The txt file wiki.train.raw.bert-base-uncased saves the tokens and each line in this file is the tokens of a line in the original file, The hdf5 file wiki.train.raw.bert-base-uncased.hdf5 stores all the tokens continuously and use wiki.train.raw.bert-base-uncased.line to index the starting token index of each line. The ".line" file has L+1 lines where L is the number of lines in the original files. Each line has a range "line[i]" to "line[i+1]" in the hdf5 file.


  1. Wiki103 (around 10 min)
    bash tokenization/tokenize_wiki103_bert.bash 
  2. English Wikipedia (around 3 hours)
    bash tokenization/tokenize_wiki_bert.bash 

Extracting Image Features

The image pre-processing extracts the image features to build the keys in the vokenization retrieval process.

Download the Visual Genome (VG) images

Since MS COCO images are used in training the cross-modal matching model as in xmatching. We will use the Visual Genome images as candidate vokens for retrievel. We here download the images first.

wget -P data/vg/
wget -P data/vg/
unzip data/vg/ -d data/vg/images && rm data/vg/
unzip data/vg/ -d data/vg/images && rm data/vg/
cd data/vg/images
mv VG_100K/* .
mv VG_100K_2/* .
rm -rf VG_100K VG_100K_2
cd ../../../

If you already have Visual Genome image on disk. Save them as

|-- vg
    |-- images
         |-- 1000.jpg
         |-- 1001.jpg
         |-- ......

Build Universal Image Ids

We first build a list of universal image indexes with vokenization/ It is used to unify the image ids in different experiments thus the feature array stored in hdf5 could be universally indexed. The image ids are saved under a shared path LOCAL_DIR (default to data/vokenization) defined in vokenization/ The image ids are saved under data/vokenization/images with format {IMAGE_SET}_ids.txt. We will make sure that all the experiments agree with this meta info, so that we would not get different indexing in different retrieval experiments.

Note: The ids created by are only the order of the images. The actual images in the dictionary are provided by extract_keys.bash, thus is corresponding to the _paths.txt, because the extract_keys will filter all broken images and non-existing images.


# Step 1, Build image orders.
python vokenization/  

Extracting Image Features

Extract image features regarding the list built above, using code vokenization/ The code will first read the image ids saved in data/vokenization/images/{IMAGE_SET}_ids.txt and locate the images. The features will be saved under snap/xmatching/bert_resnext/keys/{IMAGE_SET}.hdf5. It finishes within 1 hour.


# Step 2, Extract features. 
# bash scripts/extract_keys.bash $GPU_ID $MODEL_NAME 
bash scripts/extract_keys.bash 0 bert_resnext 

Benchmarking Cross-Modal Matching Models (Optional)

Before evaluating, please make sure that extracting_image_features and tokenization are completed.

We benchmark the performance of cross-modal matching models from large scale. The evaluation includes two different metrics: diversity and the retrieval performance.

Diversity (in vokenization/ ensures that the same token type is mapped to diverse images regarding its context (i.e., the sentence). Retrieval (in vokenization/ measures the correspondence of the token and the retrieved images.

We gather these two utils into one script and the command here:

bash scripts/xmatching_benchmark.bash 0 bert_resnext

The Vokenization Process

After all these steps, we could start to vokenize the language corpus. It would load the tokens saved in dataset_name.tokenizer_name.hdf5 and uses the line-split information in dataset_name.tokenzier_name.line.

The code is optimized and could be continued by just rerunning it. The vokens will be saved in snap/xmatching/bert_resnext/vokens/wiki.train.raw.vg_nococo.hdf5 by default. The file snap/xmatching/bert_resnext/vokens/wiki.train.raw.vg_nococo.ids contains the universal image ids for each voken, e.g., the image id vg_nococo/8 corresponds to 8-th feature saved in snap/xmatching/bert_resnext/keys/vg_nococo.hdf5.

Note: --tokenizer-name must be provided in the script.


  1. Wiki103 (around 1 hour on 4 Titan V)
    # Note: mp is the abbreviation for "multi-processing"
    # bash scripts/mpvokenize_wiki103.bash $USE_GPUS $SNAP_NAME
    bash scripts/mpvokenize_wiki103.bash 0,1,2,3 bert_resnext
  2. English Wikipedia (around 1 day on 4 Titan V)
    # bash scripts/mpvokenize_wiki.bash $USE_GPUS $SNAP_NAME
    bash scripts/mpvokenize_wiki.bash 0,1,2,3 bert_resnext

The script will call vokenization/ to vokenize a corpus. The vokenziation happens in vokenization/ and it use vokenization/ to do nearest neighbor search (based on faiss).

Visually-Supervised Language Model (vlm)

Pre-Training with VLM

As discussed in Sec. 2 of the paper, we use previous generated vokens to pre-train the model with visual supervision.


After the vokenization process of wiki103, we could run the model with command:

# bash scripts/small_vlm_wiki103_glue.bash $GPUs $SNAP_NAME
bash scripts/small_vlm_wiki103.bash 0,1,2,3 wiki103_bert_small

It will call vlm/ and run a BERT-6Layers-512Hiddens model on wiki103 dataset with the support of voken supervisions. The snapshot will be saved to snap/vlm/wiki103_bert_small. We recommend to run this Wiki103 experiment first since it will finish in a reasonable time (20 hours). The pure BERT pre-training option is also available later for comparisons.

Note: defautly, the mixed-precision training is not used. To support the mixed precision pre-training, please install the nvidia/apex library with command:

git clone
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

After that, you could bring back the option --fp16 and --fp16_opt_level O2 in the script scripts/small_vlm_wiki103.bash. I recommend to use --fp16_opt_level O2. Although the option O2 might be unstable, it saves a lot memory: the max per-gpu-batch-size is 32 with O1 but 64 with O2.

English Wikipedia

After the vokenization process of wiki103, we could run the model with command:

# bash scripts/base_vlm_wiki.bash $GPUs $SNAP_NAME
bash scripts/base_vlm_wiki.bash 0,1,2,3 wiki_bert_base

It will run a BERT-12Layers-768Hiddens (same as BERT_BASE) model on the English Wikipedia dataset with the support of voken supervisions. The snapshot will be saved to snap/vlm/wiki_bert_base.

It takes around 3-5 days on 4 Titan V / GTX 2080 and around 5-7 days to finish in 4 Titan Pascal/T4 cards. (This estimation is accurate since I inevitably run experiments on all these servers...). Titan V / 2080 / T4 have native support of mixed precision training (triggered by --fp16 option and need installing apex). The speed would be much faster. Titan Pascal would also save some memory with the --fp16 option.

GLUE Evaluation

We defautly use the GLUE benchmark (e.g., SST, MRPC, QQP, MNLI, QNLI,) as downstreaming tasks. Other tasks could be evaluated following the setup here by changing the option --model_name_or_path to the correct snapshot path snap/bert/wiki103.

Download GLUE dataset

This downloaindg scrip is copied from huggingface transformers project. Since the transformers is still under dense development, the change of APIs might affect the code. I have upgraded the code compaticability to transformers==3.3.

python --data_dir data/glue --tasks all

Finetuning on GLUE Tasks

The pre-trained snapshots are evaluated by fine-tuning them on the GLUE benchmark. The code are modified from the huggingface transformers.

Running GLUE evaluation for snapshots from different epochs:

# bash scripts/run_glue_epochs.bash $GPUS #SNAP_PATH --snaps $NUM_OF_SNAPS                            
bash scripts/run_glue_epochs.bash 0,1,2,3 snap/vlm/wiki103_bert_small --snaps 7                            

It will assess 7 snaps using all 0,1,2,3 GPUs. Setting snaps=-1 will assess all checkpoints. If you just want to evaluate the last (usually the best) snapshot, please use:

bash scripts/run_glue_epochs.bash 0 snap/vlm/wiki103_bert_small --snaps 1

Showing the results

For all results saved under snap/ (whatever the dir names), running the folloing command will print out all the results.

python vlm/ 

It will print results like

     RTE    MRPC   STS-B    CoLA   SST-2    QNLI     QQP    MNLI MNLI-MM    GLUE
   54.51   84.72   87.18   52.32   90.02   88.36   87.16   81.92   82.57   78.75
     RTE    MRPC   STS-B    CoLA   SST-2    QNLI     QQP    MNLI MNLI-MM    GLUE
   58.12   82.76   84.45   26.74   89.56   84.40   86.52   77.56   77.99   74.23

BERT (As baselines)

We also provide pure language-model pre-training as baselines.


# bash scripts/small_wiki103.bash $GPUs $SNAP_NAME
bash scripts/small_wiki103.bash 0,1,2,3 bert_small

It will call vlm/ and run a BERT-6Layers-512Hiddens model on wiki103 dataset with the masked language model only. The snapshot will be saved to snap/bert/wiki103_bert_small.

Or you could directly using the script small_wiki103_glue.bash to enable GLUE evaluation after finishing pre-training.

bash scripts/small_wiki103_glue.bash 0,1,2,3 bert_small

English Wikipedia


# bash scripts/base_wiki.bash $GPUs $SNAP_NAME
bash scripts/base_wiki.bash 0,1,2,3 bert_wiki

With GLUE evaluation:

bash scripts/base_wiki_glue.bash 0,1,2,3 bert_wiki

Pre-processed Data and Pre-trained Models


Wiki103 (100M tokens)

mkdir -p data/wiki103-cased
wget -P data/wiki103-cased
wget -P data/wiki103-cased
wget -P data/wiki103-cased

Wiki (2800 M tokens)

mkdir -p data/wiki-cased
wget -P data/wiki-cased
wget -P data/wiki-cased
wget -P data/wiki-cased



If you find our project useful, please cite this paper:

  title={Vokenization: Improving Language Understanding with Contextualized, 
Visual-Grounded Supervision},
  author={Tan, Hao and Bansal, Mohit},
  booktitle={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing},


I thank the support from Bloomberg Data Science Ph.D. Fellowship. We thank the reviewers and Yixin Nie and Jie Lei for their helpful discussions. Part of the code are built based on huggingface transformers and facebook xlm and faiss.



PyTorch code for EMNLP 2020 Paper "Vokenization: Improving Language Understanding with Visual Supervision"







No releases published


No packages published