Skip to content

Commit

Permalink
init
Browse files Browse the repository at this point in the history
  • Loading branch information
airsplay committed Oct 12, 2020
0 parents commit e96def5
Show file tree
Hide file tree
Showing 10 changed files with 1,300 additions and 0 deletions.
223 changes: 223 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,223 @@
# Vokenization

PyTorch code for the EMNLP 2020 paper "Vokenization: Improving Language Understanding with Contextualized,
Visual-Grounded Supervision".

```shell script
pip install -r requirements.txt
```

## xmatching: Contextualized Cross-Modal Matching
In this module (corresponding to Sec 3.2 of the paper),
we mainly want to learn a matching model,
which "Contextually" measure the alignment between words and images.
The terminology "Contextual" emphasize the nature that the sentences (the context) are also taken into consideration.

### Download Data
1. Download MS COCO images:
```shell script
# MS COCO
mkdir -p data/mscoco
wget http://images.cocodataset.org/zips/train2014.zip -P data/mscoco
wget http://images.cocodataset.org/zips/val2014.zip -P data/mscoco
unzip data/mscoco/train2014.zip -d data/mscoco/images/ && rm data/mscoco/train2014.zip
unzip data/mscoco/val2014.zip -d data/mscoco/images/ && rm data/mscoco/val2014.zip
```

2. Download captions (split following the LXMERT project):
```shell script
mkdir -p data/lxmert
wget https://nlp.cs.unc.edu/data/lxmert_data/lxmert/mscoco_train.json -P data/lxmert/
wget https://nlp.cs.unc.edu/data/lxmert_data/lxmert/mscoco_nominival.json -P data/lxmert/
wget https://nlp.cs.unc.edu/data/lxmert_data/lxmert/vgnococo.json -P data/lxmert/
wget https://nlp.cs.unc.edu/data/lxmert_data/lxmert/mscoco_minival.json -P data/lxmert/
```

### Training Models
The model has a wide range support of different visn/lang backbones.
For visual backbones, the models in torchvision are mostly supported. You might need to handle the last FC layer,
because it is written differently in different backbones.
For language backbones, we utilize the models in [pytorch.transformers](https://github.com/huggingface/transformers).
The list of supporting models is provided in [CoX/model.py]() and could be extended following the guidelines in
[pytorch.transformers](https://github.com/huggingface/transformers).

Running Commands:
```bash
# Run the cross-modal matching model with single-machine multi-processing distributed training
# Speed: 20 min ~ 30 min / 1 Epoch, 20 Epochs by default.
bash scripts/run_xmatching.bash 0,1 bert_resnext
# Or you could fine-tune the model
# bash finetune_cox.bash
```

## vokenization: Contextualized Retrieval
This step is a bridge between the cross-modality (words-and-image) matching models (CoR) and
language-only pre-training models (CoL).
The final goal is vokenization, which converts the language tokens to related images
(we called them **vokens**).
We mainly provide preprocessing tools (i.e., feature extraction, tokenization, and vokenization) and
evaluation tools of previous CoR models here.
Oveerall, we have four different utils in this stage, the pipeline is:
```
Extracting Image Features-----> Benchmakring the Matching Models (Optional) --> Vokenization
Tokenization --------------/
```
### Evaluating CoX Model and Vokenization
The evaluation includes two different
```bash
bash scripts/cox_benchmarking.bash 0 bert_resnext
```


### Pure Language Data Preprocessing
We provide scripts to get the datasets "wiki103", "wiki", and "Book Corpus".
We would note them as "XX-cased" or "XX-uncased" where the suffix "cased" / "uncased" only indicates
the property of the raw text.
#### Wiki103
```bash
bash data/wiki103/get_data_cased.sh
```

#### English Wikipedia
The script to download and process wiki data are copied from [XLM](https://github.com/facebookresearch/XLM)
```bash
bash data/wiki/get_data_cased.bash en
```

Note: For RoBERTa, it requires an untokenized version of wiki,
so please use the following command:
```bash
bash data/wiki/get_data_cased_untokenized.bash en
```

### Extracting Image Features
The image preprocessing first extracts the image features to build the keys for retrieval.

We first build a list of image indexes with `CoR/create_image_ids.py`.
It is used to unify the image ids in different experiments thus the feature array stored in hdf5 could be universally indexed.
The image ids are saved under a shared path called `LOCAL_DIR` defined in `CoR/common.py`.
The path is used to save meta info for the retrieval and we will make sure all the experiments agree with this meta info,
so that we would not get different indexing in different retrieval experiments.
The image ids are saved under `{LOCAL_DIR}/images` with format `{IMAGE_SET}_ids.txt`.

> Note: The ids created by `create_image_ids.py` are only the order of the images.
> The actual images in the dictionary are provided by `extract_keys.bash`, thus is corresponding to the
> `_paths.txt`, because the `extract_keys` will filter all broken images and non-existing images.
Commands:
```bash
# Step 1, Build image orders.
python col/cor/build_image_orders.py
```
Extract image features regarding the list built above, using code `CoR/extract_vision_keys.py`.
The code will first read the image ids saved in `{LOCAL_DIR}/images/{IMAGE_SET}_ids.txt` and locate the images.
The features will be saved under `{LOAD}/keys/coco_minival.hdf5`.
Commands:
```bash
# Step 2, Extract features.
# bash extract_keys.bash $GPU_ID $MODEL_NAME
bash extract_keys.bash 0 bert_resnext
```

### Tokenization
The language preprocessing next tokenize the language corpus.
It split the paragraphs into sentences and tokenize each sentences to CoL model specific tokens (w.r.t. RoBERTa tokenization).
It would locally save three files: `{dset_path}.{tokenizer_name}`, `{dset_path}.{tokenizer_name}.sent`,
and `{dset_path}.{tokenizer_name}.dcmt`.
`{dset_path}.{tokenizer_name}` saves the tokens and each line in this file is the tokens of a sentence.
`{dset_path}.{tokenizer_name}.sent` saves the starting token index of each sentence, it has `S+1` lines where `S` is the number of sentences.
Each sentence has tokens from `token[sent[i]: sent[i+1]]`.
`{dset_path}.{tokenizer_name}.dcmt` saves the starting sentence index of each document, it has `D+1` lines where `D` is the number of documents.
Each document has sentences from `sent[sent[i]: sent[i+1]]`.

Lastly, `to_hdf5.py` will save the processed tokens to hdf5 files `{dset_path}.{tokenizer_name}.hdf5`, locally.
The sentences could be directly fetched from this hdf5 files with the index in `.sent`.

```bash
# Tokenization
python CoR/preprocess/tokenize_wiki103.py
python CoR/preprocess/tokenize_wiki_all.py
python CoR/preprocess/tokenize_bc.py

# Save tokens to hdf5 files.
python to_hdf5.py # May need to comment out some datasets.
```

New tokenization:
```bash
cd CoR/preprocess_bert/
bash tokenize_wiki.bash
```


### Benchmarking Matching Models
> Before evaluating, please make sure that `extracting_image_features` and `tokenization` are completed.
We benchmark the performance of CoX models from large scale

### The Vokenization Process
After all these steps, we could start to vokenize the language corpus.
It would load the tokens saved in `{corpus}.{tokenizer_name}.hdf5`
and uses the sentence info in `{corpus}.{tokenzier_name}.hdf5`.

The code is highly optimized and could be continued by just rerunning it.
The vokens will be saved in `{load}/vokens/wiki.train.raw.vg_nococo.hdf5` by default.
The file `{load}/vokens/wiki.train.raw.vg_nococo` contains the line-split vokens.
If you want to change the voken's output dir, please specify the option `--output`.

```bash
python CoR/vokenize_corpus.py \
--load /ssd-playpen/home/hTan/CoL/CoX/snap/pretrain/robertal4_finetune_from_pretrained \
--corpus=/ssd-playpen/data/wiki103/wiki.train.raw \
--image-sets=vg_nococo \
--tokenizer-name=roberta-base # The tokenizer-name must be provided!
```

> Note: `--tokenizer-name` must be provided in the script.
> Moreover, please make sure that the tokens are extracted from [tokenization](#tokenization)
For the wiki and wiki103, use the following script:
#### Wiki103
First modifying the `LOAD=/ssd-playpen/home/hTan/CoL/CoX/snap/pretrain/cc_hinge05_dim64_resxt101_robertal4` in `vokenize_wiki103.bash`
to the saved CoX model.
```bash
bash vokenize_wiki103.bash 0
```
#### Wiki (all)
```bash
bash mpvokenize_wiki.bash 0,1,2,3
```



## vlm: Visually-Supervised Language Model

### RoBERTa Pre-Training

### GLUE Fine-Tuning
#### Evaluate the last checkpoint
By running
```bash
bash run_glue.bash 0 3
```
Run GLUE on GPU 0 for 3 epochs, then using
```bash
python CoL/show_glue_results.py
```
to check the GLUE results.

#### Evaluate multiple checkpoints

By running
```bash
bash run_glue_epochs.bash 0,1,2,3 --snaps 7
```
It will assess 7 snaps using all 0,1,2,3 GPUs.
`snaps=-1` will assess all checkpoints.
Then using
```bash
python CoL/show_glue_results_epochs.py
```
to check all glue_results.
20 changes: 20 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
torch
#==1.4.0
torchvision
#==0.5.0
transformers
#==2.7.0
tensorboardX

# For GLUE evaluation
sklearn

# Fiass supports fast indexing.
# The code has a torch-implemented GPU indexing, so do not worry if you could not install faiss.
faiss-gpu>=1.6.3

# Spacy is used in sentence segmentation where the sentences are the input the cross-modality matching model.
spacy

# A higher h5py version to support h5py.VirtualLayout
h5py>=2.10.0
Empty file added xmatching/__init__.py
Empty file.
Loading

0 comments on commit e96def5

Please sign in to comment.