Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Video-guided Machine Translation

This repo contains the starter code for the VATEX Translation Challenge for Video-guided Machine Translation (VMT), aiming at translating a source language description into the target language with video information as additional spatiotemporal context.

VMT is introduced in our ICCV oral paper "VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research". VATEX is a new large-scale multilingual video description dataset, which contains over 41,250 videos and 825,000 captions in both English and Chinese and half of these captions are English-Chinese translation pairs. For more details, please check the latest version of the paper:


  • Python 3.7
  • PyTorch 1.4 (1.0+)
  • nltk 3.4.5


1. Download corpus files and the extracted video features

First, under the vmt/ directory, download train/val/test json file:


Then download the I3D video features from here for trainval and here for test

# set up your DIR/vatex_features for storing large video features
mkdir DIR/vatex_features

wget -P DIR/vatex_features
unzip DIR/vatex_features/
wget -P DIR/vatex_features
unzip DIR/vatex_features/

cd vmt/
ln -s DIR/vatex_features data/vatex_features

2. Training

To train the baseline VMT model:


The default hyperparamters are set in configs.yaml.




Specify the model name in configs.yaml. The script will generate a json file for submission to the VMT Challenge on CodaLab.


The baseline VMT model achieves the following performance on corpus-level bleu score (the numbers here are slightly different from those in the paper due to different evaluation setups. For fair comparison, please compare with the performance here):

Model EN -> ZH ZH -> EN
BLEU-4 31.1 24.6

On the evaluation server, we report cumulative corpus-level BLEU score (up to 4-gram) and each individual n-gram score for reference, shown as B-1, ..., B-4.

Model performance is evaluated by cumulative BLEU-4 score in the challenge.


Please cite our paper if you use our code or dataset:

author = {Wang, Xin and Wu, Jiawei and Chen, Junkun and Li, Lei and Wang, Yuan-Fang and Wang, William Yang},
title = {VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research},
booktitle = {The IEEE International Conference on Computer Vision (ICCV)},
month = {October},
year = {2019}