Skip to content
Switch branches/tags

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Video Alignment

This repository contains the code of different video alignment algorithms, to reproduce some of the results of the paper LAMV: Learning to Align and Match Videos with Kernelized Temporal Layers.

Facebook also open-sourced a C++ version of TMK, see this post for more information.


Use the environment.yml file to create a conda environment with the required packages:

git clone
cd videoalignment
conda env create --file=environment.yml
source activate videoalignment

Alternatively, you can install each package individually, with conda or pip:

# Create conda env
conda create -n videoalignment python=3.6 -y
source activate videoalignment

# Install required packages
conda install pytorch torchvision -c pytorch -y
conda install scikit-learn -y
conda install -c conda-forge ffmpeg -y
pip install imageio imageio-ffmpeg progressbar

How to reproduce the results

The fastest way to reproduce the results is to use the precomputed features we provide. Once the dataset and its pre-computed features are correctly setup, you can use to run the evaluation.

# valid dataset names are: `Climbing`, `Madonna`, `EVVE` and `VCDB`
# valid models names are: `TMK`, `TMK_Poullot` and `CTE`
python --dataset_test Climbing --model TMK --output_dir=${EVAL_DIR}

The script will run all models on all datasets.

Data setup

The datasets we have used for research are:

You can specify the path to the data by modifying the file

If you want to use the code on your own dataset, you should write a VideoDataset class under The EVVE class is a good example to start with.

For each dataset, we expect the following directory tree:

root_dir (eg. EVVE) 
└───annotations (for EVVE) or annotation (VCDB)
│   │   some gt file or sub-folder with gt
│   │   ...
└───gt_climbing.align or gt_madonna.align (for Climbing and Madonna)
└───rmac_resnet34_29 (provided pre-computed features) 
│   │   video frame features.pkl
│   │   ...
└───rmac_resnet34_30 (optional, if you extracted features with another model)
│   │   video frame features.pkl
│   │   ...
│   │   file011.mp4 or sub-folder with videos
│   │   ...   

Expected results

In the following tables:

  • TMK_Poullot is the model refereed as TMK [26] in the paper
  • TMK (norm=feat_freq) is not in the paper
  • Frame descriptor is RMAC for all models
  • The Gist links show the expected output of the evaluation script.


See Table 2 in LAMV paper. Metric is percentage of queries localized better than a threshold (0.1s, 1s, 10s)

@0.1s @1s @10s Gist
TMK_Poullot 9.13 36.52 64.78 link
TMK (norm=feat_freq) cis 33.91 70.00 76.96 link


See Table 2 in LAMV paper. Metric is percentage of queries localized better than a threshold (0.1s, 1s, 10s)

@0.1s @1s @10s Gist
TMK_Poullot 00.00 3.37 22.47 link
TMK (norm=feat_freq) 14.61 28.09 34.83 link


See Table 3 in LAMV paper.

F1 Score Gist
TMK_Poullot 66.02 link
TMK (norm=feat_freq) 67.30 link


See Table 4 in LAMV paper.

mean mAP Gist
TMK_Poullot 51.53 link
TMK (norm=feat_freq) 53.43 link
TMK (norm=feat_freq) + Query expansion (N1=10) 58.36 link

Frame-level feature extraction

By design, the computation of frame level features is separated from the rest of the code, and the core components are agnostic of the feature choice. Therefore, any frame-level feature can be used.

Our choice has been RMACs They can be extracted from any layer of a Convolutional Neural Network, they are compact, they can be compared with dot products, and they work really well. We used ResNet-34 as our CNN, and after running a complete search on all its layer, we found that layer 29th is the best for Video Copy Detection. In the following, we will assume to use layer 29, even though most of the code can handle a different layer if needed.

The extraction of RMAC features is handled by

We provide some pre-computed features on the datasets Madonna, Climbing, EVVE and VCDB.

cd /path/to/madonna
# replace madonna (1.4GB) by climbing (620MB), EVVE (19GB) or VCDB (2.6GB)
wget && \

See the Data setup section for more details on the expected location of each file.

PCA+Whitening data

RMAC has a PCA+Whitening stage, which is trained on a dataset separated from the one used for evaluation. You can compute PCA+Whitening data on a dataset of choice by calling the provided script with --train_pca.

However, if you don't really need to use a specific dataset, you can use the PCA+W data we have computed on the entire YFCC100M, using Resnet34/Layer29 which is large enough to give “good” PCA matrices.

To download the PCA weights:

mkdir -p ${PCA_PATH}
cd ${PCA_PATH}
wget && \
    unzip && \

The provided features have already been processed with this PCA matrix.

Getting RMAC features for a video

If you want to compute the features yourself, you can use

python         \
    /path/to/video.ext          \
    ${PCA_PATH}/YFCC100M_pca    \
    /path/to/output/folder      \
    --fps=15                    \
    --resnet_level=29           \
    --workers=4                 \


  • the first argument is the path to the video. Internally, it will use ffmpeg to read the video and get the raw frames from it.
  • the second argument is the “prefix” for PCA+W data. The script will add _mean_resnet34_29.t7 and _pca_Dvt_resnet34_29.t7 by itself.
  • the third argument is the directory where to output the features. This will save a .pkl file with frame-level features.
  • --fps specifies the frame rate at which the script will get frames from the video. Notice that this* is not *the frame rate of the video. A good choice is 15 fps, but anyway you should use the same value for all the videos you want to compare, otherwise TMK won't work.
  • --workers is the number of threads that will be used for scaling and normalizing frames.
  • --b_s is the batch size

Package structure and features

The computation of the descriptors, the matching of different descriptors, and performance evaluation are all managed by, even though they live in separated parts of the package. We will first take a quick look at those parts.


See the Data setup section to specify the path to your data, and the datasets we used.


Available models are implemented in

  • CTE: Circulant temporal encoding
  • TMK_Poullot: the original TMK by Poullot
  • TMK: the TMK with our normalization, but without the trainable part described in the paper.

All models share some common methods. In particular:

  • model.single_fv takes as input a tensor with shape (b_s, T, d) where b_s is the batch size,T is the maximum length of the video, and d is the size of the frame level descriptor, and a tensor with shape (b_s, T) containing the real-valued timestamps of each descriptor, and produces as output the video descriptor (in the case of TMK, this is the TMK descriptor). Videos shorter than T can be zero-padded without affecting the final descriptor.
  • model.score_pair matches two descriptors, for a given list of offsets. In the most common case, this will be all the possible offsets in the range (-T, T) where T is the maximum length of the two videos.


The code supports different evaluators:

  • Localization errors (Madonna/Climbing)
  • Segment-level precision/recall (as defined in the VCDB paper)
  • mAP for Event detection

These are all implemented inside, and are called by the test method inside The use of evaluators requires to have defined an appropriate dataset class, so if you plan to use this code for evaluation, you should firstly set up a dataset.

Testing a model on a dataset, and evaluating its performance

All the components described above are called and arranged by, which also supports some command line arguments for the most frequent operations. For example, to run a model, test it on a dataset, and evaluate its performance, you just need to call:

python --dataset_test EVVE --model TMK_Poullot


  • --model is the name of the model (e.g. the name of the class in
  • --dataset_test is the test dataset (again, the name of the class in

Depending on the test dataset, different evaluators will be called. In the case of EVVE, the mAP will be computed. You can take a look at the test function in to see how this works.

What if I have my own data, and just want to run a model outside of this structure?

Here is a toy example to use the models on any data.

from typing import NamedTuple
import torch
from videoalignment.models import TMK_Poullot

class Args(NamedTuple):
    m: int = 16

model = TMK_Poullot(Args())

d = 512
n_frames_a = 300
frame_features_a = torch.zeros(1, n_frames_a, d).float()
timestamps_a = torch.zeros(1, n_frames_a).float()

n_frames_b = 400
frame_features_b = torch.zeros(1, n_frames_b, d).float()
timestamps_b = torch.zeros(1, n_frames_b).float()

tmk_fv_a = model.single_fv(frame_features_a, timestamps_a)
tmk_fv_b = model.single_fv(frame_features_b, timestamps_b)

offsets = torch.arange(-n_frames_b, n_frames_b).view(1, -1).float()
scores = model.score_pair(tmk_fv_a, tmk_fv_b, offsets)
  1. We load a model, in this case TMK_Poullot.
  2. Since all models expect to receive command line arguments from, we simulate that by building an Args class with the required parameters (here, m is the number of frequencies).
  3. We load the frame features and the timestamps of two videos (for the sake of simplicity, all zeros)
  4. We call single_fv to get the TMK features from the two videos. Notice that, although the two videos had different lengths, the two TMK feature vectors have the same size.
  5. We call score_pair to actually match the two videos, where offsets are all the deltas for which the TMK scores is computed. Notice that you can specify different offsets for different videos in the batch. The last calls returns a (1, 800) tensor containing the TMK score for every offset.

From here, you can:

  • Compute the maximum score for all offsets, (i.e. torch.max(scores, 1)[0]), which can be used as a measure of similarity between the two videos.
  • Compute the argmax of the scores (i.e. torch.max(scores, 1)[1]), which gives you the predicted alignment point.

The unit tests in should give you a good understanding of the API.



LAMV: Learning to Align and Match Videos with Kernelized Temporal Layers

  title={LAMV: Learning to Align and Match Videos with Kernelized Temporal Layers},
  author={Lorenzo Baraldi and Matthijs Douze and Rita Cucchiara and Herv{\'e} J{\'e}gou},
  journal={2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition},


Temporal Matching Kernel with Explicit Feature Maps

  TITLE = {{Temporal Matching Kernel with Explicit Feature Maps}},
  AUTHOR = {Poullot, S{\'e}bastien and Tsukatani, Shunsuke and Phuong Nguyen, Anh and J{\'e}gou, Herv{\'e} and Satoh, Shin'ichi},
  URL = {},
  BOOKTITLE = {{ACM Multimedia 2018}},
  ADDRESS = {Brisbane, Australia},
  PUBLISHER = {{ACM Press}},
  PAGES = {1-10},
  YEAR = {2015},
  MONTH = Oct,
  DOI = {10.1145/2733373.2806228},
  PDF = {},
  HAL_ID = {hal-01842277},
  HAL_VERSION = {v1},


Circulant Temporal Encoding for Video Retrieval and Temporal Alignment

  title={Circulant Temporal Encoding for Video Retrieval and Temporal Alignment},
  author={Matthijs Douze and J{\'e}r{\^o}me Revaud and Jakob J. Verbeek and Herv{\'e} J{\'e}gou and Cordelia Schmid},
  journal={International Journal of Computer Vision},


Particular object retrieval with integral max-pooling of CNN activations (RMAC)

  title={Particular object retrieval with integral max-pooling of CNN activations},
  author={Giorgos Tolias and Ronan Sicre and Herv{\'e} J{\'e}gou},


This repository is licensed under the CC BY-NC 4.0.


Learning to align and match videos with kernelized temporal layers




No releases published


No packages published