Test-Time Zero-Shot Temporal Action Localization

Benedetta Liberatori, Alessandro Conti, Paolo Rota, Yiming Wang, Elisa Ricci

Abstract: Zero-Shot Temporal Action Localization (ZS-TAL) seeks to identify and locate actions in untrimmed videos unseen during training. Existing ZS-TAL methods involve fine- tuning a model on a large amount of annotated training data. While effective, training-based ZS-TAL approaches as- sume the availability of labeled data for supervised learning, which can be impractical in some applications. Furthermore, the training process naturally induces a domain bias into the learned model, which may adversely affect the model’s gen- eralization ability to arbitrary videos. These considerations prompt us to approach the ZS-TAL problem from a radically novel perspective, relaxing the requirement for training data. To this aim, we introduce a novel method that performs Test- Time adaptation for Temporal Action Localization (T³AL). In a nutshell, T³AL adapts a pre-trained Vision and Lan- guage Model (VLM) at inference time on a sample basis. T³AL operates in three steps. First, a video-level pseudo- label of the action category is computed by aggregating information from the entire video. Then, action localization is performed adopting a novel procedure inspired by self- supervised learning. Finally, frame-level textual descriptions extracted with a state-of-the-art captioning model are em- ployed for refining the action region proposals. We validate the effectiveness of T³AL by conducting experiments on the THUMOS14 and the ActivityNet-v1.3 datasets. Our results demonstrate that T 3 AL significantly outperforms zero-shot baselines based on state-of-the-art VLMs, confirming the benefit of a test-time adaptation approach.

Setup

We recommend the use of a Linux machine with CUDA compatible GPUs. We provide a Conda environment to configure the required libraries.

Clone the repo with:

git clone ...
cd T3AL

Conda

The environment can be installed and activated with:

conda create --name t3al python=3.8
conda activate t3al
pip install -r requirements.txt

Preparing Datasets

We recommend to use pre-extracted CoCa features to accelerate inference. Please download the extracted features for THUMOS14 and ActivityNet datasets from links below:

Pre-extracted Features

Dataset	Link
THUMOS14	Google Drive
ActivityNet-v1.3	Google Drive

Then add the paths in the config files config/<dataset_name>.yaml, for example as:

training:
  feature_path: '/path/to/Thumos14/features/'
  video_path: '/path/to/Thumos14/videos/'

Evaluation

The method can be evaluated on the dataset of interest and selecting the split and setting, by running the following bash script:

python src/train.py experiment=tt_<dataset_name> data=<dataset_name> model.video_path=</path/to/data/> model.split=<split> model.setting=<setting> data.nsplit=0 exp_name=<exp_name>

We provide config files for the main method tt_<dataset_name>, the training free baseline tf_<dataset_name> and the baselines baseline.

Citation

Please consider citing our paper in your publications if the project helps your research.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
config		config
configs		configs
data		data
docs		docs
logs		logs
media		media
notebooks		notebooks
splits		splits
src		src
tests		tests
.project-root		.project-root
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py

benedettaliberatori/T3AL

Folders and files

Latest commit

History

Repository files navigation

Test-Time Zero-Shot Temporal Action Localization

Setup

Conda

Preparing Datasets

Pre-extracted Features

Evaluation

Citation

About

Resources

Stars

Watchers

Forks

Languages