GliTr: Glimpse Transformers with Spatiotemporal Consistency for Online Action Prediction

Authors: Samrudhdhi Rangrej, Kevin Liang, Tal Hassner, James Clark Accepted to: WACV'23 Paper

An overview of our GliTr. GliTr consists of a frame-level spatial transformer $\mathcal{T}$_$f$ and causal temporal transformers $\mathcal{T}$_$c$ and $\mathcal{T}$_$l$. One training iteration requires $T$ forward passes through our model. Above, we show two consecutive forward passes at time $t \leq T-1$ and $t+1 \leq T$.

Forward pass $t$ (blue path): Given a new glimpse $g_t$, $T_f$ extracts glimpse-features $\hat{f}$_$t$. We append $\hat{f}$_$t$ to $\hat{f}$_$1:t-1$, i.e., features extracted from $g_{1:t-1}$ during previous passes. Next, $\mathcal{T}$_$c$ predicts label $\hat{y}$_$t$ from $\hat{f}$_$1:t$. Simultaneously, $\mathcal{T}$_$l$ predicts next glimpse location $\hat{l}$_${t+1}$ from $\hat{f}$_${1:t}$.

Forward pass $t+1$ (orange path): Given a predicted location $\hat{l}$_${t+1}$, we extract a glimpse $g_{t+1}$ at $\hat{l}$_${t+1}$ from a frame $x_{t+1}$. Then, we follow the same steps as the blue path. After $T$ forward passes, we compute the losses shown in the right. To find targets $\tilde{y}$_${1:T}$ and $\tilde{f}$_${1:T}$ for spatial and temporal consistency, we use a separate pre-trained and fixed teacher model (shown on the left) that observes complete frames $x_{1:T}$. To maintain stability, we stop gradients from $\mathcal{T}$_$l$ to $\mathcal{T}$_$f$.

Requirements

numpy==1.19.2
torch==1.8.1
torchvision==0.9.1
wandb==0.12.9
timm==0.4.9

Datasets

Prepare both datasets following instructions for Something-Something V2 dataset provided in TSM repository.

Experiment Setup

Note: Create and set following paths in SSv2_Teacher.sh, Jester_Teacher.sh, SSv2_GliTr.sh and Jester_GliTr.sh.

PRETRAINED_DIR="/absolute/path/to/directory/with/pretrained/weights/"
OUTPUT_DIR="/absolute/path/to/output/directory/"
DATA_DIR="/absolute/path/to/data/directory/"
LOG_DIR="/absolute/path/to/log/directory/"

Download and store following pretrained models in PRETRAINED_DIR.

ViT-S/16 teacher weights from ibot repository
- Rename it to ibot_vits_16_checkpoint_teacher.pth.
VideoMAE ViT-B (epoch 2400) finetuning weights for Something-Something V2 dataset from VideoMAE repository
- Rename it to videomae_ssv2_ep2400_vitB.pth.

Training and Evaluation

Run SSv2_Teacher.sh
Run Jester_Teacher.sh (Set JESTER_PRETRAINED="/absolute/path/to/learnt/ssv2/teacher/weights/")
Run SSv2_GliTr.sh (Set TEACHER_CHECKPOINT="/absolute/path/to/teacher/weights/")
Run Jester_GliTr.sh (Set TEACHER_CHECKPOINT="/absolute/path/to/teacher/weights/")

Visualization

Glimpses selected by GliTr on Something-Something v2 dataset. The complete frames are shown for reference only. GliTr does not observe full frames. It only observes glimpses.

Acknowledgement

Our code is based on: deit, TSM, timm, AR-Net, catalyst, VideoMAE, STAM-Sequential-Transformers-Attention-Model

License

Please see LICENSE.md for more details.

Citation

If you find any part of our paper or this codebase useful, please consider citing our paper:

@inproceedings{rangrej2023glitr,
  title={GliTr: Glimpse Transformers with Spatiotemporal Consistency for Online Action Prediction},
  author={Rangrej, Samrudhdhi B and Liang, Kevin J and Hassner, Tal and Clark, James J},
  booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision},
  pages={3413--3423},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commits
Figs		Figs
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Jester_GliTr.sh		Jester_GliTr.sh
Jester_Teacher.sh		Jester_Teacher.sh
LICENSE.md		LICENSE.md
README.md		README.md
SSv2_GliTr.sh		SSv2_GliTr.sh
SSv2_Teacher.sh		SSv2_Teacher.sh
VideoMAE_transforms.py		VideoMAE_transforms.py
dataset.py		dataset.py
dataset_config.py		dataset_config.py
engine.py		engine.py
functional.py		functional.py
main.py		main.py
model.py		model.py
opts.py		opts.py
rand_augment.py		rand_augment.py
random_erasing.py		random_erasing.py
sampler.py		sampler.py
scheduler_factory.py		scheduler_factory.py
transforms.py		transforms.py
utils.py		utils.py
videoMAE_video_transforms.py		videoMAE_video_transforms.py

License

facebookresearch/GliTr

Folders and files

Latest commit

History

Repository files navigation

GliTr: Glimpse Transformers with Spatiotemporal Consistency for Online Action Prediction

Requirements

Datasets

Experiment Setup

Training and Evaluation

Visualization

Acknowledgement

License

Citation

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Languages