TSAM: Temporal Shift with Audio Modality

Overview

This code extends the TemporalShiftModule codebase.

We have implemented several extensions to adapt TSAM for general purpose video undestanding (including emotion recognition):

incorporates audio modality
extends shift with depth parameter (shift not only neighboring but distant frames )
use weigths ptretrained with INET21K

All this improves perfomance of TSM on kinetics400 and something-something-v1 datasets:

dataset	n-frame	TSM	TSAM
Kinetics 400	16 * 10clips	74.5%	78.0%
something-v1	16 * 10clips	48.4%	52.2%

Table 1. Perfomance on public dataset TSAM vs TSM. In both cases Resnet50 was used as backbone.

TSAM vs TSM

Figures below describe our extensions, more details could be found in the technical report.

Fig.1 - Original TSM implementation: video segments shifted by 1 segment.

Fig.2 - TSAM implementation: video segments shifted by 1 , 2, .., k segments. Audio segment is not shifted but processed by the same backbone (weights are the same for video and audio segments).

Prerequisites

The code is built with following libraries:

PyTorch 1.0 or higher
tqdm
timm
scikit-learn

Data Preparation

You need to prepare 2 config files:

config file to specify input data (use template from config/example_data.json)
config file to specify model parameters (use template from config/example_config.json)

Config files controls all parametrers for your run and saved along with results. So later you can easily recover results with parameters.

Config File Data ( see example: example/videos4test ):

First you need to set dataset folder and edit "dataset" config file:

"data_dir": "videos4test",          <- full path to dataset_folder folder (see below DataSet Folder Structure)
"dir_videos": "videos",             <- relative to "data_dir" folder with videos
"dir_frames": "frames",             <- relative to "data_dir" folder with frames
"dir_audios": "audios",             <- relative to "data_dir" folder with audio (.wav files)
"file_train_list": "training.txt",  <- relative to "data_dir" file with labels for videos used for training (see example) 
"file_val_list": "validation.txt",  <- relative to "data_dir" file with labels for videos used for vaidation (see example)
"frames_tmpl": "{:05d}.jpg"         <- template for frame names

see example: training.txt file

see example: validation.txt file

Recomended DataSet Folder Structure (see videos4test folder):

├── dataset_folder
     │
     ├── Videos    <- Contains all videos (train, validation, test).
     │
     ├── Frames  <- Contains all frames from videos  (train, validation, test)
     │
     ├── Audios  <- Contains all audio (.wav) files (train, validation, test).
     │
     ├── training.txt    <- List of videos (only ids) for training (with labels, labels must be consecutive integers starting from 0, .., k  ) .
     └── validation.txt  <- List of videos (only ids) for validation( with labels)

Attention!!!: labels must be consecutive integers starting from 0, 1,2, ..

Config File :

Config File has the following subsections:

data_augmentation (specifies params for data augmentation )
model (specifies params for model)
shift_temporal (specifies parameters for the shift)
training_param (specifies parameters for the training)
optimizer_param (specifies parameters for optimizer, and learning schedule)
model_validation (specifies the number of sampled clips in validation run)
results_folder (specifies path to the output folder)

Config File: "data_augmentation" :

In this section you can modify augmentation parameters.

Config File: "model" :

Specify model parameters:

"video_segments": 8,          <- number of frames to sample from the video
"audio_segments": 1 or 0,     <- should be 1 (include audio modality) or 0 (not include)
"arch": "resnet50_timm",      <- specify backbone 
"dropout": 0.5                <- specify dropout rate

Config File: "shift_temporal" :

Specify modelshift parameters:

  "status": true,             <- if true apply temporalshift
  "f_div": 4,                 <- propotion of features to be shifted (propotion = 1/f_div) 
  "shift_depth": 4,           <- shift depth (how many neighboring segments are affected by shift)
  "n_insert": 2,              <- specifies blocks in backbone to be shifted
  "m_insert": 0

Pretrained Models

To validate or predict the pretrained model you need to specify Input Data path in the data config file (use template from config/kinetics400_data.json).

To run validation in MultiClip mode (pretrained on kinetics400) :

# test kinetics400
python main.py --data config/kinetics400_data.json --config config/kinetics400_config.json --validate kinetics400 --device 0,1,2,4

To validate or predict the pretrained model you need to specify Input Data path in the data config file (use template from config/something_v1_data.json).

# test Something-Something V1
python main.py --data config/something_v1_data.json --config config/something_v1_config.json --validate something_v1 --device 0,1,2,4

Framing Videos

For video data pre-processing, you need ffmpeg to be installed.

Use our script to frame videos:

# will frame example video4test videos into video4test/frames and video4test/audios
python3 video2frames.py --data config/example_data.json --fps 10

#where --fps specifies frame-per-second rate (10 is default value)

# will extract .wav audio files from example "video4test" videos into video4test/audio 
python3 video2frames.py --data config/example_data.json --audio

Train Model

To train your model on your data you need to specify Input Data path in the data config file (use template from config/kinetics400_data.json).

# to train on kinetics dataset: 
python main.py --data config/kinetics400_data.json --config config/kinetics400_config.json --device 0,1,2,4 --run_id "MyRunId"

#the path to output is specified in config folder: "results_folder": "results"
# so the results folder is "results/MyRunId"

# to run example train on toy dataset (you need to frame videos before  (see above "Framing Videos")): 
python main.py --data config/example_data.json --config config/example_config.json --device 0,1,2,4 --run_id "MyRunId"

#the path to output is specified in config folder: "results_folder": "results"
# so the results folder is "results/MyRunId"

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
config		config
example/videos4test		example/videos4test
figures		figures
lib		lib
Adcumen.pdf		Adcumen.pdf
README.md		README.md
TSAM.pdf		TSAM.pdf
main.py		main.py
video2frames.py		video2frames.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TSAM: Temporal Shift with Audio Modality

Overview

Table 1. Perfomance on public dataset TSAM vs TSM. In both cases Resnet50 was used as backbone.

TSAM vs TSM

Prerequisites

Data Preparation

Config File Data ( see example: example/videos4test ):

Config File :

Config File: "data_augmentation" :

Config File: "model" :

Config File: "shift_temporal" :

Pretrained Models

Framing Videos

Train Model

About

Releases

Packages

Languages

aav-antonov/TSAM

Folders and files

Latest commit

History

Repository files navigation

TSAM: Temporal Shift with Audio Modality

Overview

Table 1. Perfomance on public dataset TSAM vs TSM. In both cases Resnet50 was used as backbone.

TSAM vs TSM

Prerequisites

Data Preparation

Config File Data ( see example: example/videos4test ):

Config File :

Config File: "data_augmentation" :

Config File: "model" :

Config File: "shift_temporal" :

Pretrained Models

Framing Videos

Train Model

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages