CATR: Combinatorial-Dependence Audio-Queried Transformer
for Audio-Visual Video Segmentation

This repo contains the official implementation of the ACM MM 2023 paper:

CATR: Combinatorial-Dependence Audio-Queried Transformer
for Audio-Visual Video Segmentation

Kexin Li, Zongxin Yang∗, Lei Chen, Yi Yang, Jun Xiao

Motivation

Environment Installation

The code was tested on a Conda environment with CUDA Version as 11.7. Install Conda and then create an environment as follows:

conda create -n catr python=3.8.17 pip -y

conda activate catr

Pytorch 2.0.0:

conda install pytorch==2.0.0 torchvision==0.15.0 torchaudio==2.0.0 pytorch-cuda=11.7 -c pytorch -c nvidia

Note that you might have to change the cudatoolkit version above according to your system's CUDA version.

Hugging Face transformers 4.11.3:

pip install transformers==4.11.3

Additional required packages:

pip install h5py wandb opencv-python protobuf av einops ruamel.yaml timm joblib

conda install -c conda-forge pandas matplotlib cython scipy cupy

Running Configuration

Parameter Settings are divided into fixed parameters and adjustable parameters.

The following table lists the parameters which can be configured directly from the command line.

The rest of the fixed parameters for each dataset can be configured in configs/DATASET_NAME.yaml.

Command	Description
-visual_backbone	resnet50 or pvt
-log_dir	the path to save train logs
-config_path	the path for fixed parameters
-train_batch_size	training batch size per GPU
-val_batch_size	eval batch size per GPU
-max_epoches	the max number of epoches to run

Data Preparation

MS(Fully-supervised Multiple-sound Source Segmentation): https://forms.gle/GKzkU2pEkh8aQVHN6
S4(Semi-supervised Single-sound Source Segmentation): https://forms.gle/GKzkU2pEkh8aQVHN6
AVSS( Fully-supervised Audio-Visual Semantic Segmentation): https://forms.gle/15GZeDbVMe2z6YUGA

Pretrained Backbones

The pretrained backbones can be downloaded from here and placed to the directory pretrained_backbones.

Config File

The Config File can be downloaded from here and the extraction code is 'tybr'.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
CATR		CATR
images		images
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CATR

CATR

images

images

README.md

README.md

main.py

main.py

requirements.txt

requirements.txt

Repository files navigation

CATR: Combinatorial-Dependence Audio-Queried Transformer
for Audio-Visual Video Segmentation

Kexin Li, Zongxin Yang∗, Lei Chen, Yi Yang, Jun Xiao

Motivation

Environment Installation

Running Configuration

Data Preparation

Pretrained Backbones

Config File

Framework

About

Releases

Packages

Languages

aspirinone/CATR.github.io

Folders and files

Latest commit

History

Repository files navigation

CATR: Combinatorial-Dependence Audio-Queried Transformer for Audio-Visual Video Segmentation

Kexin Li, Zongxin Yang∗, Lei Chen, Yi Yang, Jun Xiao

Motivation

Environment Installation

Running Configuration

Data Preparation

Pretrained Backbones

Config File

Framework

About

Resources

Stars

Watchers

Forks

Languages

CATR: Combinatorial-Dependence Audio-Queried Transformer
for Audio-Visual Video Segmentation