CAAM: Cross Attention is All for Multimodal

The video's image and audio data, as well as caption text data describing the video, are passed through an encoder that can represent each data type well to generate a feature value.And we map these feature values to the joint embedding space via a projection head. After that, a final embedding vector is generated by considering the correlation between modalities through a multi-modal fusion transformer. Finally, we put this embedding vector into a classifier and proceeded with a classification task to determine which of the 20 categories the video belongs to.

Model Architecture

Dataset

MSR VTT

Original dataset: download
Our dataset: download

MSR-VTT is a large dataset constructed for video captioning tasks, consisting of 10,000 video clips corresponding to 20 categories. Each video clip is annotated with 20 English sentences. The dataset used 6513 training sets (train-set) and 497 validation sets (Valuation-set).

We used the extracted feature value for learning.

Training

This has trained the model with default settings, RTX 4080 16GB GPU memory occupied, batch size 16, 200 epoch .

python train.py \
--we_path 'data/GoogleNews-vectors-negative300.bin' \
--data_path 'data/msrvtt_category_train.pkl'\
--val_data_path 'data/msrvtt_category_test.pkl' \
--save_path 'weights_classifier' \
--exp trial1 \
--epoch 200 \
--use_softmax True \
--use_cls_token False \
--token_projection 'projection_net' \
--num_classes 20 \
--batch_size 16 \
--device "0"

Result

# Cross Attention	Accuracy
3	55.95
6 (Fix K)	54.66
6	59.11

3 cross-attention experiments
6 cross-attention experiments (experiments with three modalities / keys, values fixed)
6 cross-attention (an experiment in which key values were alternately used in pairs of 2 modalities)

Evaluation

# Method	# Modality	Acc @1%	Acc @5%
ViT(video)	V	53.7	82.9
GRU	V	49.5	79.2
MCA-WF(GRU)	V	53.8	83.8
GRU	V + T	53.1	81.8
MCA-WF(GRU)	V + T	56.4	84.02
ViLT	V + T	55.4	83.9
MCA-WF(ViLT)	V + T	58.8	85.3
Ours-CA3	V + T + A	55.95	-
Ours-Fix K	V + T + A	54.66	-
Ours	V + T + A	59.11	-

Name		Name	Last commit message	Last commit date
Latest commit History 140 Commits
dataset		dataset
docs		docs
model		model
source		source
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
app.py		app.py
demo.py		demo.py
requirements.txt		requirements.txt
test.py		test.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CAAM: Cross Attention is All for Multimodal

Model Architecture

Dataset

MSR VTT

Training

Result

Evaluation

About

Releases

Packages

Contributors 5

Languages

deepdaiv-multimodal/24w-Tri-Modalities

Folders and files

Latest commit

History

Repository files navigation

CAAM: Cross Attention is All for Multimodal

Model Architecture

Dataset

MSR VTT

Training

Result

Evaluation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages