CVT-SLR: Contrastive Visual-Textual Transformation for Sign Language Recognition with Variational Alignment

This is the official code of the CVPR 2023 paper (Highlight presentation, acceptance rate: 2.5% of submitted papers) CVT-SLR: Contrastive Visual-Textual Transformation for Sign Language Recognition with Variational Alignment [CVPR Version] [arXiv Version].

!!! See Also

Awesome AI Sign Language Papers. If you are new or interested in AI sign language field, we highly recommend you browse this repository. We have collected papers on AI Sign Language (SL) comprehensively. For easy searching and viewing, we have categorized them according to different criteria (by time, type of research, institution, etc.). Feel free to add contents and submit updates.
Extensiton Work: The proposed novel cross-modal transformation in this work has been successfully applied to a protein design (an impotant cross-modal protein task in AI life) framework, which achieves an excellent performance. (e.g., MMDesign: Multi-Modality Transfer Learning for Generative Protein Design)

Stay tuned for more of our work related to this work!

News

2023.03.21 -> This work was selected as a highlight by CVPR 2023 (Top 2.5% of submissions, 10% of accepted papers)
2023.02.28 -> This work was accepted to CVPR 2023 (9155 submissions, and accepted 2360 papers, 25.78% accptance rate)

Proposed CVT-SLR Framework

For more details, please refer to our paper.

Prerequisites

Dependencies

As a prerequisite, you are suggested to create a brand new conda environment firstly. A reference python dependency packages could be installed as follows :

(1) python==3.8.16

(2) torch==1.12.0+cu116, pls see Pytorch official website

(3) PyYAML==6.0

(4) tqdm==4.64.0

(5) opencv-python==4.2.0.32

(6) scipy==1.4.1

F.Y.I: Not all are required and appropriate, it depends on your actual situation.

Besides, you must install ctcdecode==0.4 for beam search decode, pls see this repo in detail. Run the following command to install ctcdecode:

cd ctcdecode && pip install .

Datasets

For data preparation, please download phoenix2014 dataset and phoenix2014T dataset in advance. After extracting, it is suggested to make a soft link toward downloaded dataset.

For more details on data preparation and prerequisites, please refer to this repo. We are grateful for the foundation that their work has given us.

NB: 1) Please refer to the above-mentioned repo for dataset extracting to the ./dataset directory. 2) Resize the original sign images from 210x260 to 256x256 for augmentation, and the generated gloss dict and resized image sequences are saved in ./preprocess for your reference. 3) We didn't use sclite library for evaluation (this library maybe tricky to install) but use pure python implemented evaluation tools instead, see ./evaluation.

Configuration Setting

According to your actual situations, update the configurations in ./configs/phoenix14.yaml and ./configs/cvtslt_eval_config.yaml. Especially, focus on the hyper-parameters such as dataset_root, evaluation_dir, work_dir.

Demo Evaluation

We provide the pretrained CVT-SLR models for inference, as:

Firstly, download checkpoints to ./trained_models directory from the corresponding links as follows. Then, evaluate the pretrained model using script as:

-> [Option 1] Using AE based configuration:

python run_demo.py --work-dir ./out_cvpr/cvtslt_2/ --config ./configs/cvtslt_eval_config.yaml --device 1 --load-weights ./trained_models/cvtslt_model_dev_19.87.pt --use_seqAE AE

Evaluation results: test 20.17%, dev 19.87%

-> [Option 2] Using VAE based configuration:

python run_demo.py --work-dir ./out_cvpr/cvtslt_1/ --config ./configs/cvtslt_eval_config.yaml --device 1 --load-weights ./trained_models/cvtslt_model_dev_19.80.pt --use_seqAE VAE

Evaluation results: test 20.06%, dev 19.80%

The updated evaluation results (WER %) and download links:

Group	Models	Dev	Test	Trained Checkpoints
Group 1 (single-cue)	SubUNet	40.8	40.7	-
	Staged-Opt	39.4	38.7	-
	Align-iOpt	37.1	36.7	-
	DPD+TEM	35.6	34.5	-
	Re-Sign	27.1	26.8	-
	SFL	26.2	26.8	-
	DNF	23.8	24.4	-
	FCN	23.7	23.9	-
	VAC	21.2	22.3	-
	CMA	21.3	21.9	-
	SFL	24.9	25.3	-
	VL-SLT	21.9	22.5	-
	SMKD	20.8	21.0	-
Group 2 (multi-cue)	DNF	23.1	22.9	-
	STMC	21.1	20.7	-
	C2SLR	20.5	20.4	-
Group 3 (Ours)	CVT-SLR w/ AE	19.87	20.17	[Baidu] (pwd/提取码: k42q) or [GoogleDrive]
	CVT-SLR w/ VAE	19.80	20.06	[Baidu] (pwd/提取码: 0kga) or [GoogleDrive]

NB: please refer to our paper for more details.

Visualization

Saliency Maps

We visualize the key parts of the sign video frames in focus by using Grad-CAM. To implement this function, you can use the open python tool as: ``` import pytorch_grad_cam ```

Cross-modal Alignment Matrices

To generate the cross alignment matrices, here are some hints as:

a = ret["conv_logits"].squeeze(1)
b = ret["sequence_logits"].squeeze(1)
T = 1
simi_matric = softmax(T*(a @ b.T))

Citation

If you find this repository useful, please consider citing:

@inproceedings{zheng2023cvt,
  title={Cvt-slr: Contrastive visual-textual transformation for sign language recognition with variational alignment},
  author={Zheng, Jiangbin and Wang, Yile and Tan, Cheng and Li, Siyuan and Wang, Ge and Xia, Jun and Chen, Yidong and Li, Stan Z},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={23141--23150},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
configs		configs
ctcdecode		ctcdecode
dataset		dataset
evaluation		evaluation
imgs		imgs
modules		modules
out_cvpr		out_cvpr
preprocess		preprocess
utils		utils
.gitignore		.gitignore
README.md		README.md
cvtslr_model.py		cvtslr_model.py
eval_cvtslr.sh		eval_cvtslr.sh
gloss_encoder.py		gloss_encoder.py
run_demo.py		run_demo.py
seq_scripts.py		seq_scripts.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CVT-SLR: Contrastive Visual-Textual Transformation for Sign Language Recognition with Variational Alignment

!!! See Also

News

Proposed CVT-SLR Framework

Prerequisites

Dependencies

Datasets

Configuration Setting

Demo Evaluation

Visualization

Citation

About

Releases

Packages

Languages

binbinjiang/CVT-SLR

Folders and files

Latest commit

History

Repository files navigation

CVT-SLR: Contrastive Visual-Textual Transformation for Sign Language Recognition with Variational Alignment

!!! See Also

News

Proposed CVT-SLR Framework

Prerequisites

Dependencies

Datasets

Configuration Setting

Demo Evaluation

Visualization

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages