OTSeg: Multi-prompt Sinkhorn Attention for Zero-Shot Semantic Segmentation (ECCV 2024)

Authors: Kwanyoung Kim*, Yujin Oh*, Jong Chul Ye
*Equal contribution

Abstract: The recent success of CLIP has demonstrated promising results in zero-shot semantic segmentation by transferring muiltimodal knowledge to pixel-level classification. However, leveraging pre-trained CLIP knowledge to closely align text embeddings with pixel embeddings still has limitations in existing approaches. To address this issue, we propose OTSeg, a novel multimodal attention mechanism aimed at enhancing the potential of multiple text prompts for matching associated pixel embeddings. We first propose Multi-Prompts Sinkhorn (MPS) based on the Optimal Transport (OT) algorithm, which leads multiple text prompts to selectively focus on various semantic features within image pixels. Moreover, inspired by the success of Sinkformers in unimodal settings, we introduce the extension of MPS, called Multi-Prompts Sinkhorn Attention (MPSA) , which effectively replaces cross-attention mechanisms within Transformer framework in multimodal settings. Through extensive experiments, we demonstrate that OTSeg achieves state-of-the-art (SOTA) performance with significant gains on Zero-Shot Semantic Segmentation (ZS3) tasks across three benchmark datasets.

News

[2024.07.05] Our official Code Release
[2024.07.04] Our paper is accepted on ECCV2024.

Environment:

Install pytorch

conda install pytorch==1.10.1 torchvision==0.11.2 torchaudio=0.10.1 cudatoolkit=10.2 -c pytorch

Install required packages.

pip install -r requirements.txt

Downloading and preprocessing Dataset:

According to MMseg: https://github.com/open-mmlab/mmsegmentation/blob/master/docs/en/dataset_prepare.md

Preparing Pretrained CLIP model:

Download the pretrained model here: Path/to/pretrained/ViT-B-16.pt https://openaipublic.azureedge.net/clip/models/5806e77cd80f8b59890b7e101eabd078d9fb84e6937f9e85e4ecb61988df416f/ViT-B-16.pt

Training (Inductive):

bash dist_train.sh configs/coco/vpt_seg_zero_vit-b_512x512_80k_12_100_multi.py Path/to/coco/induct
bash dist_train.sh configs/context/vpt_context_vpt.py Path/to/context/induct
bash dist_train.sh configs/voc12/vpt_seg_zero_vit-b_512x512_20k_12_10.py Path/to/voc12/induct

Training (Transductive):

bash dist_train.sh ./configs/coco/vpt_seg_zero_vit-b_512x512_40k_12_100_multi_st.py Path/to/coco/trans --load-from=Path/to/coco/induct/iter_40000.pth
bash dist_train.sh ./configs/context/vpt_context_st_vpt.py Path/to/context/trans --load-from=Path/to/context/induct/induct/iter_20000.pth
bash dist_train.sh ./configs/voc12/vpt_seg_zero_vit-b_512x512_10k_12_10_st.py Path/to/voc12/trans --load-from=Path/to/voc12/induct/iter_10000.pth

Training (Fully supervised):

bash dist_train.sh configs/coco/vpt_seg_fully_vit-b_512x512_80k_12_100_multi.py Path/to/coco/full
bash dist_train.sh configs/context/vpt_context_fully.py Path/to/context/full
bash dist_train.sh configs/voc12/vpt_seg_fully_vit-b_512x512_20k_12_10.py Path/to/voc12/full

Pretrained models:

Dataset	Setting	Model Zoo
PASCAL VOC 2012	Inductive	[Google Drive]
PASCAL VOC 2012	Transductive	[Google Drive]
PASCAL VOC 2012	Fully	[Google Drive]
PASCAL CONTEXT	Inductive	[Google Drive]
PASCAL CONTEXT	Transductive	[Google Drive]
PASCAL CONTEXT	Fully	[Google Drive]
COCO Stuff 164K	Inductive	[Google Drive]
COCO Stuff 164K	Transductive	[Google Drive]
COCO Stuff 164K	Fully	[Google Drive]

Inference:

python test.py ./path/to/config ./path/to/model.pth --eval=mIoU

Cross Dataset Inference:

CUDA_VISIBLE_DEVICES="0" python test.py ./configs/cross_dataset/coco-to-ade.py Path/to/coco/trans/iter_40000.pth --eval=mIoU
CUDA_VISIBLE_DEVICES="0" python test.py ./configs/cross_dataset/coco-to-context.py Path/to/coco/trans/iter_40000.pth --eval=mIoU
CUDA_VISIBLE_DEVICES="0" python test.py ./configs/cross_dataset/coco-to-voc.py Path/to/coco/trans/iter_40000.pth --eval=mIoU
CUDA_VISIBLE_DEVICES="0" python test.py ./configs/cross_dataset/context-to-coco.py Path/to/context/trans/iter_20000.pth --eval=mIoU
CUDA_VISIBLE_DEVICES="0" python test.py ./configs/cross_dataset/context-to-voc.py Path/to/context/trans/iter_20000.pth --eval=mIoU

Acknowledgement:

CLIP: https://github.com/openai/CLIP
Visual Prompt Tuning: https://github.com/KMnP/vpt
ZegOT: https://arxiv.org/abs/2301.12171
ZegCLIP: https://github.com/ZiqinZhou66/ZegCLIP
PLOT: https://github.com/CHENGY12/PLOT
Sinkformer: https://github.com/michaelsdr/sinkformers

Citation:

@misc{kim2024otsegmultipromptsinkhornattention,
      title={OTSeg: Multi-prompt Sinkhorn Attention for Zero-Shot Semantic Segmentation}, 
      author={Kwanyoung Kim and Yujin Oh and Jong Chul Ye},
      year={2024},
      eprint={2403.14183},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2403.14183}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OTSeg: Multi-prompt Sinkhorn Attention for Zero-Shot Semantic Segmentation (ECCV 2024)

News

Environment:

Downloading and preprocessing Dataset:

Preparing Pretrained CLIP model:

Training (Inductive):

Training (Transductive):

Training (Fully supervised):

Pretrained models:

Inference:

Cross Dataset Inference:

Acknowledgement:

Citation:

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
configs		configs
dataset		dataset
figs		figs
models		models
pretrained		pretrained
README.md		README.md
dist_test.sh		dist_test.sh
dist_train.sh		dist_train.sh
requirements.txt		requirements.txt
test.py		test.py
train.py		train.py

cubeyoung/OTSeg

Folders and files

Latest commit

History

Repository files navigation

OTSeg: Multi-prompt Sinkhorn Attention for Zero-Shot Semantic Segmentation (ECCV 2024)

News

Environment:

Downloading and preprocessing Dataset:

Preparing Pretrained CLIP model:

Training (Inductive):

Training (Transductive):

Training (Fully supervised):

Pretrained models:

Inference:

Cross Dataset Inference:

Acknowledgement:

Citation:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages