Skip to content
This repository has been archived by the owner on Jul 1, 2024. It is now read-only.

Text as prompts #93

Open
peiwang062 opened this issue Apr 8, 2023 · 31 comments
Open

Text as prompts #93

peiwang062 opened this issue Apr 8, 2023 · 31 comments
Labels
enhancement New feature or request

Comments

@peiwang062
Copy link

Thanks for leasing this wonderful work!
I saw the demo shows examples of using point, box as input prompt. Does the demo support text as prompt?

@stefanjaspers
Copy link

Following! Text prompting has been mentioned in the research paper but hasn't been released yet. Really looking forward to this feature because I need it for a specific use case.

@darvilabtech
Copy link

Exactly, wait for it to be released

@HaoZhang990127
Copy link

Thank you for your exciting work!

I also want to use text as prompt to generate mask in my project. Now i am using ClipSeg to generate the mask, but it can not performance well in fine grained semantics.

When do you plan to open source the code of text as prompt? What is the approximate time line? Waiting for this amazing work.

@jy00161yang
Copy link

following

1 similar comment
@eware-godaddy
Copy link

following

@0xbitches
Copy link

The paper mentioned they used CLIP to handle text prompts:

We represent points and boxes by positional encodings [95] summed with learned embeddings for each prompt type and free-form text with an off-the-shelf text encoder from CLIP [82].

It appears the demo does not seem to allow textual inputs though.

@darvilabtech
Copy link

darvilabtech commented Apr 9, 2023

@peiwang062 @stefanjaspers @HaoZhang990127 @eware-godaddy @jy00161yang
https://github.com/IDEA-Research/Grounded-Segment-Anything does what we are all looking after.

@peiwang062
Copy link
Author

@peiwang062 @stefanjaspers @HaoZhang990127 @eware-godaddy @jy00161yang https://github.com/IDEA-Research/Grounded-Segment-Anything does what we are all looking after.

yes, we could simply combine these two, but if SAM can do it better, why do we need two models. We don’t know if Grounding Dino is the bottleneck if we just use its output to SAM.

@alexw994
Copy link

@peiwang062 @stefanjaspers @HaoZhang990127 @eware-godaddy @jy00161yang https://github.com/IDEA-Research/Grounded-Segment-Anything does what we are all looking after.

why not use output of SAM as bounding box?

@narbhar
Copy link

narbhar commented Apr 10, 2023

@peiwang062 @stefanjaspers @HaoZhang990127 @eware-godaddy @jy00161yang https://github.com/IDEA-Research/Grounded-Segment-Anything does what we are all looking after.

why not use output of SAM as bounding box?

The current version of SAM without of the CLIP text encoder only gives instances from image points or bounding boxes as prompts. Thus SAM's output is only instances without semantic information about the segmentation. With a text encoder you can correlate SAM's output with text such as an object of interest in an image. If you have a promptable text based object detector you can bridge this link for the time being thus SAM's output not generic instance segmentation anymore which is what Grounded Segment Anything is helping to do.

@luca-medeiros
Copy link

Put together a demo of grounded-segment-anything with radio for better testing.
I tested using clip, open-clip, and groundingdino. Groudingdino performs much better with a great performance. Less than 1 sec on a A100 for DINO+SAM. Maybe ill add the clip versions as well.
https://github.com/luca-medeiros/lang-segment-anything

@alexw994
Copy link

alexw994 commented Apr 10, 2023

@peiwang062 @stefanjaspers @HaoZhang990127 @eware-godaddy @jy00161yang https://github.com/IDEA-Research/Grounded-Segment-Anything does what we are all looking after.

why not use output of SAM as bounding box?

The current version of SAM without of the CLIP text encoder only gives instances from image points or bounding boxes as prompts. Thus SAM's output is only instances without semantic information about the segmentation. With a text encoder you can correlate SAM's output with text such as an object of interest in an image. If you have a promptable text based object detector you can bridge this link for the time being thus SAM's output not generic instance segmentation anymore which is what Grounded Segment Anything is helping to do.

If SAM just segment in the bounding box, I think many methods can be used for this also. Like BoxInstSeg https://github.com/LiWentomng/BoxInstSeg

@peiwang062
Copy link
Author

@peiwang062 @stefanjaspers @HaoZhang990127 @eware-godaddy @jy00161yang https://github.com/IDEA-Research/Grounded-Segment-Anything does what we are all looking after.

why not use output of SAM as bounding box?

The current version of SAM without of the CLIP text encoder only gives instances from image points or bounding boxes as prompts. Thus SAM's output is only instances without semantic information about the segmentation. With a text encoder you can correlate SAM's output with text such as an object of interest in an image. If you have a promptable text based object detector you can bridge this link for the time being thus SAM's output not generic instance segmentation anymore which is what Grounded Segment Anything is helping to do.

If SAM just segment in the bounding box, I think many methods can be used for this also. Like BoxInstSeg https://github.com/LiWentomng/BoxInstSeg

It should be able to support boxes, points, masks and text as prompts as the paper mentions, no?

@nikolausWest
Copy link

following

1 similar comment
@yash0307
Copy link

following

@9p15p
Copy link

9p15p commented Apr 11, 2023

Following

1 similar comment
@fyuf
Copy link

fyuf commented Apr 11, 2023

Following

@nikhilaravi nikhilaravi added the enhancement New feature or request label Apr 12, 2023
@Zhangwenyao1
Copy link

following

@Eli-YiLi
Copy link

Our work can achieve text to mask with SAM:
https://github.com/xmed-lab/CLIP_Surgery

This is our work about CLIP's explainability. It's able to guide SAM to achieve text to mask without manual points.

Besides, it's very simple without any fine-tuning, using the CLIP model itself only.

Furthermore, it enhances many open-vocabulary tasks, like segmentation, multi-label classification, multimodal visualization.

This is the jupyter demo:
https://github.com/xmed-lab/CLIP_Surgery/blob/master/demo.ipynb

fig4

@zaojiahua
Copy link

following

@FrancisDacian
Copy link

@ignoHH
Copy link

ignoHH commented Apr 21, 2023

following

2 similar comments
@bjccdsrlcr
Copy link

following

@mydcxiao
Copy link

following

@xuxiaoxxxx
Copy link

+1

@daminnock
Copy link

following

2 similar comments
@Alice1820
Copy link

following

@freshman97
Copy link

following

@zhangjingxian1998
Copy link

waiting for it

@N-one
Copy link

N-one commented Sep 15, 2023

following

1 similar comment
@moktsuiqin
Copy link

following

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests