-
Notifications
You must be signed in to change notification settings - Fork 5.4k
Text as prompts #93
Comments
Following! Text prompting has been mentioned in the research paper but hasn't been released yet. Really looking forward to this feature because I need it for a specific use case. |
Exactly, wait for it to be released |
Thank you for your exciting work! I also want to use text as prompt to generate mask in my project. Now i am using ClipSeg to generate the mask, but it can not performance well in fine grained semantics. When do you plan to open source the code of text as prompt? What is the approximate time line? Waiting for this amazing work. |
following |
1 similar comment
following |
The paper mentioned they used CLIP to handle text prompts:
It appears the demo does not seem to allow textual inputs though. |
@peiwang062 @stefanjaspers @HaoZhang990127 @eware-godaddy @jy00161yang |
yes, we could simply combine these two, but if SAM can do it better, why do we need two models. We don’t know if Grounding Dino is the bottleneck if we just use its output to SAM. |
why not use output of SAM as bounding box? |
The current version of SAM without of the CLIP text encoder only gives instances from image points or bounding boxes as prompts. Thus SAM's output is only instances without semantic information about the segmentation. With a text encoder you can correlate SAM's output with text such as an object of interest in an image. If you have a promptable text based object detector you can bridge this link for the time being thus SAM's output not generic instance segmentation anymore which is what Grounded Segment Anything is helping to do. |
Put together a demo of grounded-segment-anything with radio for better testing. |
If SAM just segment in the bounding box, I think many methods can be used for this also. Like BoxInstSeg https://github.com/LiWentomng/BoxInstSeg |
It should be able to support boxes, points, masks and text as prompts as the paper mentions, no? |
following |
1 similar comment
following |
Following |
1 similar comment
Following |
following |
Our work can achieve text to mask with SAM: This is our work about CLIP's explainability. It's able to guide SAM to achieve text to mask without manual points. Besides, it's very simple without any fine-tuning, using the CLIP model itself only. Furthermore, it enhances many open-vocabulary tasks, like segmentation, multi-label classification, multimodal visualization. This is the jupyter demo: |
following |
can try the result using this explorer extension |
following |
2 similar comments
following |
following |
+1 |
following |
2 similar comments
following |
following |
waiting for it |
following |
1 similar comment
following |
Thanks for leasing this wonderful work!
I saw the demo shows examples of using point, box as input prompt. Does the demo support text as prompt?
The text was updated successfully, but these errors were encountered: