-
Notifications
You must be signed in to change notification settings - Fork 861
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add an example of few-shot memory bank model with MultiModal Feature Extraction #2822
Conversation
Job PR-2822-9dbea0c is done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR, performance looks good.
@Linuxdex , Let's add more description of the memory cache structure. |
@@ -0,0 +1,65 @@ | |||
# Use MultiModal Feature Extraction to Create a Few-shot Cache Adapter Model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also feel that cache adapter is not a good name.
Job PR-2822-9a7909b is done. |
Job PR-2822-20b107f is done. |
Job PR-2822-f3ca830 is done. |
Job PR-2822-b7841fe is done. |
Job PR-2822-576cd0a is done. |
Job PR-2822-86fbc2d is done. |
Job PR-2822-339f34a is done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. We can revise the textual descriptions in another PR.
|
||
Memory bank follows the excellent design of [Tip-Adapter](https://arxiv.org/pdf/2207.09519.pdf) which stores the image features of few-shot training set to improve the performance of zero-shot CLIP through feature similarity. The stored features can also serve as the initialization of a trainable classifier. This ProtoNet-like design makes full use of few-shot training information and leads to good performance [3]. We believe that the effectiveness of this design is not limited to CLIP, and can be widely applied to few-shot classification tasks of images and texts. | ||
|
||
Memory bank which is the derivative application of Tip-Adapter obtains diversified multi-modal features through MultiModal Feature Extraction. In this example, we first trained a linear classifier based on multi-modal features to generate baseline accuracy. Then, the similarity result between features and memory bank is introduced to baseline predict probability. Finally, an additional linear adapter which is initialized with memory bank is trained to help few-shot classification. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to rephrase this paragraph with proper English.
|
||
Memory bank which is the derivative application of Tip-Adapter obtains diversified multi-modal features through MultiModal Feature Extraction. In this example, we first trained a linear classifier based on multi-modal features to generate baseline accuracy. Then, the similarity result between features and memory bank is introduced to baseline predict probability. Finally, an additional linear adapter which is initialized with memory bank is trained to help few-shot classification. | ||
|
||
Hyper-parameters `alpha` and `beta` which adjust the memory bank are modified through grid search on validation set to attain the superior performance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. Need to revise the paragraph.
return features | ||
|
||
|
||
def generate_clip_weights(args, classnames, template, predictor): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can consider to just introduce the variable called semantic_label_embedding
and compare the similarity between semantic_label_embedding
and the embeddings of the labels stored in the memory bank.
Job PR-2822-91d77d2 is done. |
Job PR-2822-aff0452 is done. |
Job PR-2822-de7be1c is done. |
Job PR-2822-5d44696 is done. |
This is an example which provides a simple and clear way of implementing a memory-bank-powered few-shot learning model with AutoGluon MultiModal according to Tip-adapter.
The idea is to store
<feature, label>
pairs from the training data in a key-value memory bank. In the prediction phase, we compare the similarity between the test image features and the memory bank keys, and aggregate the prediction logits. The logits obtained via feature-similarity is combined with logits obtained from a classification model that directly predicts the label from the features.Experiments show that adding a memory-bank can improve the performance of image, text and image-text classification in the few-shot learning scenario.