New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RuntimeError: You have to specify either the warmup_steps or warmup_ratio CLI parameter #10
Comments
Can you try setting the warmup ratio to something different than 0? This line might fail because of that |
Thanks @vinid for replying. Can you suggest some metrics so that measures the correctness or closeness of embeddings that has been learned by the model so that it gives a meaningful understanding of model performance in terms of embeddings creation. Thanks again. |
I think image retrieval is a nice task to evaluate the quality of the embeddings. See here: https://arxiv.org/abs/2204.03972 |
Thanks again @vinid for your valuable reply. Just need your feedback on the approach that I am taking to embed products embedding using image and text, if you can just suggest some better ways to do it . also, I will go through the paper thanks for sharing. |
Hi, I can chime in here :) You are following a standard approach to train dual-encoder architectures (like CLIP) with contrastive loss, so I think you are good to go. Other things you might be considering:
|
@g8a9 Also, I think training with ViT would need large number of images to progress better right? |
Using the first token (the Since if you start from pre-trained checkpoints the two nets (ResNet50 and ViT) are easy to interchange, I still would test performance with both, regardless of the number of samples you have. |
@g8a9 Thanks for answering. I tried to train with ViT, it started to overfit but I think if I add more data it should converge properly. Thanks again for your help. |
Hi, sorry for the delay. The original CLIP paper shows how to use the prompt engineeringg model for zero-shot classification. If you have a set of categories (e.g., the classification labels in imagenet), you can measure similarities between your image and a set of sentences like "A picture of a ". It is more complicated than this, but you can find better references on their paper.
Generally, yes, your model will be able to produce both text and image embedding as per the CLIP architecture. |
I am facing the warmup_ratio and warmup_steps error even though I have mentioned in the CLI parameter.
The text was updated successfully, but these errors were encountered: