Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: You have to specify either the warmup_steps or warmup_ratio CLI parameter #10

Closed
karndeepsingh opened this issue Jun 30, 2022 · 9 comments
Labels
help wanted Extra attention is needed question Further information is requested

Comments

@karndeepsingh
Copy link

I am facing the warmup_ratio and warmup_steps error even though I have mentioned in the CLI parameter.

!python run_hybrid_clip.py \
    --output_dir ${MODEL_DIR} \
    --text_model_name_or_path="bertin-project/bertin-roberta-base-spanish" \
    --vision_model_name_or_path="openai/clip-vit-base-patch32" \
    --tokenizer_name="bertin-project/bertin-roberta-base-spanish" \
    --train_file="/content/drive/MyDrive/train_dataset.json" \
    --validation_file="/content/drive/MyDrive/valid_dataset.json" \
    --do_train --do_eval \
    --num_train_epochs="40" --max_seq_length 96 \
    --per_device_train_batch_size="64" \
    --per_device_eval_batch_size="64" \
    --learning_rate="5e-5" \
    --warmup_steps "0" \
    --warmup_ratio 0.0 \
    --weight_decay 0.1 \
    --overwrite_output_dir \
    --preprocessing_num_workers 32 \

loading weights file https://huggingface.co/openai/clip-vit-base-patch32/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/8a82711445c5200c2b4fd30739df371f5b3ce2d7e316418d58dd290bae1c1cc8.dabcc684421296ebcdafd583a4415c1757ae007787f2d0e17b87482d9b8cf197
Loading PyTorch weights from /root/.cache/huggingface/transformers/8a82711445c5200c2b4fd30739df371f5b3ce2d7e316418d58dd290bae1c1cc8.dabcc684421296ebcdafd583a4415c1757ae007787f2d0e17b87482d9b8cf197
PyTorch checkpoint contains 151,277,440 parameters.
Some weights of the model checkpoint at openai/clip-vit-base-patch32 were not used when initializing FlaxCLIPModel: {('text_model', 'embeddings', 'position_ids'), ('vision_model', 'embeddings', 'position_ids')}
- This IS expected if you are initializing FlaxCLIPModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing FlaxCLIPModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of FlaxCLIPModel were initialized from the model checkpoint at openai/clip-vit-base-patch32.
If your task is similar to the task the model of the checkpoint was trained on, you can already use FlaxCLIPModel for predictions without further training.
text_config_dict is None. Initializing the CLIPTextConfig with default values.
vision_config_dict is None. initializing the CLIPVisionConfig with default values.
/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py:490: UserWarning: This DataLoader will create 32 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  cpuset_checked))
2022-06-30 11:27:58.559519: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Traceback (most recent call last):
  File "run_hybrid_clip.py", line 975, in <module>
    main()
  File "run_hybrid_clip.py", line 716, in main
    "You have to specify either the warmup_steps or warmup_ratio CLI parameter"
RuntimeError: You have to specify either the warmup_steps or warmup_ratio CLI parameter
@vinid
Copy link
Contributor

vinid commented Jul 1, 2022

Can you try setting the warmup ratio to something different than 0?

This line might fail because of that

@karndeepsingh
Copy link
Author

karndeepsingh commented Jul 1, 2022

Can you try setting the warmup ratio to something different than 0?

This line might fail because of that

Thanks @vinid for replying.
I would also like to know how can I show correctness of embeddings in terms of metric that have been learned by the model.

Can you suggest some metrics so that measures the correctness or closeness of embeddings that has been learned by the model so that it gives a meaningful understanding of model performance in terms of embeddings creation.

Thanks again.

@vinid
Copy link
Contributor

vinid commented Jul 1, 2022

I think image retrieval is a nice task to evaluate the quality of the embeddings. See here: https://arxiv.org/abs/2204.03972

@karndeepsingh
Copy link
Author

karndeepsingh commented Jul 1, 2022

I think image retrieval is a nice task to evaluate the quality of the embeddings. See here: https://arxiv.org/abs/2204.03972

Thanks again @vinid for your valuable reply.
I have trained CLIP model on different products images like clothing products, Electronic etc. With their description which in spanish language. So, I used Resnet50 for image encoding and BERTIN model of spanish for text encoding and trained the model with contrastive loss. I got good embeddings but want to know how to measure it and present to business.

Just need your feedback on the approach that I am taking to embed products embedding using image and text, if you can just suggest some better ways to do it .

also, I will go through the paper

thanks for sharing.
Waiting for you reply.

@g8a9
Copy link
Contributor

g8a9 commented Jul 4, 2022

Hi, I can chime in here :)

You are following a standard approach to train dual-encoder architectures (like CLIP) with contrastive loss, so I think you are good to go. Other things you might be considering:

  • use ViT instead of ResNet as image encoder (recent architectures, including CLIP and CLIP Italian) achieved the best performance using vision transformers;
  • run a few initial training steps with encoders fixed, learning only the weights of projection layers (see backbone freezing)
  • I assume you are starting from pre-trained ResNet50 and BERTIN checkpoints. If that is not the case, you could try that

@g8a9 g8a9 added help wanted Extra attention is needed question Further information is requested labels Jul 4, 2022
@karndeepsingh
Copy link
Author

karndeepsingh commented Jul 4, 2022

Hi, I can chime in here :)

You are following a standard approach to train dual-encoder architectures (like CLIP) with contrastive loss, so I think you are good to go. Other things you might be considering:

  • use ViT instead of ResNet as image encoder (recent architectures, including CLIP and CLIP Italian) achieved the best performance using vision transformers;
  • run a few initial training steps with encoders fixed, learning only the weights of projection layers (see backbone freezing)
  • I assume you are starting from pre-trained ResNet50 and BERTIN checkpoints. If that is not the case, you could try that

@g8a9
Thanks for replying.
I had one more question, if I train using ViT then what would be last hidden layer dimension to be fed into Projection Layers?

Also, I think training with ViT would need large number of images to progress better right?

@g8a9
Copy link
Contributor

g8a9 commented Jul 4, 2022

Using the first token (the [CLS]) from the output of the vision encoder is a common choice with vision transformers. You can find traces of that in both the official CLIP implementation (https://github.com/openai/CLIP/blob/main/clip/model.py#L236) and the one we used from HF (the pooler_output here: https://huggingface.co/docs/transformers/main/model_doc/clip#transformers.FlaxCLIPVisionModel).

Since if you start from pre-trained checkpoints the two nets (ResNet50 and ViT) are easy to interchange, I still would test performance with both, regardless of the number of samples you have.

@karndeepsingh
Copy link
Author

karndeepsingh commented Jul 4, 2022

Using the first token (the [CLS]) from the output of the vision encoder is a common choice with vision transformers. You can find traces of that in both the official CLIP implementation (https://github.com/openai/CLIP/blob/main/clip/model.py#L236) and the one we used from HF (the pooler_output here: https://huggingface.co/docs/transformers/main/model_doc/clip#transformers.FlaxCLIPVisionModel).

Since if you start from pre-trained checkpoints the two nets (ResNet50 and ViT) are easy to interchange, I still would test performance with both, regardless of the number of samples you have.

@g8a9 Thanks for answering. I tried to train with ViT, it started to overfit but I think if I add more data it should converge properly.
One thing more I want to understand, once I have this model trained on my dataset how I can use the embeddings for classification tasks on certain categories? Adding to this question, when I say embedding will the model will return the embeddings both image and text or just images ?? and how I can use these embedding for further classification task on certain categories.
if you can guide me on this point as well, it would be great.

Thanks again for your help.

@g8a9
Copy link
Contributor

g8a9 commented Jul 20, 2022

Hi, sorry for the delay.

The original CLIP paper shows how to use the prompt engineeringg model for zero-shot classification. If you have a set of categories (e.g., the classification labels in imagenet), you can measure similarities between your image and a set of sentences like "A picture of a ". It is more complicated than this, but you can find better references on their paper.

Adding to this question, when I say embedding will the model will return the embeddings both image and text or just images ??

Generally, yes, your model will be able to produce both text and image embedding as per the CLIP architecture.

@vinid vinid closed this as completed Jul 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants