RuntimeError: You have to specify either the warmup_steps or warmup_ratio CLI parameter #10

karndeepsingh · 2022-06-30T11:31:46Z

I am facing the warmup_ratio and warmup_steps error even though I have mentioned in the CLI parameter.

!python run_hybrid_clip.py \
    --output_dir ${MODEL_DIR} \
    --text_model_name_or_path="bertin-project/bertin-roberta-base-spanish" \
    --vision_model_name_or_path="openai/clip-vit-base-patch32" \
    --tokenizer_name="bertin-project/bertin-roberta-base-spanish" \
    --train_file="/content/drive/MyDrive/train_dataset.json" \
    --validation_file="/content/drive/MyDrive/valid_dataset.json" \
    --do_train --do_eval \
    --num_train_epochs="40" --max_seq_length 96 \
    --per_device_train_batch_size="64" \
    --per_device_eval_batch_size="64" \
    --learning_rate="5e-5" \
    --warmup_steps "0" \
    --warmup_ratio 0.0 \
    --weight_decay 0.1 \
    --overwrite_output_dir \
    --preprocessing_num_workers 32 \

loading weights file https://huggingface.co/openai/clip-vit-base-patch32/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/8a82711445c5200c2b4fd30739df371f5b3ce2d7e316418d58dd290bae1c1cc8.dabcc684421296ebcdafd583a4415c1757ae007787f2d0e17b87482d9b8cf197
Loading PyTorch weights from /root/.cache/huggingface/transformers/8a82711445c5200c2b4fd30739df371f5b3ce2d7e316418d58dd290bae1c1cc8.dabcc684421296ebcdafd583a4415c1757ae007787f2d0e17b87482d9b8cf197
PyTorch checkpoint contains 151,277,440 parameters.
Some weights of the model checkpoint at openai/clip-vit-base-patch32 were not used when initializing FlaxCLIPModel: {('text_model', 'embeddings', 'position_ids'), ('vision_model', 'embeddings', 'position_ids')}
- This IS expected if you are initializing FlaxCLIPModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing FlaxCLIPModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of FlaxCLIPModel were initialized from the model checkpoint at openai/clip-vit-base-patch32.
If your task is similar to the task the model of the checkpoint was trained on, you can already use FlaxCLIPModel for predictions without further training.
text_config_dict is None. Initializing the CLIPTextConfig with default values.
vision_config_dict is None. initializing the CLIPVisionConfig with default values.
/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py:490: UserWarning: This DataLoader will create 32 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  cpuset_checked))
2022-06-30 11:27:58.559519: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Traceback (most recent call last):
  File "run_hybrid_clip.py", line 975, in <module>
    main()
  File "run_hybrid_clip.py", line 716, in main
    "You have to specify either the warmup_steps or warmup_ratio CLI parameter"
RuntimeError: You have to specify either the warmup_steps or warmup_ratio CLI parameter

The text was updated successfully, but these errors were encountered:

vinid · 2022-07-01T04:24:58Z

Can you try setting the warmup ratio to something different than 0?

This line might fail because of that

karndeepsingh · 2022-07-01T13:38:25Z

Can you try setting the warmup ratio to something different than 0?

This line might fail because of that

Thanks @vinid for replying.
I would also like to know how can I show correctness of embeddings in terms of metric that have been learned by the model.

Can you suggest some metrics so that measures the correctness or closeness of embeddings that has been learned by the model so that it gives a meaningful understanding of model performance in terms of embeddings creation.

Thanks again.

vinid · 2022-07-01T19:49:33Z

I think image retrieval is a nice task to evaluate the quality of the embeddings. See here: https://arxiv.org/abs/2204.03972

karndeepsingh · 2022-07-01T20:10:23Z

I think image retrieval is a nice task to evaluate the quality of the embeddings. See here: https://arxiv.org/abs/2204.03972

Thanks again @vinid for your valuable reply.
I have trained CLIP model on different products images like clothing products, Electronic etc. With their description which in spanish language. So, I used Resnet50 for image encoding and BERTIN model of spanish for text encoding and trained the model with contrastive loss. I got good embeddings but want to know how to measure it and present to business.

Just need your feedback on the approach that I am taking to embed products embedding using image and text, if you can just suggest some better ways to do it .

also, I will go through the paper

thanks for sharing.
Waiting for you reply.

g8a9 · 2022-07-04T09:08:33Z

Hi, I can chime in here :)

You are following a standard approach to train dual-encoder architectures (like CLIP) with contrastive loss, so I think you are good to go. Other things you might be considering:

use ViT instead of ResNet as image encoder (recent architectures, including CLIP and CLIP Italian) achieved the best performance using vision transformers;
run a few initial training steps with encoders fixed, learning only the weights of projection layers (see backbone freezing)
I assume you are starting from pre-trained ResNet50 and BERTIN checkpoints. If that is not the case, you could try that

karndeepsingh · 2022-07-04T12:27:31Z

Hi, I can chime in here :)

You are following a standard approach to train dual-encoder architectures (like CLIP) with contrastive loss, so I think you are good to go. Other things you might be considering:

use ViT instead of ResNet as image encoder (recent architectures, including CLIP and CLIP Italian) achieved the best performance using vision transformers;

run a few initial training steps with encoders fixed, learning only the weights of projection layers (see backbone freezing)

I assume you are starting from pre-trained ResNet50 and BERTIN checkpoints. If that is not the case, you could try that

@g8a9
Thanks for replying.
I had one more question, if I train using ViT then what would be last hidden layer dimension to be fed into Projection Layers?

Also, I think training with ViT would need large number of images to progress better right?

g8a9 · 2022-07-04T13:06:37Z

Using the first token (the [CLS]) from the output of the vision encoder is a common choice with vision transformers. You can find traces of that in both the official CLIP implementation (https://github.com/openai/CLIP/blob/main/clip/model.py#L236) and the one we used from HF (the pooler_output here: https://huggingface.co/docs/transformers/main/model_doc/clip#transformers.FlaxCLIPVisionModel).

Since if you start from pre-trained checkpoints the two nets (ResNet50 and ViT) are easy to interchange, I still would test performance with both, regardless of the number of samples you have.

karndeepsingh · 2022-07-04T20:32:32Z

Using the first token (the [CLS]) from the output of the vision encoder is a common choice with vision transformers. You can find traces of that in both the official CLIP implementation (https://github.com/openai/CLIP/blob/main/clip/model.py#L236) and the one we used from HF (the pooler_output here: https://huggingface.co/docs/transformers/main/model_doc/clip#transformers.FlaxCLIPVisionModel).

Since if you start from pre-trained checkpoints the two nets (ResNet50 and ViT) are easy to interchange, I still would test performance with both, regardless of the number of samples you have.

@g8a9 Thanks for answering. I tried to train with ViT, it started to overfit but I think if I add more data it should converge properly.
One thing more I want to understand, once I have this model trained on my dataset how I can use the embeddings for classification tasks on certain categories? Adding to this question, when I say embedding will the model will return the embeddings both image and text or just images ?? and how I can use these embedding for further classification task on certain categories.
if you can guide me on this point as well, it would be great.

Thanks again for your help.

g8a9 · 2022-07-20T07:19:31Z

Hi, sorry for the delay.

The original CLIP paper shows how to use the prompt engineeringg model for zero-shot classification. If you have a set of categories (e.g., the classification labels in imagenet), you can measure similarities between your image and a set of sentences like "A picture of a ". It is more complicated than this, but you can find better references on their paper.

Adding to this question, when I say embedding will the model will return the embeddings both image and text or just images ??

Generally, yes, your model will be able to produce both text and image embedding as per the CLIP architecture.

g8a9 added help wanted Extra attention is needed question Further information is requested labels Jul 4, 2022

g8a9 mentioned this issue Jul 4, 2022

ERORR: "Missing XLA Configuration" while running the script? #9

Closed

vinid closed this as completed Jul 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: You have to specify either the warmup_steps or warmup_ratio CLI parameter #10

RuntimeError: You have to specify either the warmup_steps or warmup_ratio CLI parameter #10

karndeepsingh commented Jun 30, 2022

vinid commented Jul 1, 2022

karndeepsingh commented Jul 1, 2022 •

edited

vinid commented Jul 1, 2022

karndeepsingh commented Jul 1, 2022 •

edited

g8a9 commented Jul 4, 2022

karndeepsingh commented Jul 4, 2022 •

edited

g8a9 commented Jul 4, 2022 •

edited

karndeepsingh commented Jul 4, 2022 •

edited

g8a9 commented Jul 20, 2022

RuntimeError: You have to specify either the warmup_steps or warmup_ratio CLI parameter #10

RuntimeError: You have to specify either the warmup_steps or warmup_ratio CLI parameter #10

Comments

karndeepsingh commented Jun 30, 2022

vinid commented Jul 1, 2022

karndeepsingh commented Jul 1, 2022 • edited

vinid commented Jul 1, 2022

karndeepsingh commented Jul 1, 2022 • edited

g8a9 commented Jul 4, 2022

karndeepsingh commented Jul 4, 2022 • edited

g8a9 commented Jul 4, 2022 • edited

karndeepsingh commented Jul 4, 2022 • edited

g8a9 commented Jul 20, 2022

karndeepsingh commented Jul 1, 2022 •

edited

karndeepsingh commented Jul 1, 2022 •

edited

karndeepsingh commented Jul 4, 2022 •

edited

g8a9 commented Jul 4, 2022 •

edited

karndeepsingh commented Jul 4, 2022 •

edited