You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, may I ask if you think the linear layer used for the network middle layer is to train the text-language alignment ability from scratch? And have you tried to train the framwork with initialized by other fine pretrained feature extractor(have not been trained by clip)?
The text was updated successfully, but these errors were encountered:
In fact, I cannot assert with certainty the behavior of the intermediate layers, but what can be confirmed is that fine-tuning is indeed very useful. Therefore, you can also try other feature extractors, such as DINO.
Hello, may I ask if you think the linear layer used for the network middle layer is to train the text-language alignment ability from scratch? And have you tried to train the framwork with initialized by other fine pretrained feature extractor(have not been trained by clip)?
The text was updated successfully, but these errors were encountered: