-
Notifications
You must be signed in to change notification settings - Fork 907
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How can i exttract features from vision? #72
Comments
Start with one of these:
|
Hi @mathshangw You can take a look at Lines 94 to 138 in ba9edd1
Let me know if you have any questions |
Also:
|
@mathshangw have you been able to extract features as you wanted ? If yes, I let you close the issue. Otherwise, feel free to ask more questions and I will try to help. |
sorry for the late reply . I tried the code but are these features for images before classifiying it .. I mean if I need to remove the output or classification layer to get the features . does it will be the same |
In the case of Facebook's DINO (🤗's documentation), contrary to Microsoft's BEiT (🤗's documentation), yes. As I understood it, Mathilde froze the features and added a linear classification layer on top. However, if you look at other works, e.g. BEiT:
I think it is better to see the performance of the network with frozen features, because fine-tuning hides the effect of the pre-training. I hope Mathilde keeps this approach in the future, or at least offers both perspectives, and I wish others would as well. In a nutshell, if you want to use DINO, you can use the official implementation without worrying about a classification layer. Or you can use 🤗's implementation, and just extract the [CLS] token as your feature. I have checked in a Github Gist, and you should get similar results with both methods. The only thing which you have to be wary of is the pre-processing, which 🤗 modified, without giving a reason. |
Thanks a lot for replying .. so excuse me
then I can use the model for the images I have to extract the features ? |
It is possible that your code is equivalent, but I would follow the README and use this: import torch
resnet50 = torch.hub.load('facebookresearch/dino:main', 'dino_resnet50') |
i tried your code but got |
That is because you would need to install |
i installed it using pip3 install transformers but didn't solve it |
If you want a minimal example without HuggingFace: from PIL import Image
import requests
def get_image(url):
return Image.open(requests.get(url, stream=True).raw) from torchvision import transforms as pth_transforms
preprocess = pth_transforms.Compose([
pth_transforms.Resize(256, interpolation=3),
pth_transforms.CenterCrop(224),
pth_transforms.ToTensor(),
pth_transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)),
])
def get_features(model, image):
return model(preprocess(image).unsqueeze(0)) import torch
resnet50 = torch.hub.load('facebookresearch/dino:main', 'dino_resnet50')
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
features = get_features(resnet50, get_image(url)) |
thanks a lot but excuse me how can i get the details about |
The details are in my minimal code snippet above: resize, center-crop, normalization. This is a very simple pre-processing. |
oh i didn't take my attention .. thanks a lot .. appreciating your help |
Hi, I reopen this topic because I am a bit lost in which are the best features to extract for comparing images (looking for similar images, independet of view point)
I am loading this model and then just do
this returns a 384 (or 768) dimensial feature. Is this feature the class tokens activations? or it comes from other place? I think if this is the case it would not be ideal as in contains positional informations, which is not the best for comparing images from different viewpoints. Also I see that from the teacher model I an not using the mlp head that is used for training and it outputs 60k+ dim for trainingand comparing to the student branch. So, If I would like to have an image feature (with pseudo-semantinc info, nos positional) in the order of 2..3k dimansional, which would be the best place to get it from Thanks |
i need to extract features from vision transformer . How can i start ?
The text was updated successfully, but these errors were encountered: