Is this the right way to do inference? #2

Suhail · 2023-04-17T19:39:35Z

I presume I don't need Normalize?

Esbenthorius · 2023-04-17T20:41:07Z

Not sure if its correct, but hope it helps

import torch
from PIL import Image
import torchvision.transforms as T
import hubconf

dinov2_vits14 = hubconf.dinov2_vits14()

img = Image.open('meta_dog.png')

transform = T.Compose([
T.Resize(224),
T.CenterCrop(224),
T.ToTensor(),
T.Normalize(mean=[0.5], std=[0.5]),
])

img = transform(img)[:3].unsqueeze(0)

with torch.no_grad():
features = dinov2_vits14(img, return_patches=True)[0]

print(features.shape)
import matplotlib.pyplot as plt
import numpy as np
from sklearn.decomposition import PCA

pca = PCA(n_components=3)
pca.fit(features)

pca_features = pca.transform(features)
pca_features = (pca_features - pca_features.min()) / (pca_features.max() - pca_features.min())
pca_features = pca_features * 255

plt.imshow(pca_features.reshape(16, 16, 3).astype(np.uint8))
plt.savefig('meta_dog_features.png')

In dinov2/models/vision_transformer.py line 290 add

def forward(self, *args, is_training=False, return_patches=False, **kwargs):
ret = self.forward_features(*args, **kwargs)
if is_training:
return ret
elif return_patches:
return ret["x_norm_patchtokens"]
else:
return self.head(ret["x_norm_clstoken"])

input:

visualized features:

patricklabatut · 2023-04-17T20:51:08Z

@Suhail To generate features from the pretrained backbones, just use a transform similar to the standard one used for evaluating on image classification with the typical ImageNet normalization mean and std (see what's used in the code). Also, as noted in the model card, the model can also use image sizes that are multiple of the patch size.

Suhail · 2023-04-17T21:01:41Z

@Suhail To generate features from the pretrained backbones, just use a transform similar to the standard one used for evaluating on image classification with the typical ImageNet normalization mean and std (see what's used in the code). Also, as noted in the model card, the model can also use image sizes that are multiple of the patch size.

Thanks! This is what I used:

image_transforms = T.Compose([
    T.Resize(256, interpolation=T.InterpolationMode.BICUBIC),
    T.CenterCrop(224),
    T.ToTensor(),
    T.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)),
])

Let me know if that's wrong though.

jjennings955 · 2023-04-17T22:43:00Z

Not sure if its correct, but hope it helps

import torch from PIL import Image import torchvision.transforms as T import hubconf

dinov2_vits14 = hubconf.dinov2_vits14()

img = Image.open('meta_dog.png')

transform = T.Compose([ T.Resize(224), T.CenterCrop(224), T.ToTensor(), T.Normalize(mean=[0.5], std=[0.5]), ])

img = transform(img)[:3].unsqueeze(0)

with torch.no_grad(): features = dinov2_vits14(img, return_patches=True)[0]

print(features.shape) import matplotlib.pyplot as plt import numpy as np from sklearn.decomposition import PCA

pca = PCA(n_components=3) pca.fit(features)

pca_features = pca.transform(features) pca_features = (pca_features - pca_features.min()) / (pca_features.max() - pca_features.min()) pca_features = pca_features * 255

plt.imshow(pca_features.reshape(16, 16, 3).astype(np.uint8)) plt.savefig('meta_dog_features.png')

In dinov2/models/vision_transformer.py line 290 add

def forward(self, *args, is_training=False, return_patches=False, **kwargs): ret = self.forward_features(*args, **kwargs) if is_training: return ret elif return_patches: return ret["x_norm_patchtokens"] else: return self.head(ret["x_norm_clstoken"])

input:

visualized features:

I found this helpful, but I would say instead of needing to modify the forward function, you can just do dino.forward_features(x)["x_norm_patchtokens"] yourself directly.

TimDarcet · 2023-04-18T12:18:32Z

@Suhail To generate features from the pretrained backbones, just use a transform similar to the standard one used for evaluating on image classification with the typical ImageNet normalization mean and std (see what's used in the code). Also, as noted in the model card, the model can also use image sizes that are multiple of the patch size.

Thanks! This is what I used:
image_transforms = T.Compose([
    T.Resize(256, interpolation=T.InterpolationMode.BICUBIC),
    T.CenterCrop(224),
    T.ToTensor(),
    T.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)),
])
Let me know if that's wrong though.

What you are doing is correct. What you get with the forward method is the CLS token. If you'd like the patch tokens, you can use forward_features, as noted by @jjennings955

Suhail · 2023-04-18T13:43:35Z

@Suhail To generate features from the pretrained backbones, just use a transform similar to the standard one used for evaluating on image classification with the typical ImageNet normalization mean and std (see what's used in the code). Also, as noted in the model card, the model can also use image sizes that are multiple of the patch size.

Thanks! This is what I used:
image_transforms = T.Compose([
T.Resize(256, interpolation=T.InterpolationMode.BICUBIC),
T.CenterCrop(224),
T.ToTensor(),
T.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)),
])
Let me know if that's wrong though.

What you are doing is correct. What you get with the forward method is the CLS token. If you'd like the patch tokens, you can use forward_features, as noted by @jjennings955

I think what I want is an embedding like CLIP that contains the features/understanding of the image. Is that what I'd get from forward_features?

woctezuma · 2023-04-18T13:46:45Z

If this is like DINO, any of the two features could be used as an image embedding.

Edit: You can see here how it is done in knn.py and log_regression.py, by simply calling model(samples).float():

dinov2/dinov2/eval/utils.py

Line 122 in fc49f49

features_rank = model(samples).float()

See:

dinov2/dinov2/eval/knn.py

Lines 260 to 264 in fc49f49

    
           logger.info("Extracting features for train set...") 
        
           train_features, train_labels = extract_features( 
        
               model, train_dataset, batch_size, num_workers, gather_on_cpu=gather_on_cpu 
        
           ) 
        
           logger.info(f"Train features created, shape {train_features.shape}.")

dinov2/dinov2/eval/log_regression.py

Lines 277 to 279 in fc49f49

    
           train_features, train_labels = extract_features( 
        
               model, train_dataset, batch_size, num_workers, gather_on_cpu=(train_features_device == _CPU_DEVICE) 
        
           )

dinov2/dinov2/eval/utils.py

Lines 114 to 122 in fc49f49

    
           def extract_features_with_dataloader(model, data_loader, sample_count, gather_on_cpu=False): 
        
               gather_device = torch.device("cpu") if gather_on_cpu else torch.device("cuda") 
        
               metric_logger = MetricLogger(delimiter="  ") 
        
               features, all_labels = None, None 
        
               for samples, (index, labels_rank) in metric_logger.log_every(data_loader, 10): 
        
                   samples = samples.cuda(non_blocking=True) 
        
                   labels_rank = labels_rank.cuda(non_blocking=True) 
        
                   index = index.cuda(non_blocking=True) 
        
                   features_rank = model(samples).float()

woctezuma · 2023-04-18T13:59:10Z

Please note that linear.py adopts a different approach.

dinov2/dinov2/eval/utils.py

Lines 42 to 44 in fc49f49

    
           features = self.feature_model.get_intermediate_layers( 
        
               images, self.n_last_blocks, return_class_token=True 
        
           )

See:

dinov2/dinov2/eval/linear.py

Lines 503 to 507 in fc49f49

    
           n_last_blocks_list = [1, 4] 
        
           n_last_blocks = max(n_last_blocks_list) 
        
           autocast_ctx = partial(torch.cuda.amp.autocast, enabled=True, dtype=autocast_dtype) 
        
           feature_model = ModelWithIntermediateLayers(model, n_last_blocks, autocast_ctx) 
        
           sample_output = feature_model(train_dataset[0][0].unsqueeze(0).cuda())

dinov2/dinov2/eval/utils.py

Lines 39 to 45 in fc49f49

    
           def forward(self, images): 
        
               with torch.inference_mode(): 
        
                   with self.autocast_ctx(): 
        
                       features = self.feature_model.get_intermediate_layers( 
        
                           images, self.n_last_blocks, return_class_token=True 
        
                       ) 
        
               return features

It was also the case with DINO:

How can i exttract features from vision? dino#72

You could also do fancier stuff, e.g. "concatenate [CLS] token and GeM pooled patch tokens", as with DINO's copy detection.

Elsword016 · 2023-04-19T02:36:37Z

How about this??

img = Image.open('')
transform = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]), ])
input_tensor = transform(img)
input_batch = input_tensor.unsqueeze(0).cuda()
with torch.no_grad():
output =dinov2_vits14.get_intermediate_layers(input_batch)

the output is a tuple of intermediate feature maps. Then you can select which features you want from the tuple, and then you can try K-means etc etc

woctezuma · 2023-04-19T09:29:44Z

Yes, get_intermediate_layers() allows different approaches. This is similar to what is done in linear.py as mentioned above.

You could also use GeM pooled patch tokens with this output, as in eval_copy_detection.py for DINO (v1).

Suhail · 2023-04-21T22:20:55Z

Sounds like this is all I need to do to get a features embedding: dino_emb = dinov2_vitg14(t_img.unsqueeze(0))

patricklabatut · 2023-04-24T22:32:11Z

Closing as this seems resolved (and using #53 to keep track of documentation needs on feature extraction).

aaiguy · 2023-05-03T06:09:38Z

hello, How to train nearest neighbors model on extracted embeddings of images from different classes of folders using dinov2 model and retrieve nearest similar image for query image ?
I tried below approach using sklearn nearest neighbors

import torch
from sklearn.neighbors import NearestNeighbors 
import pickle
from PIL import Image
import torchvision.transforms as T
import os 
# import hubconf
import tqdm
from tqdm import tqdm_notebook
device = torch.device('cuda' if torch.cuda.is_available() else "cpu")
print('device:',device)
# dinov2_vits14 = hubconf.dinov2_vits14()
dinov2_vits14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitg14')
dinov2_vits14.to(device)
def extract_features(filename):
    img = Image.open(filename)

    transform = T.Compose([
    T.Resize(224),
    T.CenterCrop(224),
    T.ToTensor(),
    T.Normalize(mean=[0.5], std=[0.5]),
    ])

    img = transform(img)[:3].unsqueeze(0)

    with torch.no_grad():
        features = dinov2_vits14(img.to('cuda'))[0]

    # print(features.shape)
    return features.numpy()

extensions = ['.jpg', '.JPG', '.jpeg', '.JPEG', '.png', '.PNG']

def get_file_list(root_dir):
    file_list = []
    for root, directories, filenames in os.walk(root_dir):
        for filename in filenames:
            if any(ext in filename for ext in extensions):
                filepath = os.path.join(root, filename)
                if os.path.exists(filepath):
                  file_list.append(filepath)
                else:
                  print(filepath)
    return file_list
def extract_features(filename):
    img = Image.open(filename)

    transform = T.Compose([
    T.Resize(224),
    T.CenterCrop(224),
    T.ToTensor(),
    T.Normalize(mean=[0.5], std=[0.5]),
    ])

    img = transform(img)[:3].unsqueeze(0)

    with torch.no_grad():
        features = dinov2_vits14(img.to('cuda'))[0]

    # print(features.shape)
    return features.cpu().numpy()

extensions = ['.jpg', '.JPG', '.jpeg', '.JPEG', '.png', '.PNG']

def get_file_list(root_dir):
    file_list = []
    for root, directories, filenames in os.walk(root_dir):
        for filename in filenames:
            if any(ext in filename for ext in extensions):
                filepath = os.path.join(root, filename)
                if os.path.exists(filepath):
                  file_list.append(filepath)
                else:
                  print(filepath)
    return file_list

# # path to the your datasets
root_dir = 'image_folder' 
filenames = sorted(get_file_list(root_dir))
print('Total files :', len(filenames))
feature_list = []
for i in tqdm.tqdm(range(len(filenames))):
    feature_list.append(extract_features(filenames[i]))
pickle.dump(feature_list,open('dino-all-feature-list.pickle','wb'))
pickle.dump(filenames,open('dino-all-filenames.pickle','wb'))
neighbors = NearestNeighbors(n_neighbors=5, algorithm='brute',metric='euclidean').fit(feature_list)
# Save the model to a file
with open('dino-all-neighbors2.pkl', 'wb') as f:
    pickle.dump(neighbors, f)

with above dinov2 based trained model i get around 70% accuracy on testing data for retrieving similar class images, is there a way to improve my approach in better manner to improvise the accuracy ??

woctezuma · 2023-05-03T08:12:33Z

First, for k-NN classification, have a look at knn.py.

Second, after a quick look at your code, I would suggest to try a different metric, e.g. cosine instead of euclidean.

Third, I believe you should use a different image pre-processing (cf. transform in your code). Copy the one used for DINOv2.

For further question, I would suggest to create a separate Github issue for this purpose.

aaiguy · 2023-05-03T10:17:32Z

hey thanks, i will look into it.
where can I find the cv.transform used for DINOv2 one?

woctezuma · 2023-05-03T10:51:46Z

hey thanks, i will look into it. where can I find the cv.transform used for DINOv2 one?

It is mentioned above: #2 (comment)

dinov2/dinov2/data/transforms.py

Lines 86 to 90 in c3c2683

    
           transforms_list = [ 
        
               transforms.Resize(resize_size, interpolation=interpolation), 
        
               transforms.CenterCrop(crop_size), 
        
               MaybeToTensor(), 
        
               make_normalize_transform(mean=mean, std=std),

It is similar to what you did but some values may differ, e.g.:

resizing to 256 resolution before center-cropping at 224 resolution,

dinov2/dinov2/data/transforms.py

Lines 80 to 84 in c3c2683

    
           resize_size: int = 256, 
        
           interpolation=transforms.InterpolationMode.BICUBIC, 
        
           crop_size: int = 224, 
        
           mean: Sequence[float] = IMAGENET_DEFAULT_MEAN, 
        
           std: Sequence[float] = IMAGENET_DEFAULT_STD,

normalizing with different mean and std.

dinov2/dinov2/data/transforms.py

Lines 43 to 44 in c3c2683

    
           IMAGENET_DEFAULT_MEAN = (0.485, 0.456, 0.406) 
        
           IMAGENET_DEFAULT_STD = (0.229, 0.224, 0.225)

aaiguy · 2023-05-05T10:41:10Z

Not sure if its correct, but hope it helps

import torch from PIL import Image import torchvision.transforms as T import hubconf

dinov2_vits14 = hubconf.dinov2_vits14()

img = Image.open('meta_dog.png')

transform = T.Compose([ T.Resize(224), T.CenterCrop(224), T.ToTensor(), T.Normalize(mean=[0.5], std=[0.5]), ])

img = transform(img)[:3].unsqueeze(0)

with torch.no_grad(): features = dinov2_vits14(img, return_patches=True)[0]

print(features.shape) import matplotlib.pyplot as plt import numpy as np from sklearn.decomposition import PCA

pca = PCA(n_components=3) pca.fit(features)

pca_features = pca.transform(features) pca_features = (pca_features - pca_features.min()) / (pca_features.max() - pca_features.min()) pca_features = pca_features * 255

plt.imshow(pca_features.reshape(16, 16, 3).astype(np.uint8)) plt.savefig('meta_dog_features.png')

In dinov2/models/vision_transformer.py line 290 add

def forward(self, *args, is_training=False, return_patches=False, **kwargs): ret = self.forward_features(*args, **kwargs) if is_training: return ret elif return_patches: return ret["x_norm_patchtokens"] else: return self.head(ret["x_norm_clstoken"])

input:

visualized features:

how to visualize feature like this ?? ,
I tried as below

import matplotlib.pyplot as plt
import numpy as np
from sklearn.decomposition import PCA


test_img = r"image.png"

features = extract_features_new(test_img)

pca = PCA(n_components=3)
pca.fit(features)

pca_features = pca.transform(features)
pca_features = (pca_features - pca_features.min()) / (pca_features.max() - pca_features.min())
pca_features = pca_features * 255

plt.imshow(pca_features.reshape(16, 16, 3).astype(np.uint8))

with this i'm getting error

Output exceeds the size limit. Open the full output data in a text editor---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[133], line 11
      8 features = extract_features_new(test_img)
     10 pca = PCA(n_components=3)
---> 11 pca.fit(features)
     13 pca_features = pca.transform(features)
     14 pca_features = (pca_features - pca_features.min()) / (pca_features.max() - pca_features.min())
     
     
ValueError: Expected 2D array, got 1D array instead:
array=[ 0.48167408 -2.6765716  -1.8200531  ... -2.971799    1.1348227
 -1.9918671 ].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

feature shape is 1024, how would i fix this ?

woctezuma · 2023-05-05T11:17:38Z

how to visualize feature like this ?? ,

XiaominLi1997 · 2023-05-08T01:51:30Z

@Suhail To generate features from the pretrained backbones, just use a transform similar to the standard one used for evaluating on image classification with the typical ImageNet normalization mean and std (see what's used in the code). Also, as noted in the model card, the model can also use image sizes that are multiple of the patch size.

Hi, it seems that I can get feature embedding of [1, 256, 384] for an image, then I reshape it to [1, 16, 16, 384], I can get the visualized features. But, how can I get a feature map with a larger resolution because I wonna get finer info such as texture.

purnasai · 2023-05-13T14:44:41Z

Hi @XiaominLi1997, Use Larger models.

feat_dim = 384 # vits14
feat_dim = 768 # vitb14
feat_dim = 1024 # vitl14
feat_dim = 1536 # vitg14

So, you can use Vitg14 & Also Increase Input Image size in Multiple of 14. Ex: 518pix( i.e 14patchsize * 37pixels).
Hope this helps.

ydove0324 · 2024-04-17T14:42:35Z

T.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)),

Why do you need to center to (0.485,0.456,0.406)? Is anywhere mentioning this?

charchit7 · 2024-04-22T04:25:46Z

@ydove0324 this is standard imagenet mean used for training. It's a common practice.

patricklabatut added the documentation Improvements or additions to documentation label Apr 18, 2023

patricklabatut self-assigned this Apr 18, 2023

woctezuma mentioned this issue Apr 18, 2023

How to do image feature extractions? #15

Closed

This was referenced Apr 19, 2023

Could you tell me how to visualize PCA 3-channel image just like the showcase in your README #23

Closed

Example for how to do inference with this model? #1

Closed

How to do Instance Retrieval like the demo? #7

Closed

This was referenced Apr 23, 2023

How to get intermediate image features like from Swin Transformers? #48

Closed

How to evaluate on image retrieval datasets using nearest neighbors? #50

Open

patricklabatut mentioned this issue Apr 24, 2023

[request] Feature extraction documentation #53

Open

patricklabatut closed this as completed Apr 24, 2023

This was referenced May 1, 2023

Image embeddings extraction example #73

Closed

script used to generate the teaser video #75

Closed

woctezuma mentioned this issue May 3, 2023

Releasing code to use DinoV2 to calculate Feature loss. #76

Open

aaiguy mentioned this issue May 3, 2023

how to use dinov2 for image retrieval [image similarity search] in large image dataset as shown in demo https://dinov2.metademolab.com/demos?category=retrieval? #78

Open

woctezuma mentioned this issue May 4, 2023

如何把获取的特征，用来做图像分割目标检测 #82

Closed

This was referenced Jun 3, 2023

Same input, same model, different feature output (PyTorch 2.0) #115

Closed

Same input, same model, different feature output (PyTorch 2.0) pytorch/pytorch#102908

Closed

woctezuma mentioned this issue Jun 23, 2023

Documentation on Inference #127

Closed

mzur mentioned this issue Oct 4, 2023

Use image retrieval techniques to find similiar images biigle/maia#27

Closed

lehigh123 mentioned this issue Nov 26, 2023

Using default transforms when creating embeddings index - confusion on resize and crop transform #326

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is this the right way to do inference? #2

Is this the right way to do inference? #2

Suhail commented Apr 17, 2023 •

edited

Loading

Esbenthorius commented Apr 17, 2023 •

edited

Loading

patricklabatut commented Apr 17, 2023

Suhail commented Apr 17, 2023

jjennings955 commented Apr 17, 2023

TimDarcet commented Apr 18, 2023 •

edited

Loading

Suhail commented Apr 18, 2023

woctezuma commented Apr 18, 2023 •

edited

Loading

woctezuma commented Apr 18, 2023 •

edited

Loading

Elsword016 commented Apr 19, 2023 •

edited

Loading

woctezuma commented Apr 19, 2023 •

edited

Loading

Suhail commented Apr 21, 2023

patricklabatut commented Apr 24, 2023

aaiguy commented May 3, 2023

woctezuma commented May 3, 2023 •

edited

Loading

aaiguy commented May 3, 2023

woctezuma commented May 3, 2023 •

edited

Loading

aaiguy commented May 5, 2023 •

edited

Loading

woctezuma commented May 5, 2023 •

edited

Loading

XiaominLi1997 commented May 8, 2023

purnasai commented May 13, 2023

ydove0324 commented Apr 17, 2024

charchit7 commented Apr 22, 2024

Is this the right way to do inference? #2

Is this the right way to do inference? #2

Comments

Suhail commented Apr 17, 2023 • edited Loading

Esbenthorius commented Apr 17, 2023 • edited Loading

patricklabatut commented Apr 17, 2023

Suhail commented Apr 17, 2023

jjennings955 commented Apr 17, 2023

TimDarcet commented Apr 18, 2023 • edited Loading

Suhail commented Apr 18, 2023

woctezuma commented Apr 18, 2023 • edited Loading

woctezuma commented Apr 18, 2023 • edited Loading

Elsword016 commented Apr 19, 2023 • edited Loading

woctezuma commented Apr 19, 2023 • edited Loading

Suhail commented Apr 21, 2023

patricklabatut commented Apr 24, 2023

aaiguy commented May 3, 2023

woctezuma commented May 3, 2023 • edited Loading

aaiguy commented May 3, 2023

woctezuma commented May 3, 2023 • edited Loading

aaiguy commented May 5, 2023 • edited Loading

woctezuma commented May 5, 2023 • edited Loading

XiaominLi1997 commented May 8, 2023

purnasai commented May 13, 2023

ydove0324 commented Apr 17, 2024

charchit7 commented Apr 22, 2024

Suhail commented Apr 17, 2023 •

edited

Loading

Esbenthorius commented Apr 17, 2023 •

edited

Loading

TimDarcet commented Apr 18, 2023 •

edited

Loading

woctezuma commented Apr 18, 2023 •

edited

Loading

woctezuma commented Apr 18, 2023 •

edited

Loading

Elsword016 commented Apr 19, 2023 •

edited

Loading

woctezuma commented Apr 19, 2023 •

edited

Loading

woctezuma commented May 3, 2023 •

edited

Loading

woctezuma commented May 3, 2023 •

edited

Loading

aaiguy commented May 5, 2023 •

edited

Loading

woctezuma commented May 5, 2023 •

edited

Loading