How to use Depth embedding. #14

softmurata · 2023-05-10T01:41:29Z

Thanks for great work!
I want to use Depth embedding in ImageBind, but I cannot get good results...
Please instruct how to use depth embeddings..

・depth estimator and create depth image

from transformers import DPTFeatureExtractor, DPTForDepthEstimation
import torch
import numpy as np
from PIL import Image

feature_extractor = DPTFeatureExtractor.from_pretrained("Intel/dpt-large")
model = DPTForDepthEstimation.from_pretrained("Intel/dpt-large")

text = "bird"
image = Image.open(f"/content/ImageBind/.assets/{text}_image.jpg")

encoding = feature_extractor(image, return_tensors="pt")
    
# forward pass
with torch.no_grad():
  outputs = model(**encoding)
  predicted_depth = outputs.predicted_depth
    
# interpolate to original size
prediction = torch.nn.functional.interpolate(
                        predicted_depth.unsqueeze(1),
                        size=image.size[::-1],
                        mode="bicubic",
                        align_corners=False,
    ).squeeze()
output = prediction.cpu().numpy()
formatted = (output * 255 / np.max(output)).astype('uint8')
img = Image.fromarray(formatted)
img.save(f"/content/ImageBind/.assets/{text}_depth.jpg")

・after that, inference with the following code

from torchvision import transforms
from PIL import Image
def load_and_transform_depth_data(depth_paths, device):
    if depth_paths is None:
        return None

    depth_ouputs = []
    for depth_path in depth_paths:
        data_transform = transforms.Compose(
            [
                transforms.Resize(
                    224, interpolation=transforms.InterpolationMode.BICUBIC
                ),
                transforms.CenterCrop(224),
                transforms.ToTensor(),
                # transforms.Normalize((0.5, ), (0.5, ))  # if I use this normalization, I cannot get good results...
            ]
        )
        with open(depth_path, "rb") as fopen:
            image = Image.open(fopen).convert("L")

        image = data_transform(image).to(device)
        depth_ouputs.append(image)
    return torch.stack(depth_ouputs, dim=0)


import data
import torch
from models import imagebind_model
from models.imagebind_model import ModalityType

text_list=["A dog.", "A car", "A bird"]
image_paths=[".assets/dog_image.jpg", ".assets/car_image.jpg", ".assets/bird_image.jpg"]
audio_paths=[".assets/dog_audio.wav", ".assets/car_audio.wav", ".assets/bird_audio.wav"]
depth_paths = [".assets/dog_depth.jpg", ".assets/car_depth.jpg", ".assets/bird_depth.jpg"]

device = "cuda:0" if torch.cuda.is_available() else "cpu"

# Instantiate model
model = imagebind_model.imagebind_huge(pretrained=True)
model.eval()
model.to(device)

# Load data
inputs = {
    ModalityType.TEXT: data.load_and_transform_text(text_list, device),
    ModalityType.VISION: data.load_and_transform_vision_data(image_paths, device),
    ModalityType.AUDIO: data.load_and_transform_audio_data(audio_paths, device),
    ModalityType.DEPTH: load_and_transform_depth_data(depth_paths, device),
}

with torch.no_grad():
    embeddings = model(inputs)

print(
    "Vision x Depth: ",
    torch.softmax(embeddings[ModalityType.VISION] @ embeddings[ModalityType.DEPTH].T, dim=-1),
)
print(
    "Text x Depth: ",
    torch.softmax(embeddings[ModalityType.TEXT] @ embeddings[ModalityType.DEPTH].T, dim=-1),
)
print(
    "Depth x Audio: ",
    torch.softmax(embeddings[ModalityType.DEPTH] @ embeddings[ModalityType.AUDIO].T, dim=-1),
)

・output

Vision x Depth:  tensor([[0.3444, 0.3040, 0.3516],
        [0.3451, 0.2363, 0.4186],
        [0.3517, 0.3634, 0.2849]], device='cuda:0')
Text x Depth:  tensor([[9.5571e-01, 4.4270e-02, 1.5210e-05],
        [5.6266e-01, 4.3734e-01, 9.7014e-10],
        [4.6230e-06, 1.0000e+00, 7.2704e-15]], device='cuda:0')
Depth x Audio:  tensor([[1.9618e-01, 1.4769e-02, 7.8905e-01],
        [1.5248e-02, 4.6171e-03, 9.8014e-01],
        [1.5896e-04, 1.8075e-02, 9.8177e-01]], device='cuda:0')

Please replay!

The text was updated successfully, but these errors were encountered:

ZrrSkywalker · 2023-05-24T12:29:46Z

Same question. The paper said that the depth maps are transformed into disparity maps. Will this matter? @softmurata

antonioo-c · 2023-05-31T07:16:21Z

Same question. And I tried to use the depth map to disparity map code by @imisra from here, still did not get reasonable results.

omaralvarez · 2023-06-10T06:57:14Z

I am also interested in how to use the depth embeddings properly, not getting good results.

StanLei52 · 2023-08-27T17:25:50Z

not sure if it is because the dog/car/bird cases do not appear in the training set of ImageBind

llziss4ai · 2023-10-14T08:34:55Z

We can use absolute depth in meters to inference by this repo

xiaos16 · 2024-03-28T07:13:30Z

@imisra Hello, I want to know how to preprocess the disparity map gotten by this code ? Thanks !

I filter samples of the 19 classes and get top1 acc 34.51 on sunRGBD-only

Oringa mentioned this issue May 11, 2023

Thermal x Vision Support #37

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use Depth embedding. #14

How to use Depth embedding. #14

softmurata commented May 10, 2023

ZrrSkywalker commented May 24, 2023

antonioo-c commented May 31, 2023 •

edited

Loading

omaralvarez commented Jun 10, 2023 •

edited

Loading

StanLei52 commented Aug 27, 2023

llziss4ai commented Oct 14, 2023

xiaos16 commented Mar 28, 2024 •

edited

Loading

How to use Depth embedding. #14

How to use Depth embedding. #14

Comments

softmurata commented May 10, 2023

ZrrSkywalker commented May 24, 2023

antonioo-c commented May 31, 2023 • edited Loading

omaralvarez commented Jun 10, 2023 • edited Loading

StanLei52 commented Aug 27, 2023

llziss4ai commented Oct 14, 2023

xiaos16 commented Mar 28, 2024 • edited Loading

antonioo-c commented May 31, 2023 •

edited

Loading

omaralvarez commented Jun 10, 2023 •

edited

Loading

xiaos16 commented Mar 28, 2024 •

edited

Loading