Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use Depth embedding. #14

Open
softmurata opened this issue May 10, 2023 · 6 comments
Open

How to use Depth embedding. #14

softmurata opened this issue May 10, 2023 · 6 comments

Comments

@softmurata
Copy link

Thanks for great work!
I want to use Depth embedding in ImageBind, but I cannot get good results...
Please instruct how to use depth embeddings..

・depth estimator and create depth image

from transformers import DPTFeatureExtractor, DPTForDepthEstimation
import torch
import numpy as np
from PIL import Image

feature_extractor = DPTFeatureExtractor.from_pretrained("Intel/dpt-large")
model = DPTForDepthEstimation.from_pretrained("Intel/dpt-large")

text = "bird"
image = Image.open(f"/content/ImageBind/.assets/{text}_image.jpg")

encoding = feature_extractor(image, return_tensors="pt")
    
# forward pass
with torch.no_grad():
  outputs = model(**encoding)
  predicted_depth = outputs.predicted_depth
    
# interpolate to original size
prediction = torch.nn.functional.interpolate(
                        predicted_depth.unsqueeze(1),
                        size=image.size[::-1],
                        mode="bicubic",
                        align_corners=False,
    ).squeeze()
output = prediction.cpu().numpy()
formatted = (output * 255 / np.max(output)).astype('uint8')
img = Image.fromarray(formatted)
img.save(f"/content/ImageBind/.assets/{text}_depth.jpg")

・after that, inference with the following code

from torchvision import transforms
from PIL import Image
def load_and_transform_depth_data(depth_paths, device):
    if depth_paths is None:
        return None

    depth_ouputs = []
    for depth_path in depth_paths:
        data_transform = transforms.Compose(
            [
                transforms.Resize(
                    224, interpolation=transforms.InterpolationMode.BICUBIC
                ),
                transforms.CenterCrop(224),
                transforms.ToTensor(),
                # transforms.Normalize((0.5, ), (0.5, ))  # if I use this normalization, I cannot get good results...
            ]
        )
        with open(depth_path, "rb") as fopen:
            image = Image.open(fopen).convert("L")

        image = data_transform(image).to(device)
        depth_ouputs.append(image)
    return torch.stack(depth_ouputs, dim=0)


import data
import torch
from models import imagebind_model
from models.imagebind_model import ModalityType

text_list=["A dog.", "A car", "A bird"]
image_paths=[".assets/dog_image.jpg", ".assets/car_image.jpg", ".assets/bird_image.jpg"]
audio_paths=[".assets/dog_audio.wav", ".assets/car_audio.wav", ".assets/bird_audio.wav"]
depth_paths = [".assets/dog_depth.jpg", ".assets/car_depth.jpg", ".assets/bird_depth.jpg"]

device = "cuda:0" if torch.cuda.is_available() else "cpu"

# Instantiate model
model = imagebind_model.imagebind_huge(pretrained=True)
model.eval()
model.to(device)

# Load data
inputs = {
    ModalityType.TEXT: data.load_and_transform_text(text_list, device),
    ModalityType.VISION: data.load_and_transform_vision_data(image_paths, device),
    ModalityType.AUDIO: data.load_and_transform_audio_data(audio_paths, device),
    ModalityType.DEPTH: load_and_transform_depth_data(depth_paths, device),
}

with torch.no_grad():
    embeddings = model(inputs)

print(
    "Vision x Depth: ",
    torch.softmax(embeddings[ModalityType.VISION] @ embeddings[ModalityType.DEPTH].T, dim=-1),
)
print(
    "Text x Depth: ",
    torch.softmax(embeddings[ModalityType.TEXT] @ embeddings[ModalityType.DEPTH].T, dim=-1),
)
print(
    "Depth x Audio: ",
    torch.softmax(embeddings[ModalityType.DEPTH] @ embeddings[ModalityType.AUDIO].T, dim=-1),
)

・output

Vision x Depth:  tensor([[0.3444, 0.3040, 0.3516],
        [0.3451, 0.2363, 0.4186],
        [0.3517, 0.3634, 0.2849]], device='cuda:0')
Text x Depth:  tensor([[9.5571e-01, 4.4270e-02, 1.5210e-05],
        [5.6266e-01, 4.3734e-01, 9.7014e-10],
        [4.6230e-06, 1.0000e+00, 7.2704e-15]], device='cuda:0')
Depth x Audio:  tensor([[1.9618e-01, 1.4769e-02, 7.8905e-01],
        [1.5248e-02, 4.6171e-03, 9.8014e-01],
        [1.5896e-04, 1.8075e-02, 9.8177e-01]], device='cuda:0')

Please replay!

@ZrrSkywalker
Copy link

Same question. The paper said that the depth maps are transformed into disparity maps. Will this matter? @softmurata

@antonioo-c
Copy link

antonioo-c commented May 31, 2023

Same question. And I tried to use the depth map to disparity map code by @imisra from here, still did not get reasonable results.

@omaralvarez
Copy link

omaralvarez commented Jun 10, 2023

I am also interested in how to use the depth embeddings properly, not getting good results.

@StanLei52
Copy link

not sure if it is because the dog/car/bird cases do not appear in the training set of ImageBind

@llziss4ai
Copy link

We can use absolute depth in meters to inference by this repo

@xiaos16
Copy link

xiaos16 commented Mar 28, 2024

@imisra Hello, I want to know how to preprocess the disparity map gotten by this code ? Thanks !

I filter samples of the 19 classes and get top1 acc 34.51 on sunRGBD-only

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants