_The following code requires having uploaded the dinov2 folder from our GitHub repository to your My Drive folder on Google Drive. There are further instructions below._

_Run the first 2 code blocks. After the second code block terminates, it will ask you to restart the session. Restart the session and then proceed to the next code block. You do not need to re-run the first two blocks after restarting the session._

In [1]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [2]:
!pip install -U openmim

Collecting openmim
  Downloading openmim-0.3.9-py2.py3-none-any.whl.metadata (16 kB)
Collecting colorama (from openmim)
  Downloading colorama-0.4.6-py2.py3-none-any.whl.metadata (17 kB)
Collecting model-index (from openmim)
  Downloading model_index-0.1.11-py3-none-any.whl.metadata (3.9 kB)
Collecting opendatalab (from openmim)
  Downloading opendatalab-0.0.10-py3-none-any.whl.metadata (6.4 kB)
Collecting ordered-set (from model-index->openmim)
  Downloading ordered_set-4.1.0-py3-none-any.whl.metadata (5.3 kB)
Collecting pycryptodome (from opendatalab->openmim)
  Downloading pycryptodome-3.21.0-cp36-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.4 kB)
Collecting openxlab (from opendatalab->openmim)
  Downloading openxlab-0.1.2-py3-none-any.whl.metadata (3.8 kB)
Collecting filelock~=3.14.0 (from openxlab->opendatalab->openmim)
  Downloading filelock-3.14.0-py3-none-any.whl.metadata (2.8 kB)
Collecting oss2~=2.17.0 (from openxlab->opendatalab->openmim)
  Downloading oss

_The next two blocks require some time to run (1-3 minutes total). Run them, and while they are running, do the following:_

_Upload all images and ground truth depth maps to the /content directory. If you click on the Files tab to the left, you should be brought to the /content directory by default (you should see directories called .config and sample\_data). For example, your file system could look like /content/1.png, /content/1.jpg, /content/2.png, etc._

_This code is designed to work for one scene at a time, so only upload the data for one scene. Performance metrics for multiple scenes can be easily obtained by running this code separately for each scene, and then taking the average of the performance on each scene, weighted by the number of samples in each scene._

_The outputs in this notebook were computed specifically on the scene basement\_0001a\_out, whereas the performance reported in the paper is averaged across all scenes, so the performance metrics here will not perfectly match those in the paper._

In [1]:
!mim install mmcv==1.5.0

Looking in links: https://download.openmmlab.com/mmcv/dist/cu121/torch2.5.0/index.html
Collecting mmcv==1.5.0
  Downloading mmcv-1.5.0.tar.gz (530 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m530.7/530.7 kB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting addict (from mmcv==1.5.0)
  Downloading addict-2.4.0-py3-none-any.whl.metadata (1.0 kB)
Collecting yapf (from mmcv==1.5.0)
  Downloading yapf-0.43.0-py3-none-any.whl.metadata (46 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.8/46.8 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
Downloading addict-2.4.0-py3-none-any.whl (3.8 kB)
Downloading yapf-0.43.0-py3-none-any.whl (256 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m256.2/256.2 kB[0m [31m15.0 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: mmcv
  Building wheel for mmcv (setup.py) ... [?25l[?25hdone
  Created wheel for mm

In [2]:
import shutil

# Define the source and destination path
source_folder_path = '/content/gdrive/My Drive/dinov2'
destination_folder_path = '/content/dinov2'

# Copy the folder
shutil.copytree(source_folder_path, destination_folder_path)

'/content/dinov2'

_Run the next 5 blocks_

In [3]:
import os
from PIL import Image
import numpy as np
import torch
import torch.nn.functional as F
import matplotlib
import matplotlib.pyplot as plt
from pathlib import Path
import sys
import math
import itertools
from functools import partial
import mmcv
from collections import defaultdict
import cv2
from google.colab.patches import cv2_imshow

from torchvision import transforms

from dinov2.eval.depth.models import build_depther

In [4]:
class CenterPadding(torch.nn.Module):
    def __init__(self, multiple):
        super().__init__()
        self.multiple = multiple

    def _get_pad(self, size):
        new_size = math.ceil(size / self.multiple) * self.multiple
        pad_size = new_size - size
        pad_size_left = pad_size // 2
        pad_size_right = pad_size - pad_size_left
        return pad_size_left, pad_size_right

    @torch.inference_mode()
    def forward(self, x):
        pads = list(itertools.chain.from_iterable(self._get_pad(m) for m in x.shape[:1:-1]))
        output = F.pad(x, pads)
        return output


def create_depther(cfg, backbone_model, backbone_size, head_type):
    train_cfg = cfg.get("train_cfg")
    test_cfg = cfg.get("test_cfg")
    depther = build_depther(cfg.model, train_cfg=train_cfg, test_cfg=test_cfg)

    depther.backbone.forward = partial(
        backbone_model.get_intermediate_layers,
        n=cfg.model.backbone.out_indices,
        reshape=True,
        return_class_token=cfg.model.backbone.output_cls_token,
        norm=cfg.model.backbone.final_norm,
    )

    if hasattr(backbone_model, "patch_size"):
        depther.backbone.register_forward_pre_hook(lambda _, x: CenterPadding(backbone_model.patch_size)(x[0]))

    return depther

In [5]:
def make_depth_transform() -> transforms.Compose:
    return transforms.Compose([
        transforms.ToTensor(),
        lambda x: 255.0 * x[:3], # Discard alpha component and scale by 255
        transforms.Normalize(
            mean=(123.675, 116.28, 103.53),
            std=(58.395, 57.12, 57.375),
        ),
    ])


def render_depth(values, colormap_name="magma_r") -> Image:
    min_value, max_value = values.min(), values.max()
    normalized_values = (values - min_value) / (max_value - min_value)

    colormap = matplotlib.colormaps[colormap_name]
    colors = colormap(normalized_values, bytes=True) # ((1)xhxwx4)
    colors = colors[:, :, :3] # Discard alpha component
    return Image.fromarray(colors)

_If you wish to change the size of the DINOv2 encoder, set BACKBONE\_SIZE accordingly._

In [6]:
BACKBONE_SIZE = "small" # in ("small", "base", "large" or "giant")


backbone_archs = {
    "small": "vits14",
    "base": "vitb14",
    "large": "vitl14",
    "giant": "vitg14",
}
backbone_arch = backbone_archs[BACKBONE_SIZE]
backbone_name = f"dinov2_{backbone_arch}"

backbone_model = torch.hub.load(repo_or_dir="facebookresearch/dinov2", model=backbone_name)
backbone_model.eval()
backbone_model.cuda()

Downloading: "https://github.com/facebookresearch/dinov2/zipball/main" to /root/.cache/torch/hub/main.zip
Downloading: "https://dl.fbaipublicfiles.com/dinov2/dinov2_vits14/dinov2_vits14_pretrain.pth" to /root/.cache/torch/hub/checkpoints/dinov2_vits14_pretrain.pth
100%|██████████| 84.2M/84.2M [00:00<00:00, 170MB/s]


DinoVisionTransformer(
  (patch_embed): PatchEmbed(
    (proj): Conv2d(3, 384, kernel_size=(14, 14), stride=(14, 14))
    (norm): Identity()
  )
  (blocks): ModuleList(
    (0-11): 12 x NestedTensorBlock(
      (norm1): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
      (attn): MemEffAttention(
        (qkv): Linear(in_features=384, out_features=1152, bias=True)
        (attn_drop): Dropout(p=0.0, inplace=False)
        (proj): Linear(in_features=384, out_features=384, bias=True)
        (proj_drop): Dropout(p=0.0, inplace=False)
      )
      (ls1): LayerScale()
      (drop_path1): Identity()
      (norm2): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
      (mlp): Mlp(
        (fc1): Linear(in_features=384, out_features=1536, bias=True)
        (act): GELU(approximate='none')
        (fc2): Linear(in_features=1536, out_features=384, bias=True)
        (drop): Dropout(p=0.0, inplace=False)
      )
      (ls2): LayerScale()
      (drop_path2): Identity()
    )
  )
  (n

In [7]:
import urllib

import mmcv
from mmcv.runner import load_checkpoint


def load_config_from_url(url: str) -> str:
    with urllib.request.urlopen(url) as f:
        return f.read().decode()


HEAD_DATASET = "nyu" # in ("nyu", "kitti")
HEAD_TYPE = "dpt" # in ("linear", "linear4", "dpt")


DINOV2_BASE_URL = "https://dl.fbaipublicfiles.com/dinov2"
head_config_url = f"{DINOV2_BASE_URL}/{backbone_name}/{backbone_name}_{HEAD_DATASET}_{HEAD_TYPE}_config.py"
head_checkpoint_url = f"{DINOV2_BASE_URL}/{backbone_name}/{backbone_name}_{HEAD_DATASET}_{HEAD_TYPE}_head.pth"

cfg_str = load_config_from_url(head_config_url)
cfg = mmcv.Config.fromstring(cfg_str, file_format=".py")

model = create_depther(
    cfg,
    backbone_model=backbone_model,
    backbone_size=BACKBONE_SIZE,
    head_type=HEAD_TYPE,
)

load_checkpoint(model, head_checkpoint_url, map_location="cpu")
model.eval()
model.cuda()

load checkpoint from http path: https://dl.fbaipublicfiles.com/dinov2/dinov2_vits14/dinov2_vits14_nyu_dpt_head.pth


Downloading: "https://dl.fbaipublicfiles.com/dinov2/dinov2_vits14/dinov2_vits14_nyu_dpt_head.pth" to /root/.cache/torch/hub/checkpoints/dinov2_vits14_nyu_dpt_head.pth
100%|██████████| 160M/160M [00:04<00:00, 40.6MB/s]


DepthEncoderDecoder(
  (backbone): DinoVisionTransformer()
  (decode_head): DPTHead(
    align_corners=False
    (loss_decode): ModuleList(
      (0): SigLoss()
      (1): GradientLoss()
    )
    (conv_depth): HeadDepth(
      (head): Sequential(
        (0): Conv2d(256, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (1): Interpolate()
        (2): Conv2d(128, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (3): ReLU()
        (4): Conv2d(32, 1, kernel_size=(1, 1), stride=(1, 1))
      )
    )
    (relu): ReLU()
    (sigmoid): Sigmoid()
    (reassemble_blocks): ReassembleBlocks(
      (projects): ModuleList(
        (0): ConvModule(
          (conv): Conv2d(384, 48, kernel_size=(1, 1), stride=(1, 1))
        )
        (1): ConvModule(
          (conv): Conv2d(384, 96, kernel_size=(1, 1), stride=(1, 1))
        )
        (2): ConvModule(
          (conv): Conv2d(384, 192, kernel_size=(1, 1), stride=(1, 1))
        )
        (3): ConvModule(
          (co

_Set num\_images to the number of images for the scene you uploaded_

In [8]:
num_images = 281

_Run this block to set up the speed tests._

In [40]:
# Speed Test Setup

import cv2
import matplotlib.pyplot as plt
import numpy as np
import time

%cd '/content/'

spatial_offset_pixels = 150
distance_threshold = 100
feature_algorithm = cv2.ORB_create(nfeatures=500)

image_height = 480
image_width = 640
original_dino_depth_map = torch.zeros((num_images, image_height, image_width))
updated_dino_depth_map = torch.zeros((num_images, image_height, image_width))
images = []
grayscale_images = []

for i in range(num_images):
    images.append(cv2.imread(str(i+1) + '.jpg'))
    grayscale_images.append(cv2.imread(str(i+1) + '.jpg', 0))

transform = make_depth_transform()

/content


_Run this block to test the speed of the Vanilla DINOv2 model._

In [50]:
# Speed Test - Vanilla Model

start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)

start.record()
for i in range(num_images):
    transformed_image = transform(images[i])
    batch = transformed_image.unsqueeze(0).cuda()

    with torch.no_grad():
        original_dino_depth_map[i] = model.whole_inference(batch, img_meta=None, rescale=True).squeeze()

end.record()
torch.cuda.synchronize()

avg_milliseconds_original = start.elapsed_time(end)/num_images
print("Vanilla DINOv2 Time for " + str(num_images) + " Forward Props = " + str(avg_milliseconds_original) + " milliseconds (" + str(1000/avg_milliseconds_original) + " Hz)")

Vanilla DINOv2 Time for 281 Forward Props = 141.99924933274022 milliseconds (7.0422907494161935 Hz)


_Run this block to test the speed of the Iteration 0 Pipeline. Set speed\_optimized=True for ORB Correspondence Speed and speed\_optimized=False for ORB Correspondence Accuracy._

In [60]:
# Speed Test - Iteration 0

speed_optimized = False

rho = 0.70

start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)

start.record()
for i in range(num_images):
    transformed_image = transform(images[i])
    batch = transformed_image.unsqueeze(0).cuda()

    with torch.no_grad():
        original_dino_depth_map[i] = model.whole_inference(batch, img_meta=None, rescale=True).squeeze()

    if i > 0:
        keypoints1, descriptors1 = feature_algorithm.detectAndCompute(images[i], None)
        keypoints2, descriptors2 = feature_algorithm.detectAndCompute(images[i-1], None)

        bf = cv2.BFMatcher(cv2.NORM_L2, crossCheck=True)
        matches = bf.match(descriptors1, descriptors2)

        if speed_optimized:
            matches = [m for m in matches if m.distance < distance_threshold]

        sum_tensor = torch.zeros_like(original_dino_depth_map[0])
        count_tensor = torch.zeros_like(original_dino_depth_map[0])

        for index in range(len(matches)):
            match = matches[index]
            img1_idx = match.queryIdx
            img2_idx = match.trainIdx

            (x1, y1) = keypoints1[img1_idx].pt
            (x2, y2) = keypoints2[img2_idx].pt

            x1 = int(x1)
            y1 = int(y1)
            x2 = int(x2)
            y2 = int(y2)

            y1_start = max(y1 - spatial_offset_pixels, 0)
            y1_end = min(y1 + spatial_offset_pixels, image_height - 1)
            x1_start = max(x1 - spatial_offset_pixels, 0)
            x1_end = min(x1 + spatial_offset_pixels, image_width - 1)

            y2_start = max(y2 - spatial_offset_pixels, 0)
            y2_end = min(y2 + spatial_offset_pixels, image_height - 1)
            x2_start = max(x2 - spatial_offset_pixels, 0)
            x2_end = min(x2 + spatial_offset_pixels, image_width - 1)

            if y1_start == 0:
                y2_end = min(y2_end, y2_start + y1_end - y1_start)
            if y2_start == 0:
                y1_end = min(y1_end, y1_start + y2_end - y2_start)
            if y1_end == image_height - 1:
                y2_start = max(y2_start, y2_end - y1_end + y1_start)
            if y2_end == image_height - 1:
                y1_start = max(y1_start, y1_end - y2_end + y2_start)
            if x1_start == 0:
                x2_end = min(x2_end, x2_start + x1_end - x1_start)
            if x2_start == 0:
                x1_end = min(x1_end, x1_start + x2_end - x2_start)
            if x1_end == image_width - 1:
                x2_start = max(x2_start, x2_end - x1_end + x1_start)
            if x2_end == image_width - 1:
                x1_start = max(x1_start, x1_end - x2_end + x2_start)

            # Extract valid subregion from original_dino_depth_map
            subregion1 = original_dino_depth_map[i, y1_start:y1_end+1, x1_start:x1_end+1]
            subregion2 = original_dino_depth_map[i-1, y2_start:y2_end+1, x2_start:x2_end+1]

            sum_tensor[y1_start:y1_end+1, x1_start:x1_end+1] += rho * subregion1 + (1 - rho) * subregion2
            count_tensor[y1_start:y1_end+1, x1_start:x1_end+1] += 1

        updated_dino_depth_map[i] = original_dino_depth_map[i].clone()
        updated_dino_depth_map[i][count_tensor != 0] = sum_tensor[count_tensor != 0] / count_tensor[count_tensor != 0]

end.record()
torch.cuda.synchronize()

avg_milliseconds_iteration_0 = start.elapsed_time(end)/num_images
print("Iteration 0 Time for " + str(num_images) + " Forward Props = " + str(avg_milliseconds_iteration_0) + " milliseconds (" + str(1000/avg_milliseconds_iteration_0) + " Hz, " + str(100*(avg_milliseconds_iteration_0-avg_milliseconds_original)/avg_milliseconds_original) + " percent change relative to Vanilla DINOv2)")

Iteration 0 Time for 281 Forward Props = 283.9788701067616 milliseconds (3.5213887555227297 Hz, 99.98617699825097 percent change relative to Vanilla DINOv2)


_Run this block to test the speed of the Iteration 1 Pipeline._

In [52]:
# Speed Test - Iteration 1

rho = 0.83

start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)

start.record()
for i in range(num_images):
    transformed_image = transform(images[i])
    batch = transformed_image.unsqueeze(0).cuda()

    with torch.inference_mode():
        original_dino_depth_map[i] = model.whole_inference(batch, img_meta=None, rescale=True).squeeze()

    if i > 0:
        shift_x, shift_y = cv2.phaseCorrelate(np.float32(grayscale_images[i-1]), np.float32(grayscale_images[i]))[0]

        shift_x = round(shift_x)
        shift_y = round(shift_y)

        updated_dino_depth_map[i] = original_dino_depth_map[i].clone()

        if shift_x >= 0 and shift_y >= 0:
            updated_dino_depth_map[i, shift_y:, shift_x:] = rho * original_dino_depth_map[i, shift_y:, shift_x:] + (1 - rho) * original_dino_depth_map[i-1, :image_height-shift_y, :image_width-shift_x]
        elif shift_x >= 0 and shift_y < 0:
            updated_dino_depth_map[i, :image_height+shift_y, shift_x:] = rho * original_dino_depth_map[i, :image_height+shift_y, shift_x:] + (1 - rho) * original_dino_depth_map[i-1, -shift_y:, :image_width-shift_x]
        elif shift_x < 0 and shift_y >= 0:
            updated_dino_depth_map[i, shift_y:, :image_width+shift_x] = rho * original_dino_depth_map[i, shift_y:, :image_width+shift_x] + (1 - rho) * original_dino_depth_map[i-1, :image_height-shift_y, -shift_x:]
        else:
            updated_dino_depth_map[i, :image_height+shift_y, :image_width+shift_x] = rho * original_dino_depth_map[i, :image_height+shift_y, :image_width+shift_x] + (1 - rho) * original_dino_depth_map[i-1, -shift_y:, -shift_x:]

end.record()
torch.cuda.synchronize()

avg_milliseconds_iteration_1 = start.elapsed_time(end)/num_images
print("Iteration 1 Time for " + str(num_images) + " Forward Props = " + str(avg_milliseconds_iteration_1) + " milliseconds (" + str(1000/avg_milliseconds_iteration_1) + " Hz, " + str(100*(avg_milliseconds_iteration_1-avg_milliseconds_original)/avg_milliseconds_original) + " percent change relative to Vanilla DINOv2)")

Iteration 1 Time for 281 Forward Props = 145.62527802491104 milliseconds (6.866939679448628 Hz, 2.5535548315992265 percent change relative to Vanilla DINOv2)


_Run this block to test the accuracy of the Iteration 0 Pipeline, as compared to the Vanilla DINOv2 model. Set speed\_optimized=True for ORB Correspondence Speed and speed\_optimized=False for ORB Correspondence Accuracy._

In [55]:
# Accuracy Test - Iteration 0

import cv2
import matplotlib.pyplot as plt
import numpy as np

%cd '/content/'

speed_optimized = False

spatial_offset_pixels = 150

rho = 0.70

image_height = 480
image_width = 640
ground_truth_depth = np.zeros((num_images, image_height, image_width))
original_dino_depth_map = torch.zeros((num_images, image_height, image_width))
updated_dino_depth_map = torch.zeros((num_images, image_height, image_width))
original_dino_mse = torch.zeros(num_images)
updated_dino_mse = torch.zeros(num_images)
difference = torch.zeros(num_images)
images = []

distance_threshold = 100

for i in range(num_images):
    images.append(cv2.imread(str(i+1) + '.jpg'))
    ground_truth_depth[i] = 10.0 * np.array(Image.open(str(i+1) + '.png')).astype(float)/255.0

feature_algorithm = cv2.ORB_create(nfeatures=500)
transform = make_depth_transform()

for i in range(num_images):
    transformed_image = transform(images[i])
    batch = transformed_image.unsqueeze(0).cuda()

    with torch.inference_mode():
        original_dino_depth_map[i] = model.whole_inference(batch, img_meta=None, rescale=True).squeeze()

    masked_true_depths = torch.tensor(ground_truth_depth[i][ground_truth_depth[i] > 0.0])
    original_masked_dino_depths = torch.tensor(original_dino_depth_map[i][ground_truth_depth[i] > 0.0])

    original_dino_mse[i] = F.mse_loss(original_masked_dino_depths, masked_true_depths)

    print("\nIteration 0 - Original Dino MSE for Image " + str(i) + " = " + str(original_dino_mse[i].item()))

    if i == 0:
        updated_dino_mse[i] = original_dino_mse[i]
        difference[i] = updated_dino_mse[i].item() - original_dino_mse[i].item()
    else:
        keypoints1, descriptors1 = feature_algorithm.detectAndCompute(images[i], None)
        keypoints2, descriptors2 = feature_algorithm.detectAndCompute(images[i-1], None)

        bf = cv2.BFMatcher(cv2.NORM_L2, crossCheck=True)
        matches = bf.match(descriptors1, descriptors2)

        if speed_optimized:
            matches = [m for m in matches if m.distance < distance_threshold]

        sum_tensor = torch.zeros_like(original_dino_depth_map[0])
        count_tensor = torch.zeros_like(original_dino_depth_map[0])

        for index in range(len(matches)):
            match = matches[index]
            img1_idx = match.queryIdx
            img2_idx = match.trainIdx

            (x1, y1) = keypoints1[img1_idx].pt
            (x2, y2) = keypoints2[img2_idx].pt

            x1 = int(x1)
            y1 = int(y1)
            x2 = int(x2)
            y2 = int(y2)

            y1_start = max(y1 - spatial_offset_pixels, 0)
            y1_end = min(y1 + spatial_offset_pixels, image_height - 1)
            x1_start = max(x1 - spatial_offset_pixels, 0)
            x1_end = min(x1 + spatial_offset_pixels, image_width - 1)

            y2_start = max(y2 - spatial_offset_pixels, 0)
            y2_end = min(y2 + spatial_offset_pixels, image_height - 1)
            x2_start = max(x2 - spatial_offset_pixels, 0)
            x2_end = min(x2 + spatial_offset_pixels, image_width - 1)

            if y1_start == 0:
                y2_end = min(y2_end, y2_start + y1_end - y1_start)
            if y2_start == 0:
                y1_end = min(y1_end, y1_start + y2_end - y2_start)
            if y1_end == image_height - 1:
                y2_start = max(y2_start, y2_end - y1_end + y1_start)
            if y2_end == image_height - 1:
                y1_start = max(y1_start, y1_end - y2_end + y2_start)
            if x1_start == 0:
                x2_end = min(x2_end, x2_start + x1_end - x1_start)
            if x2_start == 0:
                x1_end = min(x1_end, x1_start + x2_end - x2_start)
            if x1_end == image_width - 1:
                x2_start = max(x2_start, x2_end - x1_end + x1_start)
            if x2_end == image_width - 1:
                x1_start = max(x1_start, x1_end - x2_end + x2_start)

            # Extract valid subregion from original_dino_depth_map
            subregion1 = original_dino_depth_map[i, y1_start:y1_end+1, x1_start:x1_end+1]
            subregion2 = original_dino_depth_map[i-1, y2_start:y2_end+1, x2_start:x2_end+1]

            sum_tensor[y1_start:y1_end+1, x1_start:x1_end+1] += rho * subregion1 + (1 - rho) * subregion2
            count_tensor[y1_start:y1_end+1, x1_start:x1_end+1] += 1

        average = original_dino_depth_map[i].clone()
        average[count_tensor != 0] = sum_tensor[count_tensor != 0] / count_tensor[count_tensor != 0]

        updated_dino_depth_map[i] = average.clone()

        updated_masked_dino_depths = torch.tensor(updated_dino_depth_map[i][ground_truth_depth[i] > 0.0])

        updated_dino_mse[i] = F.mse_loss(updated_masked_dino_depths, masked_true_depths)

        print("Iteration 0 - Updated Dino MSE for Image " + str(i) + " = " + str(updated_dino_mse[i].item()))

        difference[i] = updated_dino_mse[i].item() - original_dino_mse[i].item()

        print("Iteration 0 - Difference = " + str(difference[i].item()))

print("\nIteration 0 - Mean Original DINO MSE Across " + str(num_images) + " Images = " + str(torch.mean(original_dino_mse).item()))
print("Iteration 0 - Mean Updated DINO MSE Across " + str(num_images) + " Images = " + str(torch.mean(updated_dino_mse).item()))
print("Iteration 0 - Mean Difference Across " + str(num_images) + " Images = " + str(torch.mean(difference).item()) + ", (" + str(torch.mean(difference).item()/torch.mean(original_dino_mse).item()*100) + " percent change relative to Vanilla DINOv2)")

/content


  original_masked_dino_depths = torch.tensor(original_dino_depth_map[i][ground_truth_depth[i] > 0.0])



Iteration 0 - Original Dino MSE for Image 0 = 0.4600278437137604

Iteration 0 - Original Dino MSE for Image 1 = 0.33338451385498047


  updated_masked_dino_depths = torch.tensor(updated_dino_depth_map[i][ground_truth_depth[i] > 0.0])


Iteration 0 - Updated Dino MSE for Image 1 = 0.33806440234184265
Iteration 0 - Difference = 0.004679888486862183

Iteration 0 - Original Dino MSE for Image 2 = 0.3002174198627472
Iteration 0 - Updated Dino MSE for Image 2 = 0.25804051756858826
Iteration 0 - Difference = -0.042176902294158936

Iteration 0 - Original Dino MSE for Image 3 = 0.23217007517814636
Iteration 0 - Updated Dino MSE for Image 3 = 0.20629426836967468
Iteration 0 - Difference = -0.02587580680847168

Iteration 0 - Original Dino MSE for Image 4 = 0.16918054223060608
Iteration 0 - Updated Dino MSE for Image 4 = 0.1607423573732376
Iteration 0 - Difference = -0.00843818485736847

Iteration 0 - Original Dino MSE for Image 5 = 0.1919342428445816
Iteration 0 - Updated Dino MSE for Image 5 = 0.1652613878250122
Iteration 0 - Difference = -0.026672855019569397

Iteration 0 - Original Dino MSE for Image 6 = 0.12380967289209366
Iteration 0 - Updated Dino MSE for Image 6 = 0.1227167546749115
Iteration 0 - Difference = -0.00109291

_Run this block to test the accuracy of the Iteration 1 Pipeline, as compared to the Vanilla DINOv2 model._

In [48]:
# Accuracy Test - Iteration 1

import cv2
import matplotlib.pyplot as plt
import numpy as np

%cd '/content/'

rho = 0.83

image_height = 480
image_width = 640
ground_truth_depth = np.zeros((num_images, image_height, image_width))
original_dino_depth_map = torch.zeros((num_images, image_height, image_width))
updated_dino_depth_map = torch.zeros((num_images, image_height, image_width))
original_dino_mse = torch.zeros(num_images)
updated_dino_mse = torch.zeros(num_images)
difference = torch.zeros(num_images)
images = []
grayscale_images = []

for i in range(num_images):
    images.append(cv2.imread(str(i+1) + '.jpg'))
    grayscale_images.append(cv2.imread(str(i+1) + '.jpg', 0))
    ground_truth_depth[i] = 10.0 * np.array(Image.open(str(i+1) + '.png')).astype(float)/255.0

feature_algorithm = cv2.ORB_create(nfeatures=500)
transform = make_depth_transform()

for i in range(num_images):
    transformed_image = transform(images[i])
    batch = transformed_image.unsqueeze(0).cuda()

    with torch.no_grad():
        original_dino_depth_map[i] = model.whole_inference(batch, img_meta=None, rescale=True).squeeze()

    masked_true_depths = torch.tensor(ground_truth_depth[i][ground_truth_depth[i] > 0.0])
    original_masked_dino_depths = torch.tensor(original_dino_depth_map[i][ground_truth_depth[i] > 0.0])

    original_dino_mse[i] = F.mse_loss(original_masked_dino_depths, masked_true_depths)

    print("\nIteration 1 - Original Dino MSE for Image " + str(i) + " = " + str(original_dino_mse[i].item()))

    if i == 0:
        updated_dino_mse[i] = original_dino_mse[i]
        difference[i] = 0.0
    else:
        shift_x, shift_y = cv2.phaseCorrelate(np.float32(grayscale_images[i-1]), np.float32(grayscale_images[i]))[0]

        shift_x = round(shift_x)
        shift_y = round(shift_y)

        if np.abs(shift_x) < 100 and np.abs(shift_y) < 100:
            updated_dino_depth_map[i] = original_dino_depth_map[i].clone()
            if shift_x >= 0 and shift_y >= 0:
                updated_dino_depth_map[i, shift_y:, shift_x:] = rho * original_dino_depth_map[i, shift_y:, shift_x:] + (1 - rho) * original_dino_depth_map[i-1, :image_height-shift_y, :image_width-shift_x]
            elif shift_x >= 0 and shift_y < 0:
                updated_dino_depth_map[i, :image_height+shift_y, shift_x:] = rho * original_dino_depth_map[i, :image_height+shift_y, shift_x:] + (1 - rho) * original_dino_depth_map[i-1, -shift_y:, :image_width-shift_x]
            elif shift_x < 0 and shift_y >= 0:
                updated_dino_depth_map[i, shift_y:, :image_width+shift_x] = rho * original_dino_depth_map[i, shift_y:, :image_width+shift_x] + (1 - rho) * original_dino_depth_map[i-1, :image_height-shift_y, -shift_x:]
            else:
                updated_dino_depth_map[i, :image_height+shift_y, :image_width+shift_x] = rho * original_dino_depth_map[i, :image_height+shift_y, :image_width+shift_x] + (1 - rho) * original_dino_depth_map[i-1, -shift_y:, -shift_x:]

            updated_masked_dino_depths = torch.tensor(updated_dino_depth_map[i][ground_truth_depth[i] > 0.0])

            updated_dino_mse[i] = F.mse_loss(updated_masked_dino_depths, masked_true_depths)

            print("Iteration 1 - Updated Dino MSE for Image " + str(i) + " = " + str(updated_dino_mse[i].item()))

            difference[i] = updated_dino_mse[i].item() - original_dino_mse[i].item()

            print("Iteration 1 - Difference = " + str(difference[i].item()))
        else:
            updated_dino_mse[i] = original_dino_mse[i]
            difference[i] = 0.0

print("\nIteration 0 - Mean Original DINO MSE Across " + str(num_images) + " Images = " + str(torch.mean(original_dino_mse).item()))
print("Iteration 0 - Mean Updated DINO MSE Across " + str(num_images) + " Images = " + str(torch.mean(updated_dino_mse).item()))
print("Iteration 0 - Mean Difference Across " + str(num_images) + " Images = " + str(torch.mean(difference).item()) + ", (" + str(torch.mean(difference).item()/torch.mean(original_dino_mse).item()*100) + " percent change relative to Vanilla DINOv2)")

/content


  original_masked_dino_depths = torch.tensor(original_dino_depth_map[i][ground_truth_depth[i] > 0.0])
  updated_masked_dino_depths = torch.tensor(updated_dino_depth_map[i][ground_truth_depth[i] > 0.0])



Iteration 1 - Original Dino MSE for Image 0 = 0.4600278437137604

Iteration 1 - Original Dino MSE for Image 1 = 0.33338451385498047
Iteration 1 - Updated Dino MSE for Image 1 = 0.3407396674156189
Iteration 1 - Difference = 0.007355153560638428

Iteration 1 - Original Dino MSE for Image 2 = 0.3002174198627472
Iteration 1 - Updated Dino MSE for Image 2 = 0.27350690960884094
Iteration 1 - Difference = -0.02671051025390625

Iteration 1 - Original Dino MSE for Image 3 = 0.23217007517814636
Iteration 1 - Updated Dino MSE for Image 3 = 0.21855226159095764
Iteration 1 - Difference = -0.01361781358718872

Iteration 1 - Original Dino MSE for Image 4 = 0.16918054223060608
Iteration 1 - Updated Dino MSE for Image 4 = 0.15813253819942474
Iteration 1 - Difference = -0.011048004031181335

Iteration 1 - Original Dino MSE for Image 5 = 0.1919342428445816
Iteration 1 - Updated Dino MSE for Image 5 = 0.17343859374523163
Iteration 1 - Difference = -0.018495649099349976

Iteration 1 - Original Dino MSE fo