-
Notifications
You must be signed in to change notification settings - Fork 814
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[request] Depth estimation documentation, training code and / or model weights #54
Comments
Can you give an expected timeframe on when depth estimation will be available? |
I'd also be interested to hear. |
I'd also be happy if you could share the semantic segmentation heads. The one that produces the results on the web demo. |
Would be excellent to obtain depth estimation output per image. Supportive of this enhancement! |
segmentation head similar to the demo please |
Also interested in acquiring depth info per image, really cool! |
Also very interested to have the depth estimation head model documentation (and model/weights if possible). |
@patricklabatut Thank you so much for the main code. |
Could you please release the segmentation part? |
|
Very interested and waiting for your release! |
Cool! |
very interested in releasing the depth estimation head |
Interested in depth estimation head as well (or any documentation on how to reproduce the results using provided models) |
Interested in the depth part also! |
@patricklabatut could you maybe shed some light on the decision to not release the depth estimation parts immediately? |
@patricklabatut amazing work! any approximate timeline on if/when a trained depth estimation head could be released? |
I would love to learn the news about the depth |
I would also appreciate an example code for depth estimation. Can't do much with the model's output embeddings yet. |
Very interested in the depth estimation code! I tried to add linear head but actually I don't know how to convert the (batch_size, num_of_tokens, feature_dim) tensor to (batch_size, 256 image_width, image_height) to get the paper's result on SUNRGBD. |
Would appreciate greatly if your pre-trained depth estimator/optical flow model is released! |
Thanks for your interest. Please note that we don't have an optical flow model (although one could leverage the provided backbones to train a matching head for this task). |
Would be awesome if someone train a depth estimation head on top of the provided backbone (dinov2_vitl14_pretrain.pth). Any thoughts on who/how and estimated eta? |
I would also like to request an estimated release date for the depth estimation pre-train head. Thank you. |
Two questions about the "DPT decoder" mentioned in 7.3 Dense Recognition Tasks-Depth estimation part. I search for the DPT source code, do the "DPT decoder" refers to its refinenet? If yes, I'm curious on why you choose this decoder . Thank you! |
@patricklabatut - any updates on the depth estimation code? |
Hah, I adapted header of DPT from its official repo to DINOV2 . The accuracy is obviously lower than that in the paper. |
Hi how much RMSE did you get for depth estimation with DPT decoder? For NYUv2 or SUNRGBD? I am really interested in the results. Thank you very much! @emojilearning |
Hi @patricklabatut, thanks for releasing the code and starting this issue to track progress on depth estimation. I have tried to re-implement this but have not been successful (was unable to achieve an RMSE below 0.52 for ViT-B/14). My re-implementation is based on the following quoted part from Sec. 7.4. There are many details missing that I filled in, but I cannot seem to get the performance reported. I hope that this can help others who seem to also be struggle with reproducing this number as well as perhaps make it easy for the authors to highlight the key difference that would help us reproduce the depth probe. I am basing my experiments on this part describing the simplest setup
Below i detail my attempt based on the details provided in the paper: Image extraction I simply assumed that you were training at a similar resolution as NYU (480x640), I went down (462x616) as they are multiple of 14x14 while keeping the aspect ratio. Depending on the setup, we might have augmentations or not. In the case of extracting dense features and training a layer, there might be no augmentations. Alternatively, we can keep the backbone frozen and training with image augmentations. I tried both, for augmentations, I used ColorJitter, RandomResizedCrop, Random Rotation (<= 10 degrees), RandomHorizontalFlip. With the exception of jitter, those augmentations were applied to both images and depth. Feature Extraction The output tokens capture a grid that is 14x smaller than the full image. you can get the outputs of the patch tokens and the cls token from the output of dino and then reshape them into the correct shape as seen below. This results in an output of import torch
import einops as E
vit = torch.hub.load("facebookresearch/dinov2", "dinov2_vitb14").cuda()
ret = vit.forward_features(image)
patch_tok = ret["x_norm_patchtokens"]
cls_tok = ret["x_norm_clstoken"]
_, _, img_h, img_w = image.shape
patch_h, patch_w = img_h / 14, img_w / 14
patch_tok = E.rearrange(patch_tok, "b (h w) c -> b c h w", h=patch_h, patch_w)
cls_tok = cls_tok[:, :, None, None].repeat(1, 1, patch_h, patch_h)
output = torch.cat((patch_tok, cls_tok), dim=1) Depth estimation The paper states that they bilinear upsample the features by a scale of 4 and then apply a linear layer. This leaves a resolution discrepancy of 3.5x. I tackled this by simply upsampling again to match the depth resolution. The linear layer is a simple 1x1 convolution applied to the grid that maps the features to a 256 dimensions vector depictng the probabilities for each of the depth bins. I then apply the AdaBins uniform-bin baseline which computes 256 depth values for each bin. The inner product of those two vectors is the output value. It is worth noting that both AdaBins and BinsFormer use adaptive bins for some minor performance gain, however, the difference in performance caused by bin choice is much smaller than the difference observed in performance. Loss This is where things get a bit confusing. The paper seems to suggest that they use the BinsFormer with uniform bin size and 256 bins as noted above. This is typically trained with the scale-invariant depth loss estimates depth and then applies the loss. Using a classification loss, while possible, seemed like an odd choice. In that case, one would discretize the depth to 256 bins (I used a range 0-10m) and then apply a cross entropy loss. I tried both losses and the scale invariant loss does better. Optimization I used AdamW (default parameters) with a cosine schedule for learning rate decay. I split the training data randomly at the level of room types with a train-val split of 0.7:0.3. I trained for 20 epochs. Training for 100 epochs didn't seem to help much. As I noted, I have tried several different variants and none of them could achieve the performance reported in the paper. I would greatly appreciate any feedback from the authors with either their implementation or suggesting what might be different between the setup I described above and the setup used in the paper. Thank you! |
Hi @mbanani, thanks for sharing research details. I also concentrate on depth estimation task based on dinov2 backbone and obtained an unexpected result. Above is my experience and opinion, thank you |
@
when a trained depth estimation head could be released? |
Same quest here. I would really appreciate it if a depth estimation head is available. |
same here. |
Hi folks, Just added support for DPT + DINOv2 in 🤗 Transformers: https://github.com/NielsRogge/Transformers-Tutorials/blob/master/DPT/DPT_inference_notebook_(depth_estimation).ipynb. We've extended the DPT model (which is one of the best depth estimation decoders) to now also leverage DINOv2 as backbone. It can be created as follows:
Transferred all checkpoints to the hub: https://huggingface.co/models?pipeline_tag=depth-estimation&other=dinov2&sort=trending. |
@NielsRogge thanks for the support! Question ~ if I already have DINOv2 embeddings extracted, is there a way for me to run them through the depth estimation portion only? |
Hi @palol, yes that's possible, you could do it as follows:
Note that the |
@NielsRogge thanks for the solution. So this means that enough of the backbone has to be preserved to follow the "lin. 4" protocol. Do you have any support for the "lin. 1" protocol, that only uses the last layer of the frozen transformer? |
Related issues:
The text was updated successfully, but these errors were encountered: