VLM³: Vision Language Models Are Native 3D Learners

Model	Coming Soon!

Summary

We show that standard VLMs are native 3D learners. We propose VLM³, which without complex data augmentations and any architecture/loss change, can make standard VLMs:

Surpass SpatialRGPT on object-level 3D understanding (both qualitative and quantitative in SpatialRGPT-bench), without using extra encoders.
Match UnidepthV2 and Moge-2 on metric depth estimation, improving the accuracy of DepthLM from 0.84 to 0.9;
Surpass DKM and RoMa for pixel correspondence estimation;
Match DepthAnything3 and surpass VGGT for camera pose estimation;

VLM³ opens up a new paradigm for simple and scalable 3D learning. Now you dont need to spend a year designing:

complex models with different backbones, prediction heads, routings.
complex losses for different prediction heads, balancing weights for different losses
complex data augmentations like image cropping, rotatin, translation, appearance augmentation etc.

All you need to do is collect data, and scale the training with a standard VLM!

Our findings provide a new perspective on what is and is not necessary for 3D vision:

Large models, task-specific architectures, losses, data-augmentations, and even the regression formulation that sets the foundation of most SOTA 3D expert vision models, are all not necessary conditions for effective 3D learning.
A generalist foundation model (VLM) with unified output domain (text) + data scaling are sufficient.

Method Overview

Given the input images, VLM³ first resizes them so that the focal length is the same for all input images (e.g., 1000 pixels). This solves camera ambiguity without the need for adding extra VLM encoders/modules. To refer to an object or pixel, VLM³ simply uses text with the pixel range normalized (e.g., [0, 2000) or [0, 1000)) for both horizontal and vertical axes. This requires no architecture or marker rendering, and makes VLM³ much more flexible and scalable. Standard VLM architectures and text-based training (SFT) are used to train the model.

Results

Contact

Zhipeng Cai, Meta Inc, homepage: https://zhipengcai.github.io/, email: czptc2h at gmail dot com.

Quickstart

Install transformers to do inference with our model.

pip install transformers>=5.4.0

Since VLM³ maintain the architecture of the base model (Qwen3-vl-4B), we can call the model for inference the same as the original VLM.

Using 🤗 Transformers to Chat

Here we show a code snippet to show you how to use the chat model with transformers:

from transformers import AutoModelForImageTextToText, AutoProcessor

# default: Load the model on the available device(s)
model = AutoModelForImageTextToText.from_pretrained(
    "facebook/VLM3-depth", dtype="auto", device_map="auto"
)

processor = AutoProcessor.from_pretrained("facebook/VLM3-depth")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://github.com/facebookresearch/VLM3/blob/main/sample_data/depth.jpeg",
            },
            {
                "type": "text",
                "text": (
                    f"Given this image, how far is the point at coordinates ({norm_x}, {norm_y}) "
                    "from the camera? The coordinates are in normalised [0, 2000] format relative "
                    "to image width and height. Output the thinking process in <think> </think> "
                    "and final answer (the meter number only, without the unit) in <answer> </answer> tags."
                ),
            },
        ],
    }
]

# Preparation for inference
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
)
inputs = inputs.to(model.device)

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Please check this Cookbook for detailed examples on how to use different checkpoints for different tasks.

Citation

@article{cai2026vlm3,
    title={VLM³: Vision Language Models Are Native 3D Learners},
    author={Cai, Zhipeng and Liu, Zhuang and Xiong, Yunyang and Liu, Zechun and Vikas, Chandra and Shi, Yangyang},
    journal={arXiv preprint arXiv:xxxx.yyyy},
    year={2026},
}

Related projects

This work is largely motivated by our previous project DepthLM

@article{cai2025depthlm,
    title={DepthLM: Metric Depth from Vision Language Models},
    author={Cai, Zhipeng and Yeh, Ching-Feng and Hu, Xu and Liu, Zhuang and Meyer, Gregory and Lei, Xinjie and Zhao, Changsheng and Li, Shang-Wen and Chandra, Vikas and Shi, Yangyang},
    journal={arXiv preprint arXiv:2509.25413},
    year={2025},
}

License

DepthLM is FAIR CC-BY-NC licensed, as found in the LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
media		media
sample_data		sample_data
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
inference.ipynb		inference.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VLM³: Vision Language Models Are Native 3D Learners

Summary

Method Overview

Results

Contact

Quickstart

Using 🤗 Transformers to Chat

Citation

Related projects

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VLM³: Vision Language Models Are Native 3D Learners

Summary

Method Overview

Results

Contact

Quickstart

Using 🤗 Transformers to Chat

Citation

Related projects

License

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages