Yuecheng liu1, Junda Cheng1†, Longliang Liu1,2, Wenjing Liao1,2, Hanrui Cheng1,2, Yuzhou Wang1, Xin Yang1,3
†Corresponding Author
1Hust, 2Carizon, 3Optics Valley Laboratory
- [2026.05.18] 🤗🤗🤗 Evaluation datasets released on Hugging Face.
- [2026.05.16] 🤗🤗🤗 Hugging Face Gradio demos released.
- [2026.05.16] Add GPU memory adjustment schemes for inference and training.
- [2026.05.15] 🤗🤗🤗Pre-trained weights released on Hugging Face.
- [2026.05.14] Add
run_video_pointcloudfor pointcloud reconstruction. - [2026.05.09] 🔥🔥🔥GemDepth is out! It effectively recovering fine-grained details and has better 3D temporal consistency.
Welcome to the official repository for GemDepth!
GemDepth is a framework built on the insight that an explicit awareness of camera motion and global 3D structure is a prerequisite for 3D consistency. Distinctively, GemDepth introduces a Geometry-Embedding Module (GEM) that predicts inter-frame camera poses to generate implicit geometric embeddings. This injection of motion priors equips the network with intrinsic 3D perception and alignment capabilities. Guided by these geometric cues, our Alternating Spatio-Temporal Transformer (ASTT) captures latent point-level correspondences to simultaneously enhance spatial precision for sharp details and enforce rigorous temporal consistency.
GemDepth achieves state of-the-art performance across multiple datasets, particularly in complex dynamic scenarios.
Comparisons with state-of-the-art methods across four of the most widely used benchmarks.
git clone https://github.com/Yuechengliu919/GemDepth
cd GemDepth
conda create -n gemdepth python=3.10
conda activate gemdepth
pip install -r requirements.txt| Model | Link |
|---|---|
| GemDepth | Download 🤗 |
The final structure shoule be like
GemDepth
├── checkpoint/
├──── gemdepth.pth
├── configs/
├── model/
├── ...
import torch
from model.gemdepth import GemDepth
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
model_configs = {
'vits': {'encoder': 'vits''features': 64, 'out_channels': [4896, 192, 384]},
'vitl': {'encoder': 'vitl''features': 256, 'out_channels'[256, 512, 1024, 1024]},
}
gemdepth = GemDepth(**model_configs[argencoder])
checkpoint = torch.load("./checkpoint/gemdepth.pth",map_location='cpu',weights_only=False)
gemdepth.load_state_dict(checkpoint,strict=True)
gemdepth = gemdepth.to(DEVICE).eval()
frames, target_fps = read_video_frames(video_path, args.max_len, args.target_fps, 1280)
depths, fps = gemdepth.infer_video_depth(frames, target_fps, input_size=args.input_size,device=DEVICE, fp32=args.fp32)
# Only video depth output
python evaluation/inference/run_video.py --input_dir ./assets/example_videos --output_dir ./assets/example_result
# video depth & pointcloud output
python evaluation/inference/run_video_pointcloud.py --input_dir ./assets/example_videos --output_dir ./assets/example_result Tips: If GPU memory is insufficient, you can adjust the infer settings in model/gemdepth.py. The default settings are:
INFER_LEN = 32
OVERLAP = 10
KEYFRAMES = [0, 12, 24, 25, 26, 27, 28, 29, 30, 31]
INTERP_LEN = 8which require about 44GB GPU memory. You can reduce them as follows:
INFER_LEN = 16
OVERLAP = 6
KEYFRAMES = [0, 6, 12, 13, 14, 15]
INTERP_LEN = 4which require about 25GB GPU memory, or:
INFER_LEN = 8
OVERLAP = 4
KEYFRAMES = [0, 3, 6, 7]
INTERP_LEN = 2which require about 15GB GPU memory. You can adjust these parameters according to your GPU memory.
We provide an interactive Gradio interface for you to easily test GemDepth on your own videos without writing any code.
pip install -r demo/requirements.txt
python demo/app.pyOur Gradio-based interface allows you to upload videos, run video depth prediction and pointcloud reconstruction, and interactively explore the 3D scene in your browser.
| Datasets | Link |
|---|---|
| Sintel | Download 🤗 |
| KITTI | Download 🤗 |
| Bonn | Download 🤗 |
| Scannet | Download 🤗 |
You can directly download the evaluation datasets via the link above, or follow the preprocessing steps below.
Follow VideoDepthAnything, download raw datasets from the following links: Sintel, KITTI, Bonn, ScanNet
pip install natsort
cd dataset/dataset_extract
python dataset_extrtact${dataset}.pyThis script will extract the dataset to the dataset/dataset_extract/dataset folder. It will also generate the json file for the dataset.
python evaluation/inference/infer/infer.py \
--infer_path ${out_path} \
--json_file ${json_path} \
--datasets ${dataset}Options:
--infer_path: path to save the output results--json_file: path to the json file for the dataset, likesintel_video.json,kitti_video_500.json,scannet_video_tae.json--datasets: dataset name, choose fromsintel,kitti,bonn,scannet
## ~500frame
python evaluation/eval/eval.py \
--infer_path ${pred_root} \
--benchmark_path ${benchmark_root} \
--datasets ${dataset}To train GemDepth on mix-datasets, run
## stage1
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 accelerate launch train.py --config-name stage1
## stage2
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 accelerate launch train.py --config-name stage2Tips: If GPU memory is insufficient, you can adjust seq_len in the config file.
If you find our works useful in your research, please consider citing our papers:
@inproceedings{Liu2026GemDepthGF,
title={GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth},
author={Yuecheng Liu and Junda Cheng and Longliang Liu and Wenjing Liao and Hanrui Cheng and Yuzhou Wang and Xin Yang},
year={2026},
url={https://api.semanticscholar.org/CorpusID:288258595}
}
This project is based on VideoDepthAnything、VGGT and DepthAnythingV2. We thank the original authors for their excellent works.

