Skip to content

Yuecheng919/GemDepth

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

55 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth

Yuecheng liu1, Junda Cheng1†, Longliang Liu1,2, Wenjing Liao1,2, Hanrui Cheng1,2, Yuzhou Wang1, Xin Yang1,3

Corresponding Author
1Hust, 2Carizon, 3Optics Valley Laboratory

If you like our project, please give us a star ⭐ on GitHub for the latest updates!

Project Page Model Paper

🤗 Demo Video

📢 News

  • [2026.05.18] 🤗🤗🤗 Evaluation datasets released on Hugging Face.
  • [2026.05.16] 🤗🤗🤗 Hugging Face Gradio demos released.
  • [2026.05.16] Add GPU memory adjustment schemes for inference and training.
  • [2026.05.15] 🤗🤗🤗Pre-trained weights released on Hugging Face.
  • [2026.05.14] Add run_video_pointcloud for pointcloud reconstruction.
  • [2026.05.09] 🔥🔥🔥GemDepth is out! It effectively recovering fine-grained details and has better 3D temporal consistency.

👋 Introduction

Welcome to the official repository for GemDepth!

GemDepth is a framework built on the insight that an explicit awareness of camera motion and global 3D structure is a prerequisite for 3D consistency. Distinctively, GemDepth introduces a Geometry-Embedding Module (GEM) that predicts inter-frame camera poses to generate implicit geometric embeddings. This injection of motion priors equips the network with intrinsic 3D perception and alignment capabilities. Guided by these geometric cues, our Alternating Spatio-Temporal Transformer (ASTT) captures latent point-level correspondences to simultaneously enhance spatial precision for sharp details and enforce rigorous temporal consistency.

GemDepth achieves state of-the-art performance across multiple datasets, particularly in complex dynamic scenarios.

network

📝 Benchmarks performance

benchmark

Comparisons with state-of-the-art methods across four of the most widely used benchmarks.

⏳ Usage

Preparation

git clone https://github.com/Yuechengliu919/GemDepth
cd GemDepth
conda create -n gemdepth python=3.10
conda activate gemdepth
pip install -r requirements.txt

Model weights

Model Link
GemDepth Download 🤗

The final structure shoule be like

GemDepth
├── checkpoint/
├──── gemdepth.pth
├── configs/
├── model/
├── ...

Use our model

import torch
from model.gemdepth import GemDepth
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
model_configs = {
    'vits': {'encoder': 'vits''features': 64, 'out_channels': [4896, 192, 384]},
    'vitl': {'encoder': 'vitl''features': 256, 'out_channels'[256, 512, 1024, 1024]},
}
gemdepth = GemDepth(**model_configs[argencoder])
checkpoint = torch.load("./checkpoint/gemdepth.pth",map_location='cpu',weights_only=False)
gemdepth.load_state_dict(checkpoint,strict=True)
gemdepth = gemdepth.to(DEVICE).eval()
frames, target_fps = read_video_frames(video_path, args.max_len, args.target_fps, 1280)
depths, fps = gemdepth.infer_video_depth(frames, target_fps, input_size=args.input_size,device=DEVICE, fp32=args.fp32)

Running script on video

# Only video depth output
python evaluation/inference/run_video.py --input_dir ./assets/example_videos --output_dir ./assets/example_result
# video depth & pointcloud output
python evaluation/inference/run_video_pointcloud.py --input_dir ./assets/example_videos --output_dir ./assets/example_result  

Tips: If GPU memory is insufficient, you can adjust the infer settings in model/gemdepth.py. The default settings are:

INFER_LEN = 32
OVERLAP = 10
KEYFRAMES = [0, 12, 24, 25, 26, 27, 28, 29, 30, 31]
INTERP_LEN = 8

which require about 44GB GPU memory. You can reduce them as follows:

INFER_LEN = 16
OVERLAP = 6
KEYFRAMES = [0, 6, 12, 13, 14, 15]
INTERP_LEN = 4

which require about 25GB GPU memory, or:

INFER_LEN = 8
OVERLAP = 4
KEYFRAMES = [0, 3, 6, 7]
INTERP_LEN = 2

which require about 15GB GPU memory. You can adjust these parameters according to your GPU memory.

Interactive Demo

We provide an interactive Gradio interface for you to easily test GemDepth on your own videos without writing any code.

pip install -r demo/requirements.txt
python demo/app.py

Our Gradio-based interface allows you to upload videos, run video depth prediction and pointcloud reconstruction, and interactively explore the 3D scene in your browser.

✏️ Training Data

✈️ Evaluation

Prepare Evaluation Datasets

Datasets Link
Sintel Download 🤗
KITTI Download 🤗
Bonn Download 🤗
Scannet Download 🤗

You can directly download the evaluation datasets via the link above, or follow the preprocessing steps below.

Follow VideoDepthAnything, download raw datasets from the following links: Sintel, KITTI, Bonn, ScanNet

pip install natsort
cd dataset/dataset_extract
python dataset_extrtact${dataset}.py

This script will extract the dataset to the dataset/dataset_extract/dataset folder. It will also generate the json file for the dataset.

Run inference

python evaluation/inference/infer/infer.py \
    --infer_path ${out_path} \
    --json_file ${json_path} \
    --datasets ${dataset}

Options:

  • --infer_path: path to save the output results
  • --json_file: path to the json file for the dataset, like sintel_video.json, kitti_video_500.json, scannet_video_tae.json
  • --datasets: dataset name, choose from sintel, kitti, bonn, scannet

Run evaluation

## ~500frame 
python evaluation/eval/eval.py \
    --infer_path ${pred_root} \
    --benchmark_path ${benchmark_root} \
    --datasets ${dataset}

✈️ Training

To train GemDepth on mix-datasets, run

## stage1
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 accelerate launch train.py --config-name stage1
## stage2
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 accelerate launch train.py --config-name stage2

Tips: If GPU memory is insufficient, you can adjust seq_len in the config file.

✈️ Citation

If you find our works useful in your research, please consider citing our papers:

@inproceedings{Liu2026GemDepthGF,
  title={GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth},
  author={Yuecheng Liu and Junda Cheng and Longliang Liu and Wenjing Liao and Hanrui Cheng and Yuzhou Wang and Xin Yang},
  year={2026},
  url={https://api.semanticscholar.org/CorpusID:288258595}
}

Acknowledgements

This project is based on VideoDepthAnythingVGGT and DepthAnythingV2. We thank the original authors for their excellent works.

About

【ICML 2026】GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages