Skip to content


Repository files navigation


A pytorch implementation of the text-to-3D model Dreamfusion, powered by the Stable Diffusion text-to-2D model.

ADVERTISEMENT: Please check out threestudio for recent improvements and better implementation in 3D content generation!

NEWS (2023.6.12):


Colab notebooks:

  • Instant-NGP backbone (-O): Instant-NGP Backbone

  • Vanilla NeRF backbone (-O2): Vanilla Backbone

Important Notice

This project is a work-in-progress, and contains lots of differences from the paper. The current generation quality cannot match the results from the original paper, and many prompts still fail badly!

Notable differences from the paper

  • Since the Imagen model is not publicly available, we use Stable Diffusion to replace it (implementation from diffusers). Different from Imagen, Stable-Diffusion is a latent diffusion model, which diffuses in a latent space instead of the original image space. Therefore, we need the loss to propagate back from the VAE's encoder part too, which introduces extra time cost in training.
  • We use the multi-resolution grid encoder to implement the NeRF backbone (implementation from torch-ngp), which enables much faster rendering (~10FPS at 800x800).
  • We use the Adan optimizer as default.


git clone
cd stable-dreamfusion

Optional: create a python virtual environment

To avoid python package conflicts, we recommend using a virtual environment, e.g.: using conda or venv:

python -m venv venv_stable-dreamfusion
source venv_stable-dreamfusion/bin/activate # you need to repeat this step for every new terminal

Install with pip

pip install -r requirements.txt

Download pre-trained models

To use image-conditioned 3D generation, you need to download some pretrained checkpoints manually:

  • Zero-1-to-3 for diffusion backend. We use zero123-xl.ckpt by default, and it is hard-coded in guidance/
    cd pretrained/zero123
  • Omnidata for depth and normal prediction. These ckpts are hardcoded in
    mkdir pretrained/omnidata
    cd pretrained/omnidata
    # assume gdown is installed
    gdown '1Jrh-bRnJEjyMCS7f-WsaFlccfPjJPPHI&confirm=t' # omnidata_dpt_depth_v2.ckpt
    gdown '1wNxVO4vVbDEMEpnAi_jwQObf2MFodcBR&confirm=t' # omnidata_dpt_normal_v2.ckpt

To use DeepFloyd-IF, you need to accept the usage conditions from hugging face, and login with huggingface-cli login in command line.

For DMTet, we port the pre-generated 32/64/128 resolution tetrahedron grids under tets. The 256 resolution one can be found here.

Build extension (optional)

By default, we use load to build the extension at runtime. We also provide the to build each extension:

cd stable-dreamfusion

# install all extension modules
bash scripts/

# if you want to install manually, here is an example:
pip install ./raymarching # install to python path (you still need the raymarching/ folder, since this only installs the built extension.)

Taichi backend (optional)

Use Taichi backend for Instant-NGP. It achieves comparable performance to CUDA implementation while No CUDA build is required. Install Taichi with pip:

pip install -i taichi-nightly

Trouble Shooting:

  • we assume working with the latest version of all dependencies, if you meet any problems from a specific dependency, please try to upgrade it first (e.g., pip install -U diffusers). If the problem still holds, reporting a bug issue will be appreciated!
  • [F glutil.cpp:338] eglInitialize() failed Aborted (core dumped): this usually indicates problems in OpenGL installation. Try to re-install Nvidia driver, or use nvidia-docker as suggested in #131 if you are using a headless server.
  • TypeError: xxx_forward(): incompatible function arguments: this happens when we update the CUDA source and you used to install the extensions earlier. Try to re-install the corresponding extension (e.g., pip install ./gridencoder).

Tested environments

  • Ubuntu 22 with torch 1.12 & CUDA 11.6 on a V100.


First time running will take some time to compile the CUDA extensions.

#### stable-dreamfusion setting

### Instant-NGP NeRF Backbone
# + faster rendering speed
# + less GPU memory (~16G)
# - need to build CUDA extensions (a CUDA-free Taichi backend is available)

## train with text prompt (with the default settings)
# `-O` equals `--cuda_ray --fp16`
# `--cuda_ray` enables instant-ngp-like occupancy grid based acceleration.
python --text "a hamburger" --workspace trial -O

# reduce stable-diffusion memory usage with `--vram_O`
# enable various vram savings (
python --text "a hamburger" --workspace trial -O --vram_O

# You can collect arguments in a file. You can override arguments by specifying them after `--file`. Note that quoted strings can't be loaded from .args files...
python --file scripts/res64.args --workspace trial_awesome_hamburger --text "a photo of an awesome hamburger"

# use CUDA-free Taichi backend with `--backbone grid_taichi`
python3 --text "a hamburger" --workspace trial -O --backbone grid_taichi

# choose stable-diffusion version (support 1.5, 2.0 and 2.1, default is 2.1 now)
python --text "a hamburger" --workspace trial -O --sd_version 1.5

# use a custom stable-diffusion checkpoint from hugging face:
python --text "a hamburger" --workspace trial -O --hf_key andite/anything-v4.0

# use DeepFloyd-IF for guidance (experimental):
python --text "a hamburger" --workspace trial -O --IF
python --text "a hamburger" --workspace trial -O --IF --vram_O # requires ~24G GPU memory

# we also support negative text prompt now:
python --text "a rose" --negative "red" --workspace trial -O

## after the training is finished:
# test (exporting 360 degree video)
python --workspace trial -O --test
# also save a mesh (with obj, mtl, and png texture)
python --workspace trial -O --test --save_mesh
# test with a GUI (free view control!)
python --workspace trial -O --test --gui

### Vanilla NeRF backbone
# + pure pytorch, no need to build extensions!
# - slow rendering speed
# - more GPU memory

## train
# `-O2` equals `--backbone vanilla`
python --text "a hotdog" --workspace trial2 -O2

# if CUDA OOM, try to reduce NeRF sampling steps (--num_steps and --upsample_steps)
python --text "a hotdog" --workspace trial2 -O2 --num_steps 64 --upsample_steps 0

## test
python --workspace trial2 -O2 --test
python --workspace trial2 -O2 --test --save_mesh
python --workspace trial2 -O2 --test --gui # not recommended, FPS will be low.

### DMTet finetuning

## use --dmtet and --init_with <nerf checkpoint> to finetune the mesh at higher reslution
python -O --text "a hamburger" --workspace trial_dmtet --dmtet --iters 5000 --init_with trial/checkpoints/df.pth

## init dmtet with a mesh to generate texture
# require install of cubvh: pip install git+
# remove --lock_geo to also finetune geometry, but performance may be bad.
python -O --text "a white bunny with red eyes" --workspace trial_dmtet_mesh --dmtet --iters 5000 --init_with ./data/bunny.obj --lock_geo

## test & export the mesh
python -O --text "a hamburger" --workspace trial_dmtet --dmtet --iters 5000 --test --save_mesh

## gui to visualize dmtet
python -O --text "a hamburger" --workspace trial_dmtet --dmtet --iters 5000 --test --gui

### Image-conditioned 3D Generation

## preprocess input image
# note: the results of image-to-3D is dependent on zero-1-to-3's capability. For best performance, the input image should contain a single front-facing object, it should have square aspect ratio, with <1024 pixel resolution. Check the examples under ./data.
# this will exports `<image>_rgba.png`, `<image>_depth.png`, and `<image>_normal.png` to the directory containing the input image.
python <image>.png
python <image>.png --border_ratio 0.4 # increase border_ratio if the center object appears too large and results are unsatisfying.

## zero123 train
# pass in the processed <image>_rgba.png by --image and do NOT pass in --text to enable zero-1-to-3 backend.
python -O --image <image>_rgba.png --workspace trial_image --iters 5000

# if the image is not exactly front-view (elevation = 0), adjust default_polar (we use polar from 0 to 180 to represent elevation from 90 to -90)
python -O --image <image>_rgba.png --workspace trial_image --iters 5000 --default_polar 80

# by default we leverage monocular depth estimation to aid image-to-3d, but if you find the depth estimation inaccurate and harms results, turn it off by:
python -O --image <image>_rgba.png --workspace trial_image --iters 5000 --lambda_depth 0

python -O --image <image>_rgba.png --workspace trial_image_dmtet --dmtet --init_with trial_image/checkpoints/df.pth

## zero123 with multiple images
python -O --image_config config/<config>.csv --workspace trial_image --iters 5000

## render <num> images per batch (default 1)
python -O --image_config config/<config>.csv --workspace trial_image --iters 5000 --batch_size 4

# providing both --text and --image enables stable-diffusion backend (similar to make-it-3d)
python -O --image hamburger_rgba.png --text "a DSLR photo of a delicious hamburger" --workspace trial_image_text --iters 5000

python -O --image hamburger_rgba.png --text "a DSLR photo of a delicious hamburger" --workspace trial_image_text_dmtet --dmtet --init_with trial_image_text/checkpoints/df.pth

## test / visualize
python -O --image <image>_rgba.png --workspace trial_image_dmtet --dmtet --test --save_mesh
python -O --image <image>_rgba.png --workspace trial_image_dmtet --dmtet --test --gui

### Debugging

# Can save guidance images for debugging purposes. These get saved in trial_hamburger/guidance.
# Warning: this slows down training considerably and consumes lots of disk space!
python --text "a hamburger" --workspace trial_hamburger -O --vram_O --save_guidance --save_guidance_interval 5 # save every 5 steps

For example commands, check scripts.

For advanced tips and other developing stuff, check Advanced Tips.


Reproduce the paper CLIP R-precision evaluation

After the testing part in the usage, the validation set containing projection from different angle is generated. Test the R-precision between prompt and the image.(R=1)

python --text "a snake is flying in the sky" --workspace snake_HQ --latest ep0100 --mode depth --clip clip-ViT-B-16


This work is based on an increasing list of amazing research works and open-source projects, thanks a lot to all the authors for sharing!

  • DreamFusion: Text-to-3D using 2D Diffusion

        author = {Poole, Ben and Jain, Ajay and Barron, Jonathan T. and Mildenhall, Ben},
        title = {DreamFusion: Text-to-3D using 2D Diffusion},
        journal = {arXiv},
        year = {2022},
  • Magic3D: High-Resolution Text-to-3D Content Creation

       title={Magic3D: High-Resolution Text-to-3D Content Creation},
       author={Lin, Chen-Hsuan and Gao, Jun and Tang, Luming and Takikawa, Towaki and Zeng, Xiaohui and Huang, Xun and Kreis, Karsten and Fidler, Sanja and Liu, Ming-Yu and Lin, Tsung-Yi},
       booktitle={IEEE Conference on Computer Vision and Pattern Recognition ({CVPR})},
  • Zero-1-to-3: Zero-shot One Image to 3D Object

        title={Zero-1-to-3: Zero-shot One Image to 3D Object},
        author={Ruoshi Liu and Rundi Wu and Basile Van Hoorick and Pavel Tokmakov and Sergey Zakharov and Carl Vondrick},
  • Perp-Neg: Re-imagine the Negative Prompt Algorithm: Transform 2D Diffusion into 3D, alleviate Janus problem and Beyond

      title={Re-imagine the Negative Prompt Algorithm: Transform 2D Diffusion into 3D, alleviate Janus problem and Beyond},
      author={Armandpour, Mohammadreza and Zheng, Huangjie and Sadeghian, Ali and Sadeghian, Amir and Zhou, Mingyuan},
      journal={arXiv preprint arXiv:2304.04968},
  • RealFusion: 360° Reconstruction of Any Object from a Single Image

        author = {Melas-Kyriazi, Luke and Rupprecht, Christian and Laina, Iro and Vedaldi, Andrea},
        title = {RealFusion: 360 Reconstruction of Any Object from a Single Image},
        year = {2023},
        url = {},
  • Fantasia3D: Disentangling Geometry and Appearance for High-quality Text-to-3D Content Creation

        title={Fantasia3D: Disentangling Geometry and Appearance for High-quality Text-to-3D Content Creation},
        author={Rui Chen and Yongwei Chen and Ningxin Jiao and Kui Jia},
        journal={arXiv preprint arXiv:2303.13873},
  • Make-It-3D: High-Fidelity 3D Creation from A Single Image with Diffusion Prior

        title={Make-It-3D: High-Fidelity 3D Creation from A Single Image with Diffusion Prior},
        author={Tang, Junshu and Wang, Tengfei and Zhang, Bo and Zhang, Ting and Yi, Ran and Ma, Lizhuang and Chen, Dong},
        journal={arXiv preprint arXiv:2303.14184},
  • Stable Diffusion and the diffusers library.

        title={High-Resolution Image Synthesis with Latent Diffusion Models},
        author={Robin Rombach and Andreas Blattmann and Dominik Lorenz and Patrick Esser and Björn Ommer},
        author = {Patrick von Platen and Suraj Patil and Anton Lozhkov and Pedro Cuenca and Nathan Lambert and Kashif Rasul and Mishig Davaadorj and Thomas Wolf},
        title = {Diffusers: State-of-the-art diffusion models},
        year = {2022},
        publisher = {GitHub},
        journal = {GitHub repository},
        howpublished = {\url{}}
  • The GUI is developed with DearPyGui.

  • Puppy image from :

  • Anya images from :


If you find this work useful, a citation will be appreciated via:

    Author = {Jiaxiang Tang},
    Year = {2022},
    Note = {},
    Title = {Stable-dreamfusion: Text-to-3D with Stable-diffusion}