<a href="https://colab.research.google.com/github/crhysc/jarvis-tools-notebooks/blob/master/cdvae_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Inverse Design of Next-Generation Superconductors Using Data-Driven Deep Generative Models

# Tutorial: CDVAE, Crystal Diffusion Variational AutoEncoder



[Reference DOI](https://pubs.acs.org/doi/10.1021/acs.jpclett.3c01260)

Authors: Charles "Rhys" Campbell (crc00042@mix.wvu.edu), Kamal Choudhary (kamal.choudhary@nist.gov),

# (1) INTRODUCTION AND MOTIVATION


# (2) INSTALLATION, CONFIGURATION, AND DEPENDENCIES


# Install Conda

In [1]:
!pip install -q condacolab
import condacolab, os, sys
condacolab.install()
print("Done")

⏬ Downloading https://github.com/jaimergp/miniforge/releases/download/24.11.2-1_colab/Miniforge3-colab-24.11.2-1_colab-Linux-x86_64.sh...
📦 Installing...
📌 Adjusting configuration...
🩹 Patching environment...
⏲ Done in 0:00:17
🔁 Restarting kernel...
Done


# Install CDVAE

In [1]:
import os
%cd /content
if not os.path.exists('cdvae'):
  !git clone https://github.com/txie-93/cdvae.git
print("Done")

/content
Cloning into 'cdvae'...
remote: Enumerating objects: 197, done.[K
remote: Counting objects: 100% (60/60), done.[K
remote: Compressing objects: 100% (41/41), done.[K
remote: Total 197 (delta 24), reused 19 (delta 19), pack-reused 137 (from 1)[K
Receiving objects: 100% (197/197), 138.14 MiB | 18.83 MiB/s, done.
Resolving deltas: 100% (62/62), done.
Updating files: 100% (89/89), done.
Done


# Switch Colab Runtime to GPU
At the top menu by the Colab logo, select **Runtime** -> **Change runtime type** -> **Any GPU**    

If this works, create GPU-based conda environment.  

If this fails due to usage limits, make the CPU-based conda environment.  



# Create **GPU**-based conda environment for CDVAE

#### Creating the **GPU** legacy env takes 7 minutes


In [None]:
%%time
%cd /content/cdvae
!mamba env create -p /usr/local/envs/cdvae_legacy -f env.yml
!conda run -p /usr/local/envs/cdvae_legacy --live-stream\
    mamba install -c conda-forge "torchmetrics<0.8" --yes
!conda run -p /usr/local/envs/cdvae_legacy --live-stream\
    mamba install mkl=2024.0 --yes
!conda run -p /usr/local/envs/cdvae_legacy --live-stream\
    pip install "monty==2022.9.9"
!conda run -p /usr/local/envs/cdvae_legacy --live-stream\
    mamba install -c conda-forge "pymatgen>=2022.0.8,<2023" --yes
!conda run 3
1-p /usr/local/envs/cdvae_legacy --live-stream\
    pip install -e .
print("Done")

In [None]:
!conda run -p /usr/local/envs/cdvae_legacy python -c "import sys; print(sys.version)"
# proves that conda is running python 3.8.*

# Create **CPU**-based conda environment for CDVAE

#### Creating the **CPU** legacy env takes 10 minutes


In [2]:
%%time
%cd /content/cdvae
!mamba env create -p /usr/local/envs/cdvae_legacy -f env.cpu.yml
!conda run -p /usr/local/envs/cdvae_legacy --live-stream\
    mamba install -c conda-forge "torchmetrics<0.8" --yes
!conda run -p /usr/local/envs/cdvae_legacy --live-stream\
    mamba install mkl=2024.0 --yes
!conda run -p /usr/local/envs/cdvae_legacy --live-stream\
    pip install "monty==2022.9.9"
!conda run -p /usr/local/envs/cdvae_legacy --live-stream\
    mamba install -c conda-forge "pymatgen>=2022.0.8,<2023" --yes
!conda run -p /usr/local/envs/cdvae_legacy --live-stream\
    pip install -e .
print("Done")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m





pytorch-1.8.1        | 1.27 GB   | :  23% 0.22922791768668035/1 [00:14<01:53, 147.27s/it]
pytorch-1.8.1        | 1.27 GB   | :  23% 0.2310134244949461/1 [00:15<01:32, 119.73s/it] 
pytorch-1.8.1        | 1.27 GB   | :  23% 0.23307454644811193/1 [00:15<01:13, 95.83s/it]
cudatoolkit-11.1.1   | 929.6 MB  | :  34% 0.3351571528579335/1 [00:15<00:50, 75.80s/it] [A


















pytorch-1.8.1        | 1.27 GB   | :  24% 0.23583069789711272/1 [00:15<00:55, 72.57s/it]
pytorch-1.8.1        | 1.27 GB   | :  24% 0.23781992024726115/1 [00:15<00:50, 66.18s/it]
pytorch-1.8.1        | 1.27 GB   | :  24% 0.24016864061249663/1 [00:15<00:44, 59.11s/it]
pytorch-1.8.1        | 1.27 GB   | :  24% 0.24233761197018855/1 [00:15<00:41, 55.33s/it]
pytorch-1.8.1        | 1.27 GB   | :  25% 0.24503384708334153/1 [00:15<00:37, 49.16s/it]
pytorch-1.8.1        | 1.27 GB   | :  25% 0.24744248378442485/1 [00:15<00:35, 46.91s/it]
pytorch-1.8.1    

# Install Dataset ETL dependencies


In [3]:
!conda run -p /usr/local/envs/cdvae_legacy \
    pip install pandas jarvis-tools

Collecting jarvis-tools
  Downloading jarvis_tools-2024.10.30-py2.py3-none-any.whl.metadata (3.0 kB)
Collecting toolz>=0.9.0 (from jarvis-tools)
  Downloading toolz-1.0.0-py3-none-any.whl.metadata (5.1 kB)
Collecting xmltodict>=0.11.0 (from jarvis-tools)
  Downloading xmltodict-0.14.2-py2.py3-none-any.whl.metadata (8.0 kB)
Downloading jarvis_tools-2024.10.30-py2.py3-none-any.whl (4.2 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.2/4.2 MB 55.5 MB/s eta 0:00:00
Downloading toolz-1.0.0-py3-none-any.whl (56 kB)
Downloading xmltodict-0.14.2-py2.py3-none-any.whl (10.0 kB)
Installing collected packages: xmltodict, toolz, jarvis-tools
Successfully installed jarvis-tools-2024.10.30 toolz-1.0.0 xmltodict-0.14.2



# (3) DATASET ETL (Extract-Transform-Load)


# Download data pre-processor

Data was generated using this [script](https://github.com/JARVIS-Materials-Design/cdvae/blob/main/scripts/generate_data_cdvae.py). It lives in the JARVIS Materials design repository, and it compiles a set of around 1000 structures and their superconducting critical temperatures into the format required for CDVAE training.

In [4]:
%cd /content/cdvae/scripts
!wget https://raw.githubusercontent.com/JARVIS-Materials-Design/cdvae/refs/heads/main/scripts/generate_data_cdvae.py

/content/cdvae/scripts
--2025-05-27 14:34:20--  https://raw.githubusercontent.com/JARVIS-Materials-Design/cdvae/refs/heads/main/scripts/generate_data_cdvae.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2947 (2.9K) [text/plain]
Saving to: ‘generate_data_cdvae.py’


2025-05-27 14:34:21 (37.2 MB/s) - ‘generate_data_cdvae.py’ saved [2947/2947]



# Run data pre-processor

In [5]:
!conda run -p /usr/local/envs/cdvae_legacy --live-stream \
    python generate_data_cdvae.py
print("Done")

Obtaining 3D dataset 76k ...
Reference:https://www.nature.com/articles/s41524-020-00440-1
Other versions:https://doi.org/10.6084/m9.figshare.6815699
100% 40.8M/40.8M [00:16<00:00, 2.52MiB/s]
Loading the zipfile...
Loading completed.
Using rest of the dataset except the test and val sets.
Done


# Move train/test/val data to the correct spot

In [6]:
%cd /content
%mkdir /content/cdvae/data/supercon
%mv /content/cdvae/scripts/train.csv /content/cdvae/data/supercon/
%mv /content/cdvae/scripts/val.csv /content/cdvae/data/supercon/
%mv /content/cdvae/scripts/test.csv /content/cdvae/data/supercon/
print("Done")

/content
Done


# Pull the supercon Hydra config YAML from JARVIS

In [7]:
%cd /content/cdvae/conf/data/
!wget https://raw.githubusercontent.com/JARVIS-Materials-Design/cdvae/refs/heads/main/conf/data/supercon.yaml

/content/cdvae/conf/data
--2025-05-27 14:40:56--  https://raw.githubusercontent.com/JARVIS-Materials-Design/cdvae/refs/heads/main/conf/data/supercon.yaml
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1817 (1.8K) [text/plain]
Saving to: ‘supercon.yaml’


2025-05-27 14:40:56 (32.6 MB/s) - ‘supercon.yaml’ saved [1817/1817]



In [15]:
%cd /content/cdvae/
%rm -rf hydra_outputs/

/content/cdvae


# (4) TRAIN WITHOUT PROPERTY PREDICTOR

# If using **GPU**

In [43]:
!PROJECT_ROOT=/content/cdvae \
 HYDRA_JOBS=/content/cdvae/hydra_outputs \
 HYDRA_FULL_ERROR=1 \
 WABDB_DIR=/content/cdvae/wandb_outputs \
 WANDB_ANONYMOUS=allow \
 conda run -p /usr/local/envs/cdvae_legacy --live-stream \
    python -u -m cdvae.run data=supercon expname=supercon

Traceback (most recent call last):
  File "/usr/local/envs/cdvae_legacy/lib/python3.8/site-packages/hydra/_internal/config_loader_impl.py", line 378, in _apply_overrides_to_config
    OmegaConf.update(cfg, key, value, merge=True)
  File "/usr/local/envs/cdvae_legacy/lib/python3.8/site-packages/omegaconf/omegaconf.py", line 725, in update
    root[key_] = {}
  File "/usr/local/envs/cdvae_legacy/lib/python3.8/site-packages/omegaconf/dictconfig.py", line 310, in __setitem__
    self._format_and_raise(
  File "/usr/local/envs/cdvae_legacy/lib/python3.8/site-packages/omegaconf/base.py", line 190, in _format_and_raise
    format_and_raise(
  File "/usr/local/envs/cdvae_legacy/lib/python3.8/site-packages/omegaconf/_utils.py", line 738, in format_and_raise
    _raise(ex, cause)
  File "/usr/local/envs/cdvae_legacy/lib/python3.8/site-packages/omegaconf/_utils.py", line 716, in _raise
    raise ex.with_traceback(sys.exc_info()[2])  # set end OC_CAUSE=1 for full backtrace
  File "/usr/local/envs/

# If using **CPU**

In [None]:
!PROJECT_ROOT=/content/cdvae \
 HYDRA_JOBS=/content/cdvae/hydra_outputs \
 HYDRA_FULL_ERROR=1 \
 WABDB_DIR=/content/cdvae/wandb_outputs \
 WANDB_ANONYMOUS=allow \
 conda run -p /usr/local/envs/cdvae_legacy --live-stream \
    python -u -m cdvae.run data=supercon expname=supercon \
    train.pl_trainer.gpus=0

[2025-05-27 16:11:12,536][hydra.utils][INFO] - Instantiating <cdvae.pl_data.datamodule.CrystDataModule>


# (5) TRAIN WITH PROPERTY PREDICTOR

# If using **GPU**

In [None]:
!PROJECT_ROOT=/content/cdvae \
 HYDRA_JOBS=/content/cdvae/hydra_outputs \
 HYDRA_FULL_ERROR=1 \
 WABDB_DIR=/content/cdvae/wandb_outputs \
 WANDB_ANONYMOUS=allow \
 conda run -p /usr/local/envs/cdvae_legacy --live-stream \
    model.predict_property=True \
    python -u -m cdvae.run data=supercon expname=supercon

# If using **CPU**

In [None]:
from types import prepare_class
!PROJECT_ROOT=/content/cdvae \
 HYDRA_JOBS=/content/cdvae/hydra_outputs \
 HYDRA_FULL_ERROR=1 \
 WABDB_DIR=/content/cdvae/wandb_outputs \
 WANDB_ANONYMOUS=allow \
 conda run -p /usr/local/envs/cdvae_legacy --live-stream \
    python -u -m cdvae.run data=supercon expname=supercon \
    model.predict_property=True \
    cfg.train.pl_trainer.gpus=0

# (6) INFERENCE

# Reconstruction

In [None]:
!PROJECT_ROOT=/content/cdvae \
 HYDRA_JOBS=/content/cdvae/hydra_outputs \
 WABDB_DIR=/content/cdvae/wandb_outputs \
 conda run -p /usr/local/envs/cdvae_legacy --live-stream \
    python scripts/evaluate.py --model_path MODEL_PATH --tasks recon

# Generation

In [None]:
!PROJECT_ROOT=/content/cdvae \
 HYDRA_JOBS=/content/cdvae/hydra_outputs \
 WABDB_DIR=/content/cdvae/wandb_outputs \
 conda run -p /usr/local/envs/cdvae_legacy --live-stream \
    python scripts/evaluate.py --model_path MODEL_PATH --tasks gen

# Optimization

In [None]:
!PROJECT_ROOT=/content/cdvae \
 HYDRA_JOBS=/content/cdvae/hydra_outputs \
 WABDB_DIR=/content/cdvae/wandb_outputs \
 conda run -p /usr/local/envs/cdvae_legacy --live-stream \
    python scripts/evaluate.py --model_path MODEL_PATH --tasks opt

# (7) NEXT STEPS & REFERENCES