<a href="https://colab.research.google.com/github/crhysc/jarvis-tools-notebooks/blob/master/jarvis-tools-notebooks/flowmm_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Tutorial**: FlowMM & FlowLLM



**Authors**: Charles "Rhys" Campbell (crc00042@mix.wvu.edu)

# TABLE OF CONTENTS

- Background and Central Goal
- Installation, Configuration, and Dependencies
- Dataset ETL
- Training
  - Manifolds
  - Unconditional Training
  - Conditional Training
- Inference
  - De Novo Generation / Unconditional Evalation
  - Reconstruction / Conditional Evaluation
- Prerelaxation
- Prepare DFT
- Compute E above hull
- Compute corrected E above hull
- Compute Stable, Unique, and Novel (SUN) structures
- Next Steps & References

# (1) BACKGROUND AND CENTRAL GOAL


# Background
### FlowMM
**FlowMM** uses Riemannian flow matching to learn how to transform simple base noise into full periodic crystal structures by jointly modeling fractional atomic coordinates and lattice parameters on the manifold defined by crystal symmetries. It tackles both **Crystal Structure Prediction** (finding the stable arrangement for a known composition) and **De Novo Generation** (proposing entirely new materials), doing so with about three times fewer integration steps than comparable diffusion-based approaches.  

### FlowLLM
**FlowLLM** builds on FlowMM by swapping out the simple analytic noise prior for samples from a pretrained CrystalLLM (a LLaMA‐style model fine-tuned on crystal data). You generate initial “noisy” structures with the LLM, then use the same Riemannian flow-matching steps to refine those into accurate crystal geometries.


# Central Goal
Show viewers how to install, train, and use FlowMM and FlowLLM.
  


# (2) INSTALLATION, CONFIGURATION, AND DEPENDENCIES


# Install Conda

In [None]:
!pip install -q condacolab
import condacolab, os, sys
condacolab.install()
print("Done")

**Note**: Colab and FlowMM have hard pins for different Python and CUDA versions. To bypass this, the "!conda run" command will be used to run most code in this notebook. This bypasses the hard pinned Colab Python version by spinning up a conda subprocess that runs its own Python kernel with the correct version required by FlowMM.

# Install FlowMM

In [None]:
import os
%cd /content
if not os.path.exists('flowmm'):
  !git clone https://github.com/facebookresearch/flowmm.git
print("Done")

# Load FlowMM submodules

In [None]:
%%bash
cd /content/flowmm
sed -i 's|git@github.com:bkmi/DiffCSP-official.git|https://github.com/bkmi/DiffCSP-official.git|' .gitmodules
sed -i 's|git@github.com:bkmi/cdvae.git|https://github.com/bkmi/cdvae.git|' .gitmodules
sed -i 's|git@github.com:facebookresearch/riemannian-fm.git|https://github.com/facebookresearch/riemannian-fm.git|' .gitmodules
git submodule sync
git submodule update --init --recursive
echo "Done"

# Switch Colab Runtime to GPU
At the top menu by the Colab logo, select **Runtime** -> **Change runtime type** -> **Any GPU**    

It is not necessary to run on GPU, but the code will complete faster.



# Create conda environment for FlowMM
Making the conda environment takes 12 minutes


In [None]:
%%time
%cd /content/flowmm
!mamba env create -p /usr/local/envs/flowmm_env -f environment.yml
!conda run -p /usr/local/envs/flowmm_env --live-stream\
    pip install -e .
print("Done")

/content/flowmm
Channels:
 - nvidia
 - pytorch
 - conda-forge
 - defaults
 - pyg
Platform: linux-64
Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ done
Solving environment: / - \ | / - \ | / - \ | / - \ | done


    current version: 24.11.2
    latest version: 25.5.0

Please update conda by running

    $ conda update -n base -c conda-forge conda



Downloading and Extracting Packages:
pytorch-2.1.0        | 1.46 GB   | :   0% 0/1 [00:00<?, ?it/s]
libcublas-11.11.3.6  | 364.0 MB  | :   0% 0/1 [00:00<?, ?it/s][A

mkl-2022.1.0         | 199.6 MB  | :   0% 0/1 [00:00<?, ?it/s][A[A


libcusparse-11.7.5.8 | 176.3 MB  | :   0% 0/1 [00:00<?, ?it/s][A[A

In [None]:
!conda run -p /usr/local/envs/flowmm_env python -c "import sys; print(sys.version)"
# proves that conda is running python 3.9.*

# Install Other dependencies


# (3) DATASET ETL (Extract-Transform-Load)


# Download data pre-processor

Data was generated using this [script](https://github.com/crhysc/utilities/blob/main/supercon_preprocess.py). It compiles a set of around 1000 structures and their superconducting critical temperatures into the format required for FlowMM training.

In [None]:
!wget https://raw.githubusercontent.com/crhysc/utilities/refs/heads/main/supercon_preprocess.py

# Run data pre-processor

In [None]:
!conda run -p /usr/local/envs/cdvae_legacy --live-stream \
    python generate_data_cdvae.py
print("Done")

# Move train/test/val data to the correct spot

In [None]:
%cd /content
%mkdir /content/cdvae/data/supercon
%mv /content/cdvae/scripts/train.csv /content/cdvae/data/supercon/
%mv /content/cdvae/scripts/val.csv /content/cdvae/data/supercon/
%mv /content/cdvae/scripts/test.csv /content/cdvae/data/supercon/
print("Done")

# Pull the supercon Hydra config YAML from JARVIS

**NOTE**: Each dataset that you want to use with CDVAE needs its own config.yml located in cdvae/conf/data/

In [None]:
%cd /content/cdvae/conf/data/
!wget https://raw.githubusercontent.com/JARVIS-Materials-Design/cdvae/refs/heads/main/conf/data/supercon.yaml

# (4) TRAIN WITHOUT PROPERTY PREDICTOR

# If using **GPU**

In [None]:
!PROJECT_ROOT=/content/cdvae \
 HYDRA_JOBS=/content/cdvae/hydra_outputs \
 HYDRA_FULL_ERROR=1 \
 WABDB_DIR=/content/cdvae/wandb_outputs \
 WANDB_ANONYMOUS=allow \
 conda run -p /usr/local/envs/cdvae_legacy --live-stream \
    python -u -m cdvae.run data=supercon expname=supercon \
    model.num_noise_level=2 \
    data.train_max_epochs=2

# If using **CPU**

The only difference is that this command include a command line override of the Hydra config that specifies zero GPUs instead of one GPU (which is the CDVAE default)

In [None]:
!PROJECT_ROOT=/content/cdvae \
 HYDRA_JOBS=/content/cdvae/hydra_outputs \
 HYDRA_FULL_ERROR=1 \
 WABDB_DIR=/content/cdvae/wandb_outputs \
 WANDB_ANONYMOUS=allow \
 conda run -p /usr/local/envs/cdvae_legacy --live-stream \
    python -u -m cdvae.run data=supercon expname=supercon \
    model.num_noise_level=2 \
    data.train_max_epochs=2 \
    train.pl_trainer.gpus=0

# (5) TRAIN WITH PROPERTY PREDICTOR

**NOTE**: The only difference between training with and without a property predictor is including the `model.predict_property=True` kwarg

# If using **GPU**

In [None]:
!PROJECT_ROOT=/content/cdvae \
 HYDRA_JOBS=/content/cdvae/hydra_outputs \
 HYDRA_FULL_ERROR=1 \
 WABDB_DIR=/content/cdvae/wandb_outputs \
 WANDB_ANONYMOUS=allow \
 conda run -p /usr/local/envs/cdvae_legacy --live-stream \
    python -u -m cdvae.run data=supercon expname=supercon \
    model.num_noise_level=2 \
    data.train_max_epochs=2 \
    train.pl_trainer.gpus=0 \
    model.predict_property=True

# If using **CPU**

In [None]:
!PROJECT_ROOT=/content/cdvae \
 HYDRA_JOBS=/content/cdvae/hydra_outputs \
 HYDRA_FULL_ERROR=1 \
 WABDB_DIR=/content/cdvae/wandb_outputs \
 WANDB_ANONYMOUS=allow \
 conda run -p /usr/local/envs/cdvae_legacy --live-stream \
    python -u -m cdvae.run data=supercon expname=supercon \
    data.train_max_epochs=2 \
    model.num_noise_level=2 \
    model.predict_property=True \
    train.pl_trainer.gpus=0

# (6) INFERENCE

The saved model path is `/content/cdvae/hydra_outputs/singlerun/YYYY-MM-DD/supercon/`, change the date to whenever you are using this notebook for the code to work.

# Reconstruction

In [None]:
!PROJECT_ROOT=/content/cdvae \
 HYDRA_JOBS=/content/cdvae/hydra_outputs \
 WABDB_DIR=/content/cdvae/wandb_outputs \
 conda run -p /usr/local/envs/cdvae_legacy --live-stream \
    python /content/cdvae/scripts/evaluate.py \
    --model_path /content/cdvae/hydra_outputs/singlerun/2025-05-27/supercon \
    --tasks recon

In [None]:
import torch
from pprint import pprint
path = "/content/cdvae/hydra_outputs/singlerun/2025-05-27/supercon/eval_recon.pt"
data = torch.load(path, map_location="cpu", weights_only=False)
pprint(data, width=120, indent=2)

# Generation

In [None]:
!PROJECT_ROOT=/content/cdvae \
 HYDRA_JOBS=/content/cdvae/hydra_outputs \
 WABDB_DIR=/content/cdvae/wandb_outputs \
 conda run -p /usr/local/envs/cdvae_legacy --live-stream \
    python /content/cdvae/scripts/evaluate.py \
    --model_path /content/cdvae/hydra_outputs/singlerun/2025-05-27/supercon \
    --tasks gen

In [None]:
import torch
from pprint import pprint
path = "/content/cdvae/hydra_outputs/singlerun/2025-05-27/supercon/eval_gen.pt"
data = torch.load(path, map_location="cpu", weights_only=False)
pprint(data, width=120, indent=2)

# Optimization

In [None]:
!PROJECT_ROOT=/content/cdvae \
 HYDRA_JOBS=/content/cdvae/hydra_outputs \
 WABDB_DIR=/content/cdvae/wandb_outputs \
 conda run -p /usr/local/envs/cdvae_legacy --live-stream \
    python /content/cdvae/scripts/evaluate.py \
    --model_path /content/cdvae/hydra_outputs/singlerun/2025-05-27/supercon \
    --tasks opt

In [None]:
import torch
from pprint import pprint
path = "/content/cdvae/hydra_outputs/singlerun/2025-05-27/supercon/eval_opt.pt"
data = torch.load(path, map_location="cpu", weights_only=False)
pprint(data, width=120, indent=2)

# (7) EVALUATION

In [None]:
!PROJECT_ROOT=/content/cdvae \
 HYDRA_JOBS=/content/cdvae/hydra_outputs \
 WABDB_DIR=/content/cdvae/wandb_outputs \
 conda run -p /usr/local/envs/cdvae_legacy --live-stream \
    python scripts/compute_metrics.py \
    --root_path /content/cdvae/hydra_outputs/singlerun/2025-05-27/supercon \
    --tasks recon gen opt

# (8) NEXT STEPS & REFERENCES

## Next Steps

1. **Hyperparameter exploration**  
   - Try different numbers of noise levels (`model.num_noise_level`) and training epochs to improve sample quality.

2. **Property-conditioned generation**  
   - Re-enable the property predictor (`model.predict_property=True`) and train with longer schedules to improve prediction accuracy.
   - After training, sample structures by specifying a target critical temperature and evaluate via DFT or empirical models.


---

## References

- **Original CDVAE paper:**  
  Li _et al._, “Crystal Diffusion Variational Autoencoder for Inverse Materials Design,” _J. Phys. Chem. Lett._ 2023, DOI: [10.1021/acs.jpclett.3c01260](https://pubs.acs.org/doi/10.1021/acs.jpclett.3c01260)

- **CDVAE GitHub repo:**  
  https://github.com/txie-93/cdvae

- **JARVIS-Materials-Design:**  
  https://github.com/JARVIS-Materials-Design/jarvis

- **Hydra configuration framework:**  
  https://hydra.cc

- **PyTorch Lightning:**  
  https://www.pytorchlightning.ai

- **condacolab:**  
  https://github.com/conda-incubator/condacolab

- **Mamba (fast conda):**  
  https://github.com/mamba-org/mamba

- **Jarvis-tools (data ETL):**  
  https://github.com/JARVIS-Materials-Design/jarvis-tools
