<a href="https://colab.research.google.com/github/crhysc/jarvis-tools-notebooks/blob/master/jarvis-tools-notebooks/flowmm_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Tutorial**: FlowMM & FlowLLM



**Authors**: Charles "Rhys" Campbell (crc00042@mix.wvu.edu)

# TABLE OF CONTENTS

- Background and Central Goal
- Installation, Configuration, and Dependencies
- Dataset ETL
- Training
  - Manifolds
  - Unconditional Training
  - Conditional Training
- Inference
  - De Novo Generation / Unconditional Evalation
  - Reconstruction / Conditional Evaluation
- Prerelaxation
- Prepare DFT
- Compute E above hull
- Compute corrected E above hull
- Compute Stable, Unique, and Novel (SUN) structures
- Next Steps & References

# (1) BACKGROUND AND CENTRAL GOAL


# Background
### FlowMM
**FlowMM** uses Riemannian flow matching to learn how to transform simple base noise into full periodic crystal structures by jointly modeling fractional atomic coordinates and lattice parameters on the manifold defined by crystal symmetries. It tackles both **Crystal Structure Prediction** (finding the stable arrangement for a known composition) and **De Novo Generation** (proposing entirely new materials), doing so with about three times fewer integration steps than comparable diffusion-based approaches.  

### FlowLLM
**FlowLLM** builds on FlowMM by swapping out the simple analytic noise prior for samples from a pretrained CrystalLLM (a LLaMA‐style model fine-tuned on crystal data). You generate initial “noisy” structures with the LLM, then use the same Riemannian flow-matching steps to refine those into accurate crystal geometries.


# Central Goal
Show viewers how to install, train, and use FlowMM and FlowLLM.
  


# (2) INSTALLATION, CONFIGURATION, AND DEPENDENCIES


# Install Conda

In [1]:
!pip install -q condacolab
import condacolab, os, sys
condacolab.install()
print("Done")

⏬ Downloading https://github.com/jaimergp/miniforge/releases/download/24.11.2-1_colab/Miniforge3-colab-24.11.2-1_colab-Linux-x86_64.sh...
📦 Installing...
📌 Adjusting configuration...
🩹 Patching environment...
⏲ Done in 0:00:14
🔁 Restarting kernel...
Done


**Note**: Colab and FlowMM have hard pins for different Python and CUDA versions. To bypass this, the "!conda run" command will be used to run most code in this notebook. This bypasses the hard pinned Colab Python version by spinning up a conda subprocess that runs its own Python kernel with the correct version required by FlowMM.

# Install FlowMM

In [2]:
import os
%cd /content
if not os.path.exists('flowmm'):
  !git clone https://github.com/facebookresearch/flowmm.git
print("Done")

/content
Cloning into 'flowmm'...
remote: Enumerating objects: 205, done.[K
remote: Counting objects: 100% (75/75), done.[K
remote: Compressing objects: 100% (64/64), done.[K
remote: Total 205 (delta 23), reused 11 (delta 11), pack-reused 130 (from 1)[K
Receiving objects: 100% (205/205), 64.55 MiB | 16.26 MiB/s, done.
Resolving deltas: 100% (45/45), done.
Filtering content: 100% (14/14), 370.49 MiB | 77.36 MiB/s, done.
Done


# Load FlowMM submodules

In [1]:
%%bash
cd /content/flowmm
sed -i 's|git@github.com:bkmi/DiffCSP-official.git|https://github.com/bkmi/DiffCSP-official.git|' .gitmodules
sed -i 's|git@github.com:bkmi/cdvae.git|https://github.com/bkmi/cdvae.git|' .gitmodules
sed -i 's|git@github.com:facebookresearch/riemannian-fm.git|https://github.com/facebookresearch/riemannian-fm.git|' .gitmodules
git submodule sync
git submodule update --init --recursive
echo "Done"

Submodule path 'remote/DiffCSP-official': checked out '199539f8dbca31a3e08ae549b2876452ff5b4ead'
Submodule path 'remote/cdvae': checked out '5837952c3de298dc6ac41c600ba4cbb4b6d9b6ed'
Submodule path 'remote/riemannian-fm': checked out 'a90927909e7df7437a5895ff7174e7b356f8526e'
Done


Submodule 'remote/DiffCSP-official' (https://github.com/bkmi/DiffCSP-official.git) registered for path 'remote/DiffCSP-official'
Submodule 'remote/cdvae' (https://github.com/bkmi/cdvae.git) registered for path 'remote/cdvae'
Submodule 'remote/riemannian-fm' (https://github.com/facebookresearch/riemannian-fm.git) registered for path 'remote/riemannian-fm'
Cloning into '/content/flowmm/remote/DiffCSP-official'...
Cloning into '/content/flowmm/remote/cdvae'...
Cloning into '/content/flowmm/remote/riemannian-fm'...


# Switch Colab Runtime to GPU
At the top menu by the Colab logo, select **Runtime** -> **Change runtime type** -> **Any GPU**    

It is not necessary to run on GPU, but the code will complete faster.



# Create conda environment for FlowMM
Making the conda environment takes 20 minutes


In [4]:
%%time
%cd /content/flowmm
!mamba env create -p /usr/local/envs/flowmm_env -f environment.yml
!conda run -p /usr/local/envs/cdvae_legacy --live-stream\
    pip install -e .
print("Done")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m






pandoc-3.7.0.1       | 20.7 MB   | :  21% 0.20616532609615004/1 [00:31<00:25, 32.26s/it][A[A[A[A[A[A[A[A[A[A[A[A[A











cuda-cupti-11.8.87   | 25.3 MB   | :  19% 0.19098420133056715/1 [00:31<00:11, 14.48s/it][A[A[A[A[A[A[A[A[A[A[A[A










pytorch-2.1.0        | 1.46 GB   | :  37% 0.3664127861362133/1 [00:31<04:11, 396.54s/it] 













libclang-11.1.0      | 19.2 MB   | :  23% 0.23405166307614225/1 [00:31<00:04,  5.84s/it][A[A[A[A[A[A[A[A[A[A[A[A[A[A











cuda-cupti-11.8.87   | 25.3 MB   | :  32% 0.32077928961347685/1 [00:31<00:01,  2.44s/it][A[A[A[A[A[A[A[A[A[A[A[A










python-3.9.0         | 28.7 MB   | :  56% 0.5644621130624702/1 [00:31<00:08, 18.95s/it][A[A[A[A[A[A[A[A[A[A[A












pytorch-2.1.0        | 1.46 GB   | :  37% 0.36861914719539374/1 [00:31<02:56, 280.34s/it]













libclang-11.1.0      | 19.2 MB   | :  4

# Install Other dependencies


# (3) DATASET ETL (Extract-Transform-Load)


# Download data pre-processor

Data was generated using this [script](https://github.com/crhysc/utilities/blob/main/supercon_preprocess.py). It compiles a set of around 1000 structures and their superconducting critical temperatures into the format required for FlowMM training.

In [None]:
%cd /content/flowmm
!wget https://raw.githubusercontent.com/crhysc/utilities/refs/heads/main/supercon_preprocess.py
%cat supercon_preprocess.py

# Run data pre-processor

In [None]:
!conda run -p /usr/local/envs/flowmm_env --live-stream \
    python supercon_preprocess.py
print("Done")

# Move train/test/val data to the correct spot

In [None]:
%cd /content
%mkdir /content/flowmm/data/supercon
%mv /content/flowmm/train.csv /content/flowmm/data/supercon/
%mv /content/flowmm/val.csv /content/flowmm/data/supercon/
%mv /content/flowmm/test.csv /content/flowmm/data/supercon/
print("Done")

# Pull the supercon Hydra config YAML from JARVIS

**NOTE**: Each dataset that you want to use with CDVAE needs its own config.yml located in cdvae/conf/data/

In [None]:
%cd /content/flowmm/scripts_model/conf/data/
!wget https://raw.githubusercontent.com/crhysc/utilities/refs/heads/main/supercon.yaml
%cat supercon.yaml

# (4) TRAINING
# Manifolds


- FlowMM allows the user to select a variety of manifolds via the keyword argument   
`model={atom_type_manifold}_{lattice_manifold}`  
when using `scripts_model/run.py`.  

- Atom type manifolds and lattice type manifolds can be found in `scripts_model/conf/model`.

# Unconditional Training

In [3]:
%cd /content/flowmm
!bash create_env_file.sh && \
 conda run -p /usr/local/envs/cdvae_legacy --live-stream \
    python -u -m scripts_model.run data=supercon model=abits_params \
    train.pl_trainer.gpus=0

/content/flowmm

EnvironmentLocationNotFound: Not a conda environment: /usr/local/envs/cdvae_legacy



# Conditional Training

In [None]:
%cd /content/flowmm
!bash create_env_file.sh && \
 conda run -p /usr/local/envs/cdvae_legacy --live-stream \
    python -u -m scripts_model.run data=supercon model=null_params \
    train.pl_trainer.gpus=0

# (5) INFERENCE
# De Novo Generation / Unconditional Evalation

In [None]:
!PROJECT_ROOT=/content/cdvae \
 HYDRA_JOBS=/content/cdvae/hydra_outputs \
 WABDB_DIR=/content/cdvae/wandb_outputs \
 conda run -p /usr/local/envs/cdvae_legacy --live-stream \
    python /content/cdvae/scripts/evaluate.py \
    --model_path /content/cdvae/hydra_outputs/singlerun/2025-05-27/supercon \
    --tasks recon

In [None]:
import torch
from pprint import pprint
path = "/content/cdvae/hydra_outputs/singlerun/2025-05-27/supercon/eval_recon.pt"
data = torch.load(path, map_location="cpu", weights_only=False)
pprint(data, width=120, indent=2)

# Generation

In [None]:
!PROJECT_ROOT=/content/cdvae \
 HYDRA_JOBS=/content/cdvae/hydra_outputs \
 WABDB_DIR=/content/cdvae/wandb_outputs \
 conda run -p /usr/local/envs/cdvae_legacy --live-stream \
    python /content/cdvae/scripts/evaluate.py \
    --model_path /content/cdvae/hydra_outputs/singlerun/2025-05-27/supercon \
    --tasks gen

In [None]:
import torch
from pprint import pprint
path = "/content/cdvae/hydra_outputs/singlerun/2025-05-27/supercon/eval_gen.pt"
data = torch.load(path, map_location="cpu", weights_only=False)
pprint(data, width=120, indent=2)

# Optimization

In [None]:
!PROJECT_ROOT=/content/cdvae \
 HYDRA_JOBS=/content/cdvae/hydra_outputs \
 WABDB_DIR=/content/cdvae/wandb_outputs \
 conda run -p /usr/local/envs/cdvae_legacy --live-stream \
    python /content/cdvae/scripts/evaluate.py \
    --model_path /content/cdvae/hydra_outputs/singlerun/2025-05-27/supercon \
    --tasks opt

In [None]:
import torch
from pprint import pprint
path = "/content/cdvae/hydra_outputs/singlerun/2025-05-27/supercon/eval_opt.pt"
data = torch.load(path, map_location="cpu", weights_only=False)
pprint(data, width=120, indent=2)

# (7) EVALUATION

In [None]:
!PROJECT_ROOT=/content/cdvae \
 HYDRA_JOBS=/content/cdvae/hydra_outputs \
 WABDB_DIR=/content/cdvae/wandb_outputs \
 conda run -p /usr/local/envs/cdvae_legacy --live-stream \
    python scripts/compute_metrics.py \
    --root_path /content/cdvae/hydra_outputs/singlerun/2025-05-27/supercon \
    --tasks recon gen opt

# (8) NEXT STEPS & REFERENCES

## Next Steps

1. **Hyperparameter exploration**  
   - Try different numbers of noise levels (`model.num_noise_level`) and training epochs to improve sample quality.

2. **Property-conditioned generation**  
   - Re-enable the property predictor (`model.predict_property=True`) and train with longer schedules to improve prediction accuracy.
   - After training, sample structures by specifying a target critical temperature and evaluate via DFT or empirical models.


---

## References

- **Original CDVAE paper:**  
  Li _et al._, “Crystal Diffusion Variational Autoencoder for Inverse Materials Design,” _J. Phys. Chem. Lett._ 2023, DOI: [10.1021/acs.jpclett.3c01260](https://pubs.acs.org/doi/10.1021/acs.jpclett.3c01260)

- **CDVAE GitHub repo:**  
  https://github.com/txie-93/cdvae

- **JARVIS-Materials-Design:**  
  https://github.com/JARVIS-Materials-Design/jarvis

- **Hydra configuration framework:**  
  https://hydra.cc

- **PyTorch Lightning:**  
  https://www.pytorchlightning.ai

- **condacolab:**  
  https://github.com/conda-incubator/condacolab

- **Mamba (fast conda):**  
  https://github.com/mamba-org/mamba

- **Jarvis-tools (data ETL):**  
  https://github.com/JARVIS-Materials-Design/jarvis-tools
