<a href="https://colab.research.google.com/github/crhysc/jarvis-tools-notebooks/blob/master/jarvis-tools-notebooks/flowmm_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Tutorial**: FlowMM & FlowLLM



**Authors**: Charles "Rhys" Campbell (crc00042@mix.wvu.edu)

# TABLE OF CONTENTS

- Background and Central Goal
- Installation, Configuration, and Dependencies
- Dataset ETL
- Training
  - Manifolds
  - Unconditional Training
  - Conditional Training
  - FlowLLM
- Inference
  - De Novo Generation / Unconditional Evalation
  - Reconstruction / Conditional Evaluation
- Next Steps & References

# (1) BACKGROUND AND CENTRAL GOAL


# Background
### FlowMM
**FlowMM** uses Riemannian flow matching to learn how to transform simple base noise into full periodic crystal structures by jointly modeling fractional atomic coordinates and lattice parameters on the manifold defined by crystal symmetries. It tackles both **Crystal Structure Prediction** (finding the stable arrangement for a known composition) and **De Novo Generation** (proposing entirely new materials), doing so with about three times fewer integration steps than comparable diffusion-based approaches.  

### FlowLLM
**FlowLLM** builds on FlowMM by swapping out the simple analytic noise prior for samples from a pretrained CrystalLLM (a LLaMA‚Äêstyle model fine-tuned on crystal data). You generate initial ‚Äúnoisy‚Äù structures with the LLM, then use the same Riemannian flow-matching steps to refine those into accurate crystal geometries.


# Central Goal
Show viewers how to install, train, and use FlowMM and FlowLLM.
  


# (2) INSTALLATION, CONFIGURATION, AND DEPENDENCIES


# Install Conda

In [1]:
!pip install -q condacolab
import condacolab, os, sys
condacolab.install()
print("Done")

‚è¨ Downloading https://github.com/jaimergp/miniforge/releases/download/24.11.2-1_colab/Miniforge3-colab-24.11.2-1_colab-Linux-x86_64.sh...
üì¶ Installing...
üìå Adjusting configuration...
ü©π Patching environment...
‚è≤ Done in 0:00:15
üîÅ Restarting kernel...
Done


**Note**: Colab and FlowMM have hard pins for different Python and CUDA versions. To bypass this, the "!conda run" command will be used to run most code in this notebook. This bypasses the hard pinned Colab Python version by spinning up a conda subprocess that runs its own Python kernel with the correct version required by FlowMM.

# Install FlowMM

In [1]:
import os
%cd /content
if not os.path.exists('flowmm'):
  !git clone https://github.com/facebookresearch/flowmm.git
print("Done")

/content
Cloning into 'flowmm'...
remote: Enumerating objects: 205, done.[K
remote: Counting objects: 100% (75/75), done.[K
remote: Compressing objects: 100% (64/64), done.[K
remote: Total 205 (delta 23), reused 11 (delta 11), pack-reused 130 (from 1)[K
Receiving objects: 100% (205/205), 64.55 MiB | 15.76 MiB/s, done.
Resolving deltas: 100% (45/45), done.
Filtering content: 100% (14/14), 370.49 MiB | 39.85 MiB/s, done.
Done


# Load FlowMM submodules

In [2]:
%%bash
cd /content/flowmm
sed -i 's|git@github.com:bkmi/DiffCSP-official.git|https://github.com/bkmi/DiffCSP-official.git|' .gitmodules
sed -i 's|git@github.com:bkmi/cdvae.git|https://github.com/bkmi/cdvae.git|' .gitmodules
sed -i 's|git@github.com:facebookresearch/riemannian-fm.git|https://github.com/facebookresearch/riemannian-fm.git|' .gitmodules
git submodule sync
git submodule update --init --recursive
echo "Done"

Submodule path 'remote/DiffCSP-official': checked out '199539f8dbca31a3e08ae549b2876452ff5b4ead'
Submodule path 'remote/cdvae': checked out '5837952c3de298dc6ac41c600ba4cbb4b6d9b6ed'
Submodule path 'remote/riemannian-fm': checked out 'a90927909e7df7437a5895ff7174e7b356f8526e'
Done


Submodule 'remote/DiffCSP-official' (https://github.com/bkmi/DiffCSP-official.git) registered for path 'remote/DiffCSP-official'
Submodule 'remote/cdvae' (https://github.com/bkmi/cdvae.git) registered for path 'remote/cdvae'
Submodule 'remote/riemannian-fm' (https://github.com/facebookresearch/riemannian-fm.git) registered for path 'remote/riemannian-fm'
Cloning into '/content/flowmm/remote/DiffCSP-official'...
Cloning into '/content/flowmm/remote/cdvae'...
Cloning into '/content/flowmm/remote/riemannian-fm'...


# Switch Colab Runtime to GPU
At the top menu by the Colab logo, select **Runtime** -> **Change runtime type** -> **Any GPU**    

It is not necessary to run on GPU, but the code will complete faster.



# Create conda environment for FlowMM
Making the conda environment takes 20 minutes


In [3]:
%%time
%cd /content/flowmm
!mamba env create -p /usr/local/envs/flowmm_env -f environment.yml
!conda run -p /usr/local/envs/flowmm_env --live-stream\
    pip install uv
!conda run -p /usr/local/envs/flowmm_env --live-stream\
    uv pip install "jarvis-tools>=2024.5" "pymatgen>=2024.1" pandas numpy tqdm
!conda run -p /usr/local/envs/flowmm_env --live-stream\
    uv pip install -e . \
                   -e remote/riemannian-fm \
                   -e remote/cdvae \
                   -e remote/DiffCSP-official
print("Done")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m










pytorch-2.1.0        | 1.46 GB   | :  36% 0.3601701437081057/1 [00:29<00:38, 60.53s/it]












pandoc-3.7.0.2       | 20.7 MB   | :  15% 0.15248611066444612/1 [00:29<01:56, 137.07s/it]       [A[A[A[A[A[A[A[A[A[A[A[A[A











pytorch-2.1.0        | 1.46 GB   | :  36% 0.3620941742052109/1 [00:29<00:37, 58.04s/it]












pytorch-2.1.0        | 1.46 GB   | :  36% 0.3638508977025678/1 [00:29<00:38, 60.82s/it]












pytorch-2.1.0        | 1.46 GB   | :  37% 0.3655239677000506/1 [00:30<00:41, 64.85s/it]












pytorch-2.1.0        | 1.46 GB   | :  37% 0.3678139822591051/1 [00:30<00:36, 57.49s/it]












pandoc-3.7.0.2       | 20.7 MB   | :  75% 0.7548817359626045/1 [00:30<00:03, 12.29s/it][A[A[A[A[A[A[A[A[A[A[A[A[A










pytorch-2.1.0        | 1.46 GB   | :  37% 0.3697693828186631/1 [00:30<00:35, 55.81s/it]












pandoc-3.7.0.2       | 20.7 MB   | :  93%

Add __ init __.py to manifm and reinstall

In [4]:
%cd /content/flowmm/
import os
if not os.path.exists('remote/riemannian-fm/manifm/__init.py__'):
    !wget -q https://raw.githubusercontent.com/crhysc/utilities/refs/heads/main/__init__.py
    !mv __init__.py /content/flowmm/remote/riemannian-fm/manifm/
!conda run -p /usr/local/envs/flowmm_env --live-stream\
    pip install -e /content/flowmm/remote/riemannian-fm/
!conda run -p /usr/local/envs/flowmm_env --live-stream\
    python -c "import manifm; print('manifm version:', manifm.__version__)"

/content/flowmm
Obtaining file:///content/flowmm/remote/riemannian-fm
  Preparing metadata (setup.py) ... [?25l[?25hdone
Installing collected packages: manifm
  Attempting uninstall: manifm
    Found existing installation: manifm 1.0.0
    Uninstalling manifm-1.0.0:
      Successfully uninstalled manifm-1.0.0
[33m  DEPRECATION: Legacy editable install of manifm==1.0.0 from file:///content/flowmm/remote/riemannian-fm (setup.py develop) is deprecated. pip 25.3 will enforce this behaviour change. A possible replacement is to add a pyproject.toml or enable --use-pep517, and use setuptools >= 64. If the resulting installation is not behaving as expected, try using --config-settings editable_mode=compat. Please consult the setuptools documentation for more information. Discussion can be found at https://github.com/pypa/pip/issues/11457[0m[33m
[0m  Running setup.py develop for manifm
Successfully installed manifm-1.0.0
manifm version: 1.0.0


# Install Other dependencies


# (3) DATASET ETL (Extract-Transform-Load)


# Download data pre-processor

Data was generated using this [script](https://github.com/crhysc/utilities/blob/main/supercon_preprocess.py). It compiles a set of around 1000 structures and their superconducting critical temperatures into the format required for FlowMM training.

In [5]:
%cd /content/flowmm
import os
if not os.path.exists('supercon_preprocess.py'):
  !wget -q https://raw.githubusercontent.com/crhysc/utilities/refs/heads/main/supercon_preprocess.py
%cat supercon_preprocess.py

/content/flowmm
#!/usr/bin/env python
"""
supercon_preprocess.py  ‚Äì  Python 3.9 compatible

Example
-------
python supercon_preprocess.py \
    --dataset dft_3d --id-key jid --target Tc_supercon \
    --train-ratio 0.8 --val-ratio 0.1 --test-ratio 0.1 \
    --seed 123 --max-size 1000
"""
from __future__ import annotations

import argparse, random, json, hashlib
from pathlib import Path
from typing import Optional, List, Tuple

import numpy as np
import pandas as pd
from tqdm import tqdm

from jarvis.db.figshare import data as jarvis_data
from jarvis.core.atoms import Atoms
from pymatgen.core import Structure
from pymatgen.symmetry.analyzer import SpacegroupAnalyzer


# ---------- helpers ----------------------------------------------------------
def canonicalise(pmg_struct: Structure, symprec: float = 0.1) -> Tuple[str, int, int]:
    """Return (cif_conv, spg_num, spg_num_conv).  Never raises."""
    try:
        sga = SpacegroupAnalyzer(pmg_struct, symprec=symprec)
        spg_num =

# Run data pre-processor

In [6]:
%cd /content/flowmm
!conda run -p /usr/local/envs/flowmm_env --live-stream \
    python supercon_preprocess.py \
        --dataset dft_3d \
        --id-key jid \
        --target Tc_supercon \
        --train-ratio 0.8 --val-ratio 0.1 --test-ratio 0.1 \
        --seed 123 \
        --max-size 25
print("Done")

/content/flowmm
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd
Obtaining 3D dataset 76k ...
Reference:https://www.nature.com/articles/s41524-020-00440-1
Other versions:https://doi.org/10.6084/m9.figshare.6815699
100% 40.8M/40.8M [00:02<00:00, 18.4MiB/s]
Loading the zipfile...
Loading completed.
Downloading/JARVIS:   5% 3678/75993 [00:00<00:17, 4177.79it/s]
Collected 25 records (max-size=25)
‚úì Wrote train.csv, val.csv, test.csv
hashes  train:2883830c37 val:a7b15dfc05 test:05f4f13f52
Done


# Move train/test/val data to the correct spot

In [7]:
%cd /content
%mkdir /content/flowmm/data/supercon
%mv /content/flowmm/train.csv /content/flowmm/data/supercon/
%mv /content/flowmm/val.csv /content/flowmm/data/supercon/
%mv /content/flowmm/test.csv /content/flowmm/data/supercon/
print("Done")

/content
Done


# Pull the supercon Hydra config YAML from GitHub

In [8]:
%cd /content/flowmm/scripts_model/conf/data/
!wget https://raw.githubusercontent.com/crhysc/utilities/refs/heads/main/supercon.yaml
%cat supercon.yaml

/content/flowmm/scripts_model/conf/data
--2025-05-30 14:14:52--  https://raw.githubusercontent.com/crhysc/utilities/refs/heads/main/supercon.yaml
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2359 (2.3K) [text/plain]
Saving to: ‚Äòsupercon.yaml‚Äô


2025-05-30 14:14:52 (13.5 MB/s) - ‚Äòsupercon.yaml‚Äô saved [2359/2359]

dataset_name: supercon
dim_coords: 3
root_path: ${oc.env:PROJECT_ROOT}/data/supercon
prop: Tc_supercon
num_targets: 1
# prop: scaled_lattice
# num_targets: 6
niggli: true
primitive: false
graph_method: crystalnn
lattice_scale_method: scale_length
preprocess_workers: 30
readout: mean
max_atoms: 24
otf_graph: false
eval_model_name: supercon
tolerance: 0.1

use_space_group: false
use_pos_index: false
train_max_epochs: 1
early_stopping_patienc

# Modify FlowMM hardcode to accept our supercon dataset

First, open **Files** in the left sidebar and navigate to **/Content/flowmm/src/flowmm/**. Click **cfg_utils.py**, and on line 15, add "supercon" to the *dataset_options* literal and delete all other strings in the tuple. Once you have done that, run the following code to generate the necessary affine stats YAML.

Next, open **Files** again and navigate to /Content/flowmm/src/flowmm/rfm/manifolds/. Click **spd.py**, and then navigate to the "if __ name __ = __ main __" block. Uncomment lines 449 through 466 (we are turning on "compute_stats". Next, on line 468, set "compute_stats = True". Next, on line 489, set "compute_stats = True" again. Next, on line 461, change ""std": std.cpu().tolist()" to ""logmap_std": std.cpu().tolist(),". Next, on line 236, change the "std" string to "logmap_std"

Finally, open Files again and navigate to /Content/flowmm/src/flowmm/rfm/manifolds/. Click spd.py, and then replace all code including and after line 531, which is a comment saying "# do some testing for SPDNonIsotropicRandom"

    pL_stats = OmegaConf.load(Path(__file__).parent / "spd_pLTL_stats.yaml")  # ‚Üê new line

    for dataset in tqdm(list(dataset_options.__args__)):
          mean_vec = torch.tensor(pL_stats[dataset]["mean"])           # now using pL_stats
          std_vec  = torch.tensor(pL_stats[dataset]["logmap_std"])     # correct key name

          # optional sanity check
          if mean_vec.ndim == 0:
              raise ValueError(
                  f"Loaded mean for {dataset} is scalar‚Äîwrong YAML? shape {mean_vec.shape}"
              )

          s = manifm_SPD(Riem_geodesic=True, Riem_norm=True)
          spd = SPDNonIsotropicRandom(mean_vec, std_vec)
          r   = spd.random_base(10, mean_vec.size(-1))
          lp  = spd.base_logprob(r)
          print(r, lp)

          r  = spd.random_base(3, 10, mean_vec.size(-1))
          lp = spd.base_logprob(r)
          print(r, lp)

# Generate SPD stats

In [18]:
%cd /content/flowmm
!bash create_env_file.sh && \
 echo "successfully ran create_env_file.sh" && \
 HYDRA_FULL_ERROR=1 \
 conda run -p /usr/local/envs/flowmm_env --live-stream \
    python -u -m flowmm.rfm.manifolds.spd

/content/flowmm
successfully ran create_env_file.sh
calculate the overall stats of p(L) for each dataset
dataset='supercon': 100% 1/1 [00:01<00:00,  1.33s/it]
calculate the density atoms to volume
dataset='supercon': 100% 1/1 [00:00<00:00,  2.56it/s]
calculate the stats of p(L | N) for each dataset
dataset='supercon': 100% 1/1 [00:00<00:00,  2.36it/s]
  0% 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/usr/local/envs/flowmm_env/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/envs/flowmm_env/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/content/flowmm/src/flowmm/rfm/manifolds/spd.py", line 545, in <module>
    r   = spd.random_base(10, mean_vec.size(-1))
  File "/content/flowmm/src/flowmm/rfm/manifolds/spd.py", line 252, in random_base
    self.logmap_std.expand(bsz, d).to(dtype=rnd_dtype, device=rnd_device)
  File "/usr/local/envs/flowmm_env/lib/python3.9/

# Create the required affine stats YAML for the dataset

In [None]:
%cd /content/flowmm
!bash create_env_file.sh && \
 echo "successfully ran create_env_file.sh" && \
 HYDRA_FULL_ERROR=1 \
 conda run -p /usr/local/envs/flowmm_env --live-stream \
    python -u -m flowmm.model.standardize \
                 data=supercon

# (4) TRAINING
# Manifolds


- FlowMM allows the user to select a variety of manifolds via the keyword argument   
`model={atom_type_manifold}_{lattice_manifold}`  
when using `scripts_model/run.py`.  

- Atom type manifolds and lattice type manifolds can be found in `scripts_model/conf/model`.

# Unconditional Training

In [None]:
%cd /content/flowmm
!bash create_env_file.sh && \
 echo "successfully ran create_env_file.sh" && \
 HYDRA_FULL_ERROR=1 \
 conda run -p /usr/local/envs/flowmm_env --live-stream \
    python -u -m scripts_model.run data=supercon model=abits_params

# Conditional Training

In [None]:
%cd /content/flowmm
!bash create_env_file.sh && \
 echo "successfully ran create_env_file.sh" && \
 HYDRA_FULL_ERROR=1 \
 conda run -p /usr/local/envs/flowmm_env --live-stream \
    python -u -m scripts_model.run data=supercon model=null_params

# FlowLLM Training

In [None]:
%cd /content/flowmm
!bash create_env_file.sh && \
 echo "successfully ran create_env_file.sh" && \
 HYDRA_FULL_ERROR=1 \
 conda run -p /usr/local/envs/flowmm_env --live-stream \
    python -u -m scripts_model.run data=mp20_llama model=null_params \
      base_distribution_from_data=True

# (5) INFERENCE
# Unconditional Evaluation - De Novo Generation



In [None]:
!bash create_env_file.sh && \
 echo "successfully ran create_env_file.sh" && \
 ckpt=PATH_TO_CHECKPOINT \
 subdir=NAME_OF_SUBDIRECTORY_AT_CHECKPOINT \
 slope=SLOPE_OF_INFERENCE_ANTI_ANNEALING \
 conda run -p /usr/local/envs/flowmm_env --live-stream \
    python scripts_model/evaluate.py generate ${ckpt} --subdir ${subdir} \
      --inference_anneal_slope ${slope} --stage test && \
    python scripts_model/evaluate.py consolidate ${ckpt} --subdir ${subdir} && \
    python scripts_model/evaluate.py old_eval_metrics ${ckpt} --subdir ${subdir} \
      --stage test && \
    python scripts_model/evaluate.py lattice_metrics ${ckpt} --subdir ${subdir} \
      --stage test

In [None]:
import torch
from pprint import pprint
path = "/content/cdvae/hydra_outputs/singlerun/2025-05-27/supercon/eval_recon.pt"
data = torch.load(path, map_location="cpu", weights_only=False)
pprint(data, width=120, indent=2)

# Conditional Evaluation - Crystal Structure Prediction - Reconstruction

In [None]:
!bash create_env_file.sh && \
 echo "successfully ran create_env_file.sh" && \
 ckpt=PATH_TO_CHECKPOINT \
 subdir=NAME_OF_SUBDIRECTORY_AT_CHECKPOINT \
 slope=SLOPE_OF_INFERENCE_ANTI_ANNEALING \
 conda run -p /usr/local/envs/flowmm_env --live-stream \
    python scripts_model/evaluate.py reconstruct ${ckpt} --subdir ${subdir} \
      --inference_anneal_slope ${slope} --stage test && \
    python scripts_model/evaluate.py consolidate ${ckpt} --subdir ${subdir} && \
    python scripts_model/evaluate.py old_eval_metrics ${ckpt} --subdir ${subdir} \
      --stage test && \
    python scripts_model/evaluate.py lattice_metrics ${ckpt} --subdir ${subdir} \
      --stage test

In [None]:
import torch
from pprint import pprint
path = "/content/cdvae/hydra_outputs/singlerun/2025-05-27/supercon/eval_recon.pt"
data = torch.load(path, map_location="cpu", weights_only=False)
pprint(data, width=120, indent=2)

## (6) NEXT STEPS & REFERENCES

## Next Steps

1. **Hyperparameter exploration**  
   - Try different numbers of noise levels (`model.num_noise_level`) and training epochs to improve sample quality.

2. **Property-conditioned generation**  
   - Re-enable the property predictor (`model.predict_property=True`) and train with longer schedules to improve prediction accuracy.
   - After training, sample structures by specifying a target critical temperature and evaluate via DFT or empirical models.


---

## References

- **Original CDVAE paper:**  
  Li _et al._, ‚ÄúCrystal Diffusion Variational Autoencoder for Inverse Materials Design,‚Äù _J. Phys. Chem. Lett._ 2023, DOI: [10.1021/acs.jpclett.3c01260](https://pubs.acs.org/doi/10.1021/acs.jpclett.3c01260)

- **CDVAE GitHub repo:**  
  https://github.com/txie-93/cdvae

- **JARVIS-Materials-Design:**  
  https://github.com/JARVIS-Materials-Design/jarvis

- **Hydra configuration framework:**  
  https://hydra.cc

- **PyTorch Lightning:**  
  https://www.pytorchlightning.ai

- **condacolab:**  
  https://github.com/conda-incubator/condacolab

- **Mamba (fast conda):**  
  https://github.com/mamba-org/mamba

- **Jarvis-tools (data ETL):**  
  https://github.com/JARVIS-Materials-Design/jarvis-tools
