A curated handoff repository for LMDB-based materials-property prediction pipelines developed during an AI-for-chemistry internship.
This directory contains three production-oriented training stacks that share the same high-level philosophy:
preprocess expensive structure features once, store them in LMDB, then train reproducibly on local machines or HPC clusters.
This repo is designed to help future developers quickly understand, reuse, and extend the project without reverse-engineering multiple codebases from scratch.
It includes:
- CGCNN_LMDB for CGCNN-based training on QMOF and ODAC-style data;
- MGT_LMDB for a Molecular Graph Transformer pipeline with tri-graph inputs;
- PMT_LMDB for a PMTransformer / MOFTransformer-style multimodal pipeline;
- results_hub for local checkpoint evaluation, CIF prediction, workflow analytics, and in-browser documentation;
- full PDF + TeX technical documentation for each pipeline;
- supporting Markdown runbooks for preprocessing, training, scripts, and containers;
- research / due-diligence notebooks kept in the root as auxiliary analysis artifacts.
| Project | Core model family | Current data focus | Main entrypoint | Full developer documentation |
|---|---|---|---|---|
CGCNN_LMDB/ |
Crystal Graph Convolutional Neural Network | QMOF, ODAC23/OpenDAC-style shard LMDBs, plus additional ASR/HMOF job scripts | CGCNN_LMDB/code/main.py |
PDF · TeX |
MGT_LMDB/ |
Molecular Graph Transformer + ALIGNN-style message passing | QMOF and ODAC-style LMDB workflows | MGT_LMDB/code/training.py |
PDF · TeX |
PMT_LMDB/ |
PMTransformer / MOFTransformer wrapper | QMOF-style CIF-tree datasets packed into LMDB | PMT_LMDB/code/trainer.py |
PDF · TeX |
results_hub/ |
Local Results Hub app | Checkpoint registry, CIF prediction, workflow metric comparison, docs browser | results_hub/server.py |
README · in-app docs |
If you are new to the repository, read in this order:
- this root README;
- the project-level README inside the target subdirectory;
- the corresponding PDF handoff document (or TeX source in
docs/tex/); - the operational Markdown guides for preprocessing / training / scripts / containers.
- optional: launch Results Hub when you want a UI for checkpoint inference, metrics comparison, or documentation browsing.
- PDF docs:
- TeX sources:
LMDB_Projects/
├── CGCNN_LMDB/
├── MGT_LMDB/
├── PMT_LMDB/
├── results_hub/
│ ├── docs/
│ ├── static/
│ ├── server.py
│ └── README.md
├── docs/
│ ├── CGCNN.pdf
│ ├── MGT.pdf
│ ├── PMT.pdf
│ └── tex/
│ ├── CGCNN_LMDB_FULL_DOCUMENTATION.tex
│ ├── MGT_LMDB_FULL_DOCUMENTATION.tex
│ └── PMT_LMDB_FULL_DOCUMENTATION.tex
├── Training_Results/
└── *.ipynb
Results Hub is the local browser application for the repository. It lets you:
- register or select CGCNN, MGT, and PMT checkpoints;
- upload one CIF file or a folder of CIF files and predict the checkpoint target without requiring ground-truth labels;
- compare unified
training_metrics.csvfiles across named workflows; - browse the Results Hub docs and the model runbooks from one UI.
Start it from the repository root:
python -m results_hub.serverThen open http://127.0.0.1:8877.
Runtime files created by the app are stored under results_hub/data/
(models/, uploads/, evaluate_runs/, and workflows/). These are local
working artifacts, not source files.
- Project overview:
CGCNN_LMDB/README.md - Preprocessing:
CGCNN_LMDB/PREPROCESS_README.md - Training:
CGCNN_LMDB/TRAINING_README.md - Scripts:
CGCNN_LMDB/scripts/README.md - Container:
CGCNN_LMDB/container/README.md
- Project overview:
MGT_LMDB/README.md - Preprocessing:
MGT_LMDB/PREPROCESS_README.md - Training:
MGT_LMDB/TRAINING_README.md - Scripts:
MGT_LMDB/scripts/README.md - Container:
MGT_LMDB/container/README.md
- Project overview:
PMT_LMDB/README.md - Preprocessing:
PMT_LMDB/PREPROCESS_README.md - Training:
PMT_LMDB/TRAINING_README.md - Scripts:
PMT_LMDB/scripts/README.md - Container:
PMT_LMDB/container/README.md
The notebooks are intentionally kept in the repository as supporting analysis / due-diligence artifacts.
They are not part of the maintained production pipeline, but they are still useful context for future developers:
CoreMOF_CSD_Modified_Due_Diligence.ipynbMOSAEC_DB_Full_Due_Diligence.ipynbodac23_is2r_analysis.ipynbodac25_ONLY_VALIDATION_SET_is2re_dataset_analysis.ipynbqmof_analysis.ipynb
Per the final handoff scope, these notebooks were not re-audited here; the maintained focus is the code + Markdown/TeX documentation layers.
The repository was finalized with attention to:
- code/documentation consistency for Markdown + TeX handoff docs;
- a local Results Hub UI for evaluation, metrics review, and docs navigation;
- developer discoverability from the root of the repo;
- low-risk fixes for obvious path / default-value drift.
- pick one pipeline;
- read its TeX document end-to-end once;
- follow the Markdown runbooks for preprocessing and training;
- use
results_hubwhen you need quick local inference on CIF files or a visual comparison of training metrics; - treat notebooks as supplemental research context, not as the canonical production interface.