This is the offcial implementation of Paper "Enhancing Molecular Property Prediction with Chemical Priors by Fractional Denoising"
The environment is composed of the following packages and versions:
pytorch-lightning 1.8.6
torch 1.13.1+cu116
torch-cluster 1.6.0+pt113cu116
torch-geometric 2.3.0
torch-scatter 2.1.0+pt113cu116
torch-sparse 0.6.17+pt113cu116
torch-spline-conv 1.2.1+pt113cu116
torchmetrics 0.11.4
wandb 0.15.3
numpy 1.22.4
scikit-learn 1.2.2
scipy 1.8.1
deepchem 2.7.1
ogb 1.3.6
omegaconf 2.3.0
tqdm 4.66.2
The basic software and environment include Python 3.8, CUDA 11.6, Ubuntu 20.04.2 with OS version 9.4.0-1ubuntu1~20.04.2, and Linux kernel version 5.4.0-177-generic.
We ran all experiments on a server equipped with 8 NVIDIA A100-PCIE-40GB GPUs.
Additionally, we have updated a Conda environment package available at google drive. You can download the environment package and unzip it into the 'envs' directory of Conda.
To leverage Frad's fine-tuned model for predicting molecular quantum properties, follow these steps:
- Prepare the molecular SMILES in format like
smiles.lst
smiles
CC(C)n(c1c2CN(Cc3cn(C)nc3C)CC1)nc2-c1ncc(C)o1
C=CCN1C(SCc2nc(cccc3)c3[nH]2)=Nc(cccc2)c2C1=O
COc1ccc(Cn2c(C(O)=O)c(CNC3CCCC3)c3c2cccc3)cc1
O=C(CCc1ccc(CC(CC2)CCN2C(c2cscc2)=O)cc1)NC1CC1
Cc1nc(CCC2)c2c(N2CC(CN(C(C=C3)=O)N=C3c3ccncc3)C2)n1
CC(C1)OC(C)CN1c1nc(Nc2cc(OC)ccc2)c(cnn2C)c2n1
- Generate coordinates for Molecular SMILES
python convert_smiles_pos.py --smiles_file=smiles.lst --output_file smiles_coord.lst
The generated coordinates and atom types for the input SMILES will be stored in smiles_coord.lst
- Utilize the fine-tuned model for prediction
Download the fine-tuned model for either the gap property from this URL or the lumo property from this URL.
Execute the following command for property prediction. The prediction results will be stored in results.csv
:
CUDA_VISIBLE_DEVICES=0 python scripts/test.py --conf examples/ET-QM9-FT_dw_0.2_long.yaml --dataset TestData --dataset-root smiles_coord.lst --train-size 1 --val-size 1 --layernorm-on-vec whitened --job-id gap{or lumo}_inference --dataset-arg gap{or lumo} --pretrained-model $finetuned-model --output-file results.csv
Dataset | Reference |
---|---|
PCQM4Mv2 | OGB Stanford, Figshare |
QM9 | Figshare |
MD17 | SGDML |
MD22 | SGDML |
ISO17 | Quantum Machine |
LBA | Zenodo |
Additionally, we offer the download link for the processed finetuned data at the following URL: google drive
All pre-trained models are uploaded to Zenodo: Zenodo Link
Alternatively, individual pre-trained models can be accessed via Google Drive:
- Pretrained model for QM9: google drive
- Pretrained model for Force Predictioin(MD17, MD22, ISO17): google drive
- Pretrained model for LBA: google drive
Below is the script for fine-tuning the QM9 task. Ensure to replace pretrain_model_path
with the actual model path. In this script, the subtask is set to 'homo', but it can be replaced with other subtasks as well.
CUDA_VISIBLE_DEVICES=0 python -u scripts/train.py --conf examples/ET-QM9-FT_dw_0.2_long.yaml --layernorm-on-vec whitened --job-id frad_homo --dataset-arg homo --denoising-weight 0.1 --dataset-root $datapath --pretrained-model $pretrain_model_path
Below is the script for fine-tuning the MD17 task. Replace pretrain_model_path with the actual model path. In this script, the subtask is set to 'aspirin', but it can be replaced with other subtasks such as {'benzene', 'ethanol', 'malonaldehyde', 'naphthalene', 'salicylic_acid', 'toluene', 'uracil'}.
CUDA_VISIBLE_DEVICES=0 python -u scripts/train.py --conf examples/ET-MD17_FT-angle_9500.yaml --job-id frad_aspirin --dataset-arg aspirin --pretrained-model $pretrain_model_path --dihedral-angle-noise-scale 20 --position-noise-scale 0.005 --composition true --sep-noisy-node true --train-loss-type smooth_l1_loss
Below is the script for fine-tuning the MD22 task. Replace pretrain_model_path with the actual model path. In this script, the subtask is set to 'AT-AT-CG-CG' by --dataset-arg
, but it can be replaced with other subtasks such in ("AT-AT-CG-CG" "AT-AT" "Ac-Ala3-NHMe" "DHA" "buckyball-catcher" "double-walled_nanotube" "stachyose").
CUDA_VISIBLE_DEVICES=0 python -u scripts/train.py --conf examples/ET-MD22.yaml --batch-size 32 --inference-batch-size 32 --num-epochs 100 --lr 1e-3 --log-dir md22-AT-AT-CG-CG --dataset-arg AT-AT-CG-CG --ngpus 1 --job-id md22-AT-AT-CG-CG --pretrained-model $$pretrain_model_path --lr-schedule cosine_warmup --save-top-k 1 --save-interval 1 --test-interval 1 --seed 666 --md17 true --train-loss-type smooth_l1_loss
Below is the script for fine-tuning the ISO17 task.
CUDA_VISIBLE_DEVICES=0 python -u scripts/train.py --conf examples/ET-ISO17.yaml --batch-size 256 --job-id iso17 --inference-batch-size 256 --pretrained-model $pretrain_model_path --num-epochs 50 --lr 2e-4 --log-dir iso-energy --ngpus 1 --save-top-k 1 --save-interval 1 --test-interval 1 --seed 666 --lr-schedule cosine_warmup --md17 true --train-loss-type smooth_l1_loss
Below is the script for fine-tuning the LBA task.
CUDA_VISIBLE_DEVICES=0 python -u scripts/train.py --conf examples/ET-LBA-FT_long_f2d.yaml --layernorm-on-vec whitened --job-id LBA --dataset-root $LBA_DATA_PATH --pretrained-model $pretrain_model_path
Model for the QM9
CUDA_VISIBLE_DEVICES=0 python -u scripts/train.py --conf examples/ET-PCQM4MV2_dih_var0.04_var2_com_re.yaml --layernorm-on-vec whitened --job-id frad_pretraining --num-epochs 8
Model for Atomic Forces Tasks like md17, md22, iso17
CUDA_VISIBLE_DEVICES=0 python -u scripts/train.py --conf examples/ET-PCQM4MV2_var0.4_var2_com_re_md17.yaml --layernorm-on-vec whitened --job-id frad_pretraining_force --num-epochs 8
-
The above script is for pre-training the model using RN noise. To switch to VRN noise, add the option
--bat-noise true
. -
For the LBA task, we incorporate angular information into the molecular geometry embedding to better model the complexity of the input protein-ligand complex. Add the option
--model equivariant-transformerf2d
to apply the custom model for LBA.