Skip to content

Explore GCT-Plus, an enhanced version of the GCT model, designed for molecular generation based on properties and structures. Trained on 1.58M molecules from MOSES, GCT-Plus excels in unconditioned, property-based, structure-based generation, and molecular interpolation.

Notifications You must be signed in to change notification settings

chaoting-sun/GCT-Plus

Repository files navigation

GCT-Plus

About the Project

In this project, we utilized the GCT model, originally a conditional variational autoencoder (CVAE) with a Transformer architecture, designed for generating molecules based on properties. We extended GCT's functionality to incorporate structure-based generation using the Bemis-Murcko scaffold, and we've named this enhanced version GCT-Plus.

We conducted extensive training of GCT-Plus on a dataset containing approximately 1.58 million neutral molecules, sourced from the MOSES benchmarking platform. We trained GCT-Plus for multiple tasks, including unconditioned generation, property-based generation, structure-based generation, and property-structure-based generation. Furthermore, we employed GCT-Plus for molecular interpolation, enabling the creation of new molecules with structures resembling those of two given starting molecules.

Getting Started

(1) Clone the repository:

git clone https://github.com/chaoting-sun/GCT-Plus.git

(2) Create an environment:

cd GCT-Plus
conda env create -n gct-plus -f ./env.yml # create a new environment named gct-plus
conda activate gct-plus

(3) Download the Models:

# 1. unconditioned GCT
gdown https://drive.google.com/uc?id=1k8HxI-h3Z9ZfJM4HZMFfZEw8Rh8bMElf -O ./Weights/vaetf/vaetf1.pt

# 2. property-based GCT
gdown https://drive.google.com/uc?id=1D5g3TF3-eFB34SXpylERSa-6L1u_SR5d -O ./Weights/pvaetf/pvaetf1.pt

# 3. structure-based GCT
gdown https://drive.google.com/uc?id=1emVfSViCVWugPda1utYaIBenbRucH_j1 -O ./Weights/scavaetf/scavaetf1.pt

# 4. property-structure-based GCT

# selected properties: logP, tPSA, QED
gdown https://drive.google.com/uc?id=10ojI90-Wrc0RTWUgOfAea6VjRk_GIPVH -O ./Weights/pscavaetf/pscavaetf1.pt

# selected properties: logP, tPSA, SAS
gdown https://drive.google.com/uc?id=1gA-woAsdYpUsDo_jQAO1n3Nf7WJS6g-D -O ./Weights/pscavaetf/pscavaetf1_molgpt.pt

# 5. property-based Transformer
gdown https://drive.google.com/uc?id=1ICK-p9p3WA4eOZfw0zPkPCP2LRks9hEg -O ./Weights/pscavaetf/pscavaetf1.pt

(4) Run Multiple Tasks

# unconditioned generation
Bashscript/infer/uc_sampling.sh

# property-based generation
Bashscript/infer/p_sampling.sh

# structure-based generation
Bashscript/infer/sca_sampling.sh

# property-structure-based generation
Bashscript/infer/psca_sampling.sh

# molecular interpolation
Bashscript/infer/mol_interpolation.sh

# visualize attention
Bashscript/infer/visualize_attention.sh

Implementation

(1) Preprocess the data

Bashscript/preprocess/preprocess.sh

(2) Re-train Models

# train a model for unconditioned generation
Bashscript/train/train_vaetf.sh

# train a model for property-based generation
Bashscript/train/train_pvaetf.sh

# train a model for structure-based generation
Bashscript/train/train_scavaetf.sh

# train a model for property-structure-based generation
Bashscript/train/train_pscavaetf.sh

(3) Model Selection

The model for unconditioned generation (vaetf) can be selected the best epochs.

Bashscript/infer/model_selection.sh

Explanation

(1) What is the difference between "train0.py" and "train1.py"?

One primary distinction between them lies in the batching method used during training. "train0.py" adopts the same approach as GCT, wherein SMILES with similar lengths are grouped together within each batch. Conversely, "train1.py" randomly assigns SMILES to batches. We observed that the second approach, used in GCT-Plus, contributes to a smoother latent space, which enhances molecular interpolation capabilities.

Furthermore, "train0.py" is limited to single-GPU training of the model, whereas "train1.py" supports parallel training across multiple GPUs.

(2) Why does "model_selection.py" only support GCT-Plus for unconditioned generation?

We employed the KL Divergence metric, as defined in GuacaMol, to identify the optimal epoch. In this process, we calculated a score (S) by comparing the similarity between the reference set (the test set in MOSES) and the set generated by GCT-Plus. A higher S score indicates more effective model learning.

We expected the model's KL divergence to exhibit a concave curve as it evolves with epochs, and we selected the epoch with the highest S score. Upon testing, we observed that only GCT-Plus for unconditioned generation selected a reasonable epoch (37-38). In contrast, the epochs chosen for other tasks with GCT-Plus were too small. For instance, the first epoch for GCT-Plus responsible for property-structure-based generation yielded the lowest S score.

Reference

About

Explore GCT-Plus, an enhanced version of the GCT model, designed for molecular generation based on properties and structures. Trained on 1.58M molecules from MOSES, GCT-Plus excels in unconditioned, property-based, structure-based generation, and molecular interpolation.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published