GCT-Plus

About the Project

In this project, we utilized the GCT model, originally a conditional variational autoencoder (CVAE) with a Transformer architecture, designed for generating molecules based on properties. We extended GCT's functionality to incorporate structure-based generation using the Bemis-Murcko scaffold, and we've named this enhanced version GCT-Plus.

We conducted extensive training of GCT-Plus on a dataset containing approximately 1.58 million neutral molecules, sourced from the MOSES benchmarking platform. We trained GCT-Plus for multiple tasks, including unconditioned generation, property-based generation, structure-based generation, and property-structure-based generation. Furthermore, we employed GCT-Plus for molecular interpolation, enabling the creation of new molecules with structures resembling those of two given starting molecules.

Getting Started

(1) Clone the repository:

git clone https://github.com/chaoting-sun/GCT-Plus.git

(2) Create an environment:

cd GCT-Plus
conda env create -n gct-plus -f ./env.yml # create a new environment named gct-plus
conda activate gct-plus

(3) Download the Models:

# 1. unconditioned GCT
gdown https://drive.google.com/uc?id=1k8HxI-h3Z9ZfJM4HZMFfZEw8Rh8bMElf -O ./Weights/vaetf/vaetf1.pt

# 2. property-based GCT
gdown https://drive.google.com/uc?id=1D5g3TF3-eFB34SXpylERSa-6L1u_SR5d -O ./Weights/pvaetf/pvaetf1.pt

# 3. structure-based GCT
gdown https://drive.google.com/uc?id=1emVfSViCVWugPda1utYaIBenbRucH_j1 -O ./Weights/scavaetf/scavaetf1.pt

# 4. property-structure-based GCT

# selected properties: logP, tPSA, QED
gdown https://drive.google.com/uc?id=10ojI90-Wrc0RTWUgOfAea6VjRk_GIPVH -O ./Weights/pscavaetf/pscavaetf1.pt

# selected properties: logP, tPSA, SAS
gdown https://drive.google.com/uc?id=1gA-woAsdYpUsDo_jQAO1n3Nf7WJS6g-D -O ./Weights/pscavaetf/pscavaetf1_molgpt.pt

# 5. property-based Transformer
gdown https://drive.google.com/uc?id=1ICK-p9p3WA4eOZfw0zPkPCP2LRks9hEg -O ./Weights/pscavaetf/pscavaetf1.pt

(4) Run Multiple Tasks

# unconditioned generation
Bashscript/infer/uc_sampling.sh

# property-based generation
Bashscript/infer/p_sampling.sh

# structure-based generation
Bashscript/infer/sca_sampling.sh

# property-structure-based generation
Bashscript/infer/psca_sampling.sh

# molecular interpolation
Bashscript/infer/mol_interpolation.sh

# visualize attention
Bashscript/infer/visualize_attention.sh

Implementation

(1) Preprocess the data

Bashscript/preprocess/preprocess.sh

(2) Re-train Models

# train a model for unconditioned generation
Bashscript/train/train_vaetf.sh

# train a model for property-based generation
Bashscript/train/train_pvaetf.sh

# train a model for structure-based generation
Bashscript/train/train_scavaetf.sh

# train a model for property-structure-based generation
Bashscript/train/train_pscavaetf.sh

(3) Model Selection

The model for unconditioned generation (vaetf) can be selected the best epochs.

Bashscript/infer/model_selection.sh

Explanation

(1) What is the difference between "train0.py" and "train1.py"?

One primary distinction between them lies in the batching method used during training. "train0.py" adopts the same approach as GCT, wherein SMILES with similar lengths are grouped together within each batch. Conversely, "train1.py" randomly assigns SMILES to batches. We observed that the second approach, used in GCT-Plus, contributes to a smoother latent space, which enhances molecular interpolation capabilities.

Furthermore, "train0.py" is limited to single-GPU training of the model, whereas "train1.py" supports parallel training across multiple GPUs.

(2) Why does "model_selection.py" only support GCT-Plus for unconditioned generation?

We employed the KL Divergence metric, as defined in GuacaMol, to identify the optimal epoch. In this process, we calculated a score (S) by comparing the similarity between the reference set (the test set in MOSES) and the set generated by GCT-Plus. A higher S score indicates more effective model learning.

We expected the model's KL divergence to exhibit a concave curve as it evolves with epochs, and we selected the epoch with the highest S score. Upon testing, we observed that only GCT-Plus for unconditioned generation selected a reasonable epoch (37-38). In contrast, the epochs chosen for other tasks with GCT-Plus were too small. For instance, the first epoch for GCT-Plus responsible for property-structure-based generation yielded the lowest S score.

Reference

Model architecture - borrowed from Hyunseung-Kim/molGCT
property computation - rdkit/rdkit
SMILES tokenizer - modified from XinhaoLi74/SmilesPE
Evaluation metrics
- Most metrics: molecularsets/moses
- SSF (same scaffold fraction): devalab/molgpt

Name		Name	Last commit message	Last commit date
Latest commit History 121 Commits
Bashscript		Bashscript
Configuration		Configuration
Inference		Inference
Model		Model
Plot		Plot
Train		Train
Utils		Utils
.gitignore		.gitignore
README.md		README.md
env.yml		env.yml
inference.py		inference.py
preprocess.py		preprocess.py
train.py		train.py
train1.py		train1.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GCT-Plus

About the Project

Getting Started

Implementation

Explanation

Reference

About

Releases

Packages

Contributors 2

Languages

chaoting-sun/GCT-Plus

Folders and files

Latest commit

History

Repository files navigation

GCT-Plus

About the Project

Getting Started

Implementation

Explanation

Reference

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages