# Exploring Generative Methodologies with Deepchem

By Debasish Mohanty

# Table of Contents:
1. [Introduction](#introduction)
2. [Generative Methods](#genmethods)
      - [BRICS](#brics)
      - [Neural Networks](#nns)
3. [Code Implementation with Deepchem](#codeimplement)
      - [Setup](#setup)
      - [BRICS Generation with SMILES](#bricssmiles)
      - [BRICS Generation with PSMILES](#bricspsmiles)
      - [LSTM Generators](#lstmimplement)
      - [Pretrained Implementation with PSMILES and weighted graphs](#pretrainedpsmileswdg)
4. [Summary](#sum)
5. [References](#ref)

## Colab

This tutorial is designed to be done in Google colab. If you'd like to open this notebook in colab, you can use the following link.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepchem/deepchem/blob/master/examples/tutorials/Exploring_Generative_Methodologies_with_Deepchem.ipynb)

# 1. Introduction <a class="anchor" id="introduction"></a>

The development of new chemical compounds and materials, especially organic molecules and polymers, often relies on computational techniques for molecular design and discovery. Generative modeling in cheminformatics is a powerful approach that leverages various algorithms to predict and create novel molecules. In this tutorial, we will delve into the application of generative methodologies using the Deepchem library, focusing on fragment-based and deep learning-driven techniques for molecular generation. We begin by exploring BRICS (Breaking of Retrosynthetically Interesting Chemical Substructures), a method for systematically fragmenting and reconstructing molecular structures based on chemically meaningful subunits. BRICS offers a rule-based approach that respects chemical stability and common retrosynthetic pathways, making it useful for creating feasible molecular candidates.

Next, we explore neural network-based methods, particularly recurrent neural networks (RNNs) like Long Short-Term Memory (LSTM), which have been widely used for sequence generation. In the context of molecular generation, LSTM models are trained to produce novel molecules by learning from patterns in SMILES (Simplified Molecular Input Line Entry System) representations of known compounds. This method excels in capturing complex structural patterns and generating novel molecules that fit within the chemical space learned from the training data. We also extend these methods to Polymer SMILES (PSMILES), a notation that represents repeating polymer structures, making it possible to model polymers using the same generative techniques. By incorporating weighted graphs and pretrained models, we enhance the model's ability to generate polymers, pushing the boundaries of traditional molecular design. These generative and validation methods are explicitly utilized and analyzed in our research paper "Open-source Polymer Generative Pipeline". `[1]` This tutorial provides a comprehensive introduction to advanced molecular generative techniques, merging traditional rule-based methods with the flexibility of deep learning for chemical and polymer discovery.

# 2. Generative Methods <a class="anchor" id="genmethods"></a>

Generative methods in computational chemistry have revolutionized the design and discovery of new molecules by enabling the generation of novel structures with specific properties. Techniques such as BRICS, Neural Networks, Variational Autoencoders (VAEs), and Transformer-based models offer a diverse range of approaches for molecular generation. BRICS, a rule-based method, assembles molecules from fragments, ensuring chemical validity and synthetic feasibility. Neural networks, including LSTMs, VAEs, and Transformer models, learn complex patterns in molecular data to generate new compounds. While VAEs allow interpolation in latent spaces to design smooth transitions between molecular properties, some models leverage attention mechanisms to capture long-range dependencies within molecular structures, offering fine control over molecular features. Furthermore, diffusion models have emerged as a powerful probabilistic approach for generating high-quality molecules by reversing a noise-adding process.

However, for our purpose of exploring generative methodologies with Deepchem, we focus on BRICS and LSTM-based generators, which offer efficient and interpretable methods for fragment-based generation and sequential learning from molecular datasets, respectively. These methods are particularly well-suited for generating novel polymer structures and exploring chemical spaces in a controlled yet flexible manner, making them ideal for this tutorial.

## BRICS <a class="anchor" id="brics"></a>

BRICS (Breaking of Retrosynthetically Interesting Chemical Substructures) is a fragment-based generative method that assembles molecules by connecting predefined fragments. It is designed to mimic the way chemists think about breaking down molecules into smaller, chemically meaningful substructures. By combining these fragments, BRICS can generate novel molecular structures while ensuring synthetic feasibility. The strength of the BRICS method lies in its ability to generate chemically valid molecules that follow real-world synthesis rules, making it a useful tool for drug discovery and materials design. BRICS starts by decomposing molecules into fragments at specific, chemically significant bonds. It then combines these fragments in different ways to create new molecules. The method avoids chemically impossible or invalid bonds and ensures that the generated molecules adhere to certain synthetic constraints. For example we can use following case:

Let's consider Styrene `(C6H5CH=CH2)` and Vinyl Acetate `(C=COC(C)=O)` as examples to illustrate the BRICS method.

In the case of Styrene, the molecule can be fragmented into two key parts:

- A benzene ring `(C6H5)`
- A vinyl group `(CH=CH2)`

For Vinyl Acetate `(C=COC(C)=O)`, the molecule can be fragmented into:

- An acetoxy group `(*C(=O)C)`
- An alkoxy vinyl group `(*OC=C)`

The BRICS fragmentation method breaks the molecules at strategic points while retaining bond breakage information, represented as "attachment points" `(*)` that indicate where these fragments can be recombined.

For instance, the benzene ring from Styrene can be recombined with the acetoxy group from Vinyl Acetate, resulting in a new hypothetical molecule where the benzene ring is connected to an acetoxy chain. Alternatively, the vinyl group from Styrene could be combined with the alkoxy vinyl group from Vinyl Acetate to form a novel structure. This recombination of molecular fragments in a controlled and logical way enables BRICS to generate a wide variety of potential new structures, providing a platform for exploring novel chemical compounds that could be of interest for further research or synthesis. The above process is illustrated in Figure-1.

![brics-image](https://github.com/deepchem/deepchem/blob/master/examples/tutorials/assets/polymer_images/gen_tutorial_image.png?raw=1)

*Figure - 1: BRICS method illustrated for styrene and vinly acetate*

## Neural Networks <a class="anchor" id="nns"></a>

Neural networks have significantly advanced the field of molecular generation by enabling the design of novel molecular structures based on learned patterns from large datasets. These techniques, which include Long Short-Term Memory (LSTM) networks, transformer-based models, and diffusion models, offer different advantages and challenges depending on the task at hand.

LSTMs are a type of recurrent neural network (RNN) designed to handle sequential data, making them well-suited for generating molecular sequences such as SMILES (Simplified Molecular Input Line Entry System) strings. By learning the dependencies between atoms and bonds in a molecule, LSTMs can generate realistic molecules one atom or bond at a time. For instance, given the initial part of a SMILES string, an LSTM can predict the next characters, continuing the sequence until the molecule is fully described. In molecular generation tasks, LSTMs are valuable for their ability to learn from chemical datasets with sequential representation.

LSTMs operate using gates, which control the flow of information within the network:

1. **Forget Gate:** Decides which information from the previous time step should be discarded. It looks at the current input and the previous hidden state and outputs a value between 0 and 1. A value of 0 means "forget" and 1 means "keep."

2. **Input Gate:** Determines what new information should be added to the cell state. It uses the current input and the previous hidden state to decide which values to update in the memory.

3. **Cell State:** The "memory" of the LSTM. The cell state is updated by the input and forget gates and carries information across time steps. This allows the LSTM to "remember" long-term dependencies and keep important information.

4. **Output Gate:** Decides what the next hidden state should be based on the cell state and current input. The hidden state is then passed to the next time step and also used for the output prediction.

These gates allow the LSTM to retain important information over long sequences while ignoring irrelevant data, which helps overcome the vanishing gradient problem often encountered in traditional RNNs. In molecular generation, an LSTM network can be trained on molecular data (such as SMILES strings) to learn patterns of atoms, bonds, and functional groups. The LSTM can then generate new molecules by predicting the next character in the sequence, one step at a time, building up a full SMILES string. This sequential nature makes LSTMs particularly effective for generating molecules in an iterative process, with each predicted atom or bond influencing the next, ensuring that generated molecules are coherent and chemically valid

## Code Implementation with Deepchem <a class="anchor" id="codeimplement"></a>

## Setup <a class="anchor" id="setup"></a>

In [None]:
# installation
! pip install rdkit deepchem[torch]

In [None]:
# all imports
from deepchem.models.torch_models import LSTMGenerator
from deepchem.models import BRICSGenerator
from deepchem.utils.poly_wd_graph_utils import PolyWDGStringValidator
from deepchem.data import NumpyDataset
import torch
from rdkit import Chem

## BRICS Generation with SMILES <a class="anchor" id="bricssmiles"></a>

DeepChem integrates BRICS functionality through the `BRICSGenerator` class. This class manages the underlying processes and generates unique candidate molecules, providing the results via a straightforward method. The features are outlined below.

In [2]:
sample_smiles = ['CC(=O)Oc1ccccc1C(=O)O', 'CC(=O)NC1=CC=C(O)C=C1']
brics_generator = BRICSGenerator()
generated_candidates, number_of_candidates = brics_generator.sample(sample_smiles)

In [3]:
for i, candidate in enumerate(generated_candidates):
    print(i+1,".", candidate)

1 . O=C(O)c1ccc(O)cc1
2 . O=C(O)c1ccccc1-c1ccc(O)cc1
3 . O=C(O)c1ccccc1C(=O)O
4 . CC(=O)Nc1ccccc1C(=O)O
5 . O=C(O)c1ccccc1Nc1ccc(O)cc1
6 . O=C(O)c1ccccc1Nc1ccccc1-c1ccc(O)cc1
7 . O=C(O)c1ccccc1Nc1ccccc1C(=O)O
8 . O=C(O)c1ccccc1-c1ccccc1-c1ccc(O)cc1
9 . O=C(O)c1ccccc1-c1ccccc1C(=O)O
10 . CC(=O)Nc1ccccc1-c1ccccc1C(=O)O
11 . O=C(O)c1ccccc1-c1ccccc1Nc1ccc(O)cc1
12 . CC(=O)Oc1ccccc1-c1ccccc1C(=O)O
13 . O=C(O)c1ccccc1-c1ccccc1Oc1ccc(O)cc1
14 . O=C(O)c1ccccc1-c1ccccc1-c1ccccc1C(=O)O
15 . O=C(O)c1ccccc1-c1ccccc1-c1ccccc1-c1ccc(O)cc1
16 . CC(=O)Oc1ccccc1C(=O)O
17 . O=C(O)c1ccccc1Oc1ccc(O)cc1
18 . O=C(O)c1ccccc1Oc1ccccc1C(=O)O
19 . O=C(O)c1ccccc1Oc1ccccc1-c1ccc(O)cc1
20 . CC(=O)OC(C)=O
21 . CC(=O)Oc1ccc(O)cc1
22 . CC(=O)Oc1ccccc1-c1ccc(O)cc1
23 . CC(=O)Oc1ccccc1-c1ccccc1-c1ccc(O)cc1
24 . CC(=O)Oc1ccccc1OC(C)=O
25 . CC(=O)Oc1ccccc1Oc1ccc(O)cc1
26 . CC(=O)Oc1ccccc1Nc1ccc(O)cc1
27 . CC(=O)Nc1ccccc1OC(C)=O
28 . Oc1ccc(-c2ccccc2Oc2ccccc2-c2ccc(O)cc2)cc1
29 . Oc1ccc(Nc2ccccc2Oc2ccc(O)cc2)cc1
30 . CC(=

In the above example we can observe generation of 49 new candidates from two smiles input.

## BRICS Generation with PSMILES <a class="anchor" id="bricspsmiles"></a>

The `BRICSGenerator` class by default takes smiles as input. But we have extended the application for psmiles and dendrimers (having more than two endpoints[`*`]) as well. Those features can be used as follows.

In [4]:
# with polymer application
sample_psmiles = ['*CC(=O)CC*', '*c1ccc(CC*)cc1', '*CC(CC*)CC*']
# is_polymer determines whether the input is a polymer or not  
psmiles_generated_candidates, ps_number_of_candidates = brics_generator.sample(sample_psmiles, is_polymer=True)

In [5]:
for i, candidate in enumerate(psmiles_generated_candidates):
    print(i+1,".", candidate)

1 . [*]CCc1ccc([*])cc1
2 . [*]c1ccc(-c2ccc([*])cc2)cc1


In the above example, we can observe 3 psmiles used to generate two unique psmiles candidates.

In [6]:
# with dendrimer application
sample_dendrimers_source = [
        # Branched aliphatic cores
        '*C(=O)N(CC(=O)*)CC(=O)*',  # 3-point N-centered with amide bonds
        '*OC(=O)c1c(C(=O)O*)cc(C(=O)O*)cc1',  # 3-point aromatic with ester bonds
    ]
dendrimer_generated_candidates, den_number_of_candidates = brics_generator.sample(sample_dendrimers_source, is_polymer=True, is_dendrimer=True)

In [7]:
for i, candidate in enumerate(dendrimer_generated_candidates):
    print(i+1,".", candidate)

1 . O=C([*])CO[*]
2 . O=C([*])O[*]
3 . O=C(O[*])c1cc(O[*])ccc1O[*]
4 . [*]Oc1ccc(O[*])c(O[*])c1
5 . O=C(O[*])c1ccc(O[*])c(O[*])c1
6 . O=C(O[*])c1ccc(O[*])cc1O[*]
7 . O=C([*])N(C(=O)[*])C(=O)[*]
8 . O=C([*])CN(C(=O)[*])C(=O)[*]
9 . O=C([*])CN(CC(=O)[*])C(=O)[*]
10 . O=C([*])CN(CC(=O)[*])CC(=O)[*]


In the above example, we can observe generation of 10 candidates in which 2 are psmiles for polymers while other 8 can be described as dendrimers.

## LSTM Generators <a class="anchor" id="lstmimplement"></a>

DeepChem provides a versatile LSTM generator that learns sequential string patterns from its input, making it compatible with various string representations such as SMILES, PSMILES, and weighted graphs. The `LSTMGenerator` class is used for both training and inference of the LSTM model. The following example demonstrates its application.

In [8]:
# training the model
generator = LSTMGenerator(model_dir="./lstm_generator_model")
dataset = NumpyDataset(["CCC"])
loss = generator.fit(dataset,
                        nb_epoch=3,
                        checkpoint_interval=1,
                        max_checkpoints_to_keep=1)

In [9]:
print("loss >>>", loss)

loss >>> 10.016533851623535


In [10]:
# inference
generated = generator.sample(1,max_len=10)
print("generated >>>", generated)

generated >>> ['UnderworldEnricoyaGracepronunciationndingpuritychantedrecommendationsRadar']


In the above example, we can see the random generation of string from the LSTM model which is trained only on one SMILES (i.e. `CCC`).

In [21]:
restored_generator = LSTMGenerator()

restored_generator.load_from_pretrained(source_model=LSTMGenerator(),
                               model_dir="./lstm_generator_model")
random_gens = generator.sample(3, max_len=10)

In [22]:
print("random generations >>>", random_gens)

random generations >>> ['##lakedecreasefailsHoganSaskatchewanTrumanhoarsegigglelieShadow', '##vageDC00CouldchangedContemporaryLaurentOperatingwarsencourage', 'mythologynanttransformatainherentadministeredeuxstoredinformaloak']


In above manner, we can load the model from a saved checkpoint to make the inference. 

  As mentioned in the research paper we have trained two model of PSMILES (1 Million input data and 5 epochs) and weighted directed graphs (42 Thousands input data and 50 epochs) `[1]`. In the next section of the tutorial, we will be using those model checkpoints to make the inference. For more detail about the datasets and the training process go through our research paper "Open-Source Polymer Generative Pipeline". 

## Pretrained Implementation with PSMILES and Weighted Directed Graphs <a class="anchor" id="pretrainedpsmileswdg"></a>

In the source research paper `[1]`, we have trained LSTM with PSMILES and weighted directed graphs (WDG) string representations. We have trained 1 million PSMILES datapoints for 5 epochs and 42 thousand WDG datapoints for 50 epochs. Those model checkpoints are used to generate following candidates. The model checkpoints can be loaded using following code block. 

In [11]:
# getting the pretrained model for LSTM with 1M PSMILES
! wget https://deepchemdata.s3.us-west-1.amazonaws.com/trained_models/PSMILES_LSTM_1M_5_epochs.pth

# getting the pretrained model for LSTM with 42K WDGraph
! wget https://deepchemdata.s3.us-west-1.amazonaws.com/trained_models/WDGraph_LSTM_42K_50_epochs.pth

--2025-02-21 21:17:24--  https://deepchemdata.s3.us-west-1.amazonaws.com/trained_models/PSMILES_LSTM_1M_5_epochs.pth
Resolving deepchemdata.s3.us-west-1.amazonaws.com (deepchemdata.s3.us-west-1.amazonaws.com)... 16.15.4.244, 3.5.162.13, 52.219.117.138, ...
Connecting to deepchemdata.s3.us-west-1.amazonaws.com (deepchemdata.s3.us-west-1.amazonaws.com)|16.15.4.244|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 48343184 (46M) [binary/octet-stream]
Saving to: ‘PSMILES_LSTM_1M_5_epochs.pth’


2025-02-21 21:17:52 (1.84 MB/s) - ‘PSMILES_LSTM_1M_5_epochs.pth’ saved [48343184/48343184]

--2025-02-21 21:17:53--  https://deepchemdata.s3.us-west-1.amazonaws.com/trained_models/WDGraph_LSTM_42K_50_epochs.pth
Resolving deepchemdata.s3.us-west-1.amazonaws.com (deepchemdata.s3.us-west-1.amazonaws.com)... 52.219.112.49, 52.219.120.81, 52.219.113.170, ...
Connecting to deepchemdata.s3.us-west-1.amazonaws.com (deepchemdata.s3.us-west-1.amazonaws.com)|52.219.112.49|:443... connec

In [None]:
#PSMILES Generation
generator = LSTMGenerator()
ckpt = torch.load("./PSMILES_LSTM_1M_5_epochs.pth")
generator.model.load_state_dict(ckpt)

<All keys matched successfully>

In [15]:
psmile_samples = generator.sample(20, max_len = 500)

In [16]:
for i, candidate in enumerate(psmile_samples):
    print(i+1,".", candidate)

1 . [*]OC(=O)c1nc(-c2ccccc2)[nH]c1-c1ccc(S(=O)(=O)c2ccc(C([*])=O)cc2)cc1
2 . [*]C(=O)C(Cc1cccc(C2CC(C)(C)CC(C)(c3ccc4c(c3)C(=O)N([*])C4=O)C2)c1)(C(F)(F)F)C(F)(F)F
3 . [*]c1ccc2c(c1)nc([*])n2-c1ccc(-c2cccc(C3(C)CC(c4cccc(N5C(=O)c6ccc(C(=O)O)cc6C5=O)c4)CC(C)(C)C3)c2)cc1
4 . [*]c1ccc2c(c1)C(=O)N(c1ccc(N3C(=O)c4ccc(-c5ccc(C)c(-c6ccc7c(c6)C(=O)N([*])C7=O)c5)cc4C3=O)c(N3C(=O)c4ccccc4C3=O)c1)C2=O
5 . [*]CSC(=O)CCCCCCCCC(=O)N1C(=O)c2ccc([Si](C)(C)c3ccc(-c4ccc([*])cc4)cc3)cc2C1=O
6 . [*]c1nc2cc(Nc3ccc4c(c3)C(=O)N([*])C4=O)ccc2nc1NC=Nc1ccc(N=Cc2ccccc2NC(=O)CCCCCCCCCC(=O)OC)cc1
7 . [*]c1ccc(Oc2ccc([*])cc2-c2nnc(-c3ccccc3)o2)s1
8 . [*]CCCC(=O)Nc1ccc(N=Cc2ccc(-c3ccc(N=Nc4ccc(CCC[*])cc4)cc3)cc2)cc1
9 . [*]c1ccc(-c2ccc(-c3ccc(C)cc3-c3ccc4c(c3)C(=O)N(c3cccc(-c5cc([*])c6cccccc5-6)c3)C4=O)cc2)cc1
10 . [*]c1ccc(C#Cc2ccc(Oc3ccc([*])cc3-c3ccc(S(=O)(=O)c4ccc(-c5ccc(C#N)cc5)cc4)cc3)cc2)cc1
11 . [*]Oc1ccc(C(O)c2ccc([*])cc2)c(-c2ccccc2)c1
12 . [*]CCCCCCCCCCCCOC(=O)c1ccc(-c2ccc([Si](C)(C)c3ccc(N4C(=O)c5cc6c(cc5

In above manner, we can use the pretrained checkpoint from the research paper to generate hypotheical PSMILES candidates. Yet not all candidates are chemically valid, hence we can employ a validation setup as follows.

In [17]:
def validate_psmiles(smiles: str) -> bool:
    try:
        mol = Chem.MolFromSmiles(smiles)
        return mol is not None
    except:
        return False

In [18]:
for i, psmiles in enumerate(psmile_samples):
    print("value >>", psmiles)
    psmiles = psmiles.replace("[*]", "[At]")
    print("validation result >>", validate_psmiles(psmiles))

value >> [*]OC(=O)c1nc(-c2ccccc2)[nH]c1-c1ccc(S(=O)(=O)c2ccc(C([*])=O)cc2)cc1
validation result >> True
value >> [*]C(=O)C(Cc1cccc(C2CC(C)(C)CC(C)(c3ccc4c(c3)C(=O)N([*])C4=O)C2)c1)(C(F)(F)F)C(F)(F)F
validation result >> True
value >> [*]c1ccc2c(c1)nc([*])n2-c1ccc(-c2cccc(C3(C)CC(c4cccc(N5C(=O)c6ccc(C(=O)O)cc6C5=O)c4)CC(C)(C)C3)c2)cc1
validation result >> True
value >> [*]c1ccc2c(c1)C(=O)N(c1ccc(N3C(=O)c4ccc(-c5ccc(C)c(-c6ccc7c(c6)C(=O)N([*])C7=O)c5)cc4C3=O)c(N3C(=O)c4ccccc4C3=O)c1)C2=O
validation result >> True
value >> [*]CSC(=O)CCCCCCCCC(=O)N1C(=O)c2ccc([Si](C)(C)c3ccc(-c4ccc([*])cc4)cc3)cc2C1=O
validation result >> True
value >> [*]c1nc2cc(Nc3ccc4c(c3)C(=O)N([*])C4=O)ccc2nc1NC=Nc1ccc(N=Cc2ccccc2NC(=O)CCCCCCCCCC(=O)OC)cc1
validation result >> True
value >> [*]c1ccc(Oc2ccc([*])cc2-c2nnc(-c3ccccc3)o2)s1
validation result >> True
value >> [*]CCCC(=O)Nc1ccc(N=Cc2ccc(-c3ccc(N=Nc4ccc(CCC[*])cc4)cc3)cc2)cc1
validation result >> True
value >> [*]c1ccc(-c2ccc(-c3ccc(C)cc3-c3ccc4c(c3)C(=O)N(c3

[21:26:07] SMILES Parse Error: ring closure 4 duplicates bond between atom 42 and atom 46 for input: '[At]CCCCCCCCCCCCOC(=O)c1ccc(-c2ccc([Si](C)(C)c3ccc(N4C(=O)c5cc6c(cc5C4=O)C4(C(C)C)c4ccccc4C4=C5OC(=O)C(CCC[At])C5C4)cc3)cc2)cc1'


We can observe 1 of the generated PSMILES are invalid out of 20 generations.

In similar fashion, we can use LSTM for weighted directed graph representation strings as follows.

In [21]:
generator_wdg = LSTMGenerator()
ckpt = torch.load("./WDGraph_LSTM_42K_50_epochs.pth")
generator_wdg.model.load_state_dict(ckpt)

<All keys matched successfully>

In [33]:
wdg_gen_values = generator_wdg.sample(20, max_len = 500)

In [35]:
for i, candidate in enumerate(wdg_gen_values):
    print(i+1,".", candidate)

1 . [*:1]c1ccc2c(c1)S(=O)(=O)c1cc([*:2])ccc1-2.[*:3]c1ncc([*:4])c2ccccc12|0.25|0.75|<1-3:0.25:0.25<1-4:0.25:0.25<2-3:0.25:0.25<2-4:0.25:0.25<1-2:0.25:0.25<3-4:0.25:0.25<1-1:0.25:0.25<2-2:0.25:0.25<3-3:0.25:0.25<4-4:0.25:0.25
2 . [*:1]c1ccc2c(c1)S(=O)(=O)c1cc([*:2])ccc1-2.[*:3]c1sc([*:4])c2c1OCCO2|0.25|0.75|<1-3:0.25:0.25<1-4:0.25:0.25<2-3:0.25:0.25<2-4:0.25:0.25<1-2:0.25:0.25<3-4:0.25:0.25<1-1:0.25:0.25<2-2:0.25:0.25<3-3:0.25:0.25<4-4:0.25:0.25
3 . [*:1]c1ccc2c(c1)S(=O)(=O)c1cc([*:2])ccc1-2.[*:3]c1cc([*:4])cc(CC(=O)O)c1|0.25|0.75|<1-3:0.25:0.25<1-4:0.25:0.25<2-3:0.25:0.25<2-4:0.25:0.25<1-2:0.25:0.25<3-4:0.25:0.25<1-1:0.25:0.25<2-2:0.25:0.25<3-3:0.25:0.25<4-4:0.25:0.25
4 . [*:1]c1ccc2c(c1)S(=O)(=O)c1cc([*:2])ccc1-2.[*:3]c1ccc([*:4])c(Cl)c1Cl|0.25|0.75|<1-2:0.375:0.375<1-1:0.375:0.375<2-2:0.375:0.375<3-4:0.375:0.375<3-3:0.375:0.375<4-4:0.125:0.125<1-3:0.125:0.125<1-4:0.125:0.125<2-3:0.125:0.125<2-4:0.125:0.125
5 . [*:1]c1ccc2c(c1)S(=O)(=O)c1cc([*:2])ccc1-2.[*:3]c1sc([*:4])c(C(=O)O)c1C(=O

In this manner 20 candidates with weighted directed graph string format have been generated. We can employ a validation workflow for the same. The validation for the weighted directed graphs is implemented in deepchem with `PolyWDGStringValidator` class. The validator class is specifically aligned with a slightly different format for it's ease of featurization with graph based neural networks. The difference is as follows.

The generated format -> ``[*:1]c1ccc2c(c1)S(=O)(=O)c1cc([*:2])ccc1-2.[*:3]c1cc2cc3sc([*:4])cc3c2n1``


The compatible format -> ``[1*]c1ccc2c(c1)S(=O)(=O)c1cc([2*])ccc1-2.[3*]c1cc2cc3sc([4*])cc3c2n1``

To employ a simple conversion of these formats we will utilize regular expression as follows.

In [37]:
import re

def convert_format(text):
    return re.sub(r'\[\*\:(\d+)\]', r'[\1*]', text)

converted_values = [convert_format(value) for value in wdg_gen_values]
print("raw format >>", wdg_gen_values[0])
print("converted_format >>",converted_values[0])

raw format >> [*:1]c1ccc2c(c1)S(=O)(=O)c1cc([*:2])ccc1-2.[*:3]c1ncc([*:4])c2ccccc12|0.25|0.75|<1-3:0.25:0.25<1-4:0.25:0.25<2-3:0.25:0.25<2-4:0.25:0.25<1-2:0.25:0.25<3-4:0.25:0.25<1-1:0.25:0.25<2-2:0.25:0.25<3-3:0.25:0.25<4-4:0.25:0.25
converted_format >> [1*]c1ccc2c(c1)S(=O)(=O)c1cc([2*])ccc1-2.[3*]c1ncc([4*])c2ccccc12|0.25|0.75|<1-3:0.25:0.25<1-4:0.25:0.25<2-3:0.25:0.25<2-4:0.25:0.25<1-2:0.25:0.25<3-4:0.25:0.25<1-1:0.25:0.25<2-2:0.25:0.25<3-3:0.25:0.25<4-4:0.25:0.25


We can observe in the above outputs that the format is slightly changed to align with the validator. In the code below we are imposing validation on converted candidates.

In [38]:
validator = PolyWDGStringValidator()
for value in converted_values:
    print("value >>", value)
    print("validation result >>", validator.validate(value))

value >> [1*]c1ccc2c(c1)S(=O)(=O)c1cc([2*])ccc1-2.[3*]c1ncc([4*])c2ccccc12|0.25|0.75|<1-3:0.25:0.25<1-4:0.25:0.25<2-3:0.25:0.25<2-4:0.25:0.25<1-2:0.25:0.25<3-4:0.25:0.25<1-1:0.25:0.25<2-2:0.25:0.25<3-3:0.25:0.25<4-4:0.25:0.25
validation result >> True
value >> [1*]c1ccc2c(c1)S(=O)(=O)c1cc([2*])ccc1-2.[3*]c1sc([4*])c2c1OCCO2|0.25|0.75|<1-3:0.25:0.25<1-4:0.25:0.25<2-3:0.25:0.25<2-4:0.25:0.25<1-2:0.25:0.25<3-4:0.25:0.25<1-1:0.25:0.25<2-2:0.25:0.25<3-3:0.25:0.25<4-4:0.25:0.25
validation result >> True
value >> [1*]c1ccc2c(c1)S(=O)(=O)c1cc([2*])ccc1-2.[3*]c1cc([4*])cc(CC(=O)O)c1|0.25|0.75|<1-3:0.25:0.25<1-4:0.25:0.25<2-3:0.25:0.25<2-4:0.25:0.25<1-2:0.25:0.25<3-4:0.25:0.25<1-1:0.25:0.25<2-2:0.25:0.25<3-3:0.25:0.25<4-4:0.25:0.25
validation result >> True
value >> [1*]c1ccc2c(c1)S(=O)(=O)c1cc([2*])ccc1-2.[3*]c1ccc([4*])c(Cl)c1Cl|0.25|0.75|<1-2:0.375:0.375<1-1:0.375:0.375<2-2:0.375:0.375<3-4:0.375:0.375<3-3:0.375:0.375<4-4:0.125:0.125<1-3:0.125:0.125<1-4:0.125:0.125<2-3:0.125:0.125<2-4:0.125:0.

Among the 20 generated candidates, all candidates are found to be valid indicating the effectiveness of the trained LSTM for generating WDG strings.

## 4. Summary <a id="sum"></a>

In this tutorial, we explored generative methodologies using the DeepChem library, focusing on BRICS and LSTM-based generators for molecular and polymer generations. We covered the following key points:

1. **Introduction to Generative Methods**: We discussed the importance of generative modeling in computational chemistry and introduced BRICS and neural network-based methods like LSTM for molecular generation.

2. **BRICS Generation**: We demonstrated how to use the `BRICSGenerator` class in DeepChem to generate novel molecules from SMILES and PSMILES inputs. We also extended the application to dendrimers and polymers.

3. **LSTM Generators**: We trained an LSTM model using the `LSTMGenerator` class and generated new molecular sequences. We also showed how to load pretrained models for PSMILES and weighted directed graphs (WDG) to generate new candidates.

4. **Validation**: We implemented validation workflows to ensure the chemical validity of generated PSMILES and WDG strings using RDKit and DeepChem's `PolyWDGStringValidator`.

By combining rule-based methods like BRICS with deep learning techniques such as LSTM, we can efficiently explore chemical spaces and generate novel molecular structures with potential applications in drug discovery and materials science.

## 5. Reference <a id="ref"></a>

1. [Mohanty, Debasish, et al. "Open-source Polymer Generative Pipeline." arXiv preprint arXiv:2412.08658 (2024).](https://arxiv.org/abs/2412.08658)

# Congratulations! Time to join the Community!

Congratulations on completing this tutorial notebook! If you enjoyed working through the tutorial, and want to continue working with DeepChem, we encourage you to finish the rest of the tutorials in this series. You can also help the DeepChem community in the following ways:

## Star DeepChem on [GitHub](https://github.com/deepchem/deepchem)
This helps build awareness of the DeepChem project and the tools for open source drug discovery that we're trying to build.

## Join the DeepChem Gitter
The DeepChem [Gitter](https://gitter.im/deepchem/Lobby) hosts a number of scientists, developers, and enthusiasts interested in deep learning for the life sciences. Join the conversation!

## Join the DeepChem Discord
The DeepChem [Discord](https://discord.gg/cGzwCdrUqS) hosts a number of scientists, developers, and enthusiasts interested in deep learning for the life sciences. Join the conversation!

## Citing The Research Paper
This is the implementation of the work mentioned in the research paper "Open-Source Polymer Generative Pipeline". If you find this work helpful please consider citing it using provided BibTex. 

In [None]:
@article{mohanty2024open,
  title={Open-source Polymer Generative Pipeline},
  author={Mohanty, Debasish and Shreyas, V and Palai, Akshaya and Ramsundar, Bharath},
  journal={arXiv preprint arXiv:2412.08658},
  year={2024}
}

## Citing This Tutorial
If you found this tutorial useful please consider citing it using the provided BibTeX.

In [None]:
@manual{Intro1,
 title={Exploring Generative Methodologies with Deepchem},
 organization={DeepChem},
 author={Mohanty, Debasish},
 howpublished = {\url{https://github.com/deepchem/deepchem/blob/master/examples/tutorials/Exploring_Generative_Methodologies_with_Deepchem.ipynb}},
 year={2025},
}