# Generative Steps for novel PabB Inhibitors

This notebook contains the steps to use the scorer.py funciton plus chemsampler to sample and generate compounds optimised for PabB binding.

In [2]:
import os
import pandas as pd
import numpy as np
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem.rdmolfiles import MolToPDBFile
from rdkit.Chem import PandasTools
from rdkit.Chem import Descriptors
from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
from rdkit.ML.Descriptors import MoleculeDescriptors
from standardiser import standardise
import random
import string


DATAPATH = "../data"
RESULTSPATH = "../results"
SOURCEPATH = "../src"

## Initial dataset preparation

The reference_library.csv from https://github.com/ersilia-os/groverfeat/blob/main/data/reference_library.csv was downloaded and addded to the repo. This was filtered based loosely on Lipinski's rule of 5, but with caveats for lead compounds. More info can be found on the wiki page here https://en.wikipedia.org/wiki/Lipinski%27s_rule_of_five#:~:text=Lipinski's%20rule%20states%20that%2C%20in,all%20nitrogen%20or%20oxygen%20atoms. 

In [23]:
smiles_csv = os.path.join(DATAPATH, "smiles", "reference_library.csv")
filtered_smiles = os.path.join(DATAPATH, "smiles", "filtered_std_smiles.csv")
col_Names=["SMILES"]
df = pd.read_csv(smiles_csv, names=col_Names)

selected = []

for s in df["SMILES"]:
    mol = Chem.MolFromSmiles(s)
    mw = Chem.Descriptors.MolWt(mol)
    logp = Chem.Descriptors.MolLogP(mol)
    hacceptors = Chem.Descriptors.NumHAcceptors(mol)
    hdonors = Chem.Descriptors.NumHDonors(mol)
    numrotatablebonds = Chem.Descriptors.NumRotatableBonds(mol)
    if mw >=250 and mw <=450 and logp >=1 and logp <=3 and hacceptors <=5 and hdonors <=4 and numrotatablebonds <=5: 
        selected += [s]
        continue

#this almost follows Lipinski's rule of 5, but for lead compounds (see wiki)



In [14]:
len(df["SMILES"]) #how many original smiles?

1999381

In [24]:
len(selected) #how many in our list?


201515

In [45]:
id_list = (range(0, 201515))
len(id_list) # must create a list of 'IDs' for the ID column - this is required for the scorer.py


In [51]:
filtered_selected = os.path.join(DATAPATH, "smiles", "200k_for_chemsampler.csv")

dict = {'ID': id_list, 'SMILES': selected}  
       
df = pd.DataFrame(dict) 

df.to_csv(filtered_selected, index=False) 

#saves a .csv file with a column for ID and a column for SMILES which will be used to run the first round of docking into PabB.

The aim of producing this dataset of ~200k lead-like compounds is to dock a large number of potential leads into PabB and obtain docking scores. Compounds that dock with a high affinity when compared to the known binders can be used alongside the known binders as a list of compounds to seed chemsampler. We are looking for 50-100 compounds to use as a seed. Chemsampler will then sample and generate ~1 million compounds. These can be filtered, remaining compounds re-docked and further generation can occur.