# FEgrow: An Open-Source Molecular Builder and Free Energy Preparation Workflow

**Authors: Mateusz K Bieniek, Ben Cree, Rachael Pirie, Joshua T. Horton, Natalie J. Tatum, Daniel J. Cole**

## Overview
Configure the Active Learning

In [1]:
import pandas as pd
import prody
from rdkit import Chem

import fegrow
from fegrow import ChemSpace

from fegrow.testing import core_5R83_path, smiles_5R83_core_path, rec_5R83_path

In [2]:
# create the chemical space
cs = ChemSpace()
# we're not growing the scaffold, we're superimposing bigger molecules on it
cs.add_scaffold(Chem.SDMolSupplier(core_5R83_path)[0])
cs.add_protein(rec_5R83_path)

Dask can be watched on http://192.168.178.20:8989/status




In [3]:
# switch on the caching
cs.set_dask_caching()

In [4]:
# load 50k Smiles
data = pd.read_csv(smiles_5R83_core_path)

# take only 100
smiles = data.Smiles.to_list()[:200]

# here we add Smiles which should already have been matched
# to the scaffold (rdkit Mol.HasSubstructureMatch)
cs.add_smiles(smiles)

In [5]:
# configure manually 5 cases
cs.df.loc[0, ("score", "Training")] = 3.248, True
cs.df.loc[1, ("score", "Training")] = 3.572, True
cs.df.loc[2, ("score", "Training")] = 3.687, True
cs.df.loc[3, ("score", "Training")] = 3.492, True
cs.df.loc[4, ("score", "Training")] = 3.208, True

# Active Learning

## Warning! Please change the logger in order to see what is happening inside of ChemSpace.evaluate. There is too much info to output it into the screen .

In [6]:
from fegrow.al import Model, Query

In [7]:
# This is the default configuration
cs.model = Model.gaussian_process()
cs.query = Query.Greedy()

cs.active_learning(2)

Unnamed: 0,Smiles,Mol,score,h,Training,enamine_searched,enamine_id
178,[H]c1nc([H])c(N2C([H])([H])C([H])([H])C([H])([...,<rdkit.Chem.rdchem.Mol object at 0x71e6b41d3d10>,,,False,False,
135,[H]c1nc([H])c(N2C([H])([H])C([H])([H])C([H])(O...,<rdkit.Chem.rdchem.Mol object at 0x71e6b41d2a40>,,,False,False,


In [8]:
cs.query = Query.UCB(beta=10)
cs.active_learning(2)

Unnamed: 0,Smiles,Mol,score,h,Training,enamine_searched,enamine_id,regression
178,[H]c1nc([H])c(N2C([H])([H])C([H])([H])C([H])([...,<rdkit.Chem.rdchem.Mol object at 0x71e6b41d3d10>,,,False,False,,1.431
179,[H]c1nc([H])c([C@]2([H])N(OC([H])([H])[H])C([H...,<rdkit.Chem.rdchem.Mol object at 0x71e6b41d3d80>,,,False,False,,1.48


In [9]:
# The query methods available in modAL.acquisition are made available, these include
# Query.greedy(), 
# Query.PI(tradeoff=0) - highest probability of improvement
# Query.EI(tradeoff=0) - highest expected improvement
# Query.UCB(beta=1) - highest upper confidence bound (employes modAL.models.BayesianOptimizer)

# Models include the scikit:
# Model.linear()
# Model.elastic_net()
# Model.random_forest()
# Model.gradient_boosting_regressor()
# Model.mlp_regressor()

# Model.gaussian_process()  # uses a TanimotoKernel by default, meaning that it
#                           # compares the fingerprints of all the training dataset
#                           # with the cases not yet studied, which can be expensive
#                           # computationally

cs.model = Model.linear()
cs.query = Query.Greedy()
cs.active_learning()

Unnamed: 0,Smiles,Mol,score,h,Training,enamine_searched,enamine_id,regression
15,[H]OC([H])([H])C([H])([H])c1c([H])nc([H])c([H]...,<rdkit.Chem.rdchem.Mol object at 0x71e6b41af530>,,,False,False,,2.505


### Search the Enamine database usuing the sw.docking.org (check if online)
Please note that you should check whether you have the permission to use this interface. 
Furthermore, you are going to need the pip package `pydockingorg`

In [10]:
# search only molecules similar to the best molecule score-wise (n_best)
# and return up to 5
new_enamines = cs.add_enamine_molecules(n_best=1, results_per_search=10)

Querying Enamine REAL. 




Found 0 in 6.492062091827393
Enamine returned with 0 rows in 6.5s.
results Empty DataFrame
Columns: []
Index: []


type: 'DataFrame' object has no attribute 'hitSmiles'

In [None]:
new_enamines