# FEgrow: An Open-Source Molecular Builder and Free Energy Preparation Workflow

**Authors: Mateusz K Bieniek, Ben Cree, Rachael Pirie, Joshua T. Horton, Natalie J. Tatum, Daniel J. Cole**

## Overview
Configure the Active Learning

In [None]:
import pandas as pd
import prody
from rdkit import Chem

import fegrow
from fegrow import ChemSpace

from fegrow.testing import core_5R83_path, smiles_5R83_core_path, rec_5R83_path

In [None]:
# create the chemical space
cs = ChemSpace()
# we're not growing the scaffold, we're superimposing bigger molecules on it
cs.add_scaffold(Chem.SDMolSupplier(core_5R83_path)[0])
cs.add_protein(rec_5R83_path)

In [None]:
# switch on the caching
cs.set_dask_caching()

In [None]:
# load 50k Smiles
data = pd.read_csv(smiles_5R83_core_path)

# take only 100
smiles = data.Smiles.to_list()[:200]

# here we add Smiles which should already have been matched
# to the scaffold (rdkit Mol.HasSubstructureMatch)
cs.add_smiles(smiles)

In [None]:
# configure manually 5 cases
cs.df.loc[0, ("score", "Training")] = 3.248, True
cs.df.loc[1, ("score", "Training")] = 3.572, True
cs.df.loc[2, ("score", "Training")] = 3.687, True
cs.df.loc[3, ("score", "Training")] = 3.492, True
cs.df.loc[4, ("score", "Training")] = 3.208, True

# Active Learning

## Warning! Please change the logger in order to see what is happening inside of ChemSpace.evaluate. There is too much info to output it into the screen .

In [None]:
from fegrow.al import Model, Query

In [None]:
# This is the default configuration
cs.model = Model.gaussian_process()
cs.query = Query.Greedy()

cs.active_learning(2)

In [None]:
cs.query = Query.UCB(beta=10)
cs.active_learning(2)

In [None]:
# The query methods available in modAL.acquisition are made available, these include
# Query.greedy(), 
# Query.PI(tradeoff=0) - highest probability of improvement
# Query.EI(tradeoff=0) - highest expected improvement
# Query.UCB(beta=1) - highest upper confidence bound (employes modAL.models.BayesianOptimizer)

# Models include the scikit:
# Model.linear()
# Model.elastic_net()
# Model.random_forest()
# Model.gradient_boosting_regressor()
# Model.mlp_regressor()

# Model.gaussian_process()  # uses a TanimotoKernel by default, meaning that it
#                           # compares the fingerprints of all the training dataset
#                           # with the cases not yet studied, which can be expensive
#                           # computationally

cs.model = Model.linear()
cs.query = Query.Greedy()
cs.active_learning()