# 9: Pre-evaluated CS50K with Active Learning

**Authors: Mateusz K Bieniek, Ben Cree, Rachael Pirie, Joshua T. Horton, Natalie J. Tatum, Daniel J. Cole**

## Overview
An AL study using precomputed Gnina scores. 

In [None]:
import pandas as pd
import prody
from rdkit import Chem

import fegrow
from fegrow import ChemSpace

from fegrow.testing import core_5R83_path, smiles_5R83_path

In [None]:
# create the chemical space
cs = ChemSpace()
# we're not growing the scaffold, we're superimposing bigger molecules on it
cs.add_scaffold(Chem.SDMolSupplier(core_5R83_path)[0])
# we can ignore the protein as the values have been pre-computed
cs.add_protein(None)

In [None]:
# switch on the caching
# I set it here to 6GB of RAM
cs.set_dask_caching(6e9)

In [None]:
# load 50k Smiles
oracle = pd.read_csv(smiles_5R83_path)

# remove .score == 0, which was used to signal structures that were too big
oracle = oracle[oracle.cnnaffinity!=0]

# here we add Smiles which should already have been matched
# to the scaffold (rdkit Mol.HasSubstructureMatch)
smiles = oracle.Smiles.to_list()
cs.add_smiles(smiles)

# Active Learning

## Warning! Please change the logger in order to see what is happening inside of ChemSpace.evaluate. There is too much info to output it into the screen .

```python
import logging
logging.basicConfig(encoding='utf-8', level=logging.DEBUG)
```

In [None]:
from fegrow.al import Model, Query

In [None]:
# This is the default configuration
# cs.model = Model.gaussian_process()
cs.model = Model.linear()
cs.query = Query.Greedy()

In [None]:
# we will use the preivously computed scores for this AL study
# we're going to look up the values instead
def oracle_look_up(scaffold, h, smiles, *args, **kwargs):
    # mol, data
    return None, {"score": oracle[oracle.Smiles == smiles].iloc[0].cnnaffinity}

In [None]:
# the first cycle will take more time
for cycle in range(20):
    # select 2 hundred
    selections = cs.active_learning(200)
    res = cs.evaluate(selections, full_evaluation=oracle_look_up)
    
    print(f"AL{cycle:2d}. "
      f"Mean: {res.score.mean():.2f}, "
      f"Max: {res.score.max():.2f}, "
      f">4.8: {sum(res.score > 4.8):3d}, "
      f">5.0: {sum(res.score > 5.0):3d}, "
      f">5.2: {sum(res.score > 5.2):3d}, "
      f">5.4: {sum(res.score > 5.4):3d}, "
      )