# Getting started

This tutorial serves as a quick introduction on how to evaluate your generative model using docking benchmark. You will learn how to:

1. Evaluate generated molecules using docking
2. Store the results
3. Calculate some basic metrics based on achieved results

## Before you begin

1. **Make sure you have your environment prepared**. The easiest way is to create a new conda environment and adapt it using installation script located at `docking_benchmark/install_conda_env.sh`.

2. To use proteins and datasets provided with benchmark, you have to **download additional data**. Download [this zip](https://drive.google.com/open?id=1HJNgHBWE2eZc2gsHQhqay-V17GaviIxQ), unpack it and set the `DOCKING_BENCHMARK_DATA` environment variable to unpacked `data` directory -- e.g. in Linux, use `export DOCKING_BENCHMARK_DATA=/path/to/data` command.

After you complete the steps above, you are ready to evaluate your model.

## Model evaluation

Let's assume your model is wrapped in Python class with `generate_molecule` method that generates molecule in SMILES format.

In [1]:
import random

class RandomMoleculeGenerator:
    def __init__(self):
        self.smiles = [
            'C1CCCC12CC(=O)N(C(=O)C2)CCCC[N@@H+](CCC)[C@H](C3)COc(c34)cccc4OC',
            'C1CCCC12CC(=O)N(C(=O)C2)OCCC[N@@H+](CCC)[C@@H](C3)COc(c34)cccc4OC',
            'c1cc(F)ccc1C(=O)Nc(cc2)cc(c23)n(nc3)[C@H]4CC[N@H+](C)CC4',
            'c1cc(F)ccc1C(=O)NCCCC[N@@H+](CCC)[C@H](C2)COc(c23)cccc3OC',
            'c1cc(I)ccc1CCC[N@H+](CC2)CC[C@@H]2COc(c(c3c45)cccn3)nc4c(F)ccc5',
        ]
    
    def generate_molecule(self):
        return random.sample(self.smiles, 1)[0]

The class above generates a molecule by choosing random SMILES from `smiles` list. We will use it as a trivial generative model.

In [2]:
model = RandomMoleculeGenerator()

With our model in place, we can evaluate the generated molecules. First we need to pick a protein that molecules are going to be docked to. Benchmark provides four proteins out of the box -- 5HT1B, 5HT2B, ACM2 and CYP2D6. We will move on with 5HT1B. Use `get_proteins` function from `docking_benchmark.data.proteins` package to access provided proteins.

In [3]:
from docking_benchmark.data.proteins import get_proteins

protein = get_proteins()['5ht1b']



Protein is represented by a simple `Protein` class. To learn more about it, including how to access fine-tuning datasets, see [Proteins and datasets notebook](proteins-and-datasets.ipynb). The most important thing for now, is that `Protein` class has a `dock_smiles_to_protein` method. Let's see how it works, by docking a single molecule.

In [4]:
smiles = model.generate_molecule()
result = protein.dock_smiles_to_protein(smiles)
result

{'intramolecular_energy': -0.9224400000000001,
 'docking_score': -8.98301,
 'gauss(o=0__w=0.8__c=8)': 171.069404,
 'repulsion(o=0__c=8)': 1.972782,
 'hydrophobic(g=0__b=2.5__c=8)': 118.68772999999999,
 'non_dir_h_bond(g=-0.6__b=0__c=8)': 0.543306,
 'num_tors_div': 0.0}

The method returns a dictionary with docking results from SMINA. `docking_score` is the most important score, yet you can find also particular components of docking score.

## Storing results

For storing docking results you can use `OptimizedMolecules` and `OptimizedMolecules.Builder` classes. `OptimizedMolecules` is a simple wrapper around pandas' `DataFrame`, providing methods that simplify the creation of results' `DataFrame` and metrics calculation. To create `OptimizedMolecules` object you should use `OptimizedMolecules.Builder` class. To learn more about them, see [Optimized Molecules notebook](optimized-molecules.ipynb).

In [5]:
from docking_benchmark.data.results import OptimizedMolecules

results_builder = OptimizedMolecules.Builder()

for _ in range(5):
    smiles = model.generate_molecule()
    docking_result = protein.dock_smiles_to_protein(smiles)
    results_builder.append(smiles, **docking_result)

The code above appends calculated docking results to the builder. All results returned by SMINA are stored in the builder by using dictionary unpacking (`**docking_result`). After the generation process is done, use `build()` method to create `OptimizedMolecules` object.

In [6]:
results = results_builder.build()

## Interpreting the results

Raw results may be not informative enough to assess the performance. That's why `OptimizedMolecules` provides methods for results manipulation. For example, you may want to calculate internal diversity of generated molecules

In [7]:
results.internal_diversity()

0.8517809216212529

or retrieve three molecules with best docking score

In [8]:
results.get_first_n(3, by_column='docking_score')

Unnamed: 0,intramolecular_energy,docking_score,gauss(o=0__w=0.8__c=8),repulsion(o=0__c=8),hydrophobic(g=0__b=2.5__c=8),non_dir_h_bond(g=-0.6__b=0__c=8),num_tors_div
c1cc(I)ccc1CCC[N@H+](CC2)CC[C@@H]2COc(c(c3c45)cccn3)nc4c(F)ccc5,-1.78302,-11.50125,180.496812,1.989348,182.214124,0.338416,0.0
c1cc(F)ccc1C(=O)NCCCC[N@@H+](CCC)[C@H](C2)COc(c23)cccc3OC,-1.104402,-9.545828,169.344056,1.30336,129.408504,0.579836,0.0
c1cc(F)ccc1C(=O)Nc(cc2)cc(c23)n(nc3)[C@H]4CC[N@H+](C)CC4,-0.647638,-8.385558,135.181176,1.28366,104.082174,0.315988,0.0


You can also access the underlying `DataFrame`, by accessing `molecules` property of the class. This way you can calculate average score and components' values.

In [9]:
results.molecules.mean()

intramolecular_energy                -1.178353
docking_score                        -9.810879
gauss(o=0__w=0.8__c=8)              161.674015
repulsion(o=0__c=8)                   1.525456
hydrophobic(g=0__b=2.5__c=8)        138.568267
non_dir_h_bond(g=-0.6__b=0__c=8)      0.411413
num_tors_div                          0.000000
dtype: float64

More methods for results interpretation are available. `OptimizedMolecules` can be also exported to .csv file. If you want you may also use binary serialization -- it may be useful if you want to load the results later into jupyter notebook for further analysis. [See Optimized Molecules notebook for more details](optimized-molecules.ipynb).

## Summary

Now you should know how to use the package to assess your model's performance. The basic framework is to generate molecules, dock them and use `OptimizedMolecules.Builder` to build `OptimizedMolecules` object. Next, you can either use builtin `OptimizedMolecules` methods or use underlying pandas' `DataFrame` to interpret the results.