# Intro

This notebook will show you how to dock and score molecules using the asapdiscovery-docking module. 

This docking pipeline primarily focuses on the use-case for a structure-enabled drug discovery program, in which we have crystal structures of early molecules to use for *reference-based* docking. 

To this end, we have implemented an api that wraps the OpenEye POSIT docking algorithm, which through it's use of the HYBRID and SHAPEFIT algorithms, enables reference-based docking. 

## The scope of this guide

This guide will show you how to dock and score molecules. For the *extremely* necessary precursor step of data loading and prepping, please see [protein_and_ligand_prep](%protein_and_ligand_prep.ipynb)

# Data

We will use files we use for testing. Since these molecules have already been prepped for docking

In [1]:
from asapdiscovery.data.testing.test_resources import fetch_test_file
from asapdiscovery.data.schema.complex import PreppedComplex
from asapdiscovery.data.schema.ligand import Ligand
prepped_complex = PreppedComplex.from_oedu_file(
        fetch_test_file("Mpro-P2660_0A_bound-prepped_receptor.oedu"),
        ligand_kwargs={"compound_name": "test"},
        target_kwargs={"target_name": "test", "target_hash": "mock_hash"},
    )
ligand = Ligand.from_sdf(
        fetch_test_file("Mpro-P0008_0A_ERI-UCB-ce40166b-17.sdf"), compound_name="test"
    )

# Docking

As with any scientific endeavour, it's important to consider why you want to run docking and what you expect to get out of this.

| Context           | Goal                                                                     | Considerations                                                                                                                                                                                 |
|-------------------|--------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Hit-to-lead       | Reference-based docking of 100s-1000s of molecules                       | High-throughput, low false-positives                                                                                                                                                           |
| Lead-Optimization | Reference-based docking of 10s-100s of molecules                         | High accuracy, low false-positives. You probably know generally where the molecules should bind, and the maximum common substructures of the molecules should probably be very closely aligned |
| Research          | Generate protein-ligand complexes for downstream analyses or ML training | High-throughput but you can live with in-accurate poses because                                                                                                                                |


## Examining the POSITDocker

There are a *ton* of choices we can make for running docking, which will not be enumerated here. But in order to get a flavor for the options, we can examine the class attributes of the POSITDocker:

In [2]:
from asapdiscovery.docking.openeye import POSITDocker

In [3]:
docker = POSITDocker()

In [4]:
docker.dict()

{'type': 'POSITDocker',
 'relax': <POSIT_RELAX_MODE.NONE: 0>,
 'posit_method': <POSIT_METHOD.ALL: 15>,
 'use_omega': True,
 'omega_dense': False,
 'num_poses': 1,
 'allow_low_posit_prob': False,
 'low_posit_prob_thresh': 0.1,
 'allow_final_clash': False,
 'allow_retries': True}

We can also look at the `.dock` method to see what arguments we can make 

In [5]:
docker.dock?

We can see that we need:
1) a list of DockingInputBase objects
2) an output directory
3) and some dask options

Currently, we have 2 kinds of DockingInputBase objects implemented:
1) a complex-ligand pair (DockingInputPair)
2) a one-to-many ligand:complexes object (DockingInputMultiStructure)

In [6]:
from asapdiscovery.docking.docking import DockingInputPair, DockingInputMultiStructure

## Running simple docking 

### First we generate docking input

In [7]:
input_pair = DockingInputPair(ligand=ligand, complex=prepped_complex)

In [8]:
docker = POSITDocker() # let's just use defaults for now

In [9]:
results = docker.dock([input_pair]) # we won't use dask or write an output, takes ~30 s on a Macbook Pro

This returns a list of POSITDockingResults objects!

In [10]:
result = results[0]

In [11]:
result.write_docking_files("docking_test")

# Scoring

We deconvolute the *pose prediction* and *scoring* parts of docking, which enables us to try out a number of different things.
To this end, we have written a few "scorer" classes, including:
1. A traditional docking scorer: ChemGauss4Scorer
2. A score which tries to capture information about the potential for the binding site to evolve: FINTScorer
3. A few ML scorers: GATScorer, E3NNScorer, SchnetScorer
4. And finally, a MetaScorer which can run all the other ones

In [12]:
from asapdiscovery.docking.scorer import ChemGauss4Scorer, FINTScorer, GATScorer, E3NNScorer, SchnetScorer, MetaScorer

## Targets

Several of our scorers require target-specific information. We can find out the targets that the repo "knows about" like so:

In [13]:
from asapdiscovery.data.services.postera.manifold_data_validation import TargetTags

In [14]:
TargetTags.get_values()

['DENV-NS2B-NS3pro',
 'EV-A71-Capsid',
 'EV-A71-3Cpro',
 'MERS-CoV-Mpro',
 'EV-D68-Capsid',
 'SARS-CoV-2-Mpro',
 'SARS-CoV-2-Mac1',
 'SARS-CoV-2-N-protein',
 'EV-D68-3Cpro',
 'ZIKV-NS2B-NS3pro']

Since we're working with a known target, we can set that as a variable and use that throughout

In [15]:
target = TargetTags("SARS-CoV-2-Mpro")

## the chemgauss scorer

In [16]:
chemgauss_scorer = ChemGauss4Scorer()

In [17]:
scores = chemgauss_scorer.score(results)
scores

[Score(score_type=<ScoreType.chemgauss4: 'chemgauss4'>, score=-11.651097297668457, compound_name='test', smiles='c1ccc2c(c1)c(cc(=O)[nH]2)C(=O)NCCOc3cc(cc(c3)Cl)O[C@H]4CC(=O)N4', ligand_identifiers=None, ligand_inchikey='SNQHLYWSRJDLGK-FQEVSTJZSA-N', target_name='test', target_identifiers=None, complex_ligand_smiles='CNC(=O)CN1C[C@]2(CCN(C2=O)c3cncc4c3cc(cc4)Cl)c5cc(ccc5C1=O)Cl', probability=0.18000000715255737, units=<ScoreUnits.arbitrary: 'arbitrary'>)]

We can see this returns an array of score objects. If we want a dataframe, we can ask it to run this instead:

In [18]:
scores_df = chemgauss_scorer.score(results, return_df=True)
scores_df

Unnamed: 0,score_type,score,compound_name,smiles,ligand_identifiers,ligand_inchikey,target_name,target_identifiers,complex_ligand_smiles,probability,units
0,chemgauss4,-11.651097,test,c1ccc2c(c1)c(cc(=O)[nH]2)C(=O)NCCOc3cc(cc(c3)C...,,SNQHLYWSRJDLGK-FQEVSTJZSA-N,test,,CNC(=O)CN1C[C@]2(CCN(C2=O)c3cncc4c3cc(cc4)Cl)c...,0.18,ScoreUnits.arbitrary


## FINTScore

For the FINT score, we need fitness data, which means we can only work on a target for which we have vendored fitness data. To check which targets those are, we can use:

In [19]:
from asapdiscovery.data.metadata.resources import targets_with_fitness_data
targets_with_fitness_data

[<TargetTags.SARS-CoV-2-Mpro: 'SARS-CoV-2-Mpro'>,
 <TargetTags.SARS-CoV-2-Mac1: 'SARS-CoV-2-Mac1'>,
 <TargetTags.SARS-CoV-2-N-protein: 'SARS-CoV-2-N-protein'>,
 <TargetTags.ZIKV-NS2B-NS3pro: 'ZIKV-NS2B-NS3pro'>]

In [20]:
fint_scorer = FINTScorer(target=target)

In [21]:
scores = fint_scorer.score(results)
scores

2024-04-25 13:15:34,860 [INFO] [plipcmd.py:124] plip.plipcmd: Protein-Ligand Interaction Profiler (PLIP) 2.3.0
2024-04-25 13:15:34,860 [INFO] [plipcmd.py:125] plip.plipcmd: brought to you by: PharmAI GmbH (2020-2021) - www.pharm.ai - hello@pharm.ai
2024-04-25 13:15:34,860 [INFO] [plipcmd.py:126] plip.plipcmd: please cite: Adasme,M. et al. PLIP 2021: expanding the scope of the protein-ligand interaction profiler to DNA and RNA. Nucl. Acids Res. (05 May 2021), gkab294. doi: 10.1093/nar/gkab294
2024-04-25 13:15:34,860 [INFO] [plipcmd.py:49] plip.plipcmd: starting analysis of tmp_complex.pdb
2024-04-25 13:15:35,212 [INFO] [plipcmd.py:165] plip.plipcmd: finished analysis, find the result files in /var/folders/cf/d42qwdmd5_g5s63r3g_vgx2c0000gn/T/tmpu5zcqa_b/


[Score(score_type=<ScoreType.FINT: 'FINT'>, score=1.0, compound_name='test', smiles='c1ccc2c(c1)c(cc(=O)[nH]2)C(=O)NCCOc3cc(cc(c3)Cl)O[C@H]4CC(=O)N4', ligand_identifiers=None, ligand_inchikey='SNQHLYWSRJDLGK-FQEVSTJZSA-N', target_name='test', target_identifiers=None, complex_ligand_smiles='CNC(=O)CN1C[C@]2(CCN(C2=O)c3cncc4c3cc(cc4)Cl)c5cc(ccc5C1=O)Cl', probability=0.18000000715255737, units=<ScoreUnits.arbitrary: 'arbitrary'>)]

## ML Scorers

Right now our ML scorers are target specific, so we need to instantiate them with the target identified. This means we need to use the right TargetTag

In [22]:
from asapdiscovery.ml.models import ASAPMLModelRegistry

In [23]:
ASAPMLModelRegistry.get_implemented_model_types()

['e3nn', 'schnet', 'GAT']

In [24]:
from asapdiscovery.docking.scorer import MLModelScorer
ml_scorers = [MLModelScorer.from_latest_by_target_and_type(target, model_type) 
           for model_type in ASAPMLModelRegistry.get_implemented_model_types()]



In [25]:
gat_scores = ml_scorers[0].score(results)

In [26]:
gat_scores

[Score(score_type=<ScoreType.e3nn: 'e3nn'>, score=3.7135009765625, compound_name='test', smiles='c1ccc2c(c1)c(cc(=O)[nH]2)C(=O)NCCOc3cc(cc(c3)Cl)O[C@H]4CC(=O)N4', ligand_identifiers=None, ligand_inchikey='SNQHLYWSRJDLGK-FQEVSTJZSA-N', target_name='test', target_identifiers=None, complex_ligand_smiles='CNC(=O)CN1C[C@]2(CCN(C2=O)c3cncc4c3cc(cc4)Cl)c5cc(ccc5C1=O)Cl', probability=0.18000000715255737, units=<ScoreUnits.pIC50: 'pIC50'>)]

### MetaScorer

We can use the MetaScorer to run all the scoring for us and combine everything into a dataframe we can save easily

In [27]:
scorers = [chemgauss_scorer, fint_scorer, *ml_scorers]

In [28]:
metascorer = MetaScorer(scorers=scorers)

In [29]:
scores_df = metascorer.score(results, return_df=True)

2024-04-25 13:15:51,214 [INFO] [plipcmd.py:124] plip.plipcmd: Protein-Ligand Interaction Profiler (PLIP) 2.3.0
2024-04-25 13:15:51,214 [INFO] [plipcmd.py:125] plip.plipcmd: brought to you by: PharmAI GmbH (2020-2021) - www.pharm.ai - hello@pharm.ai
2024-04-25 13:15:51,214 [INFO] [plipcmd.py:126] plip.plipcmd: please cite: Adasme,M. et al. PLIP 2021: expanding the scope of the protein-ligand interaction profiler to DNA and RNA. Nucl. Acids Res. (05 May 2021), gkab294. doi: 10.1093/nar/gkab294
2024-04-25 13:15:51,214 [INFO] [plipcmd.py:49] plip.plipcmd: starting analysis of tmp_complex.pdb
2024-04-25 13:15:51,547 [INFO] [plipcmd.py:165] plip.plipcmd: finished analysis, find the result files in /var/folders/cf/d42qwdmd5_g5s63r3g_vgx2c0000gn/T/tmp6tpvur6e/


In [30]:
scores_df

score_type,docking-confidence-POSIT,target_identifiers,SMILES,complex_ligand_smiles,docking-structure-POSIT,INCHIKEY,ligand_id,ligand_identifiers,fitness-score-FINT,computed-GAT-pIC50,docking-score-POSIT,computed-E3NN-pIC50,computed-SchNet-pIC50
0,0.18,,c1ccc2c(c1)c(cc(=O)[nH]2)C(=O)NCCOc3cc(cc(c3)C...,CNC(=O)CN1C[C@]2(CCN(C2=O)c3cncc4c3cc(cc4)Cl)c...,test,SNQHLYWSRJDLGK-FQEVSTJZSA-N,test,,1.0,4.912028,-11.651097,3.713501,-1.627446


Under the hood, this uses this function: `asapdiscovery.docking.scorer.Score._combine_and_pivot_scores_df` to return the scores in a dataframe. As of version 0.4, this uses `asapdiscovery.docking.scorer._SCORE_MANIFOLD_ALIAS` to change the column names to conform to the standard names that the ASAPDiscovery Consortium agreed upon. You can examine which column names correspond to which score here:

In [31]:
from asapdiscovery.docking.scorer import _SCORE_MANIFOLD_ALIAS
_SCORE_MANIFOLD_ALIAS

{<ScoreType.chemgauss4: 'chemgauss4'>: 'docking-score-POSIT',
 <ScoreType.FINT: 'FINT'>: 'fitness-score-FINT',
 <ScoreType.GAT: 'GAT'>: 'computed-GAT-pIC50',
 <ScoreType.schnet: 'schnet'>: 'computed-SchNet-pIC50',
 <ScoreType.e3nn: 'e3nn'>: 'computed-E3NN-pIC50',
 <ScoreType.INVALID: 'INVALID'>: None,
 'target_name': 'docking-structure-POSIT',
 'compound_name': 'ligand_id',
 'smiles': 'SMILES',
 'ligand_inchikey': 'INCHIKEY',
 'probability': 'docking-confidence-POSIT'}

# A few side notes

## Dask

We make heavy use of Dask throughout our code, which helps automate parallel processing and provides a nice dashboard for evaluating the progress of large scale docking efforts. Due to the way in which Dask automates error handling, this has occasionally led to situations where the behaviour of our code is different depending on whether you have enabled Dask. We have tried to stamp out any instances of this, but if you find another, please make an issue!

## Target-specific workflows

We have implemented our library code within the `asapdiscovery-workflows` module, which puts everything together in a command-line interface (cli). Unfortunately, as of version 0.4, these workflows only work if you are using the targets specified for ASAP. We plan on changing this for version 0.5 

To find out which targets can be passed to these workflows, you can use this:

In [32]:
from asapdiscovery.data.services.postera.manifold_data_validation import TargetTags

In [33]:
TargetTags.get_values()

['DENV-NS2B-NS3pro',
 'EV-A71-Capsid',
 'EV-A71-3Cpro',
 'MERS-CoV-Mpro',
 'EV-D68-Capsid',
 'SARS-CoV-2-Mpro',
 'SARS-CoV-2-Mac1',
 'SARS-CoV-2-N-protein',
 'EV-D68-3Cpro',
 'ZIKV-NS2B-NS3pro']