# Docking and Scoring
## Intro

This notebook will show you how to dock and score molecules using the asapdiscovery-docking module. 

This docking pipeline primarily focuses on the use-case for a structure-enabled drug discovery program, in which we have crystal structures of early molecules to use for *reference-based* docking. 

To this end, we have implemented an api that wraps the OpenEye POSIT docking algorithm, which, through its use of the HYBRID and SHAPEFIT algorithms, enables reference-based docking. 

### The scope of this guide

This guide will show you how to dock and score molecules. For the *extremely* necessary precursor step of data loading and prepping, please see [protein_and_ligand_prep](%protein_and_ligand_prep.ipynb)

## Data

We will use files we use for testing, since these molecules have already been prepped for docking.

In [6]:
from asapdiscovery.data.testing.test_resources import fetch_test_file
from asapdiscovery.data.schema.complex import Complex, PreppedComplex
from asapdiscovery.data.schema.ligand import Ligand
prepped_complex = PreppedComplex.from_oedu_file(
        fetch_test_file("Mpro-P2660_0A_bound-prepped_receptor.oedu"),
        ligand_kwargs={"compound_name": "test"},
        target_kwargs={"target_name": "test", "target_hash": "mock_hash"},
    )
ligand = Ligand.from_sdf(
        fetch_test_file("Mpro-P0008_0A_ERI-UCB-ce40166b-17.sdf"), compound_name="test"
    )

## Docking with the POSITDocker

There are a *ton* of choices we can make for docking, which will not be enumerated here. But in order to get a flavor for the options, we can examine the class attributes of the POSITDocker:

In [7]:
from asapdiscovery.docking.openeye import POSITDocker

In [8]:
docker = POSITDocker()

In [9]:
docker.dict()

{'type': 'POSITDocker',
 'relax': <POSIT_RELAX_MODE.NONE: 0>,
 'posit_method': <POSIT_METHOD.ALL: 15>,
 'use_omega': True,
 'omega_dense': False,
 'num_poses': 1,
 'allow_low_posit_prob': False,
 'low_posit_prob_thresh': 0.1,
 'allow_final_clash': False,
 'allow_retries': True}

`POSITDocker.dock()` requires:
1) a list of DockingInputBase objects
2) an output directory
3) and some dask options

Currently, we have 2 kinds of DockingInputBase objects implemented:
1) a complex-ligand pair (DockingInputPair)
2) a one-to-many ligand:complexes object (DockingInputMultiStructure)

### Running simple docking 

In [11]:
from asapdiscovery.docking.docking import DockingInputPair
input_pair = DockingInputPair(ligand=ligand, complex=prepped_complex)

In [12]:
docker = POSITDocker() # let's just use defaults for now

In [13]:
results = docker.dock([input_pair]) # we won't use dask or write an output, takes ~30 s on a Macbook Pro

This returns a list of POSITDockingResults objects!

In [14]:
result = results[0]

In [15]:
result.write_docking_files("docking_test")

## Scoring

We deconvolute the *pose prediction* and *scoring* parts of docking, which enables us to try out a number of different things.
To this end, we have written a few "scorer" classes, including:
1. A traditional docking scorer: ChemGauss4Scorer
2. A score which tries to capture information about the potential for the binding site to evolve: FINTScorer
3. A few ML scorers: GATScorer, E3NNScorer, SchnetScorer
4. And finally, a MetaScorer which can run all the other ones

In [11]:
from asapdiscovery.docking.scorer import ChemGauss4Scorer, FINTScorer, MetaScorer

### Targets

Several of our scorers require target-specific information. We can find out the targets that the repo "knows about" like so:

In [17]:
from asapdiscovery.data.services.postera.manifold_data_validation import TargetTags

In [18]:
TargetTags.get_values()

['EV-D68-Capsid',
 'SARS-CoV-2-N-protein',
 'EV-A71-Capsid',
 'DENV-NS2B-NS3pro',
 'MERS-CoV-Mpro',
 'SARS-CoV-2-Mac1',
 'EV-A71-3Cpro',
 'EV-D68-3Cpro',
 'SARS-CoV-2-Mpro',
 'ZIKV-NS2B-NS3pro']

Since we're working with a known target, we can set that as a variable and use that throughout

In [19]:
target = TargetTags("SARS-CoV-2-Mpro")

### ChemGauss4 Scorer

In [20]:
chemgauss_scorer = ChemGauss4Scorer()

In [21]:
scores = chemgauss_scorer.score(results)
scores

[Score(score_type=<ScoreType.chemgauss4: 'chemgauss4'>, score=-11.651097297668457, compound_name='test', smiles='c1ccc2c(c1)c(cc(=O)[nH]2)C(=O)NCCOc3cc(cc(c3)Cl)O[C@H]4CC(=O)N4', ligand_identifiers=None, ligand_inchikey='SNQHLYWSRJDLGK-FQEVSTJZSA-N', target_name='test', target_identifiers=None, complex_ligand_smiles='CNC(=O)CN1C[C@]2(CCN(C2=O)c3cncc4c3cc(cc4)Cl)c5cc(ccc5C1=O)Cl', probability=0.18000000715255737, units=<ScoreUnits.arbitrary: 'arbitrary'>)]

We can see this returns an array of score objects. If we want a dataframe, we can ask it to run this instead:

In [22]:
scores_df = chemgauss_scorer.score(results, return_df=True)
scores_df

Unnamed: 0,score_type,score,compound_name,smiles,ligand_identifiers,ligand_inchikey,target_name,target_identifiers,complex_ligand_smiles,probability,units
0,chemgauss4,-11.651097,test,c1ccc2c(c1)c(cc(=O)[nH]2)C(=O)NCCOc3cc(cc(c3)C...,,SNQHLYWSRJDLGK-FQEVSTJZSA-N,test,,CNC(=O)CN1C[C@]2(CCN(C2=O)c3cncc4c3cc(cc4)Cl)c...,0.18,ScoreUnits.arbitrary


### FINTScore

For the FINT score, we need fitness data, which means we can only work on a target for which we have vendored fitness data. To check which targets those are, we can use:

In [23]:
from asapdiscovery.data.metadata.resources import targets_with_fitness_data
targets_with_fitness_data

[<TargetTags.SARS-CoV-2-Mpro: 'SARS-CoV-2-Mpro'>,
 <TargetTags.SARS-CoV-2-Mac1: 'SARS-CoV-2-Mac1'>,
 <TargetTags.SARS-CoV-2-N-protein: 'SARS-CoV-2-N-protein'>,
 <TargetTags.ZIKV-NS2B-NS3pro: 'ZIKV-NS2B-NS3pro'>]

In [24]:
fint_scorer = FINTScorer(target=target)

In [25]:
scores = fint_scorer.score(results)
scores

2024-04-25 14:27:29,906 [INFO] [plipcmd.py:124] plip.plipcmd: Protein-Ligand Interaction Profiler (PLIP) 2.3.0
2024-04-25 14:27:29,907 [INFO] [plipcmd.py:125] plip.plipcmd: brought to you by: PharmAI GmbH (2020-2021) - www.pharm.ai - hello@pharm.ai
2024-04-25 14:27:29,907 [INFO] [plipcmd.py:126] plip.plipcmd: please cite: Adasme,M. et al. PLIP 2021: expanding the scope of the protein-ligand interaction profiler to DNA and RNA. Nucl. Acids Res. (05 May 2021), gkab294. doi: 10.1093/nar/gkab294
2024-04-25 14:27:29,907 [INFO] [plipcmd.py:49] plip.plipcmd: starting analysis of tmp_complex.pdb
2024-04-25 14:27:30,313 [INFO] [plipcmd.py:165] plip.plipcmd: finished analysis, find the result files in /var/folders/cf/d42qwdmd5_g5s63r3g_vgx2c0000gn/T/tmpeg8i2kvy/


[Score(score_type=<ScoreType.FINT: 'FINT'>, score=1.0, compound_name='test', smiles='c1ccc2c(c1)c(cc(=O)[nH]2)C(=O)NCCOc3cc(cc(c3)Cl)O[C@H]4CC(=O)N4', ligand_identifiers=None, ligand_inchikey='SNQHLYWSRJDLGK-FQEVSTJZSA-N', target_name='test', target_identifiers=None, complex_ligand_smiles='CNC(=O)CN1C[C@]2(CCN(C2=O)c3cncc4c3cc(cc4)Cl)c5cc(ccc5C1=O)Cl', probability=0.18000000715255737, units=<ScoreUnits.arbitrary: 'arbitrary'>)]

### ML Scorers

Right now our ML scorers are target specific, so we need to instantiate them with the target identified. This means we need to use the right TargetTag

In [26]:
from asapdiscovery.ml.models import ASAPMLModelRegistry

In [27]:
ASAPMLModelRegistry.get_implemented_model_types()

['schnet', 'e3nn', 'GAT']

In [28]:
from asapdiscovery.docking.scorer import MLModelScorer
ml_scorers = [MLModelScorer.from_latest_by_target_and_type(target, model_type) 
           for model_type in ASAPMLModelRegistry.get_implemented_model_types()]



In [29]:
gat_scores = ml_scorers[0].score(results)

In [30]:
gat_scores

[Score(score_type=<ScoreType.schnet: 'schnet'>, score=-1.627445936203003, compound_name='test', smiles='c1ccc2c(c1)c(cc(=O)[nH]2)C(=O)NCCOc3cc(cc(c3)Cl)O[C@H]4CC(=O)N4', ligand_identifiers=None, ligand_inchikey='SNQHLYWSRJDLGK-FQEVSTJZSA-N', target_name='test', target_identifiers=None, complex_ligand_smiles='CNC(=O)CN1C[C@]2(CCN(C2=O)c3cncc4c3cc(cc4)Cl)c5cc(ccc5C1=O)Cl', probability=0.18000000715255737, units=<ScoreUnits.pIC50: 'pIC50'>)]

#### MetaScorer

We can use the MetaScorer to run all the scoring for us and combine everything into a dataframe we can save easily

In [31]:
scorers = [chemgauss_scorer, fint_scorer, *ml_scorers]

In [32]:
metascorer = MetaScorer(scorers=scorers)

In [33]:
scores_df = metascorer.score(results, return_df=True)

2024-04-25 14:27:45,986 [INFO] [plipcmd.py:124] plip.plipcmd: Protein-Ligand Interaction Profiler (PLIP) 2.3.0
2024-04-25 14:27:45,986 [INFO] [plipcmd.py:125] plip.plipcmd: brought to you by: PharmAI GmbH (2020-2021) - www.pharm.ai - hello@pharm.ai
2024-04-25 14:27:45,986 [INFO] [plipcmd.py:126] plip.plipcmd: please cite: Adasme,M. et al. PLIP 2021: expanding the scope of the protein-ligand interaction profiler to DNA and RNA. Nucl. Acids Res. (05 May 2021), gkab294. doi: 10.1093/nar/gkab294
2024-04-25 14:27:45,986 [INFO] [plipcmd.py:49] plip.plipcmd: starting analysis of tmp_complex.pdb
2024-04-25 14:27:46,345 [INFO] [plipcmd.py:165] plip.plipcmd: finished analysis, find the result files in /var/folders/cf/d42qwdmd5_g5s63r3g_vgx2c0000gn/T/tmp_op47i4g/


In [34]:
scores_df

score_type,target_identifiers,docking-structure-POSIT,ligand_id,ligand_identifiers,INCHIKEY,docking-confidence-POSIT,SMILES,complex_ligand_smiles,fitness-score-FINT,computed-GAT-pIC50,docking-score-POSIT,computed-E3NN-pIC50,computed-SchNet-pIC50
0,,test,test,,SNQHLYWSRJDLGK-FQEVSTJZSA-N,0.18,c1ccc2c(c1)c(cc(=O)[nH]2)C(=O)NCCOc3cc(cc(c3)C...,CNC(=O)CN1C[C@]2(CCN(C2=O)c3cncc4c3cc(cc4)Cl)c...,1.0,4.912028,-11.651097,3.713501,-1.627446


Under the hood, this uses this function: `asapdiscovery.docking.scorer.Score._combine_and_pivot_scores_df` to return the scores in a dataframe. As of version 0.4, this uses `asapdiscovery.docking.scorer._SCORE_MANIFOLD_ALIAS` to change the column names to conform to the standard names that the ASAPDiscovery Consortium agreed upon. You can examine which column names correspond to which score here:

In [35]:
from asapdiscovery.docking.scorer import _SCORE_MANIFOLD_ALIAS
_SCORE_MANIFOLD_ALIAS

{<ScoreType.chemgauss4: 'chemgauss4'>: 'docking-score-POSIT',
 <ScoreType.FINT: 'FINT'>: 'fitness-score-FINT',
 <ScoreType.GAT: 'GAT'>: 'computed-GAT-pIC50',
 <ScoreType.schnet: 'schnet'>: 'computed-SchNet-pIC50',
 <ScoreType.e3nn: 'e3nn'>: 'computed-E3NN-pIC50',
 <ScoreType.INVALID: 'INVALID'>: None,
 'target_name': 'docking-structure-POSIT',
 'compound_name': 'ligand_id',
 'smiles': 'SMILES',
 'ligand_inchikey': 'INCHIKEY',
 'probability': 'docking-confidence-POSIT'}

## Advanced Topics: Selectors

The ASAPDiscovery Consortium regularly operates in a regime where we have many experimental structures to choose from as references for docking. To accelerate the process of choosing which structures to use, we have generated a series of Selector objects which take a set of ligands and complexes and choose which set of ligand-complex pairs to use for docking.

In [36]:
from asapdiscovery.data.operators.selectors.selector_list import StructureSelector

In [37]:
StructureSelector.get_values()

['MCSSelector',
 'PairwiseSelector',
 'LeaveOneOutSelector',
 'LeaveSimilarOutSelector',
 'SelfDockingSelector']

In [38]:
from asapdiscovery.data.services.fragalysis.fragalysis_reader import FragalysisFactory
all_mpro_fns = [
        "metadata.csv",
        "aligned/Mpro-x11041_0A/Mpro-x11041_0A_bound.pdb",
        "aligned/Mpro-x1425_0A/Mpro-x1425_0A_bound.pdb",
        "aligned/Mpro-x11894_0A/Mpro-x11894_0A_bound.pdb",
        "aligned/Mpro-x1002_0A/Mpro-x1002_0A_bound.pdb",
        "aligned/Mpro-x10155_0A/Mpro-x10155_0A_bound.pdb",
        "aligned/Mpro-x0354_0A/Mpro-x0354_0A_bound.pdb",
        "aligned/Mpro-x11271_0A/Mpro-x11271_0A_bound.pdb",
        "aligned/Mpro-x1101_1A/Mpro-x1101_1A_bound.pdb",
        "aligned/Mpro-x1187_0A/Mpro-x1187_0A_bound.pdb",
        "aligned/Mpro-x10338_0A/Mpro-x10338_0A_bound.pdb",
    ]
all_paths = [fetch_test_file(f"frag_factory_test/{fn}") for fn in all_mpro_fns]
parent_dir = all_paths[0].parent
ff = FragalysisFactory.from_dir(parent_dir)
complexes = ff.load()



In [39]:
ligands = [complex.ligand for complex in complexes]

### To illustrate what the selectors do

#### SelfDockingSelector

In [40]:
selector = StructureSelector('SelfDockingSelector').selector_cls()

In [41]:
pairs = selector.select(ligands, complexes)



In [42]:
len(pairs)

10

In [43]:
all(pair.complex.ligand == pair.ligand for pair in pairs)

True

#### PairwiseSelector

In [44]:
selector = StructureSelector('PairwiseSelector').selector_cls()

In [45]:
pairs = selector.select(ligands, complexes)

In [46]:
len(pairs)

100

#### MCSSelector

In [47]:
selector = StructureSelector('MCSSelector').selector_cls()

In [48]:
pairs = selector.select(ligands, complexes, n_select=5)



In [49]:
len(pairs)

50

#### LeaveSimilarOutSelector

This one is nice for doing science - it filters out pairs where the ligand is a stereoisomer / tautomer / protonation state isomer / etc of the complex ligand. It can take a while though because it has to do an len(ligands) * len(complexes) pairwise comparison of all those chemical possibilities. 

In [50]:
selector = StructureSelector('LeaveSimilarOutSelector').selector_cls()

In [51]:
pairs = selector.select(ligands, complexes)



In [52]:
len(pairs)

90

## Advanced Topics: Multi-Structure Docking


Some docking protocols (i.e., POSIT) will accept multiple receptor structures and choose for themselves which to dock to. For these docking protocols, we pass a different kind of input:

In [17]:
from asapdiscovery.docking.docking import DockingInputMultiStructure
from asapdiscovery.data.operators.selectors.selector_list import StructureSelector

In [18]:
cached_dus = {
        "Mpro-x1002": "du_cache/Mpro-x1002_0A_bound.oedu",
        "Mpro-x0354": "du_cache/Mpro-x0354_0A_bound.oedu",
    }
prepped_complexes = [
        PreppedComplex.from_oedu_file(
            fetch_test_file(cached_du),
            ligand_kwargs={"compound_name": "test"},
            target_kwargs={"target_name": name, "target_hash": "mock_hash"},
        )
        for name, cached_du in cached_dus.items()
    ]
ligand = Ligand.from_sdf(
        fetch_test_file("Mpro-P0008_0A_ERI-UCB-ce40166b-17.sdf"), compound_name="test"
    )

Let's assume we had gotten this subset of ligand-protein pairs from the selector logic from above. This would look something like this:

In [19]:
selector = StructureSelector('LeaveSimilarOutSelector').selector_cls()
pairs = selector.select([ligand], prepped_complexes)
len(pairs)

2

We then collapse these pairs into a single MultiStructure set:

In [20]:
inputs = DockingInputMultiStructure.from_pairs(pairs) # Returns a list since multiple sets could be generated

If we already knew exactly what we wanted to do, we could just create the set directly:

In [21]:
alternate_inputs = DockingInputMultiStructure(ligand=ligand, complexes=prepped_complexes)

We can see that two are equivalent in this case:

In [22]:
inputs[0] == alternate_inputs

True

Now we run docking as before:

In [24]:
docker = POSITDocker() # let's just use defaults for now

In [25]:
results = docker.dock(inputs) # we won't use dask or write an output, takes ~3 minutes on a Macbook Pro

In [26]:
result = results[0]

In [27]:
result.write_docking_files("multi_structure_docking_test")

Since we input multiple structures, we don't know which one it actually used. We can find this out by examining the results:

In [28]:
result.input_pair.complex.target.target_name

'Mpro-x0354'

## Advanced Topics: Multi-pose Docking

Note: this is functionality that was most recently added. Please make an issue if you encounter problems :)

We'll use the same docking scheme as above

In [63]:
from asapdiscovery.docking.docking import DockingInputPair
prepped_complex = PreppedComplex.from_oedu_file(
        fetch_test_file("Mpro-P2660_0A_bound-prepped_receptor.oedu"),
        ligand_kwargs={"compound_name": "test"},
        target_kwargs={"target_name": "test", "target_hash": "mock_hash"},
    )
ligand = Ligand.from_sdf(
        fetch_test_file("Mpro-P0008_0A_ERI-UCB-ce40166b-17.sdf"), compound_name="test"
    )
input_pair = DockingInputPair(ligand=ligand, complex=prepped_complex)

In [64]:
docker = POSITDocker(num_poses=50) # we set the number of poses when we create the docker

In [65]:
results = docker.dock([input_pair]) # we won't use dask or write an output, takes ~1 min on a Macbook Pro

NOTE: As of version 0.3.1, we can generate multipose docking results, but there's nothing to distinguish them, and when you run result.write_docking_files the results for each ligand will overwrite each other. This should be fixed by 0.4.0

In [66]:
len(results)

26

In [67]:
print([result.probability for result in results])

[0.23999999463558197, 0.23999999463558197, 0.23999999463558197, 0.23999999463558197, 0.23999999463558197, 0.23999999463558197, 0.23999999463558197, 0.23999999463558197, 0.23999999463558197, 0.18000000715255737, 0.18000000715255737, 0.18000000715255737, 0.18000000715255737, 0.18000000715255737, 0.18000000715255737, 0.18000000715255737, 0.18000000715255737, 0.18000000715255737, 0.18000000715255737, 0.18000000715255737, 0.18000000715255737, 0.18000000715255737, 0.18000000715255737, 0.05000000074505806, 0.05000000074505806, 0.05000000074505806]


## A few side notes: Dask and Target Specific Workflows

### Dask

We make heavy use of Dask throughout our code, which helps automate parallel processing and provides a nice dashboard for evaluating the progress of large scale docking efforts. Due to the way in which Dask automates error handling, this has occasionally led to situations where the behaviour of our code is different depending on whether you have enabled Dask. We have tried to stamp out any instances of this, but if you find another, please make an issue!

### Target-specific workflows

We have implemented our library code within the `asapdiscovery-workflows` module, which puts everything together in a command-line interface (cli). Unfortunately, as of version 0.4, these workflows only work if you are using the targets specified for ASAP. We plan on changing this for version 0.5 

To find out which targets can be passed to these workflows, you can use this:

In [29]:
from asapdiscovery.data.services.postera.manifold_data_validation import TargetTags

In [30]:
TargetTags.get_values()

['DENV-NS2B-NS3pro',
 'SARS-CoV-2-N-protein',
 'SARS-CoV-2-Mpro',
 'EV-D68-Capsid',
 'ZIKV-NS2B-NS3pro',
 'SARS-CoV-2-Mac1',
 'MERS-CoV-Mpro',
 'EV-A71-3Cpro',
 'EV-D68-3Cpro',
 'EV-A71-Capsid']