# Selection of Top 10 Candidates

Having succesfully performed docking simulations, we can now select the top 10 candidates.

As discussed in `2_redocking`, Vina writes docking results into a `.pdbqt` file, where all poses are written into one file. ^the highest ranked pose is written first.

First, we read the names of the `.pdbqt` files outputted by the batched virtual screening procedure into a DataFrame. This follows the format of `<mol_id>_out.pwdqt`

In [1]:
import pandas as pd
from pathlib import Path

df = pd.DataFrame([str(i) for i in list(Path("./").rglob("*.pdbqt"))])
df.columns = ['file']

We again use the convenience function `get_first_mol_from_pdbqt` to read the first (most ranked) pose in a `.pdbqt` file, and return the SMILES string, and docking score. We also define a function to extract the molecule ID from the file name,

In [2]:
import re
import sys
sys.path.append('..')
from pdbqt import get_first_mol_from_pdbqt

def extract_id(file_name):
    finds = re.findall(r"([0-9]+)", file_name) # ids are just numbers
    return finds[0]

and apply this to the DataFrame.

In [3]:
df[['smiles', 'score']] = df['file'].apply(lambda x: pd.Series(get_first_mol_from_pdbqt(x, return_type='smiles')))

Directly, picking the top 10 binders based on docking score may give us molecules which too chemically similar. Our goal to suggest chemically diverse binders for the lead optimization step. In order to select chemically diverse compounds, we follow a similar procedure used in the data preparation step.

The Morgan fingerprints of the molecules are calculated Tanimoto similarity is calculated among the molecules, and Butina clustering is performed. In the data preparation step, the molecule that was the centroid of each cluster (most similar to all other molecules in the cluster) was selected. Instead here, the molecule with the highest docking score in the cluster was selected as the cluster representatives of each cluster.

In [4]:
from rdkit.Chem import AllChem
morgan = AllChem.GetMorganGenerator(radius=2, fpSize=512)

def morgan_fp(smiles):
    mol = Chem.MolFromSmiles(smiles)
    return morgan.GetFingerprint(mol)

df['morgan'] = df['smiles'].apply(morgan_fp)

In [5]:
from rdkit import DataStructs
from rdkit.ML.Cluster import Butina

def get_cluster_representative(df, cutoff):
    fps = df['morgan']
    n_fps = len(fps)
    dists = []
    for i in range(1, n_fps):
        sims = DataStructs.BulkTanimotoSimilarity(fps[i], list(fps[:i]))
        dists.extend([1 - x for x in sims])
    
    clusters = Butina.ClusterData(data=dists, nPts=n_fps, distThresh=cutoff, isDistData=True)
    
    # choose one representative per cluster: pick molecule with best docking score
    selected_idx = []
    for cl in clusters:
        if len(cl) == 1:
            selected_idx.append(cl[0])
            continue
        best = df.iloc[list(cl)]['score'].idxmin()
        selected_idx.append(best)

    return df.loc[selected_idx]

Finally, these cluster representative were sorted by the docking score, and the highest 10 were selected.

In [6]:
leads = get_cluster_representative(df, 0.4).sort_values(by='score', ascending=True).head(10)

In [7]:
leads[['smiles', 'score']].to_csv('top10.csv', index=False)