This repo contains:
- Code to run the experiments in the dockstring paper.
- The numerical results used in the paper.
- Code to generate most figures / tables in the paper.
This code is not super polished, but should be sufficient to reproduce the results. In particular, we note that this code does not use the dockstring benchmark API which we added after running our experiments. We recommend you use this for new code.
Everything should run correctly if you follow the instructions below. Please contact us or raise a github issue if anything doesn't work for you.
There are 2 datasets used in the benchmarking:
- Our subset of ExCAPE, downloaded from figshare (around 0.2 Gb)
- The ZINC 20 dataset, downloaded from their website (around 60 Gb).
First, download the dockstring dataset (instructions here)
and place the contents in a folder called data
(optionally could be a simlink).
In particular, the following files should be present:
./data/dockstring-dataset.tsv
./data/cluster_split.tsv
Some of the dockstring tasks require logP
and QED
values.
To avoid recomputing these, the following commands can be run to make an augmented version of the dockstring dataset with these extra columns:
bash scripts/dockstring-add-extra-props.sh
Some of the scripts below use the file generated by this command, so if you want to reproduce the experiments you may need to run this command.
Unfortunately, the terms of use of ZINC prohibit us from redistributing "large subsets of ZINC", so at the moment we are unable to post a link to the dataset. You should be able to download molecules directly from ZINC and assemble them into a csv file like the following:
zinc_id,smiles
7,C=CCc1ccc(OCC(=O)N(CC)CC)c(OC)c1
10,C[C@@]1(c2ccccc2)OC(C(=O)O)=CC1=O
11,COc1cc(Cc2cnc(N)nc2N)cc(OC)c1N(C)C
12,O=C(C[S@@](=O)C(c1ccccc1)c1ccccc1)NO
17,CCC[S@](=O)c1ccc2[nH]/c(=N\C(=O)OC)[nH]c2c1
18,C=CCN1C(=O)[C@@H](CC(C)C)NC1=S
22,C=C(C)CNc1ccc([C@H](C)C(=O)O)cc1
23,C=CCc1ccccc1OC[C@H](O)CNC(C)C
24,O[C@@H]1C=C2CCN3Cc4cc5c(cc4[C@@H]([C@H]23)[C@H]1O)OCO5
[...]
Our file has the molecules in sorted order (by ZINC ID) and contains 997597004 lines.
Its md5sum
is e65b33ddf9e7178aba732e396e72fe54
.
Alternatively, feel free to contact us give you the dataset.
We know it's annoying to have data available upon request to the authors,
but our hands are somewhat tied on this issue...
This file should be accessible from data/zinc/zinc20-sorted.csv
in order for the scripts to work correctly
(using a symlink if needed).
Different parts of the code have different requirements. The basic requirements are:
- python 3.7+ (tested on 3.7)
- scipy
- numpy
- pandas
- rdkit
- scikit-learn
- joblib
For specific tasks:
- deepchem (for deepchem regression models)
- gpytorch and botorch (for Gaussian process regression and BO)
- matplotlib (for plotting)
- xgboost (for xgboost regression model)
- Ensure that the dockstring dataset is downloaded (see Data Preparation)
- Ensure your python environment is set up for the models you want to run.
- Run the scripts in
src/regression
. Bash scripts to supply the correct arguments to the python scripts are located inexperiments/regression
. For example, to run all experiments for the lasso method, runbash experiments/regression/lasso.sh
. This will produce json output files in the./results
directory with train and test set metrics. These scripts can be modified to dock different proteins, use a different train/test split, etc. There are similar scripts inexperiments/regression-toy
for QED and logP. - To reproduce the regression results table in the paper,
run the script
python scripts/results_regression.py
(various arguments can be provided to change the table formatting). Since we don't explicitly control random seeds the results probably won't match exactly but they should be very similar.
- Ensure that the dockstring ExCAPE dataset is downloaded (see Data Preparation)
- Ensure Python Environment is set up.
- Run whichever tasks you want. For example,
bash experiments/virtual_screening/training/ridge.sh
. The trained model weights will be dumped into theresults
folder.
- Ensure that the ZINC dataset is downloaded (see Data Preparation)
- Split up the ZINC dataset into chunks of size 1M by running:
bash experiments/virtual_screening/zinc-predictions/split_zinc.sh
- Get the model weights, either by training new models or by copying in the old weights from
official_results
. For example, you can runmkdir -p results && cp -r official_results/virtual-screening results/
- Get model predictions, using the
pred_csv
argument to feed in the file to be predicted. For example, to run predictions for the first split of ZINC, runpred_csv="data/zinc/split-size-1000000/zinc-split000000000.csv" bash experiments/virtual_screening/zinc-predictions/ridge.sh
(all on one line). - Find the top predictions and put them into 1 file. This can be done by running
bash experiments/virtual_screening/zinc-predictions/topn_predictions.sh
, which will find the 5000 lowest docking score and output them to a csv. - Find their actual docking scores using the
scripts/dock_tsv.py
script. To do this automatically, runbash experiments/virtual_screening/zinc-predictions/score_top_predictions.sh
. Alternetively, you can split the files usingscripts/split_smiles_csv.sh
and score the predictions in parallel. - To reproduce the results in the paper, run:
python scripts/results_virtual_screening.py
Note that one baseline (the nearest neighbour baseline) was added afterwards and is therefore a bit different.
Code for it is under: scripts/virtual_screening_nearest_neighbours
.
Unfortunately we do not have the a full pipeline of bash scripts to reproduce these results,
since they were done quickly.
- Ensure that the dockstring dataset is downloaded (see Data Preparation)
- Ensure your python environment is set up for the models you want to run.
- Run the scripts in
src/mol_opt
. Bash scripts to supply the correct arguments to the python scripts are located inexperiments/molopt
. For example, to run all experiments for the graph GA method, runbash experiments/molopt/graph_ga.sh
. This will produce json output files in the./results
directory with the new SMILES and their scores. Note that these scripts can take up to 100 hours to run! These scripts can be modified to test different objectives, use a different budget, or different dataset, etc. Here are some useful notes about the scripts:- you may want to only run a specific experiment, which can be done by supplying an experiment index.
For example,
expt_idx=12 bash experiments/molopt/graph_ga.sh
- you may also want to reduce the
num_cpus
argument if you don't have 8 CPUs (but there is little point to increasing it beyond 8 due to how VINA works). - Because the scripts take a long time to run, by default they leave extensive log files, and a much larger result file at the end. These have been omitted from the repo and can be deleted after the run has done.
- you may want to only run a specific experiment, which can be done by supplying an experiment index.
For example,
- To reproduce the results in the paper, run:
python scripts/results_molopt.py
This code is not actively developed, but we set up pre-commit to handle code formatting and some other checks. If you want to contribute to this repo (e.g. by opening a pull request), please install pre-commit and run it on your local version before committing anything:
conda install pre-commit # if not already installed
pre-commit install