Ensemble Optimizer (EnOpt) is a fast, accessible tool that streamlines ensemble-docking and consensus-score analysis. EnOpt takes as input a matrix of docking scores from an ensemble virtual screen, organized as compounds (rows) X protein conformations (columns). It uses simple, interpretable machine learning to identify most-predictive subensembles and an ensemble composite score.
Before using EnOpt, ensure that you have installed a python enviroment with all necessary packages (e.g., NumPy, Pandas, SciPy, etc.). We have provided a conda specification file to make it easier to set up an environment with all necessary packages:
conda create --name [environment name] --file conda_spec_file.txt
To print a guide with all standard options and their usage:
python ensemble_optimizer.py --help
An example of the simplest use of EnOpt:
python ensembe_optimizer.py -f [input file matrix]
The input CSV file containing the ensemble docking score matrix (required):
-f INPUT_FILE
A file containing the names of known ligands, separated by commas:
-l KNOWN_LIGS, --knownLigs KNOWN_LIGS
A JSON file containing all user-specified EnOpt input parameters, as an alternative to the command line input:
--json_input JSON_INPUT
The prefix of the output file:
--outFile OUT_FILE
The number of known ligands to include in interactive output:
--top_known_out TOP_KNOWN_OUT
The number of unknowns (compounds that are not known ligands) to include in interactive output:
--top_unknown_out TOP_UNKNOWN_OUT
The scoring scheme to use for combining scores across conformations:
--scoringScheme SCORING_SCHEME
(One of "eA", "eB", "rA", or "rB". "eA" uses the average score across all conformations in the ensemble. "eB" uses the best score across all conformations. "rA" uses the average of the score rank for each conformation. "rB" uses the best-ranked score across all conformations. Default: eA.)
Whether to compute weights optimized using tree models:
--weightedScore
(EnOpt performs optimization using known ligands if included. Otherwise, it uses score rankings; not recommended.)
Whether higher (more positive) scores describing stronger binding:
--invertScoreSign
(The scheme depends on the docking program used; for example, smina uses more negative scores to represent stronger binding. Default: False, meaning that more negative scores represent stronger binding.)
Method to determine weighted scores:
--optimizationMethod OPT_METHOD
(One of "RF", Random Forest, or "XGB", Gradient-boosted trees. Default: RF.)
Number of top conformations to include in the "best subensemble":
--topConformations TOPN_CONFS
(Default: 3)
Whether to perform hyperparameter optimization for tree models:
--hyperparam
Default: False (default tree model parameters will be used).
Optional JSON file containing user-provided parameters for optimization:
--tree_params TREE_PARAMS
If not provided, default hyperparameter optimization options will be used.
Find more tools for analysis of protein-ligand binding at https://durrantlab.pitt.edu/durrant-lab-software/.
For questions, suggestions, or problems with the tool contact Roshni Bhatt at rob108@pitt.edu.
This work was supported by the National Institute of Health (1R01GM132353-01) and the University of Pittsburgh's Center for Research Computing, RRID:SCR_022735 (supported by NSFOAC-2117681). We would like to thank Yogindra Raghav for his contributions in generating initial proof-of-concept code. We also thank Darian Yang for assistance in collating and pruning ideas.