The pharmaceutical industry relies on Quantitative structure−activity relationships (QSAR) models to predict a quantified biological response of a molecule based on its descriptors, which are essentially studied properties of the molecule. These descriptors vary in complexity and can range from simple molecular weight measures to complex geometric features. Drug discovery is a time-consuming and expensive process for pharma. A major purpose of these QSAR models is to help accelerate discovery of molecular drug candidates through reduced experimental work, and eventually bring a drug to market faster. Due to recent advances in Machine Learning and hardware capabilities, Deep Neural Networks (DNNs) serve as a promising tool to predict biological activity, such as receptor binding or enzyme inhibition, based on molecular descriptors.
This project implements a DNN based on the architecture and parameters described in the following paper:
Ma, J., Sheridan, R.P., Liaw, A., Dahl, G.E. and Svetnik, V., 2015. Deep neural nets as a method for quantitative structure–activity relationships. Journal of chemical information and modeling, 55(2), pp.263-274.
The data used for training and evaluation of the model is can be downloaded from the paper's supplementary section.
Both the training and test data are structured in such a way that each row represents a molecule. There is a single column called "Act" that represents the biological activity that is to be predicted. The rest of the columns are molecular descriptors.
- Docker
- Pipenv
Docker Installation Documentation
Pipenv Installation Documentation
- Install dependencies via Pipenv
- Build Docker image based on Dockerfile
make build
Specify the dataset of interest and its location. For example:
make preprocess DATASET=NK1 DATA=~/Documents/qsar/
Specify the dataset of interest and its location and override the batch size and number of epochs specified in the Makefile.
make train DATASET=NK1 DATA=~/Documents/qsar/ BATCH_SIZE=64 EPOCHS=128
Specify the dataset of interest and its location. For example:
make evaluate DATASET=NK1 DATA=~/Documents/qsar/
The metric used to evaluate the model is the correlation coefficient (R2). According to Ma et al., a model with coefficient even as low as 0.30 is still useful since QSAR is used to prioritize a large number of molecular compounds so the activity prediction on a single molecular basis is less important. The paper recommends that the number of epochs should be set as high as possible (within hardware limits) to increase the R2. The trade off is time and resources vs. a higher R2.
Pytest
make test
Flake8 is the chosen linter
make lint
Thank you to Ma, J et al. for clear description of DNN architecture and supplementary data
- NVIDIA Docker image for GPU based Training
- Error handling if weights aren't available for a dataset
- Tests around the Preprocessor