SolTranNet Paper

The data sets and scripts used to generate the figures for SolTranNet.

SolTranNet is a fork of the Molecule Attention Transformer, whose implementation can be found here.

NOTICE

It was brought to our attention by Rajendra Joshi that the ESOL dataset provided in the Molecule Attention Transformer repository is mean centered and normalized. We have updated the code in Making_Figures.ipynb to account for this. Additionally, we provide the unnormalized ESOL data in data/esol/delaney-processed.csv. Notably, since our implementation of SolTranNet is trained on unnormalized data, we had to adjust the models predictions to compare with the original MAT implementation.

This changes the RMSE in Table 2 of the Deployed SolTranNet model (and in Table S6) from 2.99 (normalized) to 0.361 (unnormalized). It also changed the RMSE of the ESOL training, AqSol testing row of Table S6 from 3.33 (normalized) to 1.67 (unnormalized).

Requirements

PyTorch 1.7 -- compiled with CUDA
pandas 0.25.3+
RDKit 2020.03.1dev1
CUDA 10.1 or 10.2

Recreating the Figures for the Paper

We have provided the predictions that we used to generate the figures in soltrannet_data.tar.gz

NOTE -- we provided 2 versions of each training data file -- 1 for 2D and 1 for 3D. They are identical in content except for the name. This is because during training, the first time a datafile is loaded, the resulting embeddings get saved, which will differ if a 2D or 3D conformer is used to generate the embedding.

tar -xzf soltrannet_data.tar.gz

After extracting, you can follow along with Making_figures.ipynb for the code we used to generate our Figures and Tables in the paper.

Re-training Linear ML models

Install the qsar-tools repository from here
After qsar-tools is installed, use the trainlinearmodel.py script there to train a new version of the model.

NOTE -- we trained LASSO, elastic, PLS, and ridge models and have provided the requisite fingerprints for each fold of our scaffold-CCV split of AqSolDB, the full AqSolDB, and our Independent Set.

python3 $QSARTOOLSDIR/trainlinearmodel.py -o data/training_data/linear_ml/full_aqsol_rdkit2048fp.model --lasso --maxiter 100000 data/training_data/linear_ml/full_aqsol_rdkit2048fp.gz

After that model has finished training, you can use it to produce a predictions file:

python3 $QSARTOOLSDIR/applylinearmodel.py data/training_data/linear_ml/full_aqsol_rdkit2048fp.model data/training_data/linear_ml/independent_sol_rdkit2048fp.gz > data/training_data/linear_ml/full_aqsol_lasso_rdkit2048fp_ind_test.predictions

Recreating our SolTranNet sweeps

Our sweeps assume that you have compiled pytorch with CUDA enabled, and have a GPU to utilize.

We provide a way to generate jobs for a grid search with write_jobs.py. For the CCV splits, each fold would be run through the script as shown below.

NOTE -- The default arguments will recreate our architecture sweep. See the --help options to explore all available options to you. If you want a 3D version, run again but leave out the --twod option.

NOTE -- If you want to train a full model, use the --trainfile and --testfile options to write_jobs.py. This will change your output from training. Be careful not to overwrite!

python3 write_jobs.py --prefix data/training_data/aqsol_scaf_2d --fold 0 --twod --outname grid_search_2D_ccv0_training.cmds

NOTE -- if you have a weights and biases account, you can setup a sweep there and pass it as an argument into write_jobs.py to log the results there as well.

Each line of the output of (1) will result in training 1 version of SolTranNet using train.py.

NOTE -- predict.py REQUIRES the model name of the saved model from (1) in order to function properly! DO NOT CHANGE THE NAMES OF THE MODELS

python3 train.py --help

As an example, we will asssume that the following line going forward.

python3 train.py --prefix data/training_data/aqsol_scaf_2d --fold 0 --datadir sweep --epochs 100 --lr 0.04 --loss huber --dropout 0 --ldist 0 --lattn 0.25 --Ndense 1 --heads 16 --dmodel 1024 --nstacklayers 16 --seed 420 --dynamic 0 --twod

After training is complete, you will need to run predict.py with the corresponding test set file, and trained model weights. The model weights will be saved where you specify with --datadir in (2).

To see all of the options available to you, run the following:

python3 predict.py --help

Continuing our example from earlier, there should be a sweep directory with some files in it as the result of training. The weights file is the .model file in said directory, with the following format:

{datadir}/{prefix}_{fold}_drop{dropout}_ldist{ldist}_lattn{lattn}_Ndense{Ndense}_heads{heads}_dmodel{dmodel}_nsl{nstacklayers}_epochs{epochs}_dyn{dynamic}_seed{seed}_trained.model

NOTE -- If your training job utilized the --twod option, you MUST pass it again here.

python3 predict.py -m sweep/aqsol_scaf_2d_0_drop0_ldist0_lattn0.25_Ndense1_heads16_dmodel1024_nsl16_epochs100_dyn0_seed420_trained.model -i data/training_data/aqsol_scaf_2d_test0.csv -o my_model_fold0.predictions --twod --stats

This will output a predictions file, as well as display the Person R-squared and RMSE of the predicted values.

Name		Name	Last commit message	Last commit date
Latest commit History 120 Commits
figures		figures
optim_src		optim_src
src		src
Making_Figures.ipynb		Making_Figures.ipynb
README.md		README.md
diff.tex		diff.tex
latexmkrc		latexmkrc
main.tex		main.tex
predict.py		predict.py
references.bib		references.bib
soltrannet_data.tar.gz		soltrannet_data.tar.gz
supplement.tex		supplement.tex
train.py		train.py
write_jobs.py		write_jobs.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SolTranNet Paper

NOTICE

Requirements

Recreating the Figures for the Paper

Re-training Linear ML models

Recreating our SolTranNet sweeps

About

Releases

Packages

Languages

francoep/SolTranNet_paper

Folders and files

Latest commit

History

Repository files navigation

SolTranNet Paper

NOTICE

Requirements

Recreating the Figures for the Paper

Re-training Linear ML models

Recreating our SolTranNet sweeps

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages