HAG-Net

Introduction

Code for "Enhance Information Propagation for Graph Neural Network by Heterogeneous Aggregations", also lite version of package for ligand-based virtual screening.

Dependency

Package	Description
python	cpython, >=3.6.0
pytorch	>=1.4.0
pytorch_ext	install by `pip install git+https://github.com/david-leon/Pytorch_Ext.git`
numpy, scipy, matplotlib, psutil
rdkit	required by `ligand_based_VS_data_preprocessing.py` and `data_loader.py`, install by `conda install -c rdkit rdkit`
torch-scatter	>= 2.0.4
sklearn	required by `metrics.py`
optuna	required by `grid_search.py`

Repo Structure

Folder/File	Description
docs	documentation
ligand_based_VS_data_preprocessing.py	data preprocessig
ligand_based_VS_train.py	training
ligand_based_VS_model.py	production model definitions
config.py	experiment configurations
data_loader.py	batch data loader
graph_ops.py	generic graph operations
metrics.py	performance metric definitions
grid_search.py	hyperparameter tuning via grid search
util	misc utility scripts

Data Preprocessing

For input data preprocessing, use ligand_based_VS_data_preprocessing.py. Check

python ligand_based_VS_data_preprocessing.py --help

for argument details.

arg	Description
-csv_file	input csv file, each line seperated by `,`, encoded by UTF8 scheme
-smiles_col_name	column name for SMILES, default = `cleaned_smiles`
-label_col_name	column name for labels, default = `label`
-id_col_name	column name for sample IDs, default = `ID`
-sample_weight_col_name	column name for sample weights, default = `sample weight`
-task	classification / prediction. If your dataset is for inference, set `-task` to `prediction`
-feature_ver	featurizer version, default = `2`. `2` means graph output, `1` means kekulized SMILES output and `0` means DeepChem's fingerprint
-use_chirality	whether consider chirality for feature ver `0` and scaffold analysis
-addH	whether add back hydrogen atoms when canonicalize molecule
-partition_ratio	percent of samples for testing/validation, any value <= 0 or >= 1 will disable the dataset partition procedure, default = 0. Samples will always be shuffled before partitioned
-atom_dict_file	the atom dict file used for training, it's also used here for abnormal sample checking

During the preprocessing, each input SMILES string will be validated, if invalid, a runtime warning will be raised. There will be two log files saved along with the preprocessing output file, named as <dataset>_preprocessing_stdout@<time_stamp>.txt and <dataset>_preprocessing_stderr@<time_stamp>.txt. Check any warning/error message in <dataset>_preprocessing_stderr@<time_stamp>.txt and check other runtime info in <dataset>_preprocessing_stdout@<time_stamp>.txt

Data Format

Preprocessed data, as well as trainset & testset for training & evaluation, will be saved into .gpkl files with dict format as follows:

key	Required	Description
features	Y	extracted features. When `-feature_ver` is `1`, this field is the same with `SMILESs` below
labels	Y	sample ground truth labels, = `None` if there's no groud truth
SMILESs	Y	original SMILES strings for each sample
IDs	Y	sample ID, = `None` if there's no sample ID data. If `None`, during training or evaluation, ID will be generated automatically for each sample by its index number, starting from 0.
sample_weights	Y	sample weights, = `None` if there's no sample weight data
class_distribution	N	class distribution statistics if sample labels are given, else = `None`

Training

For model training, use ligand_based_VS_train.py.
Check

python ligand_based_VS_train.py --help

for argument details.

arg	Description
-device	int, -1 means CPU, >=0 means GPU
-model_file	file path of initial model weights, default = `None`
-model_ver	model version, default = `4v4`
-class_num	class number, default = `2`
-optimizer	`sgd`/`adam`/`adadelta`, default is an SGD optimizer with lr = 1e-2 and momentum = 0.1, you can change learning rate via `-lr` argument
-trainset	file path of train set, saved in `.gpkl` format as in data preprocessing step, refer to Data Format for data format details
-testset	either 1) file path of test set, saved in `.gpkl` format as in data preprocessing step, refer to Data Format for data format details 2) float in range (0, 1.0), this value will be used as partition ratio note in Multi-fold Cross Valication, this input will be ignored
-batch_size -batch_size_min	the max & min sample number for a training batch. You can set them to different values to enable variable training batch size, this will help mitigate batch size effect if your model's performance is correlated to the size of training batch
-batch_size_test	sample number for a test batch, this size is fixed
-ER_start_test	float, ER threshold below which the periodic test will be executed, for classification task
-max_epoch_num	max training epoch number
-save_model_per_epoch_num	float in (0.0, 1.0], default = `1.0`, means after training this portion of all training samples a checkpoint model file will be saved
-test_model_per_epoch_num	float in (0.0, 1.0], default = `1.0`, means after training this portion of all training samples one round of test will be excuted
-save_root_folder	root folder for training results saving, by default = `None`, for plain training a `train_results` folder under the current direction will be used for multi-fold cross validation, a folder with name convention `train_results\<m>_fold_<task>_model_<model_ver>@<time_stamp>` will be used, in which `<m>` is the number of fold, `<task>` = `classification` , `<model_ver>` is the model version string, `<time_stamp>` is the time stamp string meaning when the running is started for grid search, a folder with name convention `train_results\grid_search_<model_ver>@<time_stamp>` will be used
-save_folder	folder under the `save_root_folder`, under which one `train()` run results will be actually saved, by default = `None`, folder with name convention `[<prefix>]_ligand_VS_<task>_model_<model_ver>_pytorch@<time_stamp>` will be used, in which `<prefix>` is specified by input arg `-prefix` as below
-train_loader_worker	number of parallel loader worker for train set, default = `1`, increase this value if data loader takes more time than one batch training
-test_loader_worker	number of parallel loader worker for test set, default = `1`, increase this value if data loader takes more time than one batch testing
-weighting_method	method for different class weighting, default = `0`, refer to Class Weighting for details
-prefix	a convenient str flag to help you differentiate between different runs, by default = `None`, a prefix string `[model_<model_ver>]` will be used. This prefix string will be placed at the beginning of each on-screen output during training
-config	str, specify the `CONFIG` class name within `config.py`, the corresponding config set will be used for the training run. By default = `None`, the `model_<model_ver>_config` set will be used automatically
-select_by_aupr -select_by_corr	`true` / `false`, if `true`, the best model will be selected by `AuPR` metric on test set; else, `ER` metric will be used `true` / `false`, if `true`, the best model will be selected by `CoRR` metric on test set; else, `RMSE` metric will be used
-verbose	int, verbose level, more high more concise the on-screen output is
-alpha	list of float in range (0.0, 1.0], default=`(0.01, 0.02, 0.05, 0.1)`, specifying at top-alpha portions of the test set the relative enrichment factor will be calculated
-alpha_topk	list of int, default=`(100, 200, 500, 1000)`, specifying at top-k samples of the test set the relative enrichment factor will be calculated
-score_levels	list of float in range (0.0, 1.0), default=`(0.98, 0.95, 0.9, 0.8, 0.7, 0.6, 0.5)`, relative enrichment factor will be calculated for samples with predicted score above each level
-early_stop	default = `0`, set it to value > 0 to enable early stop, refer to Early Stop for details
-mfold	an integer or file path of `mfold_indexs.gpkl`, default = None, refer to Multi-fold Cross Validation section for details
-grid_search	default = `0`, set it to value > 0 to enable hyperparameter tuning, refer to Hyperparameter Tuning for details
-srcfolder	folder path for re-run, refer to Comparative Re-run for details, default = `None`
-lr	learning rate for optimizer

Class Weighting

For classification task training, 3 different class weighting methods are supported:
1. method 0: default method, use uniform weighting, set input arg as -weighting_method 0
2. method 1: class weight equals to inverse of class distribution of training samples, set input arg as -weighting_method 1
3. method 2: class weight equals to inverse of square root of class distribution of training samples, set input ars as -weighting_method 2

Sample Weighting

Note the metrics calculated on train set don't count in sample weights except for the loss metric; test set does not support sample weights at all; for meaningful test metric results, you should organize test samples with the same weights together, and sperately run test on test sets with different weights.

Early Stop

Enable early stop for training with input arg -early_stop m, in which m is the maximum number of consective test rounds without better metric got, after this number the training process will be stopped. Set -early_stop 0 to disable early stop.

Resume Training

For plain training, there're two alternative ways to resume an interrupted training process:
1. start a new training with input arg -model_file, this will pratically initialize the model with weights from the model_file
2. or, start a new training with input arg -save_folder, the training will retrieve a) the last run epoch and b) the latest model file saved in save_folder (logs and model files will still be saved into this folder); in this case input arg -model_file will be ignored.

Multi-fold Cross Validation

Enable multi-fold cross validation with input arg -mfold m, in which m is the number of folds or the path of specified mfold_indexs.gpkl file if you want to use the same sample splitting indices among different runs.
When multi-fold cross validation is activated, the input train set will be divided into m parts, at each fold one part will be used as test set, other parts will be used as train set. Test set specified by input arg -testset must be None now.
Train results & log will be saved in seperate folder for each fold
Multi-fold cross validation is mainly designed for small data scenarios, m is recommended to be in range [3, 5]
Parallel running: multi-fold cross validation supports local parallel running
1. First start a multi-fold training as usual, get the save_root_folder of this run
2. Start one or more multi-fold trainings with all the settings the same with the first run, and explicitly set the input arg -save_root_folder to the root folder as above
3. By local, it means the parallel syncronization is done through the local file system. On HPC or cloud enviroment, different training processes can run on different computing nodes as long as they share the same file system.
4. arg -shuffle_trainset only affect the first training process.

Hyperparameter Tuning

Auto hyperparameter tuning is supported by grid search. Enable grid search by arg -grid_search n in which n is the maximum number of trial. Set it to 0 to diable the grid search.
Grid search and multi-fold cross validation should not be activated simultaneously.
Best hyperparameter combinations will be saved into best_trial_xxx.json file under the training save_root_folder.
Currently hyperparameter search spaces for model_4v4 and model_4v7 are defined, check out the details in grid_search.py; note these search spaces are pretty large, mainly defined for model structure optimization, not for per dataset tuning. If you intend to do per dataset tuning, these search spaces should be narrowed down dramatically. Contact the authors if you're not capable of modifying the default search space definitions.
Resume running: to resume an interrupted grid search process, just start a new grid search process with input arg -save_root_folder set to the previous run
Parallel running: grid search supports local parallel running
1. First, start a grid search as usual, get the save_root_folder of this run, remember that the number of trials specified by input arg -grid_search should be reduced now (probably = # total_planned_trials / # parallel_runs)
2. Start one or more grid search with all the settings the same with the first run (except for -grid_search, you can set it to any number appropriate), and explicitly set the input arg -save_root_folder to the root folder as above
3. By local, it means the parallel syncronization is done through the local file system. On HPC or cloud enviroment, different grid search processes can run on different computing nodes as long as they share the same file system.

Comparative Re-run

For convenient re-run for comparison purpose, you can set -srcfolder to the folder path of a previous run, the training program will load trainset/testset as well as mfold_indexs.gpkl file from there.
For plain training, trainset and testset will be loaded from -srcfolder for the new run
For multi-fold cross validation, both trainset.gpkl and mfold_indexs.gpkl will be loaded from -srcfolder for the new run
For bundle training, order of each run as well as trainset/testset for each run will be loaded from -srcfolder, and the new bundle training will be executed with the same order loaded.

Cite Us

If you find this work useful, cite us by

@article{leng2021enhance,
  title={Enhance Information Propagation for Graph Neural Network by Heterogeneous Aggregations},
  author={Leng, Dawei and Guo, Jinjiang, etc.},
  journal={arXiv preprint arXiv:2102.04064},
  year={2021}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HAG-Net

Contents

Introduction

Dependency

Repo Structure

Data Preprocessing

Data Format

Training

Class Weighting

Sample Weighting

Early Stop

Resume Training

Multi-fold Cross Validation

Hyperparameter Tuning

Comparative Re-run

Cite Us

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
doc		doc
util		util
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.py		config.py
data_loader.py		data_loader.py
graph_ops.py		graph_ops.py
grid_search.py		grid_search.py
ligand_based_VS_data_preprocessing.py		ligand_based_VS_data_preprocessing.py
ligand_based_VS_model.py		ligand_based_VS_model.py
ligand_based_VS_train.py		ligand_based_VS_train.py
metrics.py		metrics.py

License

david-leon/HAG-Net

Folders and files

Latest commit

History

Repository files navigation

HAG-Net

Contents

Introduction

Dependency

Repo Structure

Data Preprocessing

Data Format

Training

Class Weighting

Sample Weighting

Early Stop

Resume Training

Multi-fold Cross Validation

Hyperparameter Tuning

Comparative Re-run

Cite Us

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages