atom and bond features #137

fhvermei · 2021-02-05T19:28:53Z

Enable addition of bond features on top of atom features (as implemented by Yanfei). Also add the option to implement those for the validation and test set. Additionally have an option to overwrite the default atom or bond features used for MPNN. Default scaling of the features with an option to omit scaling.

These additional options were requested by one of the members of the consortium.

fhvermei · 2021-02-05T19:32:41Z

2 doubts I have:

will this still allow old checkpoint files to be read correctly?
does this still run ok with cuda?

lgtm-com · 2021-02-05T20:34:50Z

This pull request introduces 1 alert when merging 4efedc6 into 8258fcb - view on LGTM.com

new alerts:

1 for Unused import

chemprop/data/data.py

chemprop/data/utils.py

chemprop/features/featurization.py

chemprop/utils.py

chemprop/train/make_predictions.py

hesther

Thanks Florence, these are some great changes! I have just reviewed the code: I think it is mostly fine, I have commented on a number of minor issues on:

variable nomenclature (I think we should use bond_features instead of bond_descriptors in all variable and function names)
a suggestion to ensure backwards compatibility reading from checkpoints concerning the scalers, as well as preventing the error of using scaling in training but not during prediction
a few minor typos in doc strings or error messages

We should probably also update the readme and docs files to describe the new functionalities.

I have furthermore ran a number of tests (which ran fine apart from the comments above), and will rerun them once we have the mentioned changes implemented. I will also test this with CUDA then.

chemprop/args.py

chemprop/features/featurization.py

lgtm-com · 2021-02-09T21:07:10Z

This pull request introduces 1 alert when merging 0083ca0 into 96efc6d - view on LGTM.com

new alerts:

1 for Unused import

lgtm-com · 2021-02-10T17:48:23Z

This pull request introduces 1 alert when merging 969ea46 into 96efc6d - view on LGTM.com

new alerts:

1 for Unused import

lgtm-com · 2021-02-10T18:40:20Z

This pull request introduces 1 alert when merging e4c5bc1 into 96efc6d - view on LGTM.com

new alerts:

1 for Unused import

lgtm-com · 2021-02-10T21:34:56Z

This pull request introduces 1 alert when merging c19136d into 96efc6d - view on LGTM.com

new alerts:

1 for Unused import

chemprop/features/utils.py

fhvermei · 2021-02-10T22:15:50Z

I think I fixed all the comments for now :) Let me know if there are more changes required!
(I did some tests but not as extensively as I did the first time)

And thank you for taking the time to look through this!

fhvermei · 2021-02-10T22:45:37Z

I also adjusted the ReadMe. Did not add info on using atom/bond features paths for separate validation and test sets, is that necessary?
Are the docs updated automatically?

hesther · 2021-02-11T15:09:07Z

The docs update automatically only in parts namely for all the function and class descriptions. However, there is a 'tutorial' part that would need to be updated manually (it is similar to what's in the README, but rst instead of md format): docs/source/tutorial.rst. I will run some tests on the updated code now.

hesther

Just tested on cpu and gpu, seems to run fine.

cjmcgill · 2021-02-13T16:42:00Z

@fhvermei sorry, my changes to smiles_processing has caused a conflict for you in run_training. When you resolve the conflict, please make sure to add smiles_columns=args.smiles_columns to the arguments you use for get_data.

The new arguments allow to have seperate atom and bond features paths in case separate test or validation data sets are used

The separate bond fetaures are read in the same manner as atom features. The order of the bonds in the separate file must match the RDKit bond order. The new bond features are added to the default bond features that are used for the D-MPNN.

This check validates if the number of extra atom or bond features matches the number of atoms or bonds in the molecules and throws an error if this is not the case. This check is done in data.py and not before to avoid the computational cost of making additional RDKit molecule graph structures.

To validate if the length of the atom descriptors matched the number of atoms in the molecule. An error is thrown if this is not the case.

Make consistently use of bond_descriptors for now. Even though bond_features would make more sense, there are some places in the code where this would seem weird. (might fix this later)

Validate if extra atom or bond features are needed and provided if the test and validation data are given with a separate path.

Validate if extra atom or bond features are needed and provided if the validation and test data are provided by a separate path.

This new feature allows to overwrite the default atom or bond features that are used in the D-MPNN. Before, without this argument, the extra provided features are used to concatenate to the default fetaures.

The extra atom and bond features that are provided by the user are now scaled by default. Arguments can be used to disable the scaling.

The data reading in interpret.py is not yet adjusted to allow extra atom or bond features. An error is thrown when the interpret.py function is used with additional atom or bond features.

After previous bug fix it makes more sense to use the naming bond features instead of bond descriptors. Also make changes to arguments that are only valid for atom features and not for both features and descriptors.

The changes will result in conflicts for reading of old model checkpoint files. Extra checks are added such that old and new checkpoint files can be used.

Check if scaling is done the same way for training and prediction.

Arguments were not passes correctly in case there are only atom messages.

Some arguments that only apply to training are changed to TrainArgs. Extra checks that were present for predicting are now removed.

mliu49 · 2021-02-18T19:24:26Z

Thanks for rebasing and rewording the commits! I will go ahead and merge since it has been reviewed and tested by others.

fhvermei requested review from swansonk14, mliu49 and hesther February 5, 2021 19:28