Adding grover featurizer #3138

arunppsg · 2022-12-13T07:41:19Z

In this PR, I am adding grover featurizer to suite of featurizers in deepchem.

tonydavis629

Looks good, I think we should figure out how to integrate the one hot encoder into the moleculer_featurizer one hot encoder. There's already 2 one hot encoders right now in deepchem. Also had a few questions that maybe a comment or little more detail would clarify.

deepchem/feat/molecule_featurizers/grover_featurizer.py

tonydavis629 · 2022-12-23T14:40:11Z

deepchem/feat/molecule_featurizers/grover_featurizer.py

+    ]
+}
+
+_BOND_FDIM = 14


No atom feature dim needed? Since it's not the same as len(_ATOM_FEATURES). It may be necessary when we do batching.

In the latest change, I reused utilities from dmpnn featurizer. This removed the need for BOND_FDIM as well as ATOM_FDIM in grover featurizer. The grover featurizer is similar to DMPNNFeaturizer. In longer run, I think they both should be unified.

tonydavis629 · 2022-12-23T14:51:37Z

deepchem/feat/molecule_featurizers/grover_featurizer.py

+  as input and computes the following sets of features:
+    1. a molecular graph from the input molecule
+    2. functional groups which are used **only** during pretraining
+    3. additional features which can **only** be used during finetuning


Why can't the optional featurers be used in pretraining?

The pretraining task depends only on a specific set of features from the molecule - the function groups, the atom vocabulary and bond vocabulary. Hence, optional features are not required in pretraining. Note: Atom vocabulary and bond vocabulary are generated at the time of training and not included in this PR.

tonydavis629 · 2022-12-23T15:12:11Z

deepchem/feat/molecule_featurizers/rdkit_descriptors.py

  (The implementation for normalization is based on `RDKit2DNormalized()` method
  in 'descriptastorus' library.)

-  The neural network architecture requires that the features are appropriately scaled to prevent
-  features with large ranges from dominating smaller ranged features, as well as preventing
-  issues where features in the training set are not drawn from the same sample distribution as
-  features in the testing set. To prevent these issues, a large sample of molecules is used to fit
-  cumulative density functions (CDFs) to all features.
-
-  CDFs were used as opposed to simpler scaling algorithms mainly because CDFs have the useful
-  property that 'each value has the same meaning: the percentage of the population observed below
-  the raw feature value.'
+  When the `is_normalized` option is set as True, descriptor values are normalized across the sample
+  by fitting a cumulative density function. CDFs were used as opposed to simpler scaling algorithms
+  mainly because CDFs have the useful property that 'each value has the same meaning: the percentage
+  of the population observed below the raw feature value.'

  Warning: Currently, the normalizing cdf parameters are not available for BCUT2D descriptors.
  (BCUT2D_MWHI, BCUT2D_MWLOW, BCUT2D_CHGHI, BCUT2D_CHGLO, BCUT2D_LOGPHI, BCUT2D_LOGPLOW, BCUT2D_MRHI, BCUT2D_MRLOW)

-  Attributes
-  ----------
-  descriptors: List[str]
-    List of RDKit descriptor names used in this class.
-
  Note


Maybe add some info about enabling custom descriptors.

I will make it in a different PR, improving overall docs.

rbharath

LGTM to me. Good to merge in once @tonydavis629's comments on the thread have been addressed

arunppsg · 2023-01-16T18:15:59Z

A note to self: I need to fix mypy error and a failing unit test before merging.

arunppsg · 2023-01-20T11:21:17Z

A note to self: I need to fix mypy error and a failing unit test before merging.

Fixed mypy and failing unit test. Going to merge this in @rbharath as this has been hanging around for some time now.

arunppsg marked this pull request as draft December 13, 2022 07:41

arunppsg force-pushed the grover_feat branch 2 times, most recently from e9b43a0 to 5d52672 Compare December 23, 2022 05:20

arunppsg marked this pull request as ready for review December 23, 2022 06:56

tonydavis629 requested changes Dec 23, 2022

View reviewed changes

arunppsg force-pushed the grover_feat branch from 5d52672 to 19b53d6 Compare December 30, 2022 03:51

arunppsg force-pushed the grover_feat branch from 19b53d6 to 651c279 Compare January 9, 2023 11:37

rbharath approved these changes Jan 16, 2023

View reviewed changes

arunppsg force-pushed the grover_feat branch from 651c279 to e68fcc3 Compare January 17, 2023 19:05

arunppsg added 5 commits January 20, 2023 15:03

added grover featurizer

e32eafe

added repr for str, int, float attributes

db557ff

test for grover featurizer

9b0bed1

minor fix

27620f7

mypy fixes

7b2c2e6

arunppsg force-pushed the grover_feat branch from e68fcc3 to 7b2c2e6 Compare January 20, 2023 09:55

arunppsg merged commit ba12d6e into deepchem:master Jan 20, 2023

arunppsg deleted the grover_feat branch April 11, 2023 09:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding grover featurizer #3138

Adding grover featurizer #3138

arunppsg commented Dec 13, 2022

tonydavis629 left a comment

tonydavis629 Dec 23, 2022

arunppsg Jan 16, 2023

tonydavis629 Dec 23, 2022

arunppsg Jan 16, 2023

tonydavis629 Dec 23, 2022

arunppsg Jan 16, 2023

rbharath left a comment

arunppsg commented Jan 16, 2023

arunppsg commented Jan 20, 2023

Adding grover featurizer #3138

Adding grover featurizer #3138

Conversation

arunppsg commented Dec 13, 2022

tonydavis629 left a comment

Choose a reason for hiding this comment

tonydavis629 Dec 23, 2022

Choose a reason for hiding this comment

arunppsg Jan 16, 2023

Choose a reason for hiding this comment

tonydavis629 Dec 23, 2022

Choose a reason for hiding this comment

arunppsg Jan 16, 2023

Choose a reason for hiding this comment

tonydavis629 Dec 23, 2022

Choose a reason for hiding this comment

arunppsg Jan 16, 2023

Choose a reason for hiding this comment

rbharath left a comment

Choose a reason for hiding this comment

arunppsg commented Jan 16, 2023

arunppsg commented Jan 20, 2023